DetectionAI investigates how reliably we can identify AI-generated writing. The notebooks compare multiple detectors across datasets and summarize performance metrics that inform practical detection thresholds.
- Repository Contents
- Prerequisites
- Installation
- Quick Start
- Study Design
- Usage
- AI Policy Compliance
- License
- Citation
- References
notebooks/– Jupyter notebooks for running the detection experiments.results/– CSV tables produced by the notebooks.data/– placeholder directory for raw datasets.requirements.txt– Python dependencies for the project.
This project requires Python 3.10+. The main Python packages used are listed in requirements.txt: pandas, numpy, torch, transformers, scikit-learn, scipy, openai, requests and tqdm.
git clone https://github.com/your-user/DetectionAI.git
cd DetectionAI
pip install -r requirements.txtRun notebooks/Code_statistics_part.ipynb first. It loads the dataset JSON files and
produces summary tables such as results/Table1_overall.csv.
Look for AUROC scores close to 1.0 indicating strong detection.
Our workflow involves several steps:
- Gather pre-2020 human-written texts spanning different genres.
- Create AI-generated versions of these texts using the following prompt [].
- Evaluate multiple AI detectors on the combined dataset, reporting type I and type II error rates overall and by genre.
- Examine how performance changes when detection thresholds vary.
- Repeat the evaluation for short passages (49 words or fewer).
- "Humanize" the AI-generated texts with the StealthGPT tool and re-test them using the best-performing detector, Pengram.
Our evaluation relies on a diverse collection of pre-2020 human-authored texts across multiple genres. The original human-written documents are sourced from publicly available and widely used datasets:
- CC-News – A collection of English-language news articles scraped from Common Crawl. Used to represent journalistic content.
- Bar-Ilan Blog Authorship Corpus – A corpus of personal blog posts written by thousands of authors. Used to evaluate performance on informal first-person writing.
- Resume Dataset (Kaggle) – A compilation of resumes and CVs representing professional self-descriptions. Used to test detector accuracy on structured, formal writing.
- Yelp Review Full – User-generated reviews with sentiment ratings. Used to evaluate performance on consumer-facing narrative and opinionated writing.
- Amazon Reviews (Kaggle) – Product reviews across a wide range of categories. Used to assess generalization across e-commerce and review genres.
- Pre-2000 Novel Corpus – A collection of long-form fictional texts written prior to the year 2000. Included to test how detectors perform on traditional narrative structures and literary language, distinct from modern internet discourse.
The notebooks were developed in Google Colab and GCP, but they can be run locally as well.
- Open the desired notebook on GitHub and choose Open in Colab.
- Mount Google Drive when prompted so the notebooks can access the data files in
/content/drive/....
- Clone the repository and install the requirements as shown above.
- Place the data JSON/CSV files in a directory of your choice and update the paths at the top of each notebook.
- Set any required API keys (e.g.
OPENAI_API_KEY) as environment variables before running cells that call external services.
This project follows the Scientific Policy on AI Use & Reproducibility in Economics (Jabarian, 2025). Key points include:
-
AI-Detection Systems – We evaluate several detectors:
- Pengram
- GPTZero
- Originality
- roberta-base-detector
-
AI-Detector System Choice Closed–source APIs were chosen as primary models because the open-source model, Roberta-base-detector, performs extremely poorly and appears only as a secondary check.
-
AI-Generating Model We used the following models to generate the AI writings
- GPT-3.5-turbo
- GPT-4.1
- Claude Opus 4
- Claude Sonnet 4
- Gemini 2.0 Flash
to generate the AI writings following this parametrization:
- model="gpt-3.5-turbo",
- temperature=0.7,
- max_tokens=min(word_count * 2, 2048)
-
AI Prompts We used the following prompt to generate the AI version of the human pre-2020 texts: """You are a writing assistant. Write an original passage on the topic: '{topic}'. It should be approximately {word_count} words long. Be clear and human-like. Avoid copying or referencing specific texts.
⚠️ Do not include or repeat the topic or instructions in your output. Return only the generated passage text.""" -
Training-Data Separation – We do not upload any confidential data to third-party services.
-
Hallucination & Robustness Diagnostics – Results tables report sensitivity to multiple sampling settings.
-
Citation and Attribution – Any verbatim AI-generated text will cite the model snapshot and date.
-
Reviewer Checklist – The repository contains, to the best of our knowledge, sufficient material to satisfy the policy's reproducibility checklist.
- Co-authors: Alex Imas, Brian Jabarian
- RAs: Eda Congedez, Ziyue Feng, Zlata Krasic, Andrew Rafael James
This project is licensed under the MIT License.
If you build upon this work, please cite it as:
Alex Imas, Brian Jabarian. "DetectionAI: Evaluating AI-generated text detectors." 2025.
Feel free to contact the authors for clarification requests.
alex.imas@chicagobooth.edu brian.jabarian@chicagobooth.edu
Brian Jabarian, 2025, Scientific Policy on AI Use & Reproducibility in Economics, work-in-progress