PaperSnitch: Automated Research Paper Reproducibility Assessment

An AI-powered system for comprehensive, automated evaluation of research paper reproducibility

Rationale & Vision

The Reproducibility Crisis

The scientific community faces a reproducibility crisis: many published research papers cannot be independently replicated due to:

Missing or incomplete code repositories
Unavailable or poorly documented datasets
Ambiguous experimental protocols and hyperparameters
Lack of detailed methodology descriptions
Inconsistent reporting of statistical procedures

Manual evaluation of paper reproducibility is:

Time-consuming: Requires expert reviewers to spend hours per paper
Inconsistent: Different reviewers may apply different standards
Not scalable: Impossible to evaluate thousands of papers at conferences
Subjective: Human bias in interpretation of criteria

Our Solution

PaperSnitch automates the reproducibility assessment process using:

Multi-Step Retrieval-Augmented Generation (RAG): Intelligently retrieves relevant paper sections for each evaluation criterion using semantic embeddings
Structured LLM Analysis: Uses gpt-5 with Pydantic schemas for deterministic, parseable outputs
Programmatic Scoring: Combines LLM-based text analysis with rules-based scoring algorithms
Code Repository Analysis: Automatically ingests, analyzes, and embeds source code to evaluate reproducibility artifacts
Workflow Orchestration: LangGraph-based DAG execution with database persistence and fault tolerance

Research Impact

This system enables:

Large-scale conference analysis: Process hundreds of papers efficiently
Consistent evaluation standards: Same criteria applied uniformly
Actionable feedback: Specific recommendations for improving reproducibility
Quantifiable metrics: Numerical scores for comparison and benchmarking
Research insights: Understanding reproducibility trends across domains

Key Features

Intelligent Analysis

Paper Type Classification: Automatically identifies papers as dataset/method/both/theoretical
Adaptive Scoring: Weights criteria based on paper type (e.g., datasets more important for dataset papers)
Code Intelligence: LLM-guided selection of reproducibility-critical files from repositories
Multi-Criterion Evaluation: 20 reproducibility criteria + 10 dataset documentation criteria + 6 code analysis components

Comprehensive Assessment

Paper-Level Analysis: Evaluates mathematical descriptions, experimental protocols, statistical reporting
Code-Level Analysis: Checks for training code, evaluation scripts, checkpoints, dependencies
Dataset Documentation: Assesses data collection, annotation protocols, ethical compliance
Evidence-Based Scoring: Links each evaluation to specific paper sections

Scalable Workflow

Distributed Execution: Celery workers for parallel paper processing
Database-Backed: MySQL persistence for all workflow states and results
Fault Tolerant: Automatic retries, error isolation, partial result aggregation
Token Tracking: Fine-grained cost accounting per workflow node

Web Interface

PDF Upload: Direct paper upload with automatic text extraction (GROBID)
Conference Scraping: Batch import papers from conference websites (MICCAI, etc.)
Analysis Dashboard: View results, scores, and detailed criterion evaluations
User Management: Profile-based tracking of analysis history

Architecture Overview

Technology Stack

┌─────────────────────────────────────────────────────────┐
│                     NGINX (Reverse Proxy)               │
│              Port 80/443 (SSL via Let's Encrypt)        │
└────────────────────┬────────────────────────────────────┘
                     │
        ┌────────────┴───────────────┐
        │                            │
   ┌────▼──────┐              ┌──────▼─────┐
   │  Django   │              │   Static   │
   │   Web     │◄─────────────┤   Files    │
   │  (ASGI)   │              └────────────┘
   └─────┬─────┘
         │
    ┌────┴─────────────┬──────────────┐
    │                  │              │
┌───▼────┐      ┌──────▼──────┐  ┌───▼────────┐
│ Celery │      │   MySQL     │  │   Redis    │
│Workers │◄────►│  Database   │  │  (Broker)  │
│ (3-5)  │      │   (InnoDB)  │  └────────────┘
└───┬────┘      └─────────────┘
    │
    └──► GROBID Server (PDF → TEI-XML)
    └──► LLM APIs (OpenAI/LiteLLM)

Core Components

Component	Technology	Purpose
Web Framework	Django 5.2.7	HTTP server, ORM, admin interface
Workflow Engine	LangGraph 1.0.6	DAG-based workflow orchestration
Task Queue	Celery 5.x	Distributed async task execution
Message Broker	Redis 7	Celery task queue backend
Database	MariaDB 11.7	Persistent storage (MySQL 8.0 compatible)
Document Processing	GROBID 0.8.0	PDF → structured XML extraction
Web Scraping	Crawl4AI 0.7.6	Conference website data extraction
Code Ingestion	GitIngest 0.3.1	Repository cloning and file extraction
LLM Integration	OpenAI SDK 2.7.2	gpt-5 API calls with structured outputs
Embeddings	text-embedding-3-small	1536-dim semantic vectors for RAG

Paper Process Workflow

The analysis pipeline consists of 8 nodes executed as a directed acyclic graph (DAG):

                    ┌─────────────────────────────────┐
                    │ A. Paper Type Classification    │
                    │    (dataset/method/both/        │
                    │     theoretical)                │
                    └──────────────┬──────────────────┘
                                   │
                    ┌──────────────▼──────────────────┐
                    │ B. Section Embeddings           │
                    │    (text-embedding-3-small)     │
                    └──────────────┬──────────────────┘
                                   │
                ┌──────────────────┼────────────────────┐
                │                  │                    │
    ┌───────────▼───────┐  ┌────▼──────────┐    ┌───────▼────────┐
    │ C. Reproducibility│  │ D. Dataset    │    │ E. Code        │
    │    Checklist      │  │    Docs Check │    │    Availability│
    │    (20 criteria)  │  │    (10 crit.) │    │    Check       │
    └───────────┬───────┘  └───────┬───────┘    └───────┬────────┘
                │                  │                    │
                │                  │          ┌─────────▼───────────┐
                │                  │          │ F. Code Embedding   │
                │                  │          │    (repo ingestion) │
                │                  │          └─────────┬───────────┘
                │                  │                    │
                │                  │          ┌─────────▼───────────┐
                │                  │          │ G. Code Repository  │
                │                  │          │    Analysis         │
                │                  │          └─────────┬───────────┘
                │                  │                    │
                └──────────────────┼────────────────────┴
                                   │
                   ┌───────────────▼─────────────────┐
                   │ H. Final Aggregation            │
                   │    (weighted scoring + LLM)     │
                   └─────────────────────────────────┘

Node Responsibilities:

Node A: Classify paper type using title + abstract
Node B: Generate embeddings for all paper sections
Node C: Evaluate 20 reproducibility criteria via multi-step RAG
Node D: Evaluate 10 dataset documentation criteria
Node E: Search for code URLs in paper (GitHub, GitLab, etc.)
Node F: Ingest code repository, select critical files, embed chunks
Node G: Analyze code repository structure and reproducibility artifacts
Node I: Aggregate scores, generate qualitative assessment

Web interface

Conferences tab

The conference tab is where conferences already scraped are shown.

Conference tab

Inside of each conference there are the tokens statistics and the list of papers related to it with a summary .

Paper tab

In the paper page the workflow of the selected analysis is shown on the top, and a list of all analyses performed is shown on the bottom.

Prerequisites

Required Software

Docker: 24.0+ with Docker Compose V2
Git: 2.30+
Linux/macOS: Tested on Ubuntu 22.04+ and macOS 13+

API Keys

You'll need an OpenAI API key with access to:

Any gpt model: compatible with Responses API for structured analysis
text-embedding-3-small: For semantic embeddings

Hardware Recommendations

Minimum:

4 CPU cores
16 GB RAM
50 GB disk space

Recommended:

8 CPU cores
32 GB RAM
100 GB SSD storage

Installation & Setup

Step 1: Clone the Repository

git clone https://github.com/yourusername/papersnitch.git
cd papersnitch

Step 2: Environment Configuration

A template_env is provided. Rename as .env.local and set all the pasword and API keys.

Step 3: Launch Development Stack

Use the provided script to start all services:

./create-dev-stack.sh up 8000 dev

What this does:

Finds available ports (8000 for Django, 3306 for MySQL, 6379 for Redis, 8071 for GROBID)
Creates stack-specific directories (mysql_dev, media_dev, static_dev)
Generates .env.dev with port configuration (starting from .env.local)
Starts Docker Compose services:
- django-web-dev: Django application server
- mysql: MariaDB 11.7 database
- redis: Redis message broker
- celery-worker: Background task processor
- celery-beat: Periodic task scheduler

Step 4: Database Initialization

Wait for database to be healthy, then run migrations:

# Check if MySQL is ready
docker exec mysql-dev mariadb -u papersnitch -ppapersnitch -e "SELECT 1"

# Run Django migrations
docker exec django-web-dev python manage.py migrate

# Create superuser for admin access
docker exec -it django-web-dev python manage.py createsuperuser

Step 5: Initialize Criteria Embeddings

Pre-compute embeddings for reproducibility criteria (one-time setup):

docker exec django-web-dev python manage.py initialize_criteria_embeddings

What this does:

Creates embeddings for 20 reproducibility checklist criteria
Creates embeddings for 10 dataset documentation criteria
Stores in database for semantic retrieval during analysis

Step 6: Start a conference scraping

Working in progress

Running the System

Starting/Stopping Services

# Start the stack
./create-dev-stack.sh up 8000 dev

# Stop the stack (preserves data)
./create-dev-stack.sh stop 8000 dev

# Stop and remove containers (preserves data)
./create-dev-stack.sh down 8000 dev

# View logs (all services)
./create-dev-stack.sh logs 8000 dev

Running Analysis

Option 1: Web Interface

Navigate to http://localhost:8000/analyze
Log in with superuser credentials
Upload a PDF or paste arXiv URL
Click "Analyze Reproducibility"
View results in real-time as workflow executes

Option 2: Django Admin

Navigate to http://localhost:8000/admin
Go to Papers → Add Paper
Upload PDF and fill metadata
Go to Workflow Runs → Add Workflow Run
Select paper and workflow definition
Save to trigger analysis

Option 3: Django Shell

docker exec -it django-web-dev python manage.py shell

from webApp.models import Paper, WorkflowDefinition
from webApp.services.workflow_orchestrator import WorkflowOrchestrator

# Get paper and workflow
paper = Paper.objects.first()
workflow_def = WorkflowDefinition.objects.get(
    name="paper_processing_with_reproducibility",
    version=8
)

# Create workflow run
orchestrator = WorkflowOrchestrator()
workflow_run = orchestrator.create_workflow_run(
    workflow_definition=workflow_def,
    paper=paper,
    context_data={
        "model": "gpt-5-2024-11-20",
        "force_reprocess": False
    }
)

print(f"Workflow run created: {workflow_run.id}")

Monitoring Workflows

View workflow progress in Django admin at:

http://localhost:8000/admin/workflow_engine/workflowrun/

Or query the database:

docker exec -it mysql-dev mariadb -u papersnitch -ppapersnitch papersnitch

# Check workflow status
SELECT id, status, started_at, completed_at
FROM workflow_runs
ORDER BY created_at DESC LIMIT 10;

# Check node status
SELECT node_id, status, duration_seconds, input_tokens, output_tokens
FROM workflow_nodes
WHERE workflow_run_id = 'your-workflow-run-id'
ORDER BY started_at;

How It Works

High-Level Overview

PaperSnitch uses an 8-node DAG workflow to comprehensively evaluate research paper reproducibility:

Paper Type Classification (Node A): Determines if paper is dataset/method/both/theoretical using LLM analysis of title and abstract
Section Embeddings (Node D): Generates semantic embeddings for all paper sections (abstract, intro, methods, results, etc.) using text-embedding-3-small
Parallel Analysis:
- Reproducibility Checklist (Node G): Evaluates 20 criteria using multi-step RAG (retrieves relevant sections per criterion, then analyzes with LLM)
- Dataset Documentation (Node H): Evaluates 10 dataset-specific criteria
- Code Workflow (Nodes B→F→C):
  - Node B: Searches for code repository URLs
  - Node F: Ingests repo, LLM selects critical files, embeds all file chunks
  - Node C: Analyzes repository structure, artifacts, and reproducibility
Final Aggregation (Node I): Combines all scores with adaptive weighting, generates qualitative assessment

Key Technical Innovations

Multi-Step RAG for Criterion Evaluation:

# For each criterion:
1. Retrieve top-3 most relevant paper sections via cosine similarity
2. Provide sections + criterion description to LLM
3. Get structured analysis (present/absent, confidence, evidence)
4. Aggregate 20 criterion analyses → category scores → overall score

Adaptive Code Scoring:

# Scoring adapts to research methodology
if methodology == "deep_learning":
    # Requires: training code + checkpoints + datasets
    max_score_components = {
        "code_completeness": 3.0,
        "artifacts": 2.5,  # Checkpoints critical
        "dataset_splits": 2.0
    }
elif methodology == "theoretical":
    # Requires: implementation code only
    max_score_components = {
        "code_completeness": 2.5,
        "artifacts": 0.5,  # Checkpoints not applicable
        "dataset_splits": 0.5
    }

LLM-Guided Code File Selection:

# Instead of embedding entire repository:
1. Extract README + file tree
2. LLM selects reproducibility-critical files (within 100k token budget)
3. Only embed selected files (20k char chunks)
4. Use embeddings for evidence-based component analysis

For detailed technical documentation, see TECHNICAL_DESCRIPTION_FOR_PAPER.md.

Configuration

Environment Variables

Key variables in .env.local:

# OpenAI API
OPENAI_API_KEY=sk-proj-...
DEFAULT_LLM_MODEL=gpt-5-2024-11-20
EMBEDDING_MODEL=text-embedding-3-small

# Database
MYSQL_DATABASE=papersnitch
MYSQL_USER=papersnitch
MYSQL_PASSWORD=your_password

# Celery
CELERY_BROKER_URL=redis://redis:6379/0
CELERY_CONCURRENCY=8  # Tasks per worker
CELERY_MAX_TASKS_PER_CHILD=1  # Restart after 1 task

# Security
DJANGO_SECRET_KEY=...
DJANGO_DEBUG=True
DJANGO_ALLOWED_HOSTS=localhost,127.0.0.1

Workflow Customization

Modify criteria or scoring weights in Django admin or via shell:

from webApp.models import ReproducibilityChecklistCriterion

criterion = ReproducibilityChecklistCriterion.objects.get(
    criterion_id="mathematical_description"
)
criterion.description = "Updated description..."
criterion.save()

# Regenerate embedding after modification
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=f"{criterion.criterion_name}\n{criterion.description}"
)
criterion.embedding = response.data[0].embedding
criterion.save()

Development Workflow

Running Multiple Stacks

Support for parallel development environments:

# Main dev stack on port 8000
./create-dev-stack.sh up 8000 dev

# Feature branch on port 8001
./create-dev-stack.sh up 8001 feature-x

# Personal stack on port 8002
./create-dev-stack.sh up 8002 my-name

# Each stack has isolated database, media files, and Redis

Hot Reload

Django auto-reloads on code changes via Docker Compose watch mode.

Database Migrations

# Create migration
docker exec django-web-dev python manage.py makemigrations

# Apply migrations
docker exec django-web-dev python manage.py migrate

# Rollback
docker exec django-web-dev python manage.py migrate workflow_engine 0001

Running Tests

# All tests
docker exec django-web-dev python manage.py test

# Specific app
docker exec django-web-dev python manage.py test webApp.tests

# With coverage
docker exec django-web-dev coverage run manage.py test
docker exec django-web-dev coverage html

Troubleshooting

Common Issues

Port Already in Use:

# Use different port
./create-dev-stack.sh up 8001 dev

MySQL Connection Refused:

# Check MySQL health
docker exec mysql-dev mariadb -u papersnitch -ppapersnitch -e "SELECT 1"

# Restart MySQL
docker restart mysql-dev

Celery Workers Not Processing:

# Check worker status
docker exec django-web-dev celery -A web inspect active

# Restart workers
docker restart celery-worker-dev

OpenAI Rate Limits:

# Reduce concurrency in compose.dev.yml:
command: celery -A web worker --concurrency=2

Out of Memory:

# Increase Docker memory limit (Docker Desktop → Settings → Resources)
# Or reduce Celery concurrency
command: celery -A web worker --concurrency=2 --max-tasks-per-child=1

Debug Scripts

# Check retrieval for specific paper
python debug_aspect_retrieval.py --paper-id 123 --aspect methodology

# List papers with embeddings
python debug_aspect_retrieval.py --list-papers

# Verify workflow installation
python verify_workflow_installation.py

Additional Documentation

TECHNICAL_DESCRIPTION_FOR_PAPER.md: Complete technical specification for academic paper
WORKFLOW_ENGINE_DELIVERY.md: Workflow engine implementation details
CODE_REPRODUCIBILITY_ANALYSIS.md: Code analysis node documentation
DEPLOYMENT_CHECKLIST.md: Production deployment guide
DOMAIN_SETUP_GUIDE.md: SSL and domain configuration

License

This project is licensed under the MIT License.

Acknowledgments

GROBID: PDF text extraction
LangGraph: Workflow orchestration
OpenAI: LLM APIs
Crawl4AI: Conference scraping
GitIngest: Code repository ingestion

Built with ❤️ for the research community

Making reproducibility the norm, not the exception

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
.stacks		.stacks
analyses		analyses
app		app
nginx		nginx
static		static
.dockerignore		.dockerignore
.gitignore		.gitignore
DEPLOYMENT_CHECKLIST.md		DEPLOYMENT_CHECKLIST.md
DOMAIN_SETUP_GUIDE.md		DOMAIN_SETUP_GUIDE.md
MARIADB_UPGRADE.md		MARIADB_UPGRADE.md
NODE_D_IMPLEMENTATION.md		NODE_D_IMPLEMENTATION.md
QUICK_START_DOMAIN_MIGRATION.md		QUICK_START_DOMAIN_MIGRATION.md
README.md		README.md
RESET_DB.md		RESET_DB.md
TECHNICAL_DESCRIPTION_FOR_PAPER.md		TECHNICAL_DESCRIPTION_FOR_PAPER.md
TOKEN_STATISTICS_IMPROVEMENTS.md		TOKEN_STATISTICS_IMPROVEMENTS.md
USER_AUTHENTICATION_CHANGES.md		USER_AUTHENTICATION_CHANGES.md
WORKFLOW_ENGINE_DELIVERY.md		WORKFLOW_ENGINE_DELIVERY.md
WORKFLOW_ENGINE_INSTALLED.md		WORKFLOW_ENGINE_INSTALLED.md
categories.json		categories.json
certbot_emission.sh		certbot_emission.sh
check_workflow.sh		check_workflow.sh
code_sample.json		code_sample.json
compose.dev.yml		compose.dev.yml
compose.prod.yml		compose.prod.yml
compose_command.txt		compose_command.txt
compute_score.py		compute_score.py
create-dev-stack.sh		create-dev-stack.sh
create_complex_dag_test.py		create_complex_dag_test.py
create_parallel_workflow_test.py		create_parallel_workflow_test.py
debug-dev-stack.sh		debug-dev-stack.sh
debug_aspect_retrieval.py		debug_aspect_retrieval.py
env_template		env_template
human_score_output.json		human_score_output.json
json_checklist.json		json_checklist.json
miccai_code_availability.pdf		miccai_code_availability.pdf
miccai_code_availability.png		miccai_code_availability.png
migrations.sh		migrations.sh
project_feedbacks.txt		project_feedbacks.txt
project_history.txt		project_history.txt
scrape_multiple_conferences.sh		scrape_multiple_conferences.sh
update_colors.py		update_colors.py
verify_workflow_installation.py		verify_workflow_installation.py
view_analysis.py		view_analysis.py

Folders and files

Latest commit

History

Repository files navigation

PaperSnitch: Automated Research Paper Reproducibility Assessment

Table of Contents

Rationale & Vision

The Reproducibility Crisis

Our Solution

Research Impact

Key Features

Intelligent Analysis

Comprehensive Assessment

Scalable Workflow

Web Interface

Architecture Overview

Technology Stack

Core Components

Paper Process Workflow

Web interface

Conferences tab

Conference tab

Paper tab

Prerequisites

Required Software

API Keys

Hardware Recommendations

Installation & Setup

Step 1: Clone the Repository

Step 2: Environment Configuration

Step 3: Launch Development Stack

Step 4: Database Initialization

Step 5: Initialize Criteria Embeddings

Step 6: Start a conference scraping

Running the System

Starting/Stopping Services

Running Analysis

Option 1: Web Interface

Option 2: Django Admin

Option 3: Django Shell

Monitoring Workflows

How It Works

High-Level Overview

Key Technical Innovations

Configuration

Environment Variables

Workflow Customization

Development Workflow

Running Multiple Stacks

Hot Reload

Database Migrations

Running Tests

Troubleshooting

Common Issues

Debug Scripts

Additional Documentation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages