PluRule

PluRule is a multilingual, multimodal benchmark for detecting rule violations when moderating pluralistic communities on social media: 13,371 discussion instances drawn from the Pushshift archives, each pairing a rule-violating thread with a compliant thread from the same submission, labeled against the community's own rules.

This repository contains the full construction pipeline, the scripts used to hydrate the released dataset from IDs, and the evaluation harness used in the paper.

Citation

@inproceedings{plurule2025,
  title  = {{PluRule: A Benchmark for Moderating Pluralistic Communities
            on Social Media}},
  author = {Kachwala, Zoher and Truong, Bao Tran and Muralidharan, Rasika and
            Kwak, Haewoon and An, Jisun and Menczer, Filippo},
  year   = {2026},
  booktitle = {Proc. ACL},
  note   = {Forthcoming},
}

A PluRule example: GPT-5.2 (high reasoning) is given the target comment with full context — subreddit description, rules, submission, and discussion thread — and asked to pick which rule, if any, was violated. The correct answer is (e); GPT-5.2 picks (c).

At a glance

Split	Instances	Comments	Images	Subreddits / Clusters	Rules / Clusters	Languages
Train	9,155	51,968	2,077	861 / 25	1,336 / 27	9
Val	1,382	7,631	376	537 / 25	586 / 27	9
Test	2,834	13,076	1,190	1,989 / 25	2,039 / 27	9
Total	13,371	72,675	3,643	1,989 / 25	2,885 / 27	9

Every instance contains (a) a root-to-leaf discussion thread where a moderator cited a rule on the leaf comment, (b) a compliant sibling thread from the same submission, (c) the submission itself with any images, and (d) the subreddit's full rule set.

What it covers

2D UMAP of (a) 1,989 subreddits and (b) 2,885 rules, colored by HDBSCAN cluster. Grey points are unclustered ("other"). Right: distributions of the 13,371 instances across (c) 25 subreddit clusters and (d) 27 rule clusters.

Main results

Accuracy (%) across models and context levels on the test set. Numbers in parentheses show the delta from the previous row. Bold is best per model. 95% CIs are within ±1.3% everywhere. The "No rules broken" baseline is 50%.

Context	4B Inst.	4B Think.	8B Inst.	8B Think.	30B Inst.	30B Think.	GPT-5.2 Low	GPT-5.2 High
Comment only	49.6	37.4	51.0	40.3	50.2	46.1	54.1	55.0
+ Discussion	49.2 _(−0.4)	39.8 _(+2.4)	50.7 _(−0.3)	43.9 _(+3.6)	51.0 _(+0.8)	48.2 _(+2.1)	55.3 _(+1.2)	56.2 _(+1.2)
+ Submission	48.3 _(−0.9)	44.9 _(+5.1)	49.2 _(−1.5)	47.2 _(+3.3)	51.1 _(+0.1)	49.1 _(+0.9)	56.8 _(+1.5)	57.3 _(+1.1)
+ User	48.9 _(+0.6)	45.0 _(+0.1)	50.0 _(+0.8)	46.7 _(−0.5)	52.4 _(+1.3)	49.4 _(+0.3)	57.4 _(+0.6)	57.7 _(+0.4)
+ Images	48.4 _(−0.5)	45.0 _(+0.0)	49.8 _(−0.2)	44.9 _(−1.8)	52.3 _(−0.1)	49.5 _(+0.1)	57.4 _(+0.0)	57.6 _(−0.1)

Even the best model (GPT-5.2 high reasoning with full context) only reaches 57.7% — less than 8 points above the trivial baseline. Adding context (discussion thread, submission, user identifiers, images) helps by at most 2–3 points. Open-weight models (Qwen3-VL-Instruct / -Thinking) don't beat baseline at all.

Per-cluster breakdown (GPT-5.2 high reasoning, full context)

Accuracy by (a) subreddit cluster and (b) rule cluster with 95% CI. Dashed line is the 50% baseline. Universal violations (civility, self-promotion) are solved well; context-dependent rules (low-effort, evidence-based, relevance) fall below baseline.

What do you want to do?

▶︎ Run the benchmark on the released dataset

Start here if you want to evaluate a model on PluRule.

Grab the three dehydrated split files from huggingface.co/datasets/osome-iu/PluRule and place them under ./data/.
Follow hydrate/README.md to fill in comments, submissions, and media from the Pushshift archives (~a few hours, no GPU).
Run your model through eval/README.md — supports the Qwen3-VL models configured in eval/config.py via vLLM and OpenAI API models via the two-stage evaluator.

▶︎ Rebuild PluRule from scratch

Start here if you want to reproduce the dataset end to end, tweak thresholds, or extend the pipeline.

Follow pipeline/README.md. Budget 1–2 days and multiple GPUs: embedding matcher (Qwen3-Embedding-8B), LLM judge (Qwen3-30B-A3B-Instruct), and cluster labeler (Qwen3-30B-A3B-Thinking) are all run locally via vLLM.

▶︎ Reproduce the human evaluation

See eval/human_eval/ for the Google Forms annotation protocol used in Section 5.4 of the paper (96% overall agreement with the pipeline's labels on a 100-instance audit).

Install

git clone https://github.com/osome-iu/PluRule.git
cd PluRule

# Pick the env that matches your goal:
conda env create -f environment-hydrate.yml   # minimal, hydration only (no GPU)
conda env create -f environment-pipeline.yml  # end-to-end reconstruction (GPUs)
conda env create -f environment-eval.yml      # benchmark evaluation (GPU or API keys)

For API-model evaluation, make sure the relevant API key is available in the process environment. credentials/.env.template is provided as a template, but the current evaluator reads keys from the environment used to launch Python.

Repo layout

PluRule/
├── hydrate/          # 3 scripts to reconstitute the released dataset
├── pipeline/         # end-to-end reconstruction from Pushshift (paper §5)
├── eval/             # benchmark evaluation harness
│   └── human_eval/   # human annotation reproduction
├── utils/            # shared helpers (zst I/O, Pushshift torrent, media, …)
├── config.py         # base paths + thresholds (edit before running)
├── credentials/      # API key templates (.env, Reddit, Google)
├── environment-hydrate.yml    # hydration-only conda env
├── environment-pipeline.yml   # reconstruction conda env
└── environment-eval.yml       # evaluation conda env

License

Code in this repository is released under the MIT License — see LICENSE. The released PluRule dataset files on HuggingFace are also MIT-licensed in their dehydrated form: IDs, metadata, labels, rules, and placeholders. Hydrated moderator comments, submissions, and media are not redistributed by PluRule; they are reconstructed from the publicly archived Pushshift Reddit corpus and remain bound by Reddit's terms of service.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PluRule

Citation

At a glance

What it covers

Main results

Per-cluster breakdown (GPT-5.2 high reasoning, full context)

What do you want to do?

▶︎ Run the benchmark on the released dataset

▶︎ Rebuild PluRule from scratch

▶︎ Reproduce the human evaluation

Install

Repo layout

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
credentials		credentials
data		data
eval		eval
figures		figures
hydrate		hydrate
pipeline		pipeline
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
environment-eval.yml		environment-eval.yml
environment-hydrate.yml		environment-hydrate.yml
environment-pipeline.yml		environment-pipeline.yml
plotting_config.py		plotting_config.py
requirements-hydrate.txt		requirements-hydrate.txt

Folders and files

Latest commit

History

Repository files navigation

PluRule

Citation

At a glance

What it covers

Main results

Per-cluster breakdown (GPT-5.2 high reasoning, full context)

What do you want to do?

▶︎ Run the benchmark on the released dataset

▶︎ Rebuild PluRule from scratch

▶︎ Reproduce the human evaluation

Install

Repo layout

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages