Skip to content

Zoher15/PluRule

Repository files navigation

PluRule

PluRule is a multilingual, multimodal benchmark for detecting rule violations when moderating pluralistic communities on social media: 13,371 discussion instances drawn from the Pushshift archives, each pairing a rule-violating thread with a compliant thread from the same submission, labeled against the community's own rules.

This repository contains the full construction pipeline, the scripts used to hydrate the released dataset from IDs, and the evaluation harness used in the paper.

Citation

@inproceedings{plurule2025,
  title  = {{PluRule: A Benchmark for Moderating Pluralistic Communities
            on Social Media}},
  author = {Kachwala, Zoher and Truong, Bao Tran and Muralidharan, Rasika and
            Kwak, Haewoon and An, Jisun and Menczer, Filippo},
  year   = {2026},
  booktitle = {Proc. ACL},
  note   = {Forthcoming},
}

A PluRule example

A PluRule example: GPT-5.2 (high reasoning) is given the target comment with full context — subreddit description, rules, submission, and discussion thread — and asked to pick which rule, if any, was violated. The correct answer is (e); GPT-5.2 picks (c).

At a glance

Split Instances Comments Images Subreddits / Clusters Rules / Clusters Languages
Train 9,155 51,968 2,077 861 / 25 1,336 / 27 9
Val 1,382 7,631 376 537 / 25 586 / 27 9
Test 2,834 13,076 1,190 1,989 / 25 2,039 / 27 9
Total 13,371 72,675 3,643 1,989 / 25 2,885 / 27 9

Every instance contains (a) a root-to-leaf discussion thread where a moderator cited a rule on the leaf comment, (b) a compliant sibling thread from the same submission, (c) the submission itself with any images, and (d) the subreddit's full rule set.

What it covers

Cluster landscape

2D UMAP of (a) 1,989 subreddits and (b) 2,885 rules, colored by HDBSCAN cluster. Grey points are unclustered ("other"). Right: distributions of the 13,371 instances across (c) 25 subreddit clusters and (d) 27 rule clusters.

Main results

Accuracy (%) across models and context levels on the test set. Numbers in parentheses show the delta from the previous row. Bold is best per model. 95% CIs are within ±1.3% everywhere. The "No rules broken" baseline is 50%.

Context 4B Inst. 4B Think. 8B Inst. 8B Think. 30B Inst. 30B Think. GPT-5.2 Low GPT-5.2 High
Comment only 49.6 37.4 51.0 40.3 50.2 46.1 54.1 55.0
+ Discussion 49.2 (−0.4) 39.8 (+2.4) 50.7 (−0.3) 43.9 (+3.6) 51.0 (+0.8) 48.2 (+2.1) 55.3 (+1.2) 56.2 (+1.2)
  + Submission 48.3 (−0.9) 44.9 (+5.1) 49.2 (−1.5) 47.2 (+3.3) 51.1 (+0.1) 49.1 (+0.9) 56.8 (+1.5) 57.3 (+1.1)
    + User 48.9 (+0.6) 45.0 (+0.1) 50.0 (+0.8) 46.7 (−0.5) 52.4 (+1.3) 49.4 (+0.3) 57.4 (+0.6) 57.7 (+0.4)
      + Images 48.4 (−0.5) 45.0 (+0.0) 49.8 (−0.2) 44.9 (−1.8) 52.3 (−0.1) 49.5 (+0.1) 57.4 (+0.0) 57.6 (−0.1)

Even the best model (GPT-5.2 high reasoning with full context) only reaches 57.7% — less than 8 points above the trivial baseline. Adding context (discussion thread, submission, user identifiers, images) helps by at most 2–3 points. Open-weight models (Qwen3-VL-Instruct / -Thinking) don't beat baseline at all.

Per-cluster breakdown (GPT-5.2 high reasoning, full context)

GPT-5.2 high forest plot

Accuracy by (a) subreddit cluster and (b) rule cluster with 95% CI. Dashed line is the 50% baseline. Universal violations (civility, self-promotion) are solved well; context-dependent rules (low-effort, evidence-based, relevance) fall below baseline.

What do you want to do?

▶︎ Run the benchmark on the released dataset

Start here if you want to evaluate a model on PluRule.

  1. Grab the three dehydrated split files from huggingface.co/datasets/osome-iu/PluRule and place them under ./data/.
  2. Follow hydrate/README.md to fill in comments, submissions, and media from the Pushshift archives (~a few hours, no GPU).
  3. Run your model through eval/README.md — supports the Qwen3-VL models configured in eval/config.py via vLLM and OpenAI API models via the two-stage evaluator.

▶︎ Rebuild PluRule from scratch

Start here if you want to reproduce the dataset end to end, tweak thresholds, or extend the pipeline.

Follow pipeline/README.md. Budget 1–2 days and multiple GPUs: embedding matcher (Qwen3-Embedding-8B), LLM judge (Qwen3-30B-A3B-Instruct), and cluster labeler (Qwen3-30B-A3B-Thinking) are all run locally via vLLM.

▶︎ Reproduce the human evaluation

See eval/human_eval/ for the Google Forms annotation protocol used in Section 5.4 of the paper (96% overall agreement with the pipeline's labels on a 100-instance audit).

Install

git clone https://github.com/osome-iu/PluRule.git
cd PluRule

# Pick the env that matches your goal:
conda env create -f environment-hydrate.yml   # minimal, hydration only (no GPU)
conda env create -f environment-pipeline.yml  # end-to-end reconstruction (GPUs)
conda env create -f environment-eval.yml      # benchmark evaluation (GPU or API keys)

For API-model evaluation, make sure the relevant API key is available in the process environment. credentials/.env.template is provided as a template, but the current evaluator reads keys from the environment used to launch Python.

Repo layout

PluRule/
├── hydrate/          # 3 scripts to reconstitute the released dataset
├── pipeline/         # end-to-end reconstruction from Pushshift (paper §5)
├── eval/             # benchmark evaluation harness
│   └── human_eval/   # human annotation reproduction
├── utils/            # shared helpers (zst I/O, Pushshift torrent, media, …)
├── config.py         # base paths + thresholds (edit before running)
├── credentials/      # API key templates (.env, Reddit, Google)
├── environment-hydrate.yml    # hydration-only conda env
├── environment-pipeline.yml   # reconstruction conda env
└── environment-eval.yml       # evaluation conda env

License

Code in this repository is released under the MIT License — see LICENSE. The released PluRule dataset files on HuggingFace are also MIT-licensed in their dehydrated form: IDs, metadata, labels, rules, and placeholders. Hydrated moderator comments, submissions, and media are not redistributed by PluRule; they are reconstructed from the publicly archived Pushshift Reddit corpus and remain bound by Reddit's terms of service.

About

PluRule: A multilingual, multimodal benchmark for detecting rule violations when moderating pluralistic communities on social media.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages