ARC - Automated Review Checking with Machine Learning

by Rudy's Rangers, for TikTok TechJam 2025 *FINALS*

Chosen Problem Statment: Filtering the Noise: ML for Trustworthy Location Reviews

Authors

Soo Weng Kit
Tian Fengyao (Kyrie)
Lee Chun Wayne
Shane Vivek Bharathan
Chen Runjia (Rudy) s

Project Overview

This project tackles the challenge of distinguishing between trustworthy and untrustworthy reviews. Our approach combines machine learning and deep learning models in an ensemble framework. By stacking these models in ascending order of computational cost, the system can quickly filter obvious spam with lightweight models, while reserving more expensive deep learning methods for the harder cases. This layered strategy amortizes the overall cost of prediction, ensuring both efficiency and accuracy in filtering reviews.

Setup Instructions

For this project we used the uv package manager, which is efficient and easy to use. Please install uv to get started. For more instructions on installing uv for your system, please refer to this link: https://docs.astral.sh/uv/getting-started/installation/

To support large file storage, please also install Git Large File Storage (LFS) using the following command:

git lfs install

To initialise the virtual environment, follow these steps:

uv venv
source .venv/bin/activate
uv sync

For those who prefer to use a requirements.txt, we have you covered as well. Simply run these commands instead:

uv venv
source .venv/bin/activate
uv pip install -r requirements.txt

Please also create a huggingface token and place it in your .env file.

How to Reproduce Results

We have created several scripts (both .py and .sh) to allow users to run inference with our pipeline. In this README, we will cover the usage of the overarching inference pipeline, which is the culmination of all our work. For more details on how to run each of the components, please refer to the "sub-READMEs" in src/encoder, src/fasttext and src/safety. For now, here are instructions on how to run inference, for which the Python script is located in src/pipelines/inference_pipeline.py.

Please make sure you run this from root. Unless otherwise stated, our scripts were all meant to be run from root so that users don't have to keep navigating between folders.

uv run -m src.pipelines.inference_pipeline

Alternatively, for those who prefer to configure entirely in command line, you can run:

./run.sh

from root. Below is a decscription of the possible args you can configure:

Argument	Type	Required	Default	Description
`--safety-model`	str	No	`models/safety-model-test.pkl`	Path to the safety model `.pkl` file
`--encoder-model`	str	No	`lora_sft_encoder.pth`	Path to encoder model weights (`.pth`)
`--review-file`	str	No	`data/for_model/review_1.json`	Path to JSON review file (with name, category, description, review, rating)
`--threshold`	float	No	`0.7`	Threshold for fasttext heads

Note: the pipeline was designed to take only one review (i.e. one dictionary) at a time. This was a specific design choice, as logically reviews should be evaluated the moment they are posted, not after some time until enough samples are curated for batched inference. We have taken care to ensure our pipeline is efficient at inference for each sample.

Web Applications

We have included two web applications for testing the ARC review analysis system:

1. Streamlit App (Basic)

Simple web interface for basic testing:

# Launch Streamlit app
./launch_app.sh

2. React Frontend (Advanced)

Modern React/Next.js application with Google Maps integration:

# Launch React app with backend
./launch_react_app.sh

React Frontend Features:

🗺️ Interactive Google Maps with location picker
🔍 Smart business search and auto-complete
📝 Auto-fill category and description for recognised businesses
🎨 Professional dark theme UI
📱 Responsive design
⚡ Real-time form validation

Requirements for React Frontend:

Node.js >=18.18.0
Google Maps API key (Maps JavaScript API, Places API, Geocoding API)
See frontend/README.md for detailed setup instructions

Both applications connect to the same FastAPI backend for ML analysis. The React frontend provides a more polished user experience with smart location detection.

Performance

Our pipeline components performed quite well as isolated components. You may refer to the model performance metrics included in the "sub-READMEs" in src/encoder, src/fasttext and src/safety for more details.

For your own tests, we have included 3 sample reviews in json format, which you can find under data/for_model. Feel free to use those in our web app, and you can even play around with small word misspellings, or censors.

Trigger Warning

To aid our model implementation, which utilises a lexicon-based component, we have included a set of toxic words under src/utils/toxic_lexicon.py, which can be offensive to some.

Citations

FastText

@article{bojanowski2017enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={Transactions of the Association for Computational Linguistics},
  volume={5},
  year={2017},
  issn={2307-387X},
  pages={135--146}
}

@article{bojanowski2017enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={Transactions of the Association for Computational Linguistics},
  volume={5},
  year={2017},
  issn={2307-387X},
  pages={135--146}
}

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

Google Maps Datasets

https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/
https://www.kaggle.com/datasets/denizbilginn/google-maps-restaurant-reviews

Internet Violence Study (InViS) Dataset

@data{DVN/ANGOX0_2025,
author = {Golbeck, Jen},
publisher = {Harvard Dataverse},
title = {{Internet Violence Study (InViS) Dataset}},
year = {2025},
version = {V1},
doi = {10.7910/DVN/ANGOX0},
url = {https://doi.org/10.7910/DVN/ANGOX0}
}

ToxiGen Dataset

@inproceedings{hartvigsen2022toxigen,
  title={ToxiGen: A Large-Scale Machine-Generated Dataset for Implicit and Adversarial Hate Speech Detection},
  author={Hartvigsen, Thomas and Gabriel, Saadia and Palangi, Hamid and Sap, Maarten and Ray, Dipankar and Kamar, Ece},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
  year={2022}
}

Twitter Toxic Comments Dataset

https://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset?utm_source=chatgpt.com

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
.vscode		.vscode
assets		assets
data		data
frontend		frontend
models		models
redis_data		redis_data
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
launch_app.sh		launch_app.sh
launch_react_app.sh		launch_react_app.sh
main.py		main.py
package.json		package.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARC - Automated Review Checking with Machine Learning

Chosen Problem Statment: Filtering the Noise: ML for Trustworthy Location Reviews

Authors

Project Overview

Setup Instructions

How to Reproduce Results

Web Applications

1. Streamlit App (Basic)

2. React Frontend (Advanced)

Performance

Trigger Warning

Citations

FastText

Google Maps Datasets

Internet Violence Study (InViS) Dataset

ToxiGen Dataset

Twitter Toxic Comments Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ARC - Automated Review Checking with Machine Learning

Chosen Problem Statment: Filtering the Noise: ML for Trustworthy Location Reviews

Authors

Project Overview

Setup Instructions

How to Reproduce Results

Web Applications

1. Streamlit App (Basic)

2. React Frontend (Advanced)

Performance

Trigger Warning

Citations

FastText

Google Maps Datasets

Internet Violence Study (InViS) Dataset

ToxiGen Dataset

Twitter Toxic Comments Dataset

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages