Collect Twi text generated via Gemini from English news paragraphs.
- Python 3.8+
- Internet connection
Install dependencies:
pip install huggingface_hub pandasLinux only:
sudo apt install python3-tk xclip wl-clipboardgit clone https://github.com/GhanaNLP/twi-text-collector.git
cd twi-text-collector
pip install huggingface_hub pandas
python collector.pyPaste your volunteer code when prompted and click "Download & Start →".
- App shows one English news paragraph at a time
- Click "✦ Gemini Prompt" → paste into gemini.google.com → send
- Copy Gemini's Twi response → paste into the app textarea
- App automatically:
- Removes consecutive repeated sentences
- Validates length (8,000-18,000 chars expected)
- Checks for duplicate submissions
- Click "Save & Next →" — auto-saves and moves to next paragraph
- Every 10 texts are automatically uploaded to HuggingFace as a structured dataset
- If Gemini's output is too short/long, try regenerating or enable Thinking Mode
- Click "Skip ⇥" if you can't get valid output after several tries
- Click "⬆ Push Now" to manually upload before reaching 10 texts
- Your progress is saved — restart anytime and it will resume where you left off
Download fails
Re-run python collector.py with your code — already downloaded data is cached.
Push fails App will retry automatically on next save. Click "⬆ Push Now" once connection is restored.
App won't open (Linux)
Install: sudo apt install python3-tk
Lost your code Contact the project coordinator.