Skip to content

GhanaNLP/text-collector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Twi Text Collector

Collect Twi text generated via Gemini from English news paragraphs.


Requirements

  • Python 3.8+
  • Internet connection

Install dependencies:

pip install huggingface_hub pandas

Linux only:

sudo apt install python3-tk xclip wl-clipboard

Setup

git clone https://github.com/GhanaNLP/twi-text-collector.git
cd twi-text-collector
pip install huggingface_hub pandas
python collector.py

Paste your volunteer code when prompted and click "Download & Start →".


How it works

  1. App shows one English news paragraph at a time
  2. Click "✦ Gemini Prompt" → paste into gemini.google.com → send
  3. Copy Gemini's Twi response → paste into the app textarea
  4. App automatically:
    • Removes consecutive repeated sentences
    • Validates length (8,000-18,000 chars expected)
    • Checks for duplicate submissions
  5. Click "Save & Next →" — auto-saves and moves to next paragraph
  6. Every 10 texts are automatically uploaded to HuggingFace as a structured dataset

Tips

  • If Gemini's output is too short/long, try regenerating or enable Thinking Mode
  • Click "Skip ⇥" if you can't get valid output after several tries
  • Click "⬆ Push Now" to manually upload before reaching 10 texts
  • Your progress is saved — restart anytime and it will resume where you left off

Troubleshooting

Download fails Re-run python collector.py with your code — already downloaded data is cached.

Push fails App will retry automatically on next save. Click "⬆ Push Now" once connection is restored.

App won't open (Linux) Install: sudo apt install python3-tk

Lost your code Contact the project coordinator.

About

a lightwieght app for helping humans collect local language text from gemini

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages