This scraper automates the enrichment of UPSC topper profiles with detailed bios, strategies, and insights using LLM and Google search data.
- Fetches research snippets from Google using Serper API
- Generates rich, story-format strategy and bio using LLM (Groq API)
- Stores results in MongoDB with proper schema mapping
- Handles markdown formatting for frontend rendering
- Validates and logs errors for debugging
- Install dependencies:
pnpm install
- Configure environment:
Create a
.envfile in thescraper/directory with:MONGO_URL=your_mongo_url GROQ_API_KEY=your_groq_api_key GROQ_API_URL=your_groq_api_url SERPER_API_KEY=your_serper_api_key DB_NAME=toppersjournal COLLECTION=toppers CONCURRENCY=2 TEST_LIMIT=10 MODEL_NAME=llama3-8b
- Run the scraper:
node scraper.js
- Updates MongoDB documents with
bio,strategy(markdown), andinsightsfields - Logs errors and raw LLM output for debugging
- Adjust prompt and validation logic in
scraper.jsfor different output styles - Change concurrency and limits in
.envfor performance tuning
- Check logs for invalid JSON or API errors
- Ensure all API keys and URLs are correct
- Validate MongoDB connection and schema
MIT