The TenderMatch system is a tender matching pipeline that scrapes government procurement data, filters relevant tenders, and matches them with manufacturers who can supply the needed equipment.
The architecture follows a clear pipeline:
run.py (orchestrator)
├── scrapers/cppp.py (data collection)
├── data/db.py (storage)
├── matching/filter.py (classification)
└── matching/matcher.py (semantic matching)
The main orchestrator script coordinates the entire process:
- Data Collection: Scrapes tenders from government portals (central, state, GEM)
- Deduplication: Prevents processing the same tender multiple times using content hashing
- Classification: Filters tenders into blocked/low_signal/high_signal categories
- Matching: For high-signal tenders, finds relevant manufacturers using semantic matching
- Storage: Updates database with processing flags
This module scrapes government procurement portals:
Data Sources:
- Central portal:
https://eprocure.gov.in/cppp/latestactivetendersnew/cpppdata - State portal:
https://eprocure.gov.in/cppp/latestactivetendersnew/mmpdata - GEM portal:
https://eprocure.gov.in/cppp/latestactivetendersnew/gemdata
Scraping Logic:
- Fetches pages with retry logic and rate limiting
- Parses tender rows with source-specific parsers
- Handles both GEM-specific and general tender formats
- Stops scraping when reaching old tenders (24-hour cutoff)
Data Model:
{
"tender_id": "Unique identifier",
"title": "Tender title",
"organization": "Issuing organization",
"published_date": "Publication date",
"closing_date": "Closing date",
"source_url": "Direct link to tender",
"raw_text": "Combined searchable text",
"source_portal": "central/state/gem",
"scraped_at": "Timestamp"
}SQLite-based storage with automatic deduplication:
Schema:
tenderstable with fields for all tender datacontent_hashfor deduplication (title+organization+date)- Unique constraint on tender_id + content_hash
Two-stage filtering system:
Blocklist Filtering (substring-based):
- Words like "road", "construction", "civil", "repair", etc.
- Immediately blocks irrelevant tenders
Positive Signal Detection (hybrid matching):
- Phrase matching for "testing equipment", "analytical instrument", etc.
- Categorizes as blocked/low_signal/high_signal
Advanced manufacturer matching using sentence transformers:
Embedder:
- Uses
BAAI/bge-small-en-v1.5model for text embeddings - Pre-processes manufacturer profiles into searchable vectors
- Handles manufacturer aliases and product categories
Matcher:
- Converts tender text to embeddings
- Computes cosine similarity with manufacturer profiles
- Returns top 3 most relevant manufacturers with confidence scores
Comprehensive database of 20+ specialized equipment manufacturers:
Profile Structure:
{
"id": "Unique identifier",
"name": "Company name",
"aliases": "Alternative names",
"country": "Location",
"product_categories": "Specialized products",
"embedding_text": "Detailed description for semantic matching",
"keywords": "Search terms",
"website": "Company website"
}- Scraping: Collect latest tenders from government portals
- Deduplication: Skip already-processed tenders using content hashing
- Classification: Filter tenders using keyword-based rules
- Semantic Matching: For high-signal tenders, find relevant manufacturers
- Storage: Update database with results and flags
- Reporting: Output classification statistics
Intelligent Filtering:
- Blocklist prevents processing irrelevant tenders
- Positive keyword detection identifies opportunities
- Hybrid matching for precision (phrases vs. words)
Semantic Intelligence:
- Embedding-based manufacturer matching
- Confidence scoring for match quality
- Specialized equipment domain knowledge
Efficiency:
- Automatic deduplication prevents redundant processing
- Content-based cutoff stops at old tenders
- Database storage for historical tracking