A Python-based tool for scraping and analyzing pharmacy data from various sources, with a focus on identifying independent, non-hospital pharmacies.
Latest Update (v2.1.0): Production-ready pipeline with secure API key handling, automated testing, and Perplexity
sonarmodel integration. Now includes robust budget enforcement and extensive test coverage for the orchestrator (88%).
The system processes data through the following phases in order:
-
Data Collection (Phase 1)
- Scrapes pharmacy data using Apify and Google Places API
- Handles rate limiting and API quotas
- Implements caching to minimize redundant requests
-
Deduplication (Phase 1.5)
- Smart duplicate removal using fuzzy matching
- Self-healing capabilities for data gaps
- Maintains data integrity across sources
-
Classification (Phase 2a)
- AI-powered classification using Perplexity's
sonarLLM model - Enhanced identification of independent vs. chain pharmacies with improved accuracy
- Supports both rule-based and LLM-based classification with automatic fallback
- Implements intelligent caching for cost efficiency
- AI-powered classification using Perplexity's
-
Verification (Phase 2b - Optional)
- Address and business verification using Google Places
- Ensures data accuracy and completeness
- Validates pharmacy details against trusted sources
- Budget Management: Tracks and manages API usage and costs
- Error Handling: Robust error recovery and retry mechanisms
- Performance: Optimized for large-scale data processing
- Extensibility: Modular design for easy integration of new data sources
Detailed documentation is available for each module:
- Orchestrator Module - Pipeline coordination and cache management
- Classification Module - AI-based pharmacy classification
-
Clone the repository:
git clone https://github.com/your-username/pharmacy-scraper.git cd pharmacy-scraper -
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -e .
-
Create a
.envfile in the project root:# Create .env file (this file should never be committed to version control) touch .env -
Add your API keys to the
.envfile:# API Keys GOOGLE_MAPS_API_KEY=your_google_api_key_here APIFY_API_TOKEN=your_apify_token_here PERPLEXITY_API_KEY=your_perplexity_api_key_here -
Use the secure production configuration:
# Make the setup script executable
chmod +x scripts/setup_env_and_run.sh
# Run the pipeline securely
./scripts/setup_env_and_run.sh-
Copy the example config file and update with your API keys:
cp config/example_config.json config/config.json
-
Update the following in
config/config.json:- Apify API token
- Google Places API key
- Perplexity API key
- Other configuration parameters as needed
The pharmacy scraper now supports two main execution modes:
- Test Pipeline (No API Calls):
python scripts/run_test_pipeline.pyThis runs the pipeline with mocked API services, perfect for testing changes without using API credits.
- Production Pipeline (Real API Calls):
# Run with default configuration
python scripts/run_production_pipeline.py
# Reset state and start fresh
python scripts/run_production_pipeline.py --reset
# Use a specific configuration
python scripts/run_production_pipeline.py --config config/production/custom_config.json
# Validate configuration without making API calls
python scripts/run_production_pipeline.py --dry-run- Secure Environment Setup (Recommended):
# This uses environment variables from .env file
./scripts/setup_env_and_run.shThe pipeline enforces API budget limits specified in your configuration:
"max_budget": 50.0,
"api_cost_limits": {
"apify": 0.5,
"google_places": 0.3,
"perplexity": 0.2
}This prevents unexpected API costs while ensuring data quality.
The primary entry point for classifying pharmacies is the Classifier class. It provides a simple interface that handles rule-based classification, LLM fallback, and caching automatically.
Here's how to classify a single pharmacy:
from pharmacy_scraper.classification import Classifier
from pharmacy_scraper.classification.data_models import PharmacyData
# 1. Instantiate the classifier
# The PerplexityClient is created and managed internally.
classifier = Classifier()
# 2. Define the pharmacy data
pharmacy_data = {
"name": "Downtown Pharmacy",
"address": "123 Main St",
"city": "San Francisco",
"state": "CA",
"zip": "94105",
"phone": "(555) 123-4567"
}
# 3. Classify the pharmacy
# The `use_llm` flag is True by default, enabling LLM fallback for low-confidence results.
result = classifier.classify_pharmacy(pharmacy_data, use_llm=True)
# 4. Inspect the result
print(f"Is Chain: {result.is_chain}")
print(f"Confidence: {result.confidence}")
print(f"Reason: {result.reason}")
print(f"Method: {result.method.value}") # 'rule-based', 'llm', or 'cached'To classify multiple pharmacies, simply iterate and call classify_pharmacy. The internal cache will prevent redundant API calls for duplicate entries.
pharmacies = [
{"name": "Downtown Pharmacy", "address": "123 Main St"},
{"name": "CVS Pharmacy #1234", "address": "456 Market St"},
# ... more pharmacies
]
results = [classifier.classify_pharmacy(p) for p in pharmacies]
for r in results:
print(f"{r.reason} (Method: {r.method.value})")Caching is enabled by default and managed internally by the Classifier. Results are stored in a local cache to minimize API costs and improve performance. There is no need for manual cache configuration.
client = PerplexityClient(
api_key="your_api_key",
force_reclassification=True
)
## Usage
### Running the Pipeline
```bash
python -m pharmacy_scraper.run_pipeline --config config/your_config.json
pytest tests/ -vCommon testing commands (see detailed guide in docs/TESTING.md):
- Full suite:
make test - QA suites (integration/contract/property):
make test-qa
- Performance benchmarks (opt-in):
make test-perf # measure only (PERF=1) make test-perf-strict # enforce thresholds (PERF=1 PERF_STRICT=1)
The project currently has 73% overall test coverage, with key modules having excellent coverage:
- Orchestrator Module: 88% coverage
- Comprehensive cache functionality tests
- Pipeline state management tests
- Stage execution tests
- Classification Module: 100% coverage for classifier, 92% for Perplexity client
- State Manager: 96% coverage
pharmacy-scraper/
├── config/ # Configuration files
├── data/ # Data files (ignored in git)
├── docs/ # Documentation
├── scripts/ # Scripts for data collection and processing
├── src/ # Source code
│ ├── pharmacy_scraper/ # Main package
│ │ ├── api/ # API clients
│ │ ├── classification/ # AI classification
│ │ ├── config/ # Configuration
│ │ └── ...
├── tests/ # Test files
├── .github/ # GitHub workflows and templates
├── .gitignore
├── pyproject.toml
└── README.md
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
For consistent environments, runtime dependencies are pinned in requirements.txt, and development/test tools are pinned in requirements-dev.txt.
Run basic security and dependency audits locally:
pip install -r requirements.txt -r requirements-dev.txt
bash scripts/dev_security_check.shThis script runs pip-audit for known vulnerabilities and detect-secrets to scan for accidentally committed secrets.
This project is licensed under the MIT License - see the LICENSE file for details.