Pharmacy Scraper

A Python-based tool for scraping and analyzing pharmacy data from various sources, with a focus on identifying independent, non-hospital pharmacies.

Latest Update (v2.1.0): Production-ready pipeline with secure API key handling, automated testing, and Perplexity sonar model integration. Now includes robust budget enforcement and extensive test coverage for the orchestrator (88%).

Features

Pipeline Phases

The system processes data through the following phases in order:

Data Collection (Phase 1)
- Scrapes pharmacy data using Apify and Google Places API
- Handles rate limiting and API quotas
- Implements caching to minimize redundant requests
Deduplication (Phase 1.5)
- Smart duplicate removal using fuzzy matching
- Self-healing capabilities for data gaps
- Maintains data integrity across sources
Classification (Phase 2a)
- AI-powered classification using Perplexity's sonar LLM model
- Enhanced identification of independent vs. chain pharmacies with improved accuracy
- Supports both rule-based and LLM-based classification with automatic fallback
- Implements intelligent caching for cost efficiency
Verification (Phase 2b - Optional)
- Address and business verification using Google Places
- Ensures data accuracy and completeness
- Validates pharmacy details against trusted sources

Core Capabilities

Budget Management: Tracks and manages API usage and costs
Error Handling: Robust error recovery and retry mechanisms
Performance: Optimized for large-scale data processing
Extensibility: Modular design for easy integration of new data sources

Module Documentation

Detailed documentation is available for each module:

Orchestrator Module - Pipeline coordination and cache management
Classification Module - AI-based pharmacy classification

Installation

Clone the repository:

git clone https://github.com/your-username/pharmacy-scraper.git
cd pharmacy-scraper

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -e .
```

Configuration

Option 1: Environment Variables (Recommended for Production)

Create a .env file in the project root:

# Create .env file (this file should never be committed to version control)
touch .env

Add your API keys to the .env file:

# API Keys
GOOGLE_MAPS_API_KEY=your_google_api_key_here
APIFY_API_TOKEN=your_apify_token_here
PERPLEXITY_API_KEY=your_perplexity_api_key_here

Use the secure production configuration:

# Make the setup script executable
chmod +x scripts/setup_env_and_run.sh

# Run the pipeline securely
./scripts/setup_env_and_run.sh

Option 2: Configuration File

Copy the example config file and update with your API keys:
```
cp config/example_config.json config/config.json
```
Update the following in config/config.json:
- Apify API token
- Google Places API key
- Perplexity API key
- Other configuration parameters as needed

Usage

Running the Production Pipeline

The pharmacy scraper now supports two main execution modes:

Test Pipeline (No API Calls):

python scripts/run_test_pipeline.py

This runs the pipeline with mocked API services, perfect for testing changes without using API credits.

Production Pipeline (Real API Calls):

# Run with default configuration
python scripts/run_production_pipeline.py

# Reset state and start fresh
python scripts/run_production_pipeline.py --reset

# Use a specific configuration
python scripts/run_production_pipeline.py --config config/production/custom_config.json

# Validate configuration without making API calls
python scripts/run_production_pipeline.py --dry-run

Secure Environment Setup (Recommended):

# This uses environment variables from .env file
./scripts/setup_env_and_run.sh

API Budget Management

The pipeline enforces API budget limits specified in your configuration:

"max_budget": 50.0,
"api_cost_limits": {
  "apify": 0.5,
  "google_places": 0.3,
  "perplexity": 0.2
}

This prevents unexpected API costs while ensuring data quality.

Programmatic Usage

The primary entry point for classifying pharmacies is the Classifier class. It provides a simple interface that handles rule-based classification, LLM fallback, and caching automatically.

Basic Classification

Here's how to classify a single pharmacy:

from pharmacy_scraper.classification import Classifier
from pharmacy_scraper.classification.data_models import PharmacyData

# 1. Instantiate the classifier
# The PerplexityClient is created and managed internally.
classifier = Classifier()

# 2. Define the pharmacy data
pharmacy_data = {
    "name": "Downtown Pharmacy",
    "address": "123 Main St",
    "city": "San Francisco",
    "state": "CA",
    "zip": "94105",
    "phone": "(555) 123-4567"
}

# 3. Classify the pharmacy
# The `use_llm` flag is True by default, enabling LLM fallback for low-confidence results.
result = classifier.classify_pharmacy(pharmacy_data, use_llm=True)

# 4. Inspect the result
print(f"Is Chain: {result.is_chain}")
print(f"Confidence: {result.confidence}")
print(f"Reason: {result.reason}")
print(f"Method: {result.method.value}") # 'rule-based', 'llm', or 'cached'

Batch Classification

To classify multiple pharmacies, simply iterate and call classify_pharmacy. The internal cache will prevent redundant API calls for duplicate entries.

pharmacies = [
    {"name": "Downtown Pharmacy", "address": "123 Main St"},
    {"name": "CVS Pharmacy #1234", "address": "456 Market St"},
    # ... more pharmacies
]

results = [classifier.classify_pharmacy(p) for p in pharmacies]

for r in results:
    print(f"{r.reason} (Method: {r.method.value})")

Caching

Caching is enabled by default and managed internally by the Classifier. Results are stored in a local cache to minimize API costs and improve performance. There is no need for manual cache configuration. client = PerplexityClient( api_key="your_api_key", force_reclassification=True )


## Usage

### Running the Pipeline

```bash
python -m pharmacy_scraper.run_pipeline --config config/your_config.json

Running Tests

pytest tests/ -v

Testing Quick Start

Common testing commands (see detailed guide in docs/TESTING.md):

Full suite:
```
make test
```
QA suites (integration/contract/property):
```
make test-qa
```

Performance benchmarks (opt-in):

make test-perf           # measure only (PERF=1)
make test-perf-strict    # enforce thresholds (PERF=1 PERF_STRICT=1)

Test Coverage

The project currently has 73% overall test coverage, with key modules having excellent coverage:

Orchestrator Module: 88% coverage
- Comprehensive cache functionality tests
- Pipeline state management tests
- Stage execution tests
Classification Module: 100% coverage for classifier, 92% for Perplexity client
State Manager: 96% coverage

Project Structure

pharmacy-scraper/
├── config/                   # Configuration files
├── data/                     # Data files (ignored in git)
├── docs/                     # Documentation
├── scripts/                  # Scripts for data collection and processing
├── src/                      # Source code
│   ├── pharmacy_scraper/     # Main package
│   │   ├── api/              # API clients
│   │   ├── classification/   # AI classification
│   │   ├── config/           # Configuration
│   │   └── ...
├── tests/                   # Test files
├── .github/                 # GitHub workflows and templates
├── .gitignore
├── pyproject.toml
└── README.md

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Dependency and Security Checks

For consistent environments, runtime dependencies are pinned in requirements.txt, and development/test tools are pinned in requirements-dev.txt.

Run basic security and dependency audits locally:

pip install -r requirements.txt -r requirements-dev.txt
bash scripts/dev_security_check.sh

This script runs pip-audit for known vulnerabilities and detect-secrets to scan for accidentally committed secrets.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
.cascade/rules		.cascade/rules
.github		.github
.windsurf		.windsurf
config		config
data		data
docs		docs
dotenv		dotenv
examples		examples
hypothesis_local		hypothesis_local
openai		openai
output		output
reports		reports
scripts		scripts
src		src
tests		tests
.detect-secrets.report		.detect-secrets.report
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLASSIFICATION_MODULE_PROMPT.md		CLASSIFICATION_MODULE_PROMPT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
ORCHESTRATOR_TESTING_PROMPT.md		ORCHESTRATOR_TESTING_PROMPT.md
PROJECT_PLAN.md		PROJECT_PLAN.md
README.md		README.md
aug10todo.md		aug10todo.md
cascade.json		cascade.json
pharmacy-project-digest.md		pharmacy-project-digest.md
project-cleanup-todo.md		project-cleanup-todo.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
sitecustomize.py		sitecustomize.py
test_rule.py		test_rule.py
workflow_report.md		workflow_report.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pharmacy Scraper

Features

Pipeline Phases

Core Capabilities

Module Documentation

Installation

Configuration

Option 1: Environment Variables (Recommended for Production)

Option 2: Configuration File

Usage

Running the Production Pipeline

API Budget Management

Programmatic Usage

Basic Classification

Batch Classification

Caching

Running Tests

Testing Quick Start

Test Coverage

Project Structure

Contributing

Dependency and Security Checks

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pharmacy Scraper

Features

Pipeline Phases

Core Capabilities

Module Documentation

Installation

Configuration

Option 1: Environment Variables (Recommended for Production)

Option 2: Configuration File

Usage

Running the Production Pipeline

API Budget Management

Programmatic Usage

Basic Classification

Batch Classification

Caching

Running Tests

Testing Quick Start

Test Coverage

Project Structure

Contributing

Dependency and Security Checks

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages