Skip to content

Latest commit

 

History

History
68 lines (50 loc) · 2.02 KB

File metadata and controls

68 lines (50 loc) · 2.02 KB

Data Extractor Ants

This project is a scalable web crawler architecture that uses a master-worker model to explore web pages, manage links, and store extracted data.

Architecture Overview

Level 1: High-Level Overview

graph TD
    Master[Master Crawler] --> Worker[Worker Crawler]
    Worker --> Hub[Central Hub]
    Hub --> DataStore[Data Store]
Loading

At the highest level, the master crawler assigns tasks to workers, and the data flows into a central hub and then into a data store.

Level 2: Detailed Components

graph TD
    Master[Master Crawler] -->|Assign URLs| Worker[Worker Crawlers]
    Worker -->|Fetches Pages| WorkerProcess[Processing & Extraction]
    WorkerProcess -->|Sends Data| CentralHub[Central Hub]
    CentralHub -->|Marks Links| DataStore[Data Store]
    CentralHub -->|Spawns New Crawlers| Worker
Loading

This level shows how the master assigns URLs, workers fetch and process pages, and the central hub manages link tracking and spawning new crawlers.

Level 3: Worker Crawler Details

flowchart TD
    subgraph Worker[Worker Crawler]
        FetchPage[Fetch Assigned Page] --> ExtractLinks[Extract Links & Data]
        ExtractLinks --> SendToMaster[Send Data to Master]
        ExtractLinks --> UpdateVisited[Mark Links as Visited]
        UpdateVisited --> NewLinks[Add New Links to Queue]
    end
Loading

This shows the detailed flow inside each worker crawler.

Level 3: Master Crawler Details

flowchart TD
    Master[Master Crawler]
    Master -->|Assign URLs| Workers[Manage Workers]
    Master -->|Track Visited Links| LinkTracker[Link Tracker]
    Master -->|Store Data| Hub[Central Hub]
Loading

This focuses on the master crawler's responsibilities.

Level 3: Data Store Details

flowchart TD
    CentralHub[Central Hub] -->|Sends Data| DataStore[Data Store]
    DataStore -->|Store Raw Data| Raw[Raw Data]
    DataStore -->|Process & Organize| Processed[Processed Data]
Loading

This part shows how the data store organizes and processes the incoming data.