Skip to content

Tim-Alpha/Data-Extractor-Ants

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Data Extractor Ants

This project is a scalable web crawler architecture that uses a master-worker model to explore web pages, manage links, and store extracted data.

Architecture Overview

Level 1: High-Level Overview

graph TD
    Master[Master Crawler] --> Worker[Worker Crawler]
    Worker --> Hub[Central Hub]
    Hub --> DataStore[Data Store]
Loading

At the highest level, the master crawler assigns tasks to workers, and the data flows into a central hub and then into a data store.

Level 2: Detailed Components

graph TD
    Master[Master Crawler] -->|Assign URLs| Worker[Worker Crawlers]
    Worker -->|Fetches Pages| WorkerProcess[Processing & Extraction]
    WorkerProcess -->|Sends Data| CentralHub[Central Hub]
    CentralHub -->|Marks Links| DataStore[Data Store]
    CentralHub -->|Spawns New Crawlers| Worker
Loading

This level shows how the master assigns URLs, workers fetch and process pages, and the central hub manages link tracking and spawning new crawlers.

Level 3: Worker Crawler Details

flowchart TD
    subgraph Worker[Worker Crawler]
        FetchPage[Fetch Assigned Page] --> ExtractLinks[Extract Links & Data]
        ExtractLinks --> SendToMaster[Send Data to Master]
        ExtractLinks --> UpdateVisited[Mark Links as Visited]
        UpdateVisited --> NewLinks[Add New Links to Queue]
    end
Loading

This shows the detailed flow inside each worker crawler.

Level 3: Master Crawler Details

flowchart TD
    Master[Master Crawler]
    Master -->|Assign URLs| Workers[Manage Workers]
    Master -->|Track Visited Links| LinkTracker[Link Tracker]
    Master -->|Store Data| Hub[Central Hub]
Loading

This focuses on the master crawler's responsibilities.

Level 3: Data Store Details

flowchart TD
    CentralHub[Central Hub] -->|Sends Data| DataStore[Data Store]
    DataStore -->|Store Raw Data| Raw[Raw Data]
    DataStore -->|Process & Organize| Processed[Processed Data]
Loading

This part shows how the data store organizes and processes the incoming data.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors