This project is a scalable web crawler architecture that uses a master-worker model to explore web pages, manage links, and store extracted data.
graph TD
Master[Master Crawler] --> Worker[Worker Crawler]
Worker --> Hub[Central Hub]
Hub --> DataStore[Data Store]
At the highest level, the master crawler assigns tasks to workers, and the data flows into a central hub and then into a data store.
graph TD
Master[Master Crawler] -->|Assign URLs| Worker[Worker Crawlers]
Worker -->|Fetches Pages| WorkerProcess[Processing & Extraction]
WorkerProcess -->|Sends Data| CentralHub[Central Hub]
CentralHub -->|Marks Links| DataStore[Data Store]
CentralHub -->|Spawns New Crawlers| Worker
This level shows how the master assigns URLs, workers fetch and process pages, and the central hub manages link tracking and spawning new crawlers.
flowchart TD
subgraph Worker[Worker Crawler]
FetchPage[Fetch Assigned Page] --> ExtractLinks[Extract Links & Data]
ExtractLinks --> SendToMaster[Send Data to Master]
ExtractLinks --> UpdateVisited[Mark Links as Visited]
UpdateVisited --> NewLinks[Add New Links to Queue]
end
This shows the detailed flow inside each worker crawler.
flowchart TD
Master[Master Crawler]
Master -->|Assign URLs| Workers[Manage Workers]
Master -->|Track Visited Links| LinkTracker[Link Tracker]
Master -->|Store Data| Hub[Central Hub]
This focuses on the master crawler's responsibilities.
flowchart TD
CentralHub[Central Hub] -->|Sends Data| DataStore[Data Store]
DataStore -->|Store Raw Data| Raw[Raw Data]
DataStore -->|Process & Organize| Processed[Processed Data]
This part shows how the data store organizes and processes the incoming data.