ai-crawlflow is a web crawler framework written in C/C++ with a modular and extensible architecture. The design focuses on simplicity, modularity, and building a foundation for AI-ready data pipelines.
Spider
│
▼
Scheduler
│
▼
Downloader
│
▼
Parser
│
▼
Pipeline
│
+--> CleanStage
|
+--> FilterStage
|
+--> ChunkStage (optional AI)
|
+--> EmbeddingStage (optional AI)
│
▼
Storage
DDKCrawler
├── CMakeLists.txt
│
├── include
│ └── ddkcrawler
│ ├── spider.h
│ ├── scheduler.h
│ ├── downloader.h
│ ├── parser.h
│ ├── pipeline.h
│ └── storage.h
│
├── src
│ ├── main.cpp
│ └── core
│ ├── spider.cpp
│ ├── scheduler.cpp
│ ├── downloader.cpp
│ ├── parser.cpp
│ ├── pipeline.cpp
│ └── storage.cpp
│
├── tests
│
├── legacy
│
├── README.md
└── .gitignore
Defines crawling logic and starting URLs.
Manages the queue of URLs waiting to be crawled.
Handles network connections and downloads web pages or resources.
Extracts links or data from downloaded content.
Processes Documents through a series of modular stages.
Supports optional AI-related stages such as text cleaning, chunking, and embedding.
Stores processed Documents with support for structured data formats.
Designed to be extensible for future storage backends (e.g., JSON, databases, vector storage).
- Modular crawler architecture
- URL queue system
- Parallel downloading support
- Extensible and pluggable pipeline design
- AI-ready data processing pipeline
The project is evolving towards an AI-ready data pipeline framework:
- Core crawler architecture
- Data processing pipeline enhancements
- AI-ready pipeline stages (chunking, embedding)
- Vector storage integration
- Semantic search support
C / C++
Work in progress. Actively evolving towards an AI-ready data pipeline framework.