MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.
-
Updated
Mar 27, 2026 - Python
MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.
Self-hosted search + markdown harvester for AI agents. SearXNG (100+ engines) + FastAPI + trafilatura. Tavily-compatible /search plus /extract with size presets and pagination. One-command Docker Compose.
pebkac Chrome Nonautomation - A Local LLM-Driven Web Co-Browser using Smolagents, Zendriver, Trafilatura.
Fast, accurate web content extraction in Rust. ML page-type classification, per-type extraction, confidence scoring. F1=0.966 on ScrapingHub (#1), F1=0.859 across 2,008 annotated pages (1,497 development + 511 held-out test
web Scrapper In Python
Tools for LLMs to anonymously search and browse the web
Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.
Telegram Mini App that saves internet articles to read them later
Selective web content extraction for AI agents — URL + query returns only the chunks that matter (Python library + MCP server)
A pipe-based news article scraping and metadata extraction library for Python
ChatGPT AI Clone
Real-time AI search and chat backend with WebSocket streaming, powered by Tavily web search and Google Gemini for Flutter apps.
Tools for LLMs to anonymously search and browse the web
Protocole de collecte et d'analyse d'archives de la Wayback Machine pour une analyse textuelle et statistique
🕷️ Clean, chunked documentation crawler optimized for RAG & AnythingLLM. Dockerized.
🕵️♂️ Enable anonymous web searches for your LLM with the first-ever Model Context Protocol server utilizing Tor for secure and private information retrieval.
Trafilatura API for html content info extract
A web scraper with an LLM-powered document suggestion system that combines web crawling, data extraction, and advanced AI capabilities to recommend relevant documents.
Add a description, image, and links to the trafilatura topic page so that developers can more easily learn about it.
To associate your repository with the trafilatura topic, visit your repo's landing page and select "manage topics."