#

trafilatura

Here are 29 public repositories matching this topic...

opendatalab / MinerU-HTML

MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.

nlp scraping text-extraction web-scraping corpus-tools article-extractor rag trafilatura webagent

Updated Mar 27, 2026
Python

vakovalskii / searcharvester

Self-hosted search + markdown harvester for AI agents. SearXNG (100+ engines) + FastAPI + trafilatura. Tavily-compatible /search plus /extract with size presets and pagination. One-command Docker Compose.

markdown docker-compose self-hosted web-scraping search-api ai-agents rag fastapi searxng llm llm-tools trafilatura tavily agentic-ai

Updated Apr 24, 2026
Python

michaelsoftmd / pebkac-chrome

pebkac Chrome Nonautomation - A Local LLM-Driven Web Co-Browser using Smolagents, Zendriver, Trafilatura.

chrome automation ai openai webscraping atlas claude llms claude-ai trafilatura nodriver ai-browser-automation smolagents ai-browser-control aibrowser zendriver ai-browser pebkac

Updated Apr 14, 2026
Python

Murrough-Foley / rs-trafilatura

Fast, accurate web content extraction in Rust. ML page-type classification, per-type extraction, confidence scoring. F1=0.966 on ScrapingHub (#1), F1=0.859 across 2,008 annotated pages (1,497 development + 511 held-out test

nlp rust machine-learning web-scraping content-extraction search-engine-optimization trafilatura

Updated Apr 3, 2026
Rust

Gdi87 / Webscrapper

web Scrapper In Python

scraper web pandas python3 scrapping scrapping-python scrapper-script trafilatura

Updated Sep 6, 2023
Python

maximilianromer / Onion-Search-MCP

Tools for LLMs to anonymously search and browse the web

mcp selenium tor duckduckgo selenium-webdriver tor-network geckodriver tor-hidden-services trafilatura mcp-server fastmcp ddgs tbselenium

Updated Mar 17, 2026
Python

mazzasaverio / url2md4ai

Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.

html-to-markdown text-extraction openai playwright html-to-markdown-converter trafilatura

Updated Jul 6, 2025
HTML

brzvsk / longreader

Telegram Mini App that saves internet articles to read them later

telegram nextjs scraping telegrambot fastapi readitlater trafilatura telegramminiapp

Updated Feb 19, 2026
Python

bbulb / trawl

Selective web content extraction for AI agents — URL + query returns only the chunks that matter (Python library + MCP server)

python mcp embeddings web-scraping content-extraction ai-agents rag playwright llm trafilatura bge-m3 mcp-server

Updated Apr 22, 2026
Python

Yasser03 / pipescraper

A pipe-based news article scraping and metadata extraction library for Python

python crawler data-science scraper spider data-collection news-scraper osint-python llms article-extraction trafilatura newspaper4k

Updated Mar 20, 2026
Python

vedantvisoliya / Flutter-ChatGPT-Clone

ChatGPT AI Clone

python dart websockets gemini web-scraping flutter fastapi sentence-transformers flutter-web-app trafilatura tavily ai-clones

Updated Jul 4, 2025
Dart

MUGISHA-Pascal / Flutter-Perplexity-FastAPI

Real-time AI search and chat backend with WebSocket streaming, powered by Tavily web search and Google Gemini for Flutter apps.

gemini-api fastapi trafilatura tavily-api

Updated Aug 2, 2025
Python

fvanevski / trafilatura_mcp

Trafilatura MCP Server

fetch web-scraping trafilatura mcp-server

Updated Oct 1, 2025
Python

maximilianromer / Tor-Search-MCP

Tools for LLMs to anonymously search and browse the web

mcp selenium tor duckduckgo selenium-webdriver tor-network geckodriver tor-hidden-services trafilatura mcp-server fastmcp ddgs tbselenium

Updated Mar 17, 2026

GeoffroyZhang / wayback_protocole

Protocole de collecte et d'analyse d'archives de la Wayback Machine pour une analyse textuelle et statistique

archive digital-humanities wayback-machine playwright trafilatura waybackmachine-api

Updated Apr 20, 2026
Python

warezfr / doc-crawler-rag

🕷️ Clean, chunked documentation crawler optimized for RAG & AnythingLLM. Dockerized.

python docker crawler self-hosted dynatrace rag streamlit llm trafilatura anythingllm notebookllm

Updated Jan 13, 2026
Python

10kseok / BlogToBook

블로그 글을 전자책으로 만들어주는 서비스

pdf ebook calibre fastapi trafilatura

Updated Aug 12, 2025
Python

BigBang142 / Tor-Search-MCP

🕵️♂️ Enable anonymous web searches for your LLM with the first-ever Model Context Protocol server utilizing Tor for secure and private information retrieval.

api wrapper server mcp selenium tor duckduckgo geckodriver nyaa thepiratebay ygg trafilatura mcp-server fastmcp ddgs tbselenium lacale

Updated Apr 26, 2026
Python

elvismdev / trafilatura-api

Trafilatura API for html content info extract

python nlp docker flask rest-api text-extraction web-scraping content-extraction metadata-extraction news-scraper article-extraction trafilatura

Updated Dec 2, 2025
Python

Pookie-n-Rookie / Crawlr

A web scraper with an LLM-powered document suggestion system that combines web crawling, data extraction, and advanced AI capabilities to recommend relevant documents.

multiagent llm langchain trafilatura crewai tavily agentic-rag

Updated May 10, 2025
Python

Improve this page

Add a description, image, and links to the trafilatura topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the trafilatura topic, visit your repo's landing page and select "manage topics."