Skip to content

A commercial-grade MCP Server built on FastMCP, offering robust capabilities to read, extract, and localize (into Markdown) content from web pages and PDFs with both text and images. It is purpose-built for long-term deployment in enterprise environments.

License

Notifications You must be signed in to change notification settings

ThreeFish-AI/data-extractor

Repository files navigation

Data Extractor is a commercial-grade MCP Server built on FastMCP, offering robust capabilities to read, extract, and localize (into Markdown) content from web pages and PDFs with both text and images. It is purpose-built for long-term deployment in enterprise environments.

🛠️ MCP Server Core Tools (14)

Web Page

工具名称 功能描述 主要参数
scrape_webpage 单页面抓取 url, method(自动选择), extract_config(选择器配置), wait_for_element(CSS 选择器)
scrape_multiple_webpages 批量页面抓取 urls(列表), method(统一方法), extract_config(全局配置)
scrape_with_stealth 反检测抓取 url, method(selenium/playwright), scroll_page(滚动加载), wait_for_element
fill_and_submit_form 表单自动化 url, form_data(选择器:值), submit(是否提交), submit_button_selector
extract_links 专业链接提取 url, filter_domains(域名过滤), exclude_domains(排除域名), internal_only(仅内部)
extract_structured_data 结构化数据提取 url, data_type(all/contact/social/content/products/addresses)
get_page_info 页面信息获取 url(目标 URL) - 返回标题、状态码、元数据
check_robots_txt 爬虫规则检查 url(域名 URL) - 检查 robots.txt 规则
convert_webpage_to_markdown 页面转 Markdown url, method, extract_main_content(提取主内容), embed_images(嵌入图片), formatting_options
batch_convert_webpages_to_markdown 批量 Markdown 转换 urls(列表), method, extract_main_content, embed_images, embed_options

PDF Document

工具名称 功能描述 主要参数
convert_pdf_to_markdown PDF 转 Markdown pdf_source(URL/路径), method(auto/pymupdf/pypdf), page_range, output_format
batch_convert_pdfs_to_markdown 批量 PDF 转换 pdf_sources(列表), method, page_range, output_format, include_metadata

Service Management

工具名称 功能描述 主要参数
get_server_metrics 性能指标监控 无参数 - 返回请求统计、性能指标、缓存情况
clear_cache 缓存管理 无参数 - 清空所有缓存数据

🎯 Quick Navigation

🤝 Contribution

欢迎提交 IssuePull Request 来改进这个项目。

📄 License

MIT License - 详见 LICENSE 文件


注意: 请负责任地使用此工具,遵守网站的使用条款和 robots.txt 规则,尊重网站的知识产权。

About

A commercial-grade MCP Server built on FastMCP, offering robust capabilities to read, extract, and localize (into Markdown) content from web pages and PDFs with both text and images. It is purpose-built for long-term deployment in enterprise environments.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •