This is a Flask-based web application that transforms web content into AI-ready format for RAG (Retrieval-Augmented Generation) systems. The application extracts links, content, and images from websites and generates structured PDF/CSV documents optimized for AI applications and LLM consumption.
Preferred communication style: Simple, everyday language.
- Framework: Flask (Python web framework)
- Web Scraping: Combination of BeautifulSoup, requests, and Trafilatura for content extraction
- PDF Generation: ReportLab for creating PDF documents
- Deployment: WSGI-compatible with ProxyFix middleware for reverse proxy support
- Template Engine: Jinja2 (Flask's default)
- CSS Framework: Bootstrap 5 with dark theme support
- JavaScript: Vanilla JavaScript for form validation and UI interactions
- Icons: Font Awesome for iconography
- app.py: Application factory and configuration
- routes.py: Flask route handlers for web endpoints
- web_scraper.py: Main scraping logic using multiple libraries
- link_extractor.py: Specialized link extraction functionality
- pdf_generator.py: PDF document creation and formatting
- URL validation and preprocessing
- Content extraction using Trafilatura for clean text
- Link extraction using BeautifulSoup
- Error handling and retry mechanisms
- User-agent spoofing to avoid blocking
- ReportLab-based PDF creation
- Custom styling and formatting
- Support for structured content layout
- Error PDF generation for failed scrapes
- User Input: URL submission through web form
- Validation: URL format validation and preprocessing
- Scraping: Content and link extraction from target website
- Presentation: Results displayed in web interface
- PDF Export: On-demand PDF generation from scraped data
- Flask: Web framework and routing
- BeautifulSoup4: HTML parsing and navigation
- Trafilatura: Clean text extraction from web pages
- ReportLab: PDF document generation
- Requests: HTTP client for web scraping
- Bootstrap 5: CSS framework with dark theme
- Font Awesome: Icon library
- Vanilla JavaScript: Client-side functionality
- Werkzeug: WSGI utilities and development server
- Standard Library: logging, urllib, datetime modules
- Environment-based configuration for session secrets
- Configurable session secrets for production security
- ProxyFix middleware for reverse proxy deployments
- Error logging and debugging capabilities
- Session management and security
- Request timeout configurations
- Memory-conscious content processing (500KB limit)
- Graceful error handling and recovery
The application is designed to be easily deployable on various platforms with minimal configuration changes, supporting both development and production environments.
- Fixed XSS Vulnerability: Replaced unsafe innerHTML usage in showAlert function with safe DOM manipulation methods using createElement and textContent
- Removed History Feature: Removed the history viewing page and navigation link per user request
- Removed Database: Completely removed SQLAlchemy and all database dependencies for a simpler, stateless architecture
- Enhanced Security: All user-controlled data displayed in alerts is now properly sanitized through DOM text nodes
- Added Image Scraping: New feature to extract image URLs, titles/alt text, and display images from websites
- Image Upload Feature: New capability to upload multiple images, host them online, and generate PDFs with hosted URLs
- Enhanced PDF Generation: PDFs now include actual images (up to 20) with titles and URLs
- Image Display in Results: Web interface shows thumbnails of found images with direct URL display
- CSV Export with Images: CSV files now include a section for all found images with metadata
- Comprehensive Image Collection: During website scanning, images are collected from all visited pages
- Smart Image Handling: Automatic image format conversion, resizing, and error handling for broken images
- LLM-Friendly Display: All URLs (links and images) now displayed as plain text instead of hyperlinks for better LLM comprehension
- Direct URL Display: Image addresses shown in input fields with copy functionality instead of download links
- Visual Progress Bar: Added animated progress bar with status updates during PDF generation
- Interactive Site Preview: Added modal with iframe to preview websites before PDF generation
- Image Hosting: Uploaded images are now hosted on Cloudinary's professional CDN for permanent, reliable external URLs
- Added CSV Export: Users can now export scraped data as CSV files in addition to PDF
- Added Comprehensive Website Scraping: New feature to scrape entire websites by following internal links
- True Multi-Level Scanning: Scraper now goes 3 levels deep, following ALL internal links found on each page
- Deployment Optimizations:
- Balanced scanning: max 30 pages, depth 3, 2-minute timeout
- Increased link collection limit to 5000 unique links
- Added runtime monitoring and intelligent link filtering
- Enhanced UI: Added dual scraping options (single page vs entire website) with synchronized inputs
- Complete Link Collection: Both PDF and CSV exports include ALL links found across all scanned pages