RAG Format

Overview

This is a Flask-based web application that transforms web content into AI-ready format for RAG (Retrieval-Augmented Generation) systems. The application extracts links, content, and images from websites and generates structured PDF/CSV documents optimized for AI applications and LLM consumption.

User Preferences

Preferred communication style: Simple, everyday language.

System Architecture

Backend Architecture

Framework: Flask (Python web framework)
Web Scraping: Combination of BeautifulSoup, requests, and Trafilatura for content extraction
PDF Generation: ReportLab for creating PDF documents
Deployment: WSGI-compatible with ProxyFix middleware for reverse proxy support

Frontend Architecture

Template Engine: Jinja2 (Flask's default)
CSS Framework: Bootstrap 5 with dark theme support
JavaScript: Vanilla JavaScript for form validation and UI interactions
Icons: Font Awesome for iconography

Key Components

Core Modules

app.py: Application factory and configuration
routes.py: Flask route handlers for web endpoints
web_scraper.py: Main scraping logic using multiple libraries
link_extractor.py: Specialized link extraction functionality
pdf_generator.py: PDF document creation and formatting

Web Scraping Pipeline

URL validation and preprocessing
Content extraction using Trafilatura for clean text
Link extraction using BeautifulSoup
Error handling and retry mechanisms
User-agent spoofing to avoid blocking

PDF Generation

ReportLab-based PDF creation
Custom styling and formatting
Support for structured content layout
Error PDF generation for failed scrapes

Data Flow

User Input: URL submission through web form
Validation: URL format validation and preprocessing
Scraping: Content and link extraction from target website
Presentation: Results displayed in web interface
PDF Export: On-demand PDF generation from scraped data

External Dependencies

Python Libraries

Flask: Web framework and routing
BeautifulSoup4: HTML parsing and navigation
Trafilatura: Clean text extraction from web pages
ReportLab: PDF document generation
Requests: HTTP client for web scraping

Frontend Dependencies

Bootstrap 5: CSS framework with dark theme
Font Awesome: Icon library
Vanilla JavaScript: Client-side functionality

Development Dependencies

Werkzeug: WSGI utilities and development server
Standard Library: logging, urllib, datetime modules

Deployment Strategy

Configuration

Environment-based configuration for session secrets
Configurable session secrets for production security

Production Considerations

ProxyFix middleware for reverse proxy deployments
Error logging and debugging capabilities
Session management and security

Scalability Features

Request timeout configurations
Memory-conscious content processing (500KB limit)
Graceful error handling and recovery

The application is designed to be easily deployable on various platforms with minimal configuration changes, supporting both development and production environments.

Recent Changes

August 9, 2025 - Security Fix, History Feature and Database Removal

Fixed XSS Vulnerability: Replaced unsafe innerHTML usage in showAlert function with safe DOM manipulation methods using createElement and textContent
Removed History Feature: Removed the history viewing page and navigation link per user request
Removed Database: Completely removed SQLAlchemy and all database dependencies for a simpler, stateless architecture
Enhanced Security: All user-controlled data displayed in alerts is now properly sanitized through DOM text nodes

August 9, 2025 - Image Scraping, Upload Feature, and LLM-Friendly Display Updates

Added Image Scraping: New feature to extract image URLs, titles/alt text, and display images from websites
Image Upload Feature: New capability to upload multiple images, host them online, and generate PDFs with hosted URLs
Enhanced PDF Generation: PDFs now include actual images (up to 20) with titles and URLs
Image Display in Results: Web interface shows thumbnails of found images with direct URL display
CSV Export with Images: CSV files now include a section for all found images with metadata
Comprehensive Image Collection: During website scanning, images are collected from all visited pages
Smart Image Handling: Automatic image format conversion, resizing, and error handling for broken images
LLM-Friendly Display: All URLs (links and images) now displayed as plain text instead of hyperlinks for better LLM comprehension
Direct URL Display: Image addresses shown in input fields with copy functionality instead of download links
Visual Progress Bar: Added animated progress bar with status updates during PDF generation
Interactive Site Preview: Added modal with iframe to preview websites before PDF generation
Image Hosting: Uploaded images are now hosted on Cloudinary's professional CDN for permanent, reliable external URLs

July 26, 2025 - Major Feature Additions and Deployment Optimizations

Added CSV Export: Users can now export scraped data as CSV files in addition to PDF
Added Comprehensive Website Scraping: New feature to scrape entire websites by following internal links
True Multi-Level Scanning: Scraper now goes 3 levels deep, following ALL internal links found on each page
Deployment Optimizations:
- Balanced scanning: max 30 pages, depth 3, 2-minute timeout
- Increased link collection limit to 5000 unique links
- Added runtime monitoring and intelligent link filtering
Enhanced UI: Added dual scraping options (single page vs entire website) with synchronized inputs
Complete Link Collection: Both PDF and CSV exports include ALL links found across all scanned pages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG Format

Overview

User Preferences

System Architecture

Backend Architecture

Frontend Architecture

Key Components

Core Modules

Web Scraping Pipeline

PDF Generation

Data Flow

External Dependencies

Python Libraries

Frontend Dependencies

Development Dependencies

Deployment Strategy

Configuration

Production Considerations

Scalability Features

Recent Changes

August 9, 2025 - Security Fix, History Feature and Database Removal

August 9, 2025 - Image Scraping, Upload Feature, and LLM-Friendly Display Updates

July 26, 2025 - Major Feature Additions and Deployment Optimizations

FilesExpand file tree

replit.md

Latest commit

History

replit.md

File metadata and controls

RAG Format

Overview

User Preferences

System Architecture

Backend Architecture

Frontend Architecture

Key Components

Core Modules

Web Scraping Pipeline

PDF Generation

Data Flow

External Dependencies

Python Libraries

Frontend Dependencies

Development Dependencies

Deployment Strategy

Configuration

Production Considerations

Scalability Features

Recent Changes

August 9, 2025 - Security Fix, History Feature and Database Removal

August 9, 2025 - Image Scraping, Upload Feature, and LLM-Friendly Display Updates

July 26, 2025 - Major Feature Additions and Deployment Optimizations