-
Notifications
You must be signed in to change notification settings - Fork 0
Add documentation overview and improvement notes #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| # Tracking Progress | ||
|
|
||
| This file documents ongoing plans and milestones for the project. Update it as new features are implemented or bugs are addressed. | ||
|
|
||
| - [ ] Initial documentation written | ||
| - [ ] Automated test suite | ||
| - [ ] Authentication and access control |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| # Suggested Additional Features | ||
|
|
||
| ## 1. Audit Logging | ||
| - **Description**: Record all API actions and downloads with user identifiers. | ||
| - **Rationale**: Provides traceability for compliance and troubleshooting. | ||
| - **Integration**: Extend the existing logger in `backend/src/utils/logger.py` to write structured logs, possibly to an external system. | ||
| - **Potential Impact**: Minimal if logging is asynchronous; storage requirements will increase. | ||
|
|
||
| ## 2. Plugin Architecture | ||
| - **Description**: Allow custom processors (e.g., summarizers, translators) to be plugged into the clipping pipeline. | ||
| - **Rationale**: Increases extensibility and encourages community contributions. | ||
| - **Integration**: `WebClipper` could load processors defined via entry points or a config file. | ||
| - **Potential Impact**: Adds complexity but keeps core clean when optional features are isolated. | ||
|
|
||
| ## 3. User Authentication | ||
| - **Description**: Add login and role-based access to protect private content and settings. | ||
| - **Rationale**: Necessary for multi-user deployments or SaaS offerings. | ||
| - **Integration**: FastAPI provides OAuth2 helpers; results and organization endpoints would require authentication decorators. | ||
| - **Potential Impact**: Significant changes to API and frontend but improves security. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| # Feature List | ||
|
|
||
| ## Clipping & Processing | ||
| - **URL and Sitemap Clipping**: `/clip` endpoint accepts single URLs or sitemap links and processes them through `WebClipper`【F:backend/src/web_clipper.py†L24-L103】. | ||
| - **Content Extraction**: `ContentProcessor` uses `readability` and `html2text` to convert HTML to cleaned Markdown【F:backend/src/processors/content_processor.py†L1-L77】. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| - **Marketing & Duplicate Removal**: `ContentCleaner` filters promotional sections and `SemanticContentCleaner` removes near-duplicate sections using embeddings【F:backend/src/utils/content_cleaner.py†L1-L46】【F:backend/src/utils/deduplication.py†L1-L23】. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| ## File Management | ||
| - **Markdown & PDF Output**: Processed content is saved via `FileManager`, generating styled Markdown and optional PDF files【F:backend/src/utils/file_manager.py†L1-L99】. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| - **Upload Local Files**: `/upload_file` endpoint stores user-uploaded files for processing later【F:backend/src/main.py†L61-L75】. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| ## Organization & Metadata | ||
| - **Tagging and Organizations**: Results store optional tags and organization IDs, managed through `/organizations` and `/tags` endpoints【F:backend/src/main.py†L79-L115】【F:backend/src/main.py†L134-L151】. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| - **Result CRUD**: Endpoints to list, update, delete, and download clipped results【F:backend/src/main.py†L83-L158】. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| - **Statistics API**: `/stats` returns totals for clips, organizations, active projects, and storage usage【F:backend/src/main.py†L160-L162】【F:backend/src/database.py†L73-L119】. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| ## Frontend Features | ||
| - **Upload Workflow**: Drag-and-drop or URL input interface for clipping content, with preview dialog on success【F:frontend/src/pages/Upload.tsx†L1-L214】. | ||
| - **Results Management**: Search, filter, pagination, edit, and download options for clipped documents【F:frontend/src/pages/Results.tsx†L1-L207】【F:frontend/src/pages/Results.tsx†L200-L292】. | ||
| - **Organization Dashboard**: Create, edit, and delete organizations, including basic stats display【F:frontend/src/pages/Organizations.tsx†L1-L199】. | ||
| - **User Settings**: Preferences for default formats, storage location, and appearance are editable in the settings page【F:frontend/src/pages/Settings.tsx†L1-L182】. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| # Opportunities for Improvement | ||
|
|
||
| ## Code Quality and Organization | ||
| - **Duplicate Main Scripts**: There is a commented-out prototype in `main.py` that can be removed or moved to examples for clarity【F:main.py†L1-L66】. | ||
| - **Testing Coverage**: No automated tests are present. Adding unit tests for utilities and API endpoints would greatly improve reliability. | ||
| - **Configuration Management**: Sensitive paths and constants are defined directly in `config.py`. Consider using environment variables or a settings library to support multiple environments【F:backend/config.py†L1-L33】. | ||
|
|
||
| ## Dependency Management | ||
| - **Large Model Dependencies**: The backend installs heavy NLP models like `en_core_web_lg` via `requirements.txt`, increasing build times. Evaluate whether smaller models suffice or provide an option to skip installation when not needed【F:backend/requirements.txt†L23-L27】. | ||
| - **wkhtmltopdf Runtime**: PDF generation relies on the `wkhtmltopdf` binary installed in the Docker image. Ensure deployments include this dependency and handle errors when unavailable【F:backend/Dockerfile†L1-L19】. | ||
|
|
||
| ## Backend Design | ||
| - **Database Abstraction**: `database.py` directly builds SQL queries with SQLite. Introducing an ORM (e.g., SQLAlchemy) would improve maintainability and make migrations easier【F:backend/src/database.py†L1-L149】. | ||
| - **Asynchronous Fetching**: `WebClipper` fetches URLs with a new `aiohttp` session per call. Reusing sessions or implementing connection pooling could improve performance【F:backend/src/web_clipper.py†L25-L37】. | ||
|
|
||
| ## Frontend Enhancements | ||
| - **Form Validation**: Upload and organization forms currently lack validation for required fields. Adding client-side and server-side validation would prevent invalid data. | ||
| - **Accessibility**: Ensure components meet accessibility standards (ARIA labels, keyboard navigation) for a wider range of users. | ||
|
|
||
| ## Documentation | ||
| - The project README is minimal. Expanding it with setup instructions and linking to these docs would help newcomers. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| # Codebase Overview | ||
|
|
||
| ## Purpose and Scope | ||
| This repository contains a web application that aggregates website content and converts it into downloadable documentation. Users can clip single pages or entire sitemap structures and store the results with metadata. The project includes a FastAPI backend, a React frontend, and a set of utilities for crawling, cleaning, and exporting content. | ||
|
|
||
| ## Architecture Diagram | ||
| ```mermaid | ||
| flowchart LR | ||
| A[Frontend (React)] -->|REST API| B(Backend FastAPI) | ||
| B --> C[WebClipper] | ||
| C --> D[Content Processor] | ||
| C --> E[File Manager] | ||
| C --> F[Input Handler] | ||
| D --> G[Content Cleaner] | ||
| D --> H[Semantic Deduper] | ||
| B --> I[SQLite DB] | ||
| E --> J[Markdown/PDF files] | ||
| ``` | ||
|
|
||
| ## Key Modules | ||
| - `backend/src/main.py`: FastAPI application with endpoints for clipping content, managing results, and organizations. | ||
| - `backend/src/web_clipper.py`: Orchestrates fetching, processing, and saving content. | ||
| - `backend/src/processors/content_processor.py`: Extracts readable content from HTML and converts it to Markdown. | ||
| - `backend/src/utils`: Helper utilities such as the crawler, sitemap parser, input handler, deduplication, and file manager. | ||
| - `frontend/src`: React application providing pages for upload, results, organizations, and settings. | ||
| - `backend/src/database.py`: Simple SQLite wrapper for persisting clips and organizations. | ||
|
|
||
| ## Directory Structure | ||
| ``` | ||
| / (root) | ||
| ├── backend/ # FastAPI service and utilities | ||
| │ ├── src/ | ||
| │ │ ├── processors/ | ||
| │ │ ├── utils/ | ||
| │ │ ├── database.py | ||
| │ │ └── main.py | ||
| │ ├── Dockerfile | ||
| │ └── requirements.txt | ||
| ├── frontend/ # React user interface | ||
| │ ├── src/ | ||
| │ ├── Dockerfile | ||
| │ └── package.json | ||
| ├── docker-compose.yml | ||
| └── main.py # legacy CLI prototype | ||
| ``` | ||
|
|
||
| ## Dependencies & Prerequisites | ||
| - **Backend**: Python 3.10+, FastAPI, Uvicorn, wkhtmltopdf for PDF generation. | ||
| - **Frontend**: Node.js 18+, React with Material UI. | ||
| - Docker is optional for running services via `docker-compose.yml`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file reference
backend/src/web_clipper.pylinks to the entire file, not the specific clipping functionality. Consider narrowing the reference to the relevant lines that implement the clipping logic.