Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .codex/tracking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Tracking Progress

This file documents ongoing plans and milestones for the project. Update it as new features are implemented or bugs are addressed.

- [ ] Initial documentation written
- [ ] Automated test suite
- [ ] Authentication and access control
19 changes: 19 additions & 0 deletions docs/additional_features.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Suggested Additional Features

## 1. Audit Logging
- **Description**: Record all API actions and downloads with user identifiers.
- **Rationale**: Provides traceability for compliance and troubleshooting.
- **Integration**: Extend the existing logger in `backend/src/utils/logger.py` to write structured logs, possibly to an external system.
- **Potential Impact**: Minimal if logging is asynchronous; storage requirements will increase.

## 2. Plugin Architecture
- **Description**: Allow custom processors (e.g., summarizers, translators) to be plugged into the clipping pipeline.
- **Rationale**: Increases extensibility and encourages community contributions.
- **Integration**: `WebClipper` could load processors defined via entry points or a config file.
- **Potential Impact**: Adds complexity but keeps core clean when optional features are isolated.

## 3. User Authentication
- **Description**: Add login and role-based access to protect private content and settings.
- **Rationale**: Necessary for multi-user deployments or SaaS offerings.
- **Integration**: FastAPI provides OAuth2 helpers; results and organization endpoints would require authentication decorators.
- **Potential Impact**: Significant changes to API and frontend but improves security.
21 changes: 21 additions & 0 deletions docs/features.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Feature List

## Clipping & Processing
- **URL and Sitemap Clipping**: `/clip` endpoint accepts single URLs or sitemap links and processes them through `WebClipper`【F:backend/src/web_clipper.py†L24-L103】.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file reference backend/src/web_clipper.py links to the entire file, not the specific clipping functionality. Consider narrowing the reference to the relevant lines that implement the clipping logic.

- **Content Extraction**: `ContentProcessor` uses `readability` and `html2text` to convert HTML to cleaned Markdown【F:backend/src/processors/content_processor.py†L1-L77】.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file reference backend/src/processors/content_processor.py links to the entire file. It would be more helpful to link directly to the extract_content function or the relevant section where HTML is converted to Markdown.

- **Marketing & Duplicate Removal**: `ContentCleaner` filters promotional sections and `SemanticContentCleaner` removes near-duplicate sections using embeddings【F:backend/src/utils/content_cleaner.py†L1-L46】【F:backend/src/utils/deduplication.py†L1-L23】.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding specific line numbers for the ContentCleaner and SemanticContentCleaner references to guide readers directly to the relevant code sections.


## File Management
- **Markdown & PDF Output**: Processed content is saved via `FileManager`, generating styled Markdown and optional PDF files【F:backend/src/utils/file_manager.py†L1-L99】.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file reference backend/src/utils/file_manager.py links to the entire file. It would be more helpful to link directly to the functions responsible for saving Markdown and PDF files.

- **Upload Local Files**: `/upload_file` endpoint stores user-uploaded files for processing later【F:backend/src/main.py†L61-L75】.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file reference backend/src/main.py links to the entire file. It would be more helpful to link directly to the /upload_file endpoint definition.


## Organization & Metadata
- **Tagging and Organizations**: Results store optional tags and organization IDs, managed through `/organizations` and `/tags` endpoints【F:backend/src/main.py†L79-L115】【F:backend/src/main.py†L134-L151】.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file reference backend/src/main.py links to the entire file. It would be more helpful to link directly to the /organizations and /tags endpoint definitions.

- **Result CRUD**: Endpoints to list, update, delete, and download clipped results【F:backend/src/main.py†L83-L158】.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file reference backend/src/main.py links to the entire file. It would be more helpful to link directly to the CRUD endpoint definitions.

- **Statistics API**: `/stats` returns totals for clips, organizations, active projects, and storage usage【F:backend/src/main.py†L160-L162】【F:backend/src/database.py†L73-L119】.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file reference backend/src/main.py links to the entire file. It would be more helpful to link directly to the /stats endpoint definition.


## Frontend Features
- **Upload Workflow**: Drag-and-drop or URL input interface for clipping content, with preview dialog on success【F:frontend/src/pages/Upload.tsx†L1-L214】.
- **Results Management**: Search, filter, pagination, edit, and download options for clipped documents【F:frontend/src/pages/Results.tsx†L1-L207】【F:frontend/src/pages/Results.tsx†L200-L292】.
- **Organization Dashboard**: Create, edit, and delete organizations, including basic stats display【F:frontend/src/pages/Organizations.tsx†L1-L199】.
- **User Settings**: Preferences for default formats, storage location, and appearance are editable in the settings page【F:frontend/src/pages/Settings.tsx†L1-L182】.
21 changes: 21 additions & 0 deletions docs/improvements.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Opportunities for Improvement

## Code Quality and Organization
- **Duplicate Main Scripts**: There is a commented-out prototype in `main.py` that can be removed or moved to examples for clarity【F:main.py†L1-L66】.
- **Testing Coverage**: No automated tests are present. Adding unit tests for utilities and API endpoints would greatly improve reliability.
- **Configuration Management**: Sensitive paths and constants are defined directly in `config.py`. Consider using environment variables or a settings library to support multiple environments【F:backend/config.py†L1-L33】.

## Dependency Management
- **Large Model Dependencies**: The backend installs heavy NLP models like `en_core_web_lg` via `requirements.txt`, increasing build times. Evaluate whether smaller models suffice or provide an option to skip installation when not needed【F:backend/requirements.txt†L23-L27】.
- **wkhtmltopdf Runtime**: PDF generation relies on the `wkhtmltopdf` binary installed in the Docker image. Ensure deployments include this dependency and handle errors when unavailable【F:backend/Dockerfile†L1-L19】.

## Backend Design
- **Database Abstraction**: `database.py` directly builds SQL queries with SQLite. Introducing an ORM (e.g., SQLAlchemy) would improve maintainability and make migrations easier【F:backend/src/database.py†L1-L149】.
- **Asynchronous Fetching**: `WebClipper` fetches URLs with a new `aiohttp` session per call. Reusing sessions or implementing connection pooling could improve performance【F:backend/src/web_clipper.py†L25-L37】.

## Frontend Enhancements
- **Form Validation**: Upload and organization forms currently lack validation for required fields. Adding client-side and server-side validation would prevent invalid data.
- **Accessibility**: Ensure components meet accessibility standards (ARIA labels, keyboard navigation) for a wider range of users.

## Documentation
- The project README is minimal. Expanding it with setup instructions and linking to these docs would help newcomers.
50 changes: 50 additions & 0 deletions docs/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Codebase Overview

## Purpose and Scope
This repository contains a web application that aggregates website content and converts it into downloadable documentation. Users can clip single pages or entire sitemap structures and store the results with metadata. The project includes a FastAPI backend, a React frontend, and a set of utilities for crawling, cleaning, and exporting content.

## Architecture Diagram
```mermaid
flowchart LR
A[Frontend (React)] -->|REST API| B(Backend FastAPI)
B --> C[WebClipper]
C --> D[Content Processor]
C --> E[File Manager]
C --> F[Input Handler]
D --> G[Content Cleaner]
D --> H[Semantic Deduper]
B --> I[SQLite DB]
E --> J[Markdown/PDF files]
```

## Key Modules
- `backend/src/main.py`: FastAPI application with endpoints for clipping content, managing results, and organizations.
- `backend/src/web_clipper.py`: Orchestrates fetching, processing, and saving content.
- `backend/src/processors/content_processor.py`: Extracts readable content from HTML and converts it to Markdown.
- `backend/src/utils`: Helper utilities such as the crawler, sitemap parser, input handler, deduplication, and file manager.
- `frontend/src`: React application providing pages for upload, results, organizations, and settings.
- `backend/src/database.py`: Simple SQLite wrapper for persisting clips and organizations.

## Directory Structure
```
/ (root)
├── backend/ # FastAPI service and utilities
│ ├── src/
│ │ ├── processors/
│ │ ├── utils/
│ │ ├── database.py
│ │ └── main.py
│ ├── Dockerfile
│ └── requirements.txt
├── frontend/ # React user interface
│ ├── src/
│ ├── Dockerfile
│ └── package.json
├── docker-compose.yml
└── main.py # legacy CLI prototype
```

## Dependencies & Prerequisites
- **Backend**: Python 3.10+, FastAPI, Uvicorn, wkhtmltopdf for PDF generation.
- **Frontend**: Node.js 18+, React with Material UI.
- Docker is optional for running services via `docker-compose.yml`.