Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 11 additions & 19 deletions backend/src/web_clipper.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,21 +69,16 @@ async def clip(
tags=tags,
)

base_filename = self._generate_filename(
"aggregated_document", timestamp
markdown_info = self.file_manager.save_markdown(
final_doc, timestamp, url
)
markdown_filename = f"{base_filename}.md"
markdown_path = os.path.join(self.output_dir, markdown_filename)
with open(markdown_path, "w", encoding="utf-8") as f:
f.write(final_doc)

pdf_filename = f"{base_filename}.pdf" # PDF stub
pdf_info = self.file_manager.save_pdf(final_doc, timestamp, url)
Comment on lines +72 to +75
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The save_markdown and save_pdf methods perform synchronous file I/O, which will block the asyncio event loop when called from an async function like clip. This can degrade application performance and responsiveness, especially as PDF generation can be a slow process. Consider executing these blocking calls in a thread pool executor. You can use asyncio.gather to run them concurrently for better performance.

This same issue exists in the single-URL processing logic block as well (lines 110-113).

loop = asyncio.get_running_loop()
markdown_info, pdf_info = await asyncio.gather(
    loop.run_in_executor(
        None, self.file_manager.save_markdown, final_doc, timestamp, url
    ),
    loop.run_in_executor(
        None, self.file_manager.save_pdf, final_doc, timestamp, url
    ),
)


return {
"title": "Aggregated content",
"url": url,
"markdown_path": markdown_filename,
"pdf_path": pdf_filename,
"markdown_path": markdown_info["relative_path"],
"pdf_path": pdf_info["relative_path"],
"timestamp": timestamp,
"status": "completed",
"preview": (
Expand Down Expand Up @@ -112,19 +107,16 @@ async def clip(
tags=tags,
)

base_filename = self._generate_filename(title, timestamp)
markdown_filename = f"{base_filename}.md"
markdown_path = os.path.join(self.output_dir, markdown_filename)
with open(markdown_path, "w", encoding="utf-8") as f:
f.write(final_doc)

pdf_filename = f"{base_filename}.pdf" # PDF stub
markdown_info = self.file_manager.save_markdown(
final_doc, timestamp, url
)
pdf_info = self.file_manager.save_pdf(final_doc, timestamp, url)
Comment on lines +110 to +113
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code for saving files is nearly identical to the one for handling sitemaps on lines 72-75. To improve maintainability and adhere to the Don't Repeat Yourself (DRY) principle, consider refactoring this duplicated logic into a private helper method. This would also be a good place to apply the fix for the blocking I/O issue mentioned in the other comment, solving it in one central place.

async def _save_artifacts_and_get_result(self, final_doc: str, timestamp: str, url: str, title: str) -> dict:
    loop = asyncio.get_running_loop()
    markdown_info, pdf_info = await asyncio.gather(
        loop.run_in_executor(
            None, self.file_manager.save_markdown, final_doc, timestamp, url
        ),
        loop.run_in_executor(
            None, self.file_manager.save_pdf, final_doc, timestamp, url
        ),
    )

    return {
        "title": title,
        "url": url,
        "markdown_path": markdown_info["relative_path"],
        "pdf_path": pdf_info["relative_path"],
        "timestamp": timestamp,
        "status": "completed",
        "preview": (
            final_doc[:500] + "..." if len(final_doc) > 500 else final_doc
        ),
    }


return {
"title": title,
"url": url,
"markdown_path": markdown_filename,
"pdf_path": pdf_filename,
"markdown_path": markdown_info["relative_path"],
"pdf_path": pdf_info["relative_path"],
"timestamp": timestamp,
"status": "completed",
"preview": (
Expand Down
2 changes: 1 addition & 1 deletion docs/PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
- **Phase 5**: Prepare production deployment and monitoring setup.

## Code Review Findings
- **PDF generation** is stubbed in `WebClipper.clip` and needs integration with `FileManager.save_pdf`.
- **PDF generation** is now integrated using `FileManager.save_pdf`.
- **Marketing detection** (`utils/marketing_detector.py`) is unused; integrate into content processing to filter promotional sections.
- **Semantic deduplication** (`SemanticContentCleaner`) is implemented but never applied when aggregating multiple pages.
- **File uploads** are stored but not processed. Uploaded markdown or sitemap files should trigger clipping logic.
Expand Down
11 changes: 6 additions & 5 deletions docs/TASKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,12 @@
- **Status**: Complete

### Core Development
- [ ] **TASK-002**: Finalise clipping logic with PDF generation
- [x] **TASK-002**: Finalise clipping logic with PDF generation
- **Priority**: High
- **Effort**: 4h
- **Dependencies**: TASK-001
- **Acceptance Criteria**: Markdown and PDF files generated for each clip
- **Status**: In Progress
- **Status**: Complete
- [ ] **TASK-003**: Integrate marketing detection and semantic deduplication
- **Priority**: Medium
- **Effort**: 4h
Expand Down Expand Up @@ -78,10 +78,11 @@

## Progress Summary
- **Total Tasks**: 11
- **Completed**: 1
- **In Progress**: 1
- **Completed**: 2
- **In Progress**: 0
- **Remaining**: 9
- **Overall Progress**: 9%
- **Overall Progress**: 18%

## Notes
- Tasks expanded based on code review. Progress will be tracked in future updates.
- PDF generation integrated via `FileManager.save_pdf` completing TASK-002.