-
Notifications
You must be signed in to change notification settings - Fork 0
Integrate PDF generation #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -69,21 +69,16 @@ async def clip( | |
| tags=tags, | ||
| ) | ||
|
|
||
| base_filename = self._generate_filename( | ||
| "aggregated_document", timestamp | ||
| markdown_info = self.file_manager.save_markdown( | ||
| final_doc, timestamp, url | ||
| ) | ||
| markdown_filename = f"{base_filename}.md" | ||
| markdown_path = os.path.join(self.output_dir, markdown_filename) | ||
| with open(markdown_path, "w", encoding="utf-8") as f: | ||
| f.write(final_doc) | ||
|
|
||
| pdf_filename = f"{base_filename}.pdf" # PDF stub | ||
| pdf_info = self.file_manager.save_pdf(final_doc, timestamp, url) | ||
|
|
||
| return { | ||
| "title": "Aggregated content", | ||
| "url": url, | ||
| "markdown_path": markdown_filename, | ||
| "pdf_path": pdf_filename, | ||
| "markdown_path": markdown_info["relative_path"], | ||
| "pdf_path": pdf_info["relative_path"], | ||
| "timestamp": timestamp, | ||
| "status": "completed", | ||
| "preview": ( | ||
|
|
@@ -112,19 +107,16 @@ async def clip( | |
| tags=tags, | ||
| ) | ||
|
|
||
| base_filename = self._generate_filename(title, timestamp) | ||
| markdown_filename = f"{base_filename}.md" | ||
| markdown_path = os.path.join(self.output_dir, markdown_filename) | ||
| with open(markdown_path, "w", encoding="utf-8") as f: | ||
| f.write(final_doc) | ||
|
|
||
| pdf_filename = f"{base_filename}.pdf" # PDF stub | ||
| markdown_info = self.file_manager.save_markdown( | ||
| final_doc, timestamp, url | ||
| ) | ||
| pdf_info = self.file_manager.save_pdf(final_doc, timestamp, url) | ||
|
Comment on lines
+110
to
+113
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This block of code for saving files is nearly identical to the one for handling sitemaps on lines 72-75. To improve maintainability and adhere to the Don't Repeat Yourself (DRY) principle, consider refactoring this duplicated logic into a private helper method. This would also be a good place to apply the fix for the blocking I/O issue mentioned in the other comment, solving it in one central place. async def _save_artifacts_and_get_result(self, final_doc: str, timestamp: str, url: str, title: str) -> dict:
loop = asyncio.get_running_loop()
markdown_info, pdf_info = await asyncio.gather(
loop.run_in_executor(
None, self.file_manager.save_markdown, final_doc, timestamp, url
),
loop.run_in_executor(
None, self.file_manager.save_pdf, final_doc, timestamp, url
),
)
return {
"title": title,
"url": url,
"markdown_path": markdown_info["relative_path"],
"pdf_path": pdf_info["relative_path"],
"timestamp": timestamp,
"status": "completed",
"preview": (
final_doc[:500] + "..." if len(final_doc) > 500 else final_doc
),
} |
||
|
|
||
| return { | ||
| "title": title, | ||
| "url": url, | ||
| "markdown_path": markdown_filename, | ||
| "pdf_path": pdf_filename, | ||
| "markdown_path": markdown_info["relative_path"], | ||
| "pdf_path": pdf_info["relative_path"], | ||
| "timestamp": timestamp, | ||
| "status": "completed", | ||
| "preview": ( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
save_markdownandsave_pdfmethods perform synchronous file I/O, which will block the asyncio event loop when called from anasyncfunction likeclip. This can degrade application performance and responsiveness, especially as PDF generation can be a slow process. Consider executing these blocking calls in a thread pool executor. You can useasyncio.gatherto run them concurrently for better performance.This same issue exists in the single-URL processing logic block as well (lines 110-113).