diff --git a/CLAUDE.md b/CLAUDE.md index 7d45b96d..822c8d58 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -11,6 +11,33 @@ fresh worktree. Update it when you discover a convention worth recording. - **License:** MIT (package) + BSD-3-Clause (bundled PDFium binary). - **CRAN target:** v0.1.0 ships to CRAN. Every change preserves CRAN-cleanliness. +## Scope — wrap PDFium, don't invent helpers + +The package's job is to expose Google's PDFium C API to R idiomatically. +Every public function should ultimately call into PDFium (perhaps via a +chain of internal helpers) or be unambiguously tied to PDF-format +concepts (`pdf_parse_date()` parses the PDF date-string format). + +What does **not** belong: + +- Filesystem walking (`list.files()` loops over `pdf_doc_*`). +- Network plumbing beyond what PDFium itself does. `pdf_doc_open(path)` + accepting a URL is fine — the URL becomes raw bytes which go straight + into PDFium's `FPDF_LoadMemDocument64`. A function whose body is + mostly `httr2::request(...)` is not. +- Bulk / batch wrappers ("apply this PDFium function to every file in + a folder"). Users have `lapply` and `purrr` for that. +- Cross-PDF analysis ("compare these two PDFs"). Out of scope. + +When in doubt, ask: *what PDFium symbol does this wrap?* If the answer +is "none — it's a convenience over base R", the function belongs in +user code or a separate utility package, not here. + +This is recorded as a deletion-justification in `NEWS.md` for the +`pdf_dir_summary` / `pdf_doc_open_url` retraction. Future contributors +shouldn't re-add functions whose job is to glue base R primitives +together around pdfium calls. + ## Layering — never bypass ``` diff --git a/NAMESPACE b/NAMESPACE index 9b1b7001..118daaf5 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -42,8 +42,14 @@ S3method(print,pdfium_obj_list) S3method(print,pdfium_page) S3method(print,pdfium_signature) S3method(print,pdfium_signature_list) +S3method(summary,pdfium_annot_list) +S3method(summary,pdfium_attachment_list) +S3method(summary,pdfium_bookmark_list) S3method(summary,pdfium_doc) +S3method(summary,pdfium_form_field_list) +S3method(summary,pdfium_obj_list) S3method(summary,pdfium_page) +S3method(summary,pdfium_signature_list) export(as_pdfium_annot_list) export(as_pdfium_attachment_list) export(as_pdfium_bookmark_list) @@ -105,7 +111,6 @@ export(pdf_bookmark_title) export(pdf_bookmark_uri) export(pdf_clip_path_count) export(pdf_clip_path_segments) -export(pdf_dir_summary) export(pdf_doc_bookmark_find) export(pdf_doc_bookmarks) export(pdf_doc_close) @@ -120,7 +125,6 @@ export(pdf_doc_named_dest_by_name) export(pdf_doc_named_dests) export(pdf_doc_new) export(pdf_doc_open) -export(pdf_doc_open_url) export(pdf_doc_page_mode) export(pdf_doc_permissions) export(pdf_doc_security) diff --git a/NEWS.md b/NEWS.md index 9b1414a9..454926bb 100644 --- a/NEWS.md +++ b/NEWS.md @@ -11,10 +11,12 @@ PDFs created with `pdf_doc_new()` are also writable). * `pdf_doc_open()` / `pdf_doc_close()`, `pdf_doc_new()`, `pdf_save()` / `pdf_save_to_raw()` — open existing PDFs (optionally with `readwrite = TRUE`), build new ones in memory, and persist - the result. `pdf_doc_open_url(url)` is a convenience wrapper that - fetches a `http://` / `https://` / `ftp://` / `file://` URL via - `url()` + `readBin()` and loads the bytes through PDFium's - in-memory path — no temporary file on disk. + the result. The `path =` argument of `pdf_doc_open()` accepts + either a local filesystem path or a URL (any scheme `base::url()` + recognises — typically `http://` / `https://` / `ftp://` / + `file://`); URL input is fetched into raw bytes via `url()` + + `readBin()` and loaded through PDFium's `FPDF_LoadMemDocument64`, + with no temporary file on disk. * `pdf_doc_info()`, `pdf_doc_meta()`, `pdf_doc_text()`, `pdf_doc_fonts()`, `pdf_doc_file_id()`, `pdf_doc_page_mode()`, `pdf_doc_viewer_preferences()`, `pdf_doc_viewer_preference_by_name()`, @@ -40,13 +42,29 @@ PDFs created with `pdf_doc_new()` are also writable). dispatch to the matching tibble — `summary(page)` adds the page-loaded counts (annotation count, page-object count, text-run count, link count) since the page is already loaded. -* `pdf_dir_summary(dir)` — scans a directory for PDF files and - returns one row per file in the `pdf_doc_summary()` shape. - Recursive scan via `recursive = TRUE`; pattern-matches `.pdf` - case-insensitively by default. The `errors` argument selects - one of `"warn"` (default — surface broken files but don't - abort), `"skip"` (silently drop), or `"stop"` (abort on the - first failure). +* `summary()` S3 methods for every `pdfium_*_list` class: + `pdfium_obj_list`, `pdfium_annot_list`, `pdfium_attachment_list`, + `pdfium_signature_list`, `pdfium_bookmark_list`, and + `pdfium_form_field_list`. Each dispatches to the matching + `as_tibble.*` method so `summary(x)` returns the same tibble + view `tibble::as_tibble(x)` would — matching the R idiom of + `print()` for the one-line summary and `summary()` for the deep + dive. + +## Scope retraction + +Two functions added during 0.1.0 development were retracted before +release on scope grounds (see `CLAUDE.md` §"Scope"): + +* **`pdf_doc_open_url()`** — folded into `pdf_doc_open(path = ...)`. + The URL-fetching layer is just `base::url()` + `readBin()` ahead + of PDFium's existing in-memory path, so a separate exported + symbol added surface for no PDFium-specific behaviour. +* **`pdf_dir_summary()`** — removed. Its body was `list.files()` + + `lapply(pdf_doc_summary)`; users with bulk-triage needs can + write the loop themselves in three lines. Keeping it set a + precedent for "convenience over a base R loop" creep that the + package's PDFium-wrapper mandate doesn't want. ## Page objects, paths, and text diff --git a/R/annotations.R b/R/annotations.R index a2d369cb..894cb7c1 100644 --- a/R/annotations.R +++ b/R/annotations.R @@ -246,6 +246,20 @@ as_tibble.pdfium_annot_list <- function(x, ...) { ) } +#' Tibble-shaped summary of an annotation list +#' +#' `summary()` method for `pdfium_annot_list`. Defers to +#' [as_tibble.pdfium_annot_list()] for the standard tibble view. +#' +#' @param object A `pdfium_annot_list` from [pdf_annotations()]. +#' @param ... Forwarded to [as_tibble.pdfium_annot_list()]. +#' @return The tibble returned by [as_tibble.pdfium_annot_list()]. +#' @method summary pdfium_annot_list +#' @export +summary.pdfium_annot_list <- function(object, ...) { + tibble::as_tibble(object, ...) +} + # Internal: zero-row tibble matching as_tibble.pdfium_annot_list's # schema. Used when the page has no annotations. empty_annot_tibble <- function(src_page) { diff --git a/R/attachments.R b/R/attachments.R index 39d28ee7..e677f2bf 100644 --- a/R/attachments.R +++ b/R/attachments.R @@ -75,6 +75,22 @@ as_tibble.pdfium_attachment_list <- function(x, ...) { ) } +#' Tibble-shaped summary of an attachment list +#' +#' `summary()` method for `pdfium_attachment_list`. Defers to +#' [as_tibble.pdfium_attachment_list()] for the standard tibble +#' view — matches the R idiom of `print()` for the one-line summary +#' and `summary()` for the deep dive. +#' +#' @param object A `pdfium_attachment_list` from [pdf_attachments()]. +#' @param ... Forwarded to [as_tibble.pdfium_attachment_list()]. +#' @return The tibble returned by [as_tibble.pdfium_attachment_list()]. +#' @method summary pdfium_attachment_list +#' @export +summary.pdfium_attachment_list <- function(object, ...) { + tibble::as_tibble(object, ...) +} + # Internal: zero-row tibble matching as_tibble.pdfium_attachment_list. empty_attachment_tibble <- function() { tibble::tibble( diff --git a/R/doc.R b/R/doc.R index 61141873..2404d0a8 100644 --- a/R/doc.R +++ b/R/doc.R @@ -110,6 +110,20 @@ as_tibble.pdfium_bookmark_list <- function(x, ...) { ) } +#' Tibble-shaped summary of a bookmark list +#' +#' `summary()` method for `pdfium_bookmark_list`. Defers to +#' [as_tibble.pdfium_bookmark_list()] for the standard tibble view. +#' +#' @param object A `pdfium_bookmark_list` from [pdf_doc_bookmarks()]. +#' @param ... Forwarded to [as_tibble.pdfium_bookmark_list()]. +#' @return The tibble returned by [as_tibble.pdfium_bookmark_list()]. +#' @method summary pdfium_bookmark_list +#' @export +summary.pdfium_bookmark_list <- function(object, ...) { + tibble::as_tibble(object, ...) +} + empty_bookmark_tibble <- function() { tibble::tibble( bookmark_index = integer(), @@ -642,116 +656,6 @@ summary.pdfium_doc <- function(object, ...) { pdf_doc_summary(object) } -#' Summarise every PDF in a directory in one call -#' -#' Scans a directory for PDF files and returns a tibble whose rows -#' are the [pdf_doc_summary()] output for each file. The natural -#' replacement for the standard "loop over a folder of PDFs and -#' triage" workflow — encrypted-which / has-forms-which / -#' has-attachments-which. -#' -#' Files that fail to open (corrupt, wrong format, password -#' protected) are handled per the `errors` argument: -#' -#' * `"warn"` (default) — a `warning()` per failed file; the file -#' is dropped from the result tibble. -#' * `"skip"` — silently dropped. -#' * `"stop"` — the first failed file raises an error and the -#' function aborts. -#' -#' @param dir Character scalar. Path to the directory to scan. -#' @param pattern Regular expression filtering filenames. Defaults -#' to `"\\.pdf$"` (case-insensitive). -#' @param recursive Logical. When `TRUE`, descend into -#' subdirectories. Defaults `FALSE`. -#' @param password Optional password applied to every file. `NULL` -#' (default) tries each file without a password. Useful when all -#' files share the same password. -#' @param errors One of `"warn"`, `"skip"`, `"stop"` — see Details. -#' @return A tibble with the same columns as [pdf_doc_summary()]. -#' Zero rows when the directory has no PDFs (or every PDF failed -#' to open under `errors = "skip"` / `"warn"`). -#' @seealso [pdf_doc_summary()] for the single-file companion. -#' @examples -#' fixture_dir <- system.file("extdata", "fixtures", -#' package = "pdfium") -#' if (nzchar(fixture_dir)) { -#' pdf_dir_summary(fixture_dir) -#' } -#' @export -pdf_dir_summary <- function(dir = ".", pattern = "\\.pdf$", - recursive = FALSE, password = NULL, - errors = c("warn", "skip", "stop")) { - checkmate::assert_directory_exists(dir) - checkmate::assert_string(pattern) - checkmate::assert_flag(recursive) - errors <- match.arg(errors) - - files <- list.files(dir, pattern = pattern, recursive = recursive, - full.names = TRUE, ignore.case = TRUE) - if (length(files) == 0L) { - return(pdf_doc_summary_empty()) - } - - rows <- lapply(files, function(f) { - tryCatch( - pdf_doc_summary(f, password = password), - error = function(e) { - if (errors == "stop") { - stop(sprintf("pdf_dir_summary: failed to read '%s': %s", - f, conditionMessage(e)), call. = FALSE) - } - if (errors == "warn") { - warning(sprintf("pdf_dir_summary: failed to read '%s': %s", - f, conditionMessage(e)), call. = FALSE) - } - NULL - } - ) - }) - ok <- !vapply(rows, is.null, logical(1L)) - if (!any(ok)) { - return(pdf_doc_summary_empty()) - } - out <- do.call(rbind, rows[ok]) - tibble::as_tibble(out) -} - -# Internal: zero-row tibble matching pdf_doc_summary's column shape. -# Used by pdf_dir_summary() when the directory is empty (or every -# file failed under `errors = "skip"` / `"warn"`). -pdf_doc_summary_empty <- function() { - tibble::tibble( - path = character(), - page_count = integer(), - file_version = integer(), - title = character(), - author = character(), - subject = character(), - keywords = character(), - creator = character(), - producer = character(), - creation_date = character(), - mod_date = character(), - trapped = character(), - creation_date_parsed = as.POSIXct(character(), tz = "UTC"), - mod_date_parsed = as.POSIXct(character(), tz = "UTC"), - is_tagged = logical(), - is_encrypted = logical(), - security_revision = integer(), - xref_valid = logical(), - bookmark_count = integer(), - attachment_count = integer(), - signature_count = integer(), - form_field_count = integer(), - javascript_count = integer(), - named_dest_count = integer(), - has_page_labels = logical(), - file_id_permanent = character(), - file_id_changing = character() - ) -} - # Internal: convert pdf_doc_file_id()'s raw return to a hex string, # or NA_character_ when empty. Hoisted from pdf_doc_summary so its # two branches can be unit-tested without a fixture that carries an diff --git a/R/document.R b/R/document.R index 7710229a..4d594786 100644 --- a/R/document.R +++ b/R/document.R @@ -1,22 +1,34 @@ #' Open a PDF document #' -#' Loads a PDF from disk or from an in-memory byte buffer. The -#' returned `pdfium_doc` carries an external pointer to a PDFium -#' `FPDF_DOCUMENT` handle along with a finalizer that calls -#' `FPDF_CloseDocument()` when the R object is garbage-collected. -#' Call [pdf_doc_close()] explicitly when you need deterministic -#' release. +#' Loads a PDF from disk, from a URL, or from an in-memory byte +#' buffer. The returned `pdfium_doc` carries an external pointer to +#' a PDFium `FPDF_DOCUMENT` handle along with a finalizer that +#' calls `FPDF_CloseDocument()` when the R object is +#' garbage-collected. Call [pdf_doc_close()] explicitly when you +#' need deterministic release. #' -#' Two input forms are supported. Pass `path` to load from disk -#' (via PDFium's `FPDF_LoadDocument`), or pass `source` for an -#' in-memory raw vector (via `FPDF_LoadMemDocument64`). The -#' in-memory path is useful for documents downloaded via -#' `httr2::resp_body_raw()`, `curl::curl_fetch_memory()`, or read -#' with `readBin()` straight into RAM. Exactly one of `path` or -#' `source` must be provided. +#' Two input forms are supported: #' -#' @param path Character scalar. Path to a PDF file. The file must -#' exist and be readable. Mutually exclusive with `source`. +#' * **`path`** — either a local filesystem path (loaded via +#' PDFium's `FPDF_LoadDocument`) or a URL string (any scheme +#' base R's [base::url()] knows how to open — typically +#' `http://`, `https://`, `ftp://`, `file://`; the actual list +#' depends on the R build's `libcurl` / internal handlers). +#' Detection is purely syntactic: a string matching +#' `scheme://` per RFC 3986 is routed to `url() + readBin()` +#' and loaded via PDFium's in-memory path; anything else is +#' treated as a local path. No allowlist — we trust `url()` to +#' know what it supports and error cleanly when it doesn't. +#' * **`source`** — a raw vector containing the PDF byte stream +#' (loaded via `FPDF_LoadMemDocument64`). Useful for documents +#' downloaded via `httr2::resp_body_raw()`, +#' `curl::curl_fetch_memory()`, or read with `readBin()` straight +#' into RAM. +#' +#' Exactly one of `path` or `source` must be provided. +#' +#' @param path Character scalar. Either a local path or a URL (see +#' Details). Mutually exclusive with `source`. #' @param source Raw vector containing the PDF byte stream. PDFium #' keeps an internal reference to the bytes for the document's #' lifetime, so the wrapper makes its own copy on the C++ side @@ -52,6 +64,13 @@ #' pdf_page_count(doc) #' pdf_doc_close(doc) #' } +#' +#' # `path` can be a URL - any scheme R's url() recognises. +#' if (nzchar(fixture)) { +#' doc <- pdf_doc_open(paste0("file://", fixture)) +#' pdf_page_count(doc) +#' pdf_doc_close(doc) +#' } #' @export pdf_doc_open <- function(path = NULL, source = NULL, password = NULL, readwrite = FALSE) { @@ -62,6 +81,15 @@ pdf_doc_open <- function(path = NULL, source = NULL, password = NULL, ptr <- cpp_open_document_from_memory(source, pwd) return(new_pdfium_doc(ptr, "", readwrite = readwrite)) } + if (looks_like_url(path)) { + con <- base::url(path, open = "rb") + on.exit(close(con), add = TRUE) + # readBin needs an upper bound; .Machine$integer.max is the + # documented "unbounded" sentinel. + bytes <- readBin(con, what = "raw", n = .Machine$integer.max) + ptr <- cpp_open_document_from_memory(bytes, pwd) + return(new_pdfium_doc(ptr, path, readwrite = readwrite)) + } ptr <- cpp_open_document(path.expand(path), pwd) new_pdfium_doc( ptr, @@ -70,58 +98,18 @@ pdf_doc_open <- function(path = NULL, source = NULL, password = NULL, ) } -#' Open a PDF document from a URL -#' -#' Convenience wrapper around [pdf_doc_open()] that fetches the -#' bytes of a remote (or `file://`) URL via base R's [`url()`] + -#' [`readBin()`] and loads the result through PDFium's in-memory -#' path (`FPDF_LoadMemDocument64`). No temporary file is left on -#' disk; the bytes live in R memory for the document's lifetime. -#' -#' Network errors propagate from [`url()`] / [`readBin()`] (typical -#' shape: `cannot open URL '...'` from `connection failed`). The -#' returned `pdfium_doc`'s `$path` field is the URL string itself, -#' so `print()` and [pdf_doc_summary()] surface -#' the source even though no local path exists. -#' -#' @param url Character scalar. Must start with one of `http://`, -#' `https://`, `ftp://`, or `file://`. -#' @param password Optional password for encrypted PDFs. `NULL` -#' (the default) passes no password to PDFium. -#' @param readwrite Logical. As for [pdf_doc_open()]. -#' @return A `pdfium_doc`. -#' @seealso [pdf_doc_open()] for the doc-open primitive. -#' @examples -#' fixture <- system.file("extdata", "fixtures", "minimal.pdf", -#' package = "pdfium" -#' ) -#' if (nzchar(fixture)) { -#' doc <- pdf_doc_open_url(paste0("file://", fixture)) -#' pdf_page_count(doc) -#' pdf_doc_close(doc) -#' } -#' @export -pdf_doc_open_url <- function(url, password = NULL, readwrite = FALSE) { - checkmate::assert_string(url, min.chars = 1L) - if (!grepl("^(https?|ftp|file)://", url)) { - stop( - "`url` must start with http://, https://, ftp://, or file://. ", - "Got: ", url, - call. = FALSE - ) - } - con <- base::url(url, open = "rb") - on.exit(close(con), add = TRUE) - # readBin needs a max-size hint; .Machine$integer.max is the - # documented "unbounded" sentinel. - bytes <- readBin(con, what = "raw", n = .Machine$integer.max) - doc <- pdf_doc_open(source = bytes, password = password, - readwrite = readwrite) - # Override the "" path with the source URL so - # downstream printing / pdf_doc_summary() shows where it came - # from. - doc$path <- url - doc +# Internal: does `path` look like a URL? Permissive scheme-detection +# matching RFC 3986's `URI = scheme ":" hier-part` form +# (`scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )`). Includes +# the `://` discriminator so plain `foo:bar` paths don't accidentally +# trigger URL handling. +# +# We don't enforce an allowlist of schemes — that's `url()`'s job. +# If R's `url()` doesn't recognise the scheme, it errors at open; +# that error propagates up to the caller unchanged. +looks_like_url <- function(path) { + is.character(path) && length(path) == 1L && !is.na(path) && + grepl("^[A-Za-z][A-Za-z0-9+.\\-]*://", path) } # Internal: validate the three pdf_doc_open() arguments. Split into @@ -158,6 +146,12 @@ validate_pdf_open_source <- function(source) { validate_pdf_open_path <- function(path) { checkmate::assert_string(path, min.chars = 1L) + # URL strings are handled by the url() + readBin() branch in + # pdf_doc_open(); they're not expected to exist on the local + # filesystem. + if (looks_like_url(path)) { + return(invisible()) + } if (!file.exists(path)) { stop("PDF file not found: ", path, call. = FALSE) } diff --git a/R/form_fields.R b/R/form_fields.R index b1fdbd7a..879c0f6d 100644 --- a/R/form_fields.R +++ b/R/form_fields.R @@ -220,6 +220,20 @@ as_tibble.pdfium_form_field_list <- function(x, ...) { ) } +#' Tibble-shaped summary of a form-field list +#' +#' `summary()` method for `pdfium_form_field_list`. Defers to +#' [as_tibble.pdfium_form_field_list()] for the standard tibble view. +#' +#' @param object A `pdfium_form_field_list` from [pdf_form_fields()]. +#' @param ... Forwarded to [as_tibble.pdfium_form_field_list()]. +#' @return The tibble returned by [as_tibble.pdfium_form_field_list()]. +#' @method summary pdfium_form_field_list +#' @export +summary.pdfium_form_field_list <- function(object, ...) { + tibble::as_tibble(object, ...) +} + # Internal: zero-row tibble matching as_tibble.pdfium_form_field_list. empty_form_field_tibble <- function(src_doc) { tibble::tibble( diff --git a/R/objects.R b/R/objects.R index 96ad4298..20f31658 100644 --- a/R/objects.R +++ b/R/objects.R @@ -110,6 +110,23 @@ as_tibble.pdfium_obj_list <- function(x, ...) { ) } +#' Tibble-shaped summary of a page-object list +#' +#' `summary()` method for `pdfium_obj_list`. Defers to +#' [as_tibble.pdfium_obj_list()] so users can call +#' `summary(pdf_page_objects(page))` for the standard tibble view — +#' matches the R idiom of `print()` for the one-line summary and +#' `summary()` for the deep dive. +#' +#' @param object A `pdfium_obj_list` from [pdf_page_objects()]. +#' @param ... Forwarded to [as_tibble.pdfium_obj_list()]. +#' @return The tibble returned by [as_tibble.pdfium_obj_list()]. +#' @method summary pdfium_obj_list +#' @export +summary.pdfium_obj_list <- function(object, ...) { + tibble::as_tibble(object, ...) +} + empty_obj_tibble <- function() { tibble::tibble( object_index = integer(), diff --git a/R/signatures.R b/R/signatures.R index 192df864..2a9270a2 100644 --- a/R/signatures.R +++ b/R/signatures.R @@ -76,6 +76,20 @@ as_tibble.pdfium_signature_list <- function(x, ...) { ) } +#' Tibble-shaped summary of a signature list +#' +#' `summary()` method for `pdfium_signature_list`. Defers to +#' [as_tibble.pdfium_signature_list()] for the standard tibble view. +#' +#' @param object A `pdfium_signature_list` from [pdf_signatures()]. +#' @param ... Forwarded to [as_tibble.pdfium_signature_list()]. +#' @return The tibble returned by [as_tibble.pdfium_signature_list()]. +#' @method summary pdfium_signature_list +#' @export +summary.pdfium_signature_list <- function(object, ...) { + tibble::as_tibble(object, ...) +} + empty_signature_tibble <- function() { tibble::tibble( signature_index = integer(), diff --git a/README.Rmd b/README.Rmd index ae349685..954bd193 100644 --- a/README.Rmd +++ b/README.Rmd @@ -46,8 +46,14 @@ same library that powers Chrome's PDF viewer. It has two halves: just text. * **Filling** AcroForm fields programmatically and flattening the result for downstream tooling. -* **Authoring** small programmatic PDFs (think: figure callouts, - table reports, annotated source documents). +* **Authoring** programmatic PDFs from vector graphics, text, and + annotations (think: figure callouts, table reports, annotated + source documents). v0.1.0 ships paths / text in the 14 standard + PDF fonts / annotations. Image embedding and custom-font loading + are both wrapping gaps (PDFium already exposes the symbols; we + just haven't wrapped them yet) and come in a later release; + `/Info`-dict writes and on-save encryption need upstream PDFium + changes that we've proposed but Google hasn't shipped yet. * Anything you'd otherwise drop into Python with `pypdfium2`. See [`vignette("mutating-pdfs")`](https://humanpred.github.io/rpdfium/articles/mutating-pdfs.html) diff --git a/README.md b/README.md index 06cbc2b3..0698b761 100644 --- a/README.md +++ b/README.md @@ -39,8 +39,14 @@ powers Chrome’s PDF viewer. It has two halves: text. - **Filling** AcroForm fields programmatically and flattening the result for downstream tooling. -- **Authoring** small programmatic PDFs (think: figure callouts, table - reports, annotated source documents). +- **Authoring** programmatic PDFs from vector graphics, text, and + annotations (think: figure callouts, table reports, annotated source + documents). v0.1.0 ships paths / text in the 14 standard PDF fonts / + annotations. Image embedding and custom-font loading are both wrapping + gaps (PDFium already exposes the symbols; we just haven’t wrapped them + yet) and come in a later release; `/Info`-dict writes and on-save + encryption need upstream PDFium changes that we’ve proposed but Google + hasn’t shipped yet. - Anything you’d otherwise drop into Python with `pypdfium2`. See diff --git a/_pkgdown.yml b/_pkgdown.yml index 57e4679e..b798857c 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -26,13 +26,11 @@ reference: - title: Documents contents: - pdf_doc_open - - pdf_doc_open_url - pdf_doc_close - pdf_page_count - pdf_doc_info - pdf_doc_meta - pdf_doc_summary - - pdf_dir_summary - summary.pdfium_doc - pdf_parse_date - pdf_doc_text @@ -60,6 +58,7 @@ reference: - pdf_attachments - as_pdfium_attachment_list - as_tibble.pdfium_attachment_list + - summary.pdfium_attachment_list - pdf_attachment_name - pdf_attachment_mime_type - pdf_attachment_size_bytes @@ -70,6 +69,7 @@ reference: - pdf_signatures - as_pdfium_signature_list - as_tibble.pdfium_signature_list + - summary.pdfium_signature_list - pdf_signature_sub_filter - pdf_signature_reason - pdf_signature_time @@ -80,6 +80,7 @@ reference: contents: - as_pdfium_bookmark_list - as_tibble.pdfium_bookmark_list + - summary.pdfium_bookmark_list - pdf_bookmark_title - pdf_bookmark_page_num - pdf_bookmark_action_type @@ -111,6 +112,7 @@ reference: - pdf_annot_at - as_pdfium_annot_list - as_tibble.pdfium_annot_list + - summary.pdfium_annot_list - pdf_annot_subtype - pdf_annot_subtype_code - pdf_annot_flags @@ -135,6 +137,7 @@ reference: - pdf_form_fields - as_pdfium_form_field_list - as_tibble.pdfium_form_field_list + - summary.pdfium_form_field_list - pdf_form_field_type - pdf_form_field_type_code - pdf_form_field_page_num @@ -155,6 +158,7 @@ reference: - pdf_page_objects - as_pdfium_obj_list - as_tibble.pdfium_obj_list + - summary.pdfium_obj_list - pdf_obj_type - pdf_obj_bounds - pdf_obj_rotated_bounds diff --git a/man/pdf_dir_summary.Rd b/man/pdf_dir_summary.Rd deleted file mode 100644 index aeafc80e..00000000 --- a/man/pdf_dir_summary.Rd +++ /dev/null @@ -1,62 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/doc.R -\name{pdf_dir_summary} -\alias{pdf_dir_summary} -\title{Summarise every PDF in a directory in one call} -\usage{ -pdf_dir_summary( - dir = ".", - pattern = "\\\\.pdf$", - recursive = FALSE, - password = NULL, - errors = c("warn", "skip", "stop") -) -} -\arguments{ -\item{dir}{Character scalar. Path to the directory to scan.} - -\item{pattern}{Regular expression filtering filenames. Defaults -to \code{"\\\\.pdf$"} (case-insensitive).} - -\item{recursive}{Logical. When \code{TRUE}, descend into -subdirectories. Defaults \code{FALSE}.} - -\item{password}{Optional password applied to every file. \code{NULL} -(default) tries each file without a password. Useful when all -files share the same password.} - -\item{errors}{One of \code{"warn"}, \code{"skip"}, \code{"stop"} — see Details.} -} -\value{ -A tibble with the same columns as \code{\link[=pdf_doc_summary]{pdf_doc_summary()}}. -Zero rows when the directory has no PDFs (or every PDF failed -to open under \code{errors = "skip"} / \code{"warn"}). -} -\description{ -Scans a directory for PDF files and returns a tibble whose rows -are the \code{\link[=pdf_doc_summary]{pdf_doc_summary()}} output for each file. The natural -replacement for the standard "loop over a folder of PDFs and -triage" workflow — encrypted-which / has-forms-which / -has-attachments-which. -} -\details{ -Files that fail to open (corrupt, wrong format, password -protected) are handled per the \code{errors} argument: -\itemize{ -\item \code{"warn"} (default) — a \code{warning()} per failed file; the file -is dropped from the result tibble. -\item \code{"skip"} — silently dropped. -\item \code{"stop"} — the first failed file raises an error and the -function aborts. -} -} -\examples{ -fixture_dir <- system.file("extdata", "fixtures", - package = "pdfium") -if (nzchar(fixture_dir)) { - pdf_dir_summary(fixture_dir) -} -} -\seealso{ -\code{\link[=pdf_doc_summary]{pdf_doc_summary()}} for the single-file companion. -} diff --git a/man/pdf_doc_open.Rd b/man/pdf_doc_open.Rd index a94e894e..c101c640 100644 --- a/man/pdf_doc_open.Rd +++ b/man/pdf_doc_open.Rd @@ -7,8 +7,8 @@ pdf_doc_open(path = NULL, source = NULL, password = NULL, readwrite = FALSE) } \arguments{ -\item{path}{Character scalar. Path to a PDF file. The file must -exist and be readable. Mutually exclusive with \code{source}.} +\item{path}{Character scalar. Either a local path or a URL (see +Details). Mutually exclusive with \code{source}.} \item{source}{Raw vector containing the PDF byte stream. PDFium keeps an internal reference to the bytes for the document's @@ -33,21 +33,34 @@ edits inside long pipelines.} A \code{pdfium_doc} object. } \description{ -Loads a PDF from disk or from an in-memory byte buffer. The -returned \code{pdfium_doc} carries an external pointer to a PDFium -\code{FPDF_DOCUMENT} handle along with a finalizer that calls -\code{FPDF_CloseDocument()} when the R object is garbage-collected. -Call \code{\link[=pdf_doc_close]{pdf_doc_close()}} explicitly when you need deterministic -release. +Loads a PDF from disk, from a URL, or from an in-memory byte +buffer. The returned \code{pdfium_doc} carries an external pointer to +a PDFium \code{FPDF_DOCUMENT} handle along with a finalizer that +calls \code{FPDF_CloseDocument()} when the R object is +garbage-collected. Call \code{\link[=pdf_doc_close]{pdf_doc_close()}} explicitly when you +need deterministic release. } \details{ -Two input forms are supported. Pass \code{path} to load from disk -(via PDFium's \code{FPDF_LoadDocument}), or pass \code{source} for an -in-memory raw vector (via \code{FPDF_LoadMemDocument64}). The -in-memory path is useful for documents downloaded via -\code{httr2::resp_body_raw()}, \code{curl::curl_fetch_memory()}, or read -with \code{readBin()} straight into RAM. Exactly one of \code{path} or -\code{source} must be provided. +Two input forms are supported: +\itemize{ +\item \strong{\code{path}} — either a local filesystem path (loaded via +PDFium's \code{FPDF_LoadDocument}) or a URL string (any scheme +base R's \code{\link[base:url]{base::url()}} knows how to open — typically +\verb{http://}, \verb{https://}, \verb{ftp://}, \verb{file://}; the actual list +depends on the R build's \code{libcurl} / internal handlers). +Detection is purely syntactic: a string matching +\verb{scheme://} per RFC 3986 is routed to \code{url() + readBin()} +and loaded via PDFium's in-memory path; anything else is +treated as a local path. No allowlist — we trust \code{url()} to +know what it supports and error cleanly when it doesn't. +\item \strong{\code{source}} — a raw vector containing the PDF byte stream +(loaded via \code{FPDF_LoadMemDocument64}). Useful for documents +downloaded via \code{httr2::resp_body_raw()}, +\code{curl::curl_fetch_memory()}, or read with \code{readBin()} straight +into RAM. +} + +Exactly one of \code{path} or \code{source} must be provided. } \examples{ fixture <- system.file("extdata", "fixtures", "minimal.pdf", @@ -66,4 +79,11 @@ if (nzchar(fixture)) { pdf_page_count(doc) pdf_doc_close(doc) } + +# `path` can be a URL - any scheme R's url() recognises. +if (nzchar(fixture)) { + doc <- pdf_doc_open(paste0("file://", fixture)) + pdf_page_count(doc) + pdf_doc_close(doc) +} } diff --git a/man/pdf_doc_open_url.Rd b/man/pdf_doc_open_url.Rd deleted file mode 100644 index 99e86c81..00000000 --- a/man/pdf_doc_open_url.Rd +++ /dev/null @@ -1,47 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/document.R -\name{pdf_doc_open_url} -\alias{pdf_doc_open_url} -\title{Open a PDF document from a URL} -\usage{ -pdf_doc_open_url(url, password = NULL, readwrite = FALSE) -} -\arguments{ -\item{url}{Character scalar. Must start with one of \verb{http://}, -\verb{https://}, \verb{ftp://}, or \verb{file://}.} - -\item{password}{Optional password for encrypted PDFs. \code{NULL} -(the default) passes no password to PDFium.} - -\item{readwrite}{Logical. As for \code{\link[=pdf_doc_open]{pdf_doc_open()}}.} -} -\value{ -A \code{pdfium_doc}. -} -\description{ -Convenience wrapper around \code{\link[=pdf_doc_open]{pdf_doc_open()}} that fetches the -bytes of a remote (or \verb{file://}) URL via base R's \code{\link[=url]{url()}} + -\code{\link[=readBin]{readBin()}} and loads the result through PDFium's in-memory -path (\code{FPDF_LoadMemDocument64}). No temporary file is left on -disk; the bytes live in R memory for the document's lifetime. -} -\details{ -Network errors propagate from \code{\link[=url]{url()}} / \code{\link[=readBin]{readBin()}} (typical -shape: \verb{cannot open URL '...'} from \verb{connection failed}). The -returned \code{pdfium_doc}'s \verb{$path} field is the URL string itself, -so \code{print()} and \code{\link[=pdf_doc_summary]{pdf_doc_summary()}} surface -the source even though no local path exists. -} -\examples{ -fixture <- system.file("extdata", "fixtures", "minimal.pdf", - package = "pdfium" -) -if (nzchar(fixture)) { - doc <- pdf_doc_open_url(paste0("file://", fixture)) - pdf_page_count(doc) - pdf_doc_close(doc) -} -} -\seealso{ -\code{\link[=pdf_doc_open]{pdf_doc_open()}} for the doc-open primitive. -} diff --git a/man/summary.pdfium_annot_list.Rd b/man/summary.pdfium_annot_list.Rd new file mode 100644 index 00000000..d06db339 --- /dev/null +++ b/man/summary.pdfium_annot_list.Rd @@ -0,0 +1,20 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/annotations.R +\name{summary.pdfium_annot_list} +\alias{summary.pdfium_annot_list} +\title{Tibble-shaped summary of an annotation list} +\usage{ +\method{summary}{pdfium_annot_list}(object, ...) +} +\arguments{ +\item{object}{A \code{pdfium_annot_list} from \code{\link[=pdf_annotations]{pdf_annotations()}}.} + +\item{...}{Forwarded to \code{\link[=as_tibble.pdfium_annot_list]{as_tibble.pdfium_annot_list()}}.} +} +\value{ +The tibble returned by \code{\link[=as_tibble.pdfium_annot_list]{as_tibble.pdfium_annot_list()}}. +} +\description{ +\code{summary()} method for \code{pdfium_annot_list}. Defers to +\code{\link[=as_tibble.pdfium_annot_list]{as_tibble.pdfium_annot_list()}} for the standard tibble view. +} diff --git a/man/summary.pdfium_attachment_list.Rd b/man/summary.pdfium_attachment_list.Rd new file mode 100644 index 00000000..236628af --- /dev/null +++ b/man/summary.pdfium_attachment_list.Rd @@ -0,0 +1,22 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/attachments.R +\name{summary.pdfium_attachment_list} +\alias{summary.pdfium_attachment_list} +\title{Tibble-shaped summary of an attachment list} +\usage{ +\method{summary}{pdfium_attachment_list}(object, ...) +} +\arguments{ +\item{object}{A \code{pdfium_attachment_list} from \code{\link[=pdf_attachments]{pdf_attachments()}}.} + +\item{...}{Forwarded to \code{\link[=as_tibble.pdfium_attachment_list]{as_tibble.pdfium_attachment_list()}}.} +} +\value{ +The tibble returned by \code{\link[=as_tibble.pdfium_attachment_list]{as_tibble.pdfium_attachment_list()}}. +} +\description{ +\code{summary()} method for \code{pdfium_attachment_list}. Defers to +\code{\link[=as_tibble.pdfium_attachment_list]{as_tibble.pdfium_attachment_list()}} for the standard tibble +view — matches the R idiom of \code{print()} for the one-line summary +and \code{summary()} for the deep dive. +} diff --git a/man/summary.pdfium_bookmark_list.Rd b/man/summary.pdfium_bookmark_list.Rd new file mode 100644 index 00000000..ca83c034 --- /dev/null +++ b/man/summary.pdfium_bookmark_list.Rd @@ -0,0 +1,20 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/doc.R +\name{summary.pdfium_bookmark_list} +\alias{summary.pdfium_bookmark_list} +\title{Tibble-shaped summary of a bookmark list} +\usage{ +\method{summary}{pdfium_bookmark_list}(object, ...) +} +\arguments{ +\item{object}{A \code{pdfium_bookmark_list} from \code{\link[=pdf_doc_bookmarks]{pdf_doc_bookmarks()}}.} + +\item{...}{Forwarded to \code{\link[=as_tibble.pdfium_bookmark_list]{as_tibble.pdfium_bookmark_list()}}.} +} +\value{ +The tibble returned by \code{\link[=as_tibble.pdfium_bookmark_list]{as_tibble.pdfium_bookmark_list()}}. +} +\description{ +\code{summary()} method for \code{pdfium_bookmark_list}. Defers to +\code{\link[=as_tibble.pdfium_bookmark_list]{as_tibble.pdfium_bookmark_list()}} for the standard tibble view. +} diff --git a/man/summary.pdfium_form_field_list.Rd b/man/summary.pdfium_form_field_list.Rd new file mode 100644 index 00000000..44408941 --- /dev/null +++ b/man/summary.pdfium_form_field_list.Rd @@ -0,0 +1,20 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/form_fields.R +\name{summary.pdfium_form_field_list} +\alias{summary.pdfium_form_field_list} +\title{Tibble-shaped summary of a form-field list} +\usage{ +\method{summary}{pdfium_form_field_list}(object, ...) +} +\arguments{ +\item{object}{A \code{pdfium_form_field_list} from \code{\link[=pdf_form_fields]{pdf_form_fields()}}.} + +\item{...}{Forwarded to \code{\link[=as_tibble.pdfium_form_field_list]{as_tibble.pdfium_form_field_list()}}.} +} +\value{ +The tibble returned by \code{\link[=as_tibble.pdfium_form_field_list]{as_tibble.pdfium_form_field_list()}}. +} +\description{ +\code{summary()} method for \code{pdfium_form_field_list}. Defers to +\code{\link[=as_tibble.pdfium_form_field_list]{as_tibble.pdfium_form_field_list()}} for the standard tibble view. +} diff --git a/man/summary.pdfium_obj_list.Rd b/man/summary.pdfium_obj_list.Rd new file mode 100644 index 00000000..f4288fde --- /dev/null +++ b/man/summary.pdfium_obj_list.Rd @@ -0,0 +1,23 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/objects.R +\name{summary.pdfium_obj_list} +\alias{summary.pdfium_obj_list} +\title{Tibble-shaped summary of a page-object list} +\usage{ +\method{summary}{pdfium_obj_list}(object, ...) +} +\arguments{ +\item{object}{A \code{pdfium_obj_list} from \code{\link[=pdf_page_objects]{pdf_page_objects()}}.} + +\item{...}{Forwarded to \code{\link[=as_tibble.pdfium_obj_list]{as_tibble.pdfium_obj_list()}}.} +} +\value{ +The tibble returned by \code{\link[=as_tibble.pdfium_obj_list]{as_tibble.pdfium_obj_list()}}. +} +\description{ +\code{summary()} method for \code{pdfium_obj_list}. Defers to +\code{\link[=as_tibble.pdfium_obj_list]{as_tibble.pdfium_obj_list()}} so users can call +\code{summary(pdf_page_objects(page))} for the standard tibble view — +matches the R idiom of \code{print()} for the one-line summary and +\code{summary()} for the deep dive. +} diff --git a/man/summary.pdfium_signature_list.Rd b/man/summary.pdfium_signature_list.Rd new file mode 100644 index 00000000..d7644969 --- /dev/null +++ b/man/summary.pdfium_signature_list.Rd @@ -0,0 +1,20 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/signatures.R +\name{summary.pdfium_signature_list} +\alias{summary.pdfium_signature_list} +\title{Tibble-shaped summary of a signature list} +\usage{ +\method{summary}{pdfium_signature_list}(object, ...) +} +\arguments{ +\item{object}{A \code{pdfium_signature_list} from \code{\link[=pdf_signatures]{pdf_signatures()}}.} + +\item{...}{Forwarded to \code{\link[=as_tibble.pdfium_signature_list]{as_tibble.pdfium_signature_list()}}.} +} +\value{ +The tibble returned by \code{\link[=as_tibble.pdfium_signature_list]{as_tibble.pdfium_signature_list()}}. +} +\description{ +\code{summary()} method for \code{pdfium_signature_list}. Defers to +\code{\link[=as_tibble.pdfium_signature_list]{as_tibble.pdfium_signature_list()}} for the standard tibble view. +} diff --git a/tests/testthat/test-dir-summary.R b/tests/testthat/test-dir-summary.R deleted file mode 100644 index 437584b2..00000000 --- a/tests/testthat/test-dir-summary.R +++ /dev/null @@ -1,125 +0,0 @@ -# Tests for pdf_dir_summary() — the bulk-triage helper that wraps -# pdf_doc_summary() over every PDF in a directory. - -# Helper to expose the shipped fixture directory. -fixture_dir <- function() { - system.file("extdata", "fixtures", package = "pdfium") -} - -test_that("pdf_dir_summary returns a tibble with one row per PDF", { - s <- pdf_dir_summary(fixture_dir()) - expect_s3_class(s, "tbl_df") - files <- list.files(fixture_dir(), pattern = "\\.pdf$") - expect_equal(nrow(s), length(files)) -}) - -test_that("pdf_dir_summary column shape matches pdf_doc_summary", { - bulk <- pdf_dir_summary(fixture_dir()) - one <- pdf_doc_summary(fixture_path("shapes")) - expect_named(bulk, names(one)) -}) - -test_that("pdf_dir_summary preserves the path column", { - s <- pdf_dir_summary(fixture_dir()) - expect_true(all(grepl("\\.pdf$", s$path))) - expect_true(all(file.exists(s$path))) -}) - -test_that("pdf_dir_summary recursive descent works", { - # Create a nested temp dir with two PDFs, one in a subdir. - tmp <- withr::local_tempdir() - file.copy(fixture_path("minimal"), file.path(tmp, "top.pdf")) - sub <- file.path(tmp, "subdir") - dir.create(sub) - file.copy(fixture_path("minimal"), file.path(sub, "nested.pdf")) - - flat <- pdf_dir_summary(tmp, recursive = FALSE) - expect_equal(nrow(flat), 1L) - - deep <- pdf_dir_summary(tmp, recursive = TRUE) - expect_equal(nrow(deep), 2L) -}) - -test_that("pdf_dir_summary returns zero rows for an empty dir", { - tmp <- withr::local_tempdir() - s <- pdf_dir_summary(tmp) - expect_s3_class(s, "tbl_df") - expect_equal(nrow(s), 0L) -}) - -test_that("pdf_dir_summary's empty tibble has the right shape", { - empty <- pdfium:::pdf_doc_summary_empty() - expect_s3_class(empty, "tbl_df") - expect_equal(nrow(empty), 0L) - one <- pdf_doc_summary(fixture_path("shapes")) - expect_named(empty, names(one)) -}) - -test_that("pdf_dir_summary case-insensitive PDF pattern matches .PDF too", { - tmp <- withr::local_tempdir() - file.copy(fixture_path("minimal"), file.path(tmp, "upper.PDF")) - file.copy(fixture_path("minimal"), file.path(tmp, "lower.pdf")) - s <- pdf_dir_summary(tmp) - expect_equal(nrow(s), 2L) -}) - -test_that("pdf_dir_summary errors = stop aborts on a bad file", { - tmp <- withr::local_tempdir() - file.copy(fixture_path("minimal"), file.path(tmp, "good.pdf")) - writeLines("not a pdf", file.path(tmp, "bad.pdf")) - expect_error( - pdf_dir_summary(tmp, errors = "stop"), - "failed to read" - ) -}) - -test_that("pdf_dir_summary errors = warn drops bad files with a warning", { - tmp <- withr::local_tempdir() - file.copy(fixture_path("minimal"), file.path(tmp, "good.pdf")) - writeLines("not a pdf", file.path(tmp, "bad.pdf")) - s <- suppressWarnings(pdf_dir_summary(tmp, errors = "warn")) - expect_equal(nrow(s), 1L) - expect_warning( - pdf_dir_summary(tmp, errors = "warn"), - "failed to read" - ) -}) - -test_that("pdf_dir_summary errors = skip silently drops bad files", { - tmp <- withr::local_tempdir() - file.copy(fixture_path("minimal"), file.path(tmp, "good.pdf")) - writeLines("not a pdf", file.path(tmp, "bad.pdf")) - expect_no_warning(s <- pdf_dir_summary(tmp, errors = "skip")) - expect_equal(nrow(s), 1L) -}) - -test_that("pdf_dir_summary returns zero rows when every file fails", { - tmp <- withr::local_tempdir() - writeLines("not a pdf", file.path(tmp, "bad1.pdf")) - writeLines("also not a pdf", file.path(tmp, "bad2.pdf")) - s <- suppressWarnings(pdf_dir_summary(tmp, errors = "skip")) - expect_equal(nrow(s), 0L) -}) - -test_that("pdf_dir_summary forwards the password argument", { - s <- pdf_dir_summary(fixture_dir(), password = NULL) - expect_gt(nrow(s), 0L) -}) - -test_that("pdf_dir_summary rejects bad inputs", { - expect_error(pdf_dir_summary("/this/path/does/not/exist"), - "Assertion on") - expect_error(pdf_dir_summary(fixture_dir(), pattern = NA_character_), - "Assertion on") - expect_error(pdf_dir_summary(fixture_dir(), recursive = "yes"), - "Assertion on") - expect_error(pdf_dir_summary(fixture_dir(), errors = "bogus"), - "'arg' should be one of") -}) - -test_that("pdf_dir_summary respects a custom pattern", { - # Only match the annotated fixture. - s <- pdf_dir_summary(fixture_dir(), pattern = "^annotated\\.pdf$") - expect_equal(nrow(s), 1L) - expect_match(s$path[[1L]], "annotated\\.pdf$") -}) diff --git a/tests/testthat/test-doc-open-url.R b/tests/testthat/test-doc-open-url.R deleted file mode 100644 index 989fd2f0..00000000 --- a/tests/testthat/test-doc-open-url.R +++ /dev/null @@ -1,73 +0,0 @@ -# Tests for pdf_doc_open_url(). The network test paths are -# necessarily skipped on CRAN — they use the `file://` scheme -# against a shipped fixture, which exercises the same url() + -# readBin() code path as a real `https://` URL without needing -# network access. - -test_that("pdf_doc_open_url opens a file:// URL", { - url <- paste0("file://", fixture_path("minimal")) - doc <- pdf_doc_open_url(url) - on.exit(pdf_doc_close(doc), add = TRUE) - expect_s3_class(doc, "pdfium_doc") - expect_identical(pdf_page_count(doc), 1L) -}) - -test_that("pdf_doc_open_url stores the URL as the doc path", { - url <- paste0("file://", fixture_path("minimal")) - doc <- pdf_doc_open_url(url) - on.exit(pdf_doc_close(doc), add = TRUE) - expect_identical(doc$path, url) -}) - -test_that("pdf_doc_open_url forwards password + readwrite flags", { - url <- paste0("file://", fixture_path("minimal")) - doc <- pdf_doc_open_url(url, password = NULL, readwrite = TRUE) - on.exit(pdf_doc_close(doc), add = TRUE) - expect_true(doc$readwrite) -}) - -test_that("pdf_doc_open_url rejects non-URL strings", { - expect_error(pdf_doc_open_url("not-a-url"), - "must start with http://") - expect_error(pdf_doc_open_url("/path/to/file.pdf"), - "must start with http://") - expect_error(pdf_doc_open_url(""), "Assertion on") -}) - -test_that("pdf_doc_open_url rejects bad input types", { - expect_error(pdf_doc_open_url(42L), "Assertion on") - expect_error(pdf_doc_open_url(NULL), "Assertion on") - expect_error(pdf_doc_open_url(c("a", "b")), "Assertion on") -}) - -test_that("pdf_doc_open_url surfaces URL connection errors", { - bad_url <- "file:///definitely-not-a-file-on-this-system.pdf" - suppressWarnings(expect_error(pdf_doc_open_url(bad_url))) -}) - -test_that("pdf_doc_open_url accepts http(s) URLs structurally", { - # We can't actually fetch http(s) without network access, but the - # URL-shape validation should accept these prefixes and only fail - # later at the network step. base::url() emits a warning then - # errors on unreachable hosts; suppressWarnings so the test - # output isn't noisy. - suppressWarnings({ - expect_error(pdf_doc_open_url("https://example.invalid/x.pdf")) - expect_error(pdf_doc_open_url("http://example.invalid/x.pdf")) - # Neither error should be the URL-shape error. - err1 <- tryCatch( - pdf_doc_open_url("https://example.invalid/x.pdf"), - error = function(e) conditionMessage(e) - ) - }) - expect_false(grepl("must start with", err1)) -}) - -test_that("pdf_doc_open_url round-trips through pdf_doc_summary", { - url <- paste0("file://", fixture_path("annotated")) - doc <- pdf_doc_open_url(url) - on.exit(pdf_doc_close(doc), add = TRUE) - s <- pdf_doc_summary(doc) - expect_identical(s$path, url) - expect_gt(s$form_field_count, 0L) # annotated.pdf has form fields -}) diff --git a/tests/testthat/test-document.R b/tests/testthat/test-document.R index 22d6519c..f5e7eeaf 100644 --- a/tests/testthat/test-document.R +++ b/tests/testthat/test-document.R @@ -77,3 +77,68 @@ test_that("auto-finalizer releases handles dropped without explicit close", { invisible(gc(verbose = FALSE)) succeed() }) + +# URL paths -------------------------------------------------------- +# The `path =` argument auto-detects URLs (anything matching the RFC +# 3986 scheme://host shape) and routes them through base::url() + +# readBin() before handing the bytes to PDFium's in-memory loader. +# We don't maintain a scheme allowlist — whatever R's url() handles +# is what we handle. + +test_that("pdf_doc_open accepts a file:// URL", { + url <- paste0("file://", fixture_path("minimal")) + doc <- pdf_doc_open(url) + on.exit(pdf_doc_close(doc), add = TRUE) + expect_s3_class(doc, "pdfium_doc") + expect_identical(pdf_page_count(doc), 1L) +}) + +test_that("pdf_doc_open stores the URL as the doc path", { + url <- paste0("file://", fixture_path("minimal")) + doc <- pdf_doc_open(url) + on.exit(pdf_doc_close(doc), add = TRUE) + expect_identical(doc$path, url) +}) + +test_that("pdf_doc_open passes URL bytes through to readwrite mode", { + url <- paste0("file://", fixture_path("minimal")) + doc <- pdf_doc_open(url, readwrite = TRUE) + on.exit(pdf_doc_close(doc), add = TRUE) + expect_true(doc$readwrite) +}) + +test_that("pdf_doc_open surfaces base::url() errors on unreachable hosts", { + # base::url() emits a warning then errors when the host is + # unreachable; we suppress the warning to keep the test output + # clean but assert the error still propagates. + suppressWarnings( + expect_error(pdf_doc_open("https://example.invalid/x.pdf")) + ) +}) + +test_that("pdf_doc_open treats non-URL strings as local paths", { + # A string with a colon but no `://` is a path on this system, + # not a URL — should not trigger url() handling. + expect_error(pdf_doc_open("foo:bar"), "PDF file not found") +}) + +test_that("looks_like_url accepts every RFC 3986 scheme shape", { + for (u in c("http://x", "https://x", "ftp://x", "file:///x", + "FILE:///x", "git+ssh://x")) { + expect_true(pdfium:::looks_like_url(u), info = u) + } + for (nu in c("/absolute/path", "relative/path", "x.pdf", + "1http://no", NA_character_, c("a", "b"), 42L, NULL)) { + expect_false(pdfium:::looks_like_url(nu), + info = deparse(nu, control = NULL)) + } +}) + +test_that("pdf_doc_open's URL path round-trips through pdf_doc_summary", { + url <- paste0("file://", fixture_path("annotated")) + doc <- pdf_doc_open(url) + on.exit(pdf_doc_close(doc), add = TRUE) + s <- pdf_doc_summary(doc) + expect_identical(s$path, url) + expect_gt(s$form_field_count, 0L) +}) diff --git a/tests/testthat/test-summary-list.R b/tests/testthat/test-summary-list.R new file mode 100644 index 00000000..a7cfc5b8 --- /dev/null +++ b/tests/testthat/test-summary-list.R @@ -0,0 +1,64 @@ +# Tests for `summary()` methods on the six `pdfium_*_list` classes. +# Every one is a thin dispatcher to its companion `as_tibble.*` +# method; the test surface is therefore "does dispatch happen?" not +# "are the columns right?" — column-shape tests live in each list +# type's own test file. + +test_that("summary(pdf_page_objects(...)) dispatches to as_tibble", { + doc <- pdf_doc_open(fixture_path("shapes")) + on.exit(pdf_doc_close(doc), add = TRUE) + page <- pdf_page_load(doc, 1L) + on.exit(pdf_page_close(page), add = TRUE, after = FALSE) + objs <- pdf_page_objects(page) + expect_identical(summary(objs), tibble::as_tibble(objs)) +}) + +test_that("summary(pdf_annotations(...)) dispatches to as_tibble", { + doc <- pdf_doc_open(fixture_path("annotated")) + on.exit(pdf_doc_close(doc), add = TRUE) + page <- pdf_page_load(doc, 1L) + on.exit(pdf_page_close(page), add = TRUE, after = FALSE) + annots <- pdf_annotations(page) + expect_identical(summary(annots), tibble::as_tibble(annots)) +}) + +test_that("summary(pdf_attachments(...)) dispatches to as_tibble", { + doc <- pdf_doc_open(fixture_path("attachments")) + on.exit(pdf_doc_close(doc), add = TRUE) + atts <- pdf_attachments(doc) + expect_identical(summary(atts), tibble::as_tibble(atts)) +}) + +test_that("summary(pdf_signatures(...)) dispatches to as_tibble", { + doc <- pdf_doc_open(fixture_path("signed")) + on.exit(pdf_doc_close(doc), add = TRUE) + sigs <- pdf_signatures(doc) + expect_identical(summary(sigs), tibble::as_tibble(sigs)) +}) + +test_that("summary(pdf_doc_bookmarks(...)) dispatches to as_tibble", { + doc <- pdf_doc_open(fixture_path("outline")) + on.exit(pdf_doc_close(doc), add = TRUE) + bms <- pdf_doc_bookmarks(doc) + expect_identical(summary(bms), tibble::as_tibble(bms)) +}) + +test_that("summary(pdf_form_fields(...)) dispatches to as_tibble", { + doc <- pdf_doc_open(fixture_path("annotated")) + on.exit(pdf_doc_close(doc), add = TRUE) + fields <- pdf_form_fields(doc) + expect_identical(summary(fields), tibble::as_tibble(fields)) +}) + +test_that("summary() on every list class returns a tibble", { + doc <- pdf_doc_open(fixture_path("annotated")) + on.exit(pdf_doc_close(doc), add = TRUE) + page <- pdf_page_load(doc, 1L) + on.exit(pdf_page_close(page), add = TRUE, after = FALSE) + expect_s3_class(summary(pdf_page_objects(page)), "tbl_df") + expect_s3_class(summary(pdf_annotations(page)), "tbl_df") + expect_s3_class(summary(pdf_attachments(doc)), "tbl_df") + expect_s3_class(summary(pdf_signatures(doc)), "tbl_df") + expect_s3_class(summary(pdf_doc_bookmarks(doc)), "tbl_df") + expect_s3_class(summary(pdf_form_fields(doc)), "tbl_df") +}) diff --git a/vignettes/comparison.Rmd b/vignettes/comparison.Rmd index d65d626d..c461ca34 100644 --- a/vignettes/comparison.Rmd +++ b/vignettes/comparison.Rmd @@ -30,7 +30,7 @@ contributor-facing inventory lives in | **Inspect path geometry** (segments, Bezier control points, stroke/fill, transform matrices) | **`pdfium`** — no other CRAN package surfaces this | | **Fill AcroForm fields without a JRE** | **`pdfium`** (`staplr` requires Java + pdftk) | | **Edit annotations** (read + write) | **`pdfium`** | -| **Programmatically build small PDFs** with paths, text, images, annotations | **`pdfium`** (also `minipdf` for a pure-R writer with no native dependency) | +| **Programmatically build PDFs** — any page count, vector paths, standard-font text, annotations | **`pdfium`** (also `minipdf` for a pure-R writer that additionally supports image embedding today) | | Edit XMP metadata or bookmarks | `xmpdf` (orchestrates `exiftool` / `ghostscript` / `pdftk`) | ## What `pdfium` adds @@ -92,6 +92,51 @@ pdf_save(doc, "annotated.pdf") No other CRAN package surfaces annotations at all. The full list of supported subtypes lives in `?pdf_annot_new`. +### 4. Programmatic PDF authoring (with v0.1.0 limits) + +[`pdf_doc_new()`](https://humanpred.github.io/rpdfium/reference/pdf_doc_new.html) +plus the page-object creators ([`pdf_path_new`](https://humanpred.github.io/rpdfium/reference/pdf_path_new.html), +[`pdf_rect_new`](https://humanpred.github.io/rpdfium/reference/pdf_rect_new.html), +[`pdf_text_new`](https://humanpred.github.io/rpdfium/reference/pdf_text_new.html), +plus the path-geometry appenders) let you build PDFs from scratch +in R. The mutating-pdfs vignette walks through the workflow. + +**What scales fine.** Page count is unlimited (PDFium handles +thousand-page docs efficiently); objects-per-page are unlimited; +the full vector-graphics surface — paths, Bezier curves, dash +patterns, transformation matrices, blend modes, opacity, clip +paths — is exposed; annotations are richly covered. The R↔C +boundary cost is microseconds per call, so 10⁶ object writes is +seconds, not minutes. + +**v0.1.0 limits worth knowing about.** Four authoring axes have +real gaps in the current release. They split cleanly into two +groups by what's blocking closure: + +*Blocked on this package's roadmap* (PDFium exposes the symbols; +we just haven't wrapped them yet — closure timing is in our +hands): + +| Gap | Unwrapped PDFium symbols | Workaround today | +|---|---|---| +| **Image embedding** | `FPDFImageObj_LoadJpegFile`, `FPDFImageObj_LoadJpegFileInline`, `FPDFImageObj_SetBitmap` — all in `inst/include/fpdf_edit.h`; deferred from Phase 5 pending an `FPDF_BITMAP` lifecycle design | Use `minipdf` for image-heavy programmatic PDFs, or render images separately and embed via a post-process | +| **Custom fonts** | `FPDFText_LoadFont`, `FPDFText_LoadStandardFont`, `FPDFText_LoadCidType2Font` — all in `inst/include/fpdf_edit.h`; not yet wrapped, needs a `pdfium_font` S3 class with `FPDFFont_Close` finalizer | For non-Latin scripts or branded type, render text as paths via an external tool and embed the resulting vector paths | + +*Blocked on upstream PDFium* (the symbols don't exist yet — we've +proposed them but they need to ship through Google's Gerrit review +cycle, land in a PDFium release, and propagate to a `bblanchon` +binary before we can wrap them): + +| Gap | Missing PDFium symbol(s) | Workaround today | +|---|---|---| +| **`/Info` dict writes** | `FPDF_SetMetaText` — [drafted patch](https://github.com/humanpred/rpdfium/blob/main/dev/upstream-patches/pdfium-FPDF_SetMetaText.patch) awaiting Gerrit upload | Use `xmpdf` to patch the Info dict after `pdf_save()` | +| **Encryption on save** | `FPDF_SetEncryption` — listed as CL 5 in [`dev/upstream-api-gaps.md`](https://github.com/humanpred/rpdfium/blob/main/dev/upstream-api-gaps.md); not yet drafted | Use `qpdf::pdf_encrypt()` as a post-process step | + +The full upstream-PDFium gap inventory lives in +[`dev/upstream-api-gaps.md`](https://github.com/humanpred/rpdfium/blob/main/dev/upstream-api-gaps.md). +The "what scales" claims above hold today; the limits all have a +known path to closure, with the per-table timing differences noted. + ## Where `pdfium` deliberately doesn't compete * **Structural split / merge / compress** — `qpdf` is the right