Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,33 @@ fresh worktree. Update it when you discover a convention worth recording.
- **License:** MIT (package) + BSD-3-Clause (bundled PDFium binary).
- **CRAN target:** v0.1.0 ships to CRAN. Every change preserves CRAN-cleanliness.

## Scope — wrap PDFium, don't invent helpers

The package's job is to expose Google's PDFium C API to R idiomatically.
Every public function should ultimately call into PDFium (perhaps via a
chain of internal helpers) or be unambiguously tied to PDF-format
concepts (`pdf_parse_date()` parses the PDF date-string format).

What does **not** belong:

- Filesystem walking (`list.files()` loops over `pdf_doc_*`).
- Network plumbing beyond what PDFium itself does. `pdf_doc_open(path)`
accepting a URL is fine — the URL becomes raw bytes which go straight
into PDFium's `FPDF_LoadMemDocument64`. A function whose body is
mostly `httr2::request(...)` is not.
- Bulk / batch wrappers ("apply this PDFium function to every file in
a folder"). Users have `lapply` and `purrr` for that.
- Cross-PDF analysis ("compare these two PDFs"). Out of scope.

When in doubt, ask: *what PDFium symbol does this wrap?* If the answer
is "none — it's a convenience over base R", the function belongs in
user code or a separate utility package, not here.

This is recorded as a deletion-justification in `NEWS.md` for the
`pdf_dir_summary` / `pdf_doc_open_url` retraction. Future contributors
shouldn't re-add functions whose job is to glue base R primitives
together around pdfium calls.

## Layering — never bypass

```
Expand Down
8 changes: 6 additions & 2 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,14 @@ S3method(print,pdfium_obj_list)
S3method(print,pdfium_page)
S3method(print,pdfium_signature)
S3method(print,pdfium_signature_list)
S3method(summary,pdfium_annot_list)
S3method(summary,pdfium_attachment_list)
S3method(summary,pdfium_bookmark_list)
S3method(summary,pdfium_doc)
S3method(summary,pdfium_form_field_list)
S3method(summary,pdfium_obj_list)
S3method(summary,pdfium_page)
S3method(summary,pdfium_signature_list)
export(as_pdfium_annot_list)
export(as_pdfium_attachment_list)
export(as_pdfium_bookmark_list)
Expand Down Expand Up @@ -105,7 +111,6 @@ export(pdf_bookmark_title)
export(pdf_bookmark_uri)
export(pdf_clip_path_count)
export(pdf_clip_path_segments)
export(pdf_dir_summary)
export(pdf_doc_bookmark_find)
export(pdf_doc_bookmarks)
export(pdf_doc_close)
Expand All @@ -120,7 +125,6 @@ export(pdf_doc_named_dest_by_name)
export(pdf_doc_named_dests)
export(pdf_doc_new)
export(pdf_doc_open)
export(pdf_doc_open_url)
export(pdf_doc_page_mode)
export(pdf_doc_permissions)
export(pdf_doc_security)
Expand Down
40 changes: 29 additions & 11 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,12 @@ PDFs created with `pdf_doc_new()` are also writable).
* `pdf_doc_open()` / `pdf_doc_close()`, `pdf_doc_new()`,
`pdf_save()` / `pdf_save_to_raw()` — open existing PDFs (optionally
with `readwrite = TRUE`), build new ones in memory, and persist
the result. `pdf_doc_open_url(url)` is a convenience wrapper that
fetches a `http://` / `https://` / `ftp://` / `file://` URL via
`url()` + `readBin()` and loads the bytes through PDFium's
in-memory path — no temporary file on disk.
the result. The `path =` argument of `pdf_doc_open()` accepts
either a local filesystem path or a URL (any scheme `base::url()`
recognises — typically `http://` / `https://` / `ftp://` /
`file://`); URL input is fetched into raw bytes via `url()` +
`readBin()` and loaded through PDFium's `FPDF_LoadMemDocument64`,
with no temporary file on disk.
* `pdf_doc_info()`, `pdf_doc_meta()`, `pdf_doc_text()`,
`pdf_doc_fonts()`, `pdf_doc_file_id()`, `pdf_doc_page_mode()`,
`pdf_doc_viewer_preferences()`, `pdf_doc_viewer_preference_by_name()`,
Expand All @@ -40,13 +42,29 @@ PDFs created with `pdf_doc_new()` are also writable).
dispatch to the matching tibble — `summary(page)` adds the
page-loaded counts (annotation count, page-object count,
text-run count, link count) since the page is already loaded.
* `pdf_dir_summary(dir)` — scans a directory for PDF files and
returns one row per file in the `pdf_doc_summary()` shape.
Recursive scan via `recursive = TRUE`; pattern-matches `.pdf`
case-insensitively by default. The `errors` argument selects
one of `"warn"` (default — surface broken files but don't
abort), `"skip"` (silently drop), or `"stop"` (abort on the
first failure).
* `summary()` S3 methods for every `pdfium_*_list` class:
`pdfium_obj_list`, `pdfium_annot_list`, `pdfium_attachment_list`,
`pdfium_signature_list`, `pdfium_bookmark_list`, and
`pdfium_form_field_list`. Each dispatches to the matching
`as_tibble.*` method so `summary(x)` returns the same tibble
view `tibble::as_tibble(x)` would — matching the R idiom of
`print()` for the one-line summary and `summary()` for the deep
dive.

## Scope retraction

Two functions added during 0.1.0 development were retracted before
release on scope grounds (see `CLAUDE.md` §"Scope"):

* **`pdf_doc_open_url()`** — folded into `pdf_doc_open(path = ...)`.
The URL-fetching layer is just `base::url()` + `readBin()` ahead
of PDFium's existing in-memory path, so a separate exported
symbol added surface for no PDFium-specific behaviour.
* **`pdf_dir_summary()`** — removed. Its body was `list.files()`
+ `lapply(pdf_doc_summary)`; users with bulk-triage needs can
write the loop themselves in three lines. Keeping it set a
precedent for "convenience over a base R loop" creep that the
package's PDFium-wrapper mandate doesn't want.

## Page objects, paths, and text

Expand Down
14 changes: 14 additions & 0 deletions R/annotations.R
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,20 @@ as_tibble.pdfium_annot_list <- function(x, ...) {
)
}

#' Tibble-shaped summary of an annotation list
#'
#' `summary()` method for `pdfium_annot_list`. Defers to
#' [as_tibble.pdfium_annot_list()] for the standard tibble view.
#'
#' @param object A `pdfium_annot_list` from [pdf_annotations()].
#' @param ... Forwarded to [as_tibble.pdfium_annot_list()].
#' @return The tibble returned by [as_tibble.pdfium_annot_list()].
#' @method summary pdfium_annot_list
#' @export
summary.pdfium_annot_list <- function(object, ...) {
tibble::as_tibble(object, ...)
}

# Internal: zero-row tibble matching as_tibble.pdfium_annot_list's
# schema. Used when the page has no annotations.
empty_annot_tibble <- function(src_page) {
Expand Down
16 changes: 16 additions & 0 deletions R/attachments.R
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,22 @@ as_tibble.pdfium_attachment_list <- function(x, ...) {
)
}

#' Tibble-shaped summary of an attachment list
#'
#' `summary()` method for `pdfium_attachment_list`. Defers to
#' [as_tibble.pdfium_attachment_list()] for the standard tibble
#' view — matches the R idiom of `print()` for the one-line summary
#' and `summary()` for the deep dive.
#'
#' @param object A `pdfium_attachment_list` from [pdf_attachments()].
#' @param ... Forwarded to [as_tibble.pdfium_attachment_list()].
#' @return The tibble returned by [as_tibble.pdfium_attachment_list()].
#' @method summary pdfium_attachment_list
#' @export
summary.pdfium_attachment_list <- function(object, ...) {
tibble::as_tibble(object, ...)
}

# Internal: zero-row tibble matching as_tibble.pdfium_attachment_list.
empty_attachment_tibble <- function() {
tibble::tibble(
Expand Down
124 changes: 14 additions & 110 deletions R/doc.R
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,20 @@ as_tibble.pdfium_bookmark_list <- function(x, ...) {
)
}

#' Tibble-shaped summary of a bookmark list
#'
#' `summary()` method for `pdfium_bookmark_list`. Defers to
#' [as_tibble.pdfium_bookmark_list()] for the standard tibble view.
#'
#' @param object A `pdfium_bookmark_list` from [pdf_doc_bookmarks()].
#' @param ... Forwarded to [as_tibble.pdfium_bookmark_list()].
#' @return The tibble returned by [as_tibble.pdfium_bookmark_list()].
#' @method summary pdfium_bookmark_list
#' @export
summary.pdfium_bookmark_list <- function(object, ...) {
tibble::as_tibble(object, ...)
}

empty_bookmark_tibble <- function() {
tibble::tibble(
bookmark_index = integer(),
Expand Down Expand Up @@ -642,116 +656,6 @@ summary.pdfium_doc <- function(object, ...) {
pdf_doc_summary(object)
}

#' Summarise every PDF in a directory in one call
#'
#' Scans a directory for PDF files and returns a tibble whose rows
#' are the [pdf_doc_summary()] output for each file. The natural
#' replacement for the standard "loop over a folder of PDFs and
#' triage" workflow — encrypted-which / has-forms-which /
#' has-attachments-which.
#'
#' Files that fail to open (corrupt, wrong format, password
#' protected) are handled per the `errors` argument:
#'
#' * `"warn"` (default) — a `warning()` per failed file; the file
#' is dropped from the result tibble.
#' * `"skip"` — silently dropped.
#' * `"stop"` — the first failed file raises an error and the
#' function aborts.
#'
#' @param dir Character scalar. Path to the directory to scan.
#' @param pattern Regular expression filtering filenames. Defaults
#' to `"\\.pdf$"` (case-insensitive).
#' @param recursive Logical. When `TRUE`, descend into
#' subdirectories. Defaults `FALSE`.
#' @param password Optional password applied to every file. `NULL`
#' (default) tries each file without a password. Useful when all
#' files share the same password.
#' @param errors One of `"warn"`, `"skip"`, `"stop"` — see Details.
#' @return A tibble with the same columns as [pdf_doc_summary()].
#' Zero rows when the directory has no PDFs (or every PDF failed
#' to open under `errors = "skip"` / `"warn"`).
#' @seealso [pdf_doc_summary()] for the single-file companion.
#' @examples
#' fixture_dir <- system.file("extdata", "fixtures",
#' package = "pdfium")
#' if (nzchar(fixture_dir)) {
#' pdf_dir_summary(fixture_dir)
#' }
#' @export
pdf_dir_summary <- function(dir = ".", pattern = "\\.pdf$",
recursive = FALSE, password = NULL,
errors = c("warn", "skip", "stop")) {
checkmate::assert_directory_exists(dir)
checkmate::assert_string(pattern)
checkmate::assert_flag(recursive)
errors <- match.arg(errors)

files <- list.files(dir, pattern = pattern, recursive = recursive,
full.names = TRUE, ignore.case = TRUE)
if (length(files) == 0L) {
return(pdf_doc_summary_empty())
}

rows <- lapply(files, function(f) {
tryCatch(
pdf_doc_summary(f, password = password),
error = function(e) {
if (errors == "stop") {
stop(sprintf("pdf_dir_summary: failed to read '%s': %s",
f, conditionMessage(e)), call. = FALSE)
}
if (errors == "warn") {
warning(sprintf("pdf_dir_summary: failed to read '%s': %s",
f, conditionMessage(e)), call. = FALSE)
}
NULL
}
)
})
ok <- !vapply(rows, is.null, logical(1L))
if (!any(ok)) {
return(pdf_doc_summary_empty())
}
out <- do.call(rbind, rows[ok])
tibble::as_tibble(out)
}

# Internal: zero-row tibble matching pdf_doc_summary's column shape.
# Used by pdf_dir_summary() when the directory is empty (or every
# file failed under `errors = "skip"` / `"warn"`).
pdf_doc_summary_empty <- function() {
tibble::tibble(
path = character(),
page_count = integer(),
file_version = integer(),
title = character(),
author = character(),
subject = character(),
keywords = character(),
creator = character(),
producer = character(),
creation_date = character(),
mod_date = character(),
trapped = character(),
creation_date_parsed = as.POSIXct(character(), tz = "UTC"),
mod_date_parsed = as.POSIXct(character(), tz = "UTC"),
is_tagged = logical(),
is_encrypted = logical(),
security_revision = integer(),
xref_valid = logical(),
bookmark_count = integer(),
attachment_count = integer(),
signature_count = integer(),
form_field_count = integer(),
javascript_count = integer(),
named_dest_count = integer(),
has_page_labels = logical(),
file_id_permanent = character(),
file_id_changing = character()
)
}

# Internal: convert pdf_doc_file_id()'s raw return to a hex string,
# or NA_character_ when empty. Hoisted from pdf_doc_summary so its
# two branches can be unit-tested without a fixture that carries an
Expand Down
Loading
Loading