Examine all PDF files in lookup directories, identify them using regular expressions, rename them, and copy them to organized directories.
gem install pdfhYou need to install pdftotext to extract text from PDF files.
brew install xpdfsudo dnf install -y poppler-utilssudo pacman -S popplerAfter installing this gem, create your configuration file in one of the following directories:
~/.config/pdfh.yml~/pdfh.yml- or configure the
PDFH_CONFIG_FILEenvironment variable
Then run:
pdfhThe tool will:
- Scan all PDFs in the configured
lookup_dirs - Extract text from each PDF using
pdftotext - Match the extracted text from each PDF against your configured
document_types(viare_id) - Copy matched documents to organized directories within
destination_base_path - Rename files according to your
name_template
Example configuration:
---
lookup_dirs: # Directories where all PDFs will be analyzed
- ~/Downloads
destination_base_path: ~/PDFs # Directory where all matching documents will be copied (MUST exist)
document_types:
- name: My Bank # Description (type)
re_id: 'Account ID: 12334-\w{3}' # [OPTIONAL (uses name as fallback)] RegEx to match from PDF content as document identifier
re_date: '\d{1,2} de (\w+) de (\d+)' # Date RegEx (to extract from PDF content)
store_path: "{year}/bank_docs" # Relative path to copy this document
name_template: '{period} {name}' # [OPTIONAL] Template for new filename when copiedStore Path and Name Template support the following placeholders:
| Placeholder | Description | Example |
|---|---|---|
{original} |
Original filename | MyBankDocument2.pdf |
{period} |
Year-Month | 2022-07 |
{year} |
Year | 2022 |
{month} |
Month | 07 |
{day} |
Day (if captured) | 01 |
{quarter} |
Quarter (Q1-Q4) | Q3 |
{bimester} |
Bimester (B1-B6) | B4 |
{name} |
Document type name | My Bank |
The period, year, month, day, quarter and bimester placeholders are calculated from the date captured by the re_date regular expression.
The re_date regex extracts date information from the PDF content:
| Date text | RegEx | Captured |
|---|---|---|
01/02/2025 |
(?<d>\d{2})\/(?<m>\d{2})\/(?<y>\d{4}) |
d: 01 m: 02 y: 2025 |
072025 - |
(?<m>\d{2})(?<y>\d{4}) - |
m: 07 y: 2025 |
31 de julio de 2025 |
\d{1,2} de (\w+) de (\d+) |
month: julio year: 2025 |
Named captures supported: y for year, m for month, d for day.
If named captures are not used, the regex groups will be matched in order: month, year.
After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run rake install. To release a new version, run rake bump, and then run rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.
rake install
# step by step
build pdfh.gemspec
gem install pdfh-*To release a new version, run:
rake bump
rake releaseThis will create a git tag for the version, push git commits and tags, and upload the .gem file to rubygems.org.
npm install -g @commitlint/cli @commitlint/config-conventional
commitlint --from origin --to @Bug reports and pull requests are welcome on GitHub at https://github.com/iax7/pdfh. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
The gem is available as open source under the terms of the MIT License.
Everyone interacting in the Pdfh project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.
Run with verbose output:
pdfh -vRun in dry-run mode (no files will be moved):
pdfh --dryShow version:
pdfh --version