Skip to content

MdEhsanAhsan/CustomTextParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

70 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“‚ CustomTextParser

πŸ”„ Concordance .DAT File Toolkit

Easily convert and manipulate Concordance .DAT load files β€” perfect for legal e-discovery, metadata extraction, and bulk processing.


πŸ›  What It Does

A powerful Python CLI tool designed to handle complex .DAT files with custom delimiters (ΓΎ, control characters), broken encodings, and Excel-incompatible data.

This tool can:

  • βœ… Convert .DAT to .CSV to .DAT
  • πŸ” Compare two .DAT files (with optional field mapping)
  • 🧠 Replace or remap headers
  • πŸ”— Merge multiple .DAT files intelligently
  • 🧹 Delete rows based on field values
  • 🎯 Extract and export selected fields


⚑ Cython Acceleration (v1.1+)

This tool now uses Cython-compiled quote-aware parsing for maximum speed on large .DAT files.

πŸš€ Performance Gain

File Size Rows Before (Pure Python) Now (Cython)
131 MB ~90k ~17 sec 3.45 sec
204 MB ~1.1M ~52 sec 13.56 sec
1.06 GB ~5.7M ~300 sec 64.39 sec

βœ… Quote-safe, newline-tolerant, and 4–5Γ— faster than the previous version.

🧱 How It Works

A custom parser module (quote_split_chunked.pyx) is written in Cython and compiled to a native .pyd extension, enabling fast, chunked line processing while preserving quote-state logic.

πŸ›  Compiling the Cython Module

Install a C compiler first:

  • Windows: Visual C++ Build Tools
  • Linux: sudo apt install build-essential python3-dev
  • macOS: xcode-select --install

Then build:

python setup.py build_ext --inplace

βš™οΈ Key Features

  • Handles Concordance .DAT files with embedded line breaks
  • Supports various encodings: UTF-8, UTF-16, Windows-1252, and more
  • Robust parsing even with Excel’s 32,767 character cell limit
  • CLI-first design β€” ideal for automation and scripting

πŸš€ Use Cases

  • Legal eDiscovery processing
  • Metadata cleanup and normalization
  • Custom conversions and field extraction
  • Comparing vendor-delivered load files

πŸ“¦ Installation

Clone the repo

git clone https://github.com/yourusername/dat-file-tool.git
cd dat-file-tool

Install dependencies (optional)

  • Python 3.7+
  • Requires: chardet
  • Optional: cython for native-speed parsing

✨ Features

  • βœ… Convert .dat to .csv | .csv to .dat or keep as .dat
  • πŸ”€ Compare two .dat files (with optional header mapping)
  • 🧹 Delete specific rows from .dat using a value list
  • πŸ” Merge .dat files by common headers
  • πŸ”€ Auto-detect encoding (UTF-8, UTF-16, Windows-1252, Latin-1)
  • πŸ’¬ Smart line reader handles embedded newlines and quoted fields
  • πŸ“ Output directory support via -o DIR
  • ⚠️ Excel field-length warning for long text fields (>32,767 chars)
  • 🎯 Select only specific fields from a DAT file using --select
Feature Description
--csv Export DAT file to CSV format (Comma Separated Value)
--tsv Export DAT file to TSV format (Tab Separated Value)
--dat Export to DAT format (default if none specified)
--c, --compare Compare two DAT files line-by-line
--r, --replace-header Replace headers using a mapping file (old_header,new_header)
--merge Merge multiple DAT files grouped by matching headers
--delete Delete rows based on field values listed in a file
--select Export only selected fields from the DAT file
--join Strictly join two DAT files using a key field, with duplicate header conflict resolution
--key Key field required to perform join
--o, --output-dir Specify output directory for generated files
--reorder-header, --reorder Reorder headers based on a specified order file
--split Split converted output into N files (even split)
--max-rows Maximum rows per output file (e.g., 10000).
--group-by Keep groups (by FIELD) intact when splitting

πŸ§ͺ Usage Examples

πŸ” Convert DAT to CSV / TSV

python Main.py input.dat --csv
# Output: input_converted.csv

python Main.py input.dat --tsv
# Output: input_converted.tsv

βœ‚οΈ Split Output into Multiple Files βœ…

Split the converted output into multiple files either by number of files or by maximum rows per file. Use --group-by to keep related rows (families) intact.

# 1) Evenly split into 3 files
python Main.py input.dat --csv --split 3
# Output: input_part1.csv, input_part2.csv, input_part3.csv

# 2) Split into files containing up to 10,000 rows each
python Main.py input.dat --csv --max-rows 10000
# Output: input_part1.csv, input_part2.csv, ... (each up to 10000 rows)

# 3) Keep families intact while splitting into 3 files (group by 'Family' header)
python Main.py input.dat --csv --split 3 --group-by Family
# Output: each file contains whole families β€” no family is split across files

# Note: If a single family's row count exceeds the requested --max-rows, that family will be placed alone in a file with a warning.

πŸŽ₯ Demo Example

Demo Animation

You can also specify custom output paths:

python Main.py input.dat --csv output.csv


2. Compare Two Files

Compare two DAT files and generate a detailed difference report:

# Simple comparison
python Main.py file1.dat file2.dat --compare

# With header mapping (useful for comparing files with different headers)
python Main.py file1.dat file2.dat --compare --mapping mapping.txt

Mapping File Format (mapping.txt):

OldHeader1,NewHeader1
OldHeader2,NewHeader2

Output: Creates file1_diff.csv containing all differences with SHA256 hashes for verification.


3. Replace Headers

Replace or rename column headers using a mapping file:

python Main.py data.dat --replace-header mapping.txt

Mapping File Format:

OldName,NewName
Age,PersonAge
Score,TestScore

Output: Creates data_Replaced.dat with renamed headers.


4. Select Specific Fields

Extract only selected columns from a file:

python Main.py data.dat --select fields.txt

Select File Format (fields.txt):

Name
Email
Age

Output: Creates data_selected.dat containing only the specified fields.


5. Delete Rows

Remove rows matching specific field values:

python Main.py data.dat --delete delete_list.txt

Delete File Format (delete_list.txt):

Status
Inactive
Deleted
Suspended

First line specifies the field, subsequent lines are values to delete.

Output:

  • Creates data{kept}.dat (rows to keep)
  • Creates data{removed}.dat (rows deleted)

6. Join Two Files

Perform a strict inner join on two DAT files based on key fields:

python Main.py file1.dat file2.dat --join --key "UserID"

# Multiple key fields
python Main.py file1.dat file2.dat --join --key "UserID Department"

Features:

  • Validates key field existence in both files
  • Detects and handles duplicate headers with three resolution modes:
    1. Suffix mode: Adds _2 to file2 column names
    2. File1 mode: Keeps file1 values (default)
    3. File2 mode: Overwrites with file2 values
  • Detects and reports duplicate keys with error handling

Output: Creates file1_joined.dat containing merged data.


7. Merge Multiple Files

Merge multiple DAT files with automatic header validation:

python Main.py merge_list.txt --merge

Merge List Format (merge_list.txt):

/path/to/file1.dat
/path/to/file2.dat
/path/to/file3.dat

Features:

  • Groups files by header hash
  • Creates separate output files for each group
  • Generates merge log with file counts and row statistics
  • Validates file existence and readability
  • Excludes problematic files with detailed warnings

Output:

  • Creates merge_list_group_1.dat, merge_list_group_2.dat, etc.
  • Creates merge_list_merge_log.csv with merge statistics

βš™οΈ Optional Arguments

Flag Description
--o DIR Set output directory
--help Show help message

πŸ“¦ Output Files

  • All exports go to the directory specified by -o, or default to the input file's folder.
  • Output filenames include tags like {kept}, {removed}, or _Replaced.

πŸ’‘ Encoding Detection Logic

Handles common encodings reliably:

  • βœ… UTF-8
  • βœ… UTF-8 with BOM
  • βœ… UTF-16 LE / BE (BOM detection)
  • πŸ” Uses chardet fallback for uncertain cases (based on confidence)

πŸ§ͺ Excel Limit Check

Warns if any field exceeds Excel's max cell limit (32,767 chars).


πŸ“ Requirements

  • Python 3.7+
  • Dependencies (see requirements.txt):

🧰 Development Tips

VS Code Debug Setup (optional)

Add .vscode/launch.json:

{
  "name": "Debug Merge Example",
  "type": "python",
  "request": "launch",
  "program": "${workspaceFolder}/Main.py",
  "console": "integratedTerminal",
  "args": [
    "--merge", "File_list.csv", "--csv", "-o", "merged/"
  ]
}

🀝 Contributing

Feel free to fork, enhance, or report issues! Contributions are welcome πŸ’¬


πŸ‘€ Author

Md Ehsan Ahsan πŸ“§ MyGitHub πŸ› οΈ Built with love using Python 🐍


⚠️ Disclaimer

This tool is provided as-is without any warranties.
Use it at your own risk.
I am not responsible if it eats your files, breaks your computer, or ruins your spreadsheet.

πŸš€ But Hey, if it helps you automate the boring stuff β€” you're welcome! πŸ˜„


πŸ“ License

This project is free to use under the MIT License.