Easily convert and manipulate Concordance .DAT load files β perfect for legal e-discovery, metadata extraction, and bulk processing.
A powerful Python CLI tool designed to handle complex .DAT files with custom delimiters (ΓΎ, control characters), broken encodings, and Excel-incompatible data.
This tool can:
- β
Convert
.DATto.CSVto.DAT - π Compare two
.DATfiles (with optional field mapping) - π§ Replace or remap headers
- π Merge multiple
.DATfiles intelligently - π§Ή Delete rows based on field values
- π― Extract and export selected fields
This tool now uses Cython-compiled quote-aware parsing for maximum speed on large .DAT files.
| File Size | Rows | Before (Pure Python) | Now (Cython) |
|---|---|---|---|
| 131 MB | ~90k | ~17 sec | 3.45 sec |
| 204 MB | ~1.1M | ~52 sec | 13.56 sec |
| 1.06 GB | ~5.7M | ~300 sec | 64.39 sec |
β Quote-safe, newline-tolerant, and 4β5Γ faster than the previous version.
A custom parser module (quote_split_chunked.pyx) is written in Cython and compiled to a native .pyd extension, enabling fast, chunked line processing while preserving quote-state logic.
Install a C compiler first:
- Windows: Visual C++ Build Tools
- Linux:
sudo apt install build-essential python3-dev - macOS:
xcode-select --install
Then build:
python setup.py build_ext --inplace- Handles Concordance
.DATfiles with embedded line breaks - Supports various encodings:
UTF-8,UTF-16,Windows-1252, and more - Robust parsing even with Excelβs 32,767 character cell limit
- CLI-first design β ideal for automation and scripting
- Legal eDiscovery processing
- Metadata cleanup and normalization
- Custom conversions and field extraction
- Comparing vendor-delivered load files
- π₯ Download EXE
git clone https://github.com/yourusername/dat-file-tool.git
cd dat-file-tool- Python 3.7+
- Requires:
chardet - Optional:
cythonfor native-speed parsing
- β
Convert
.datto.csv|.csvto.dator keep as.dat - π Compare two
.datfiles (with optional header mapping) - π§Ή Delete specific rows from
.datusing a value list - π Merge
.datfiles by common headers - π€ Auto-detect encoding (UTF-8, UTF-16, Windows-1252, Latin-1)
- π¬ Smart line reader handles embedded newlines and quoted fields
- π Output directory support via
-o DIR β οΈ Excel field-length warning for long text fields (>32,767 chars)- π― Select only specific fields from a DAT file using
--select
| Feature | Description |
|---|---|
--csv |
Export DAT file to CSV format (Comma Separated Value) |
--tsv |
Export DAT file to TSV format (Tab Separated Value) |
--dat |
Export to DAT format (default if none specified) |
--c, --compare |
Compare two DAT files line-by-line |
--r, --replace-header |
Replace headers using a mapping file (old_header,new_header) |
--merge |
Merge multiple DAT files grouped by matching headers |
--delete |
Delete rows based on field values listed in a file |
--select |
Export only selected fields from the DAT file |
--join |
Strictly join two DAT files using a key field, with duplicate header conflict resolution |
--key |
Key field required to perform join |
--o, --output-dir |
Specify output directory for generated files |
--reorder-header, --reorder |
Reorder headers based on a specified order file |
--split |
Split converted output into N files (even split) |
--max-rows |
Maximum rows per output file (e.g., 10000). |
--group-by |
Keep groups (by FIELD) intact when splitting |
python Main.py input.dat --csv
# Output: input_converted.csv
python Main.py input.dat --tsv
# Output: input_converted.tsvSplit the converted output into multiple files either by number of files or by maximum rows per file. Use --group-by to keep related rows (families) intact.
# 1) Evenly split into 3 files
python Main.py input.dat --csv --split 3
# Output: input_part1.csv, input_part2.csv, input_part3.csv
# 2) Split into files containing up to 10,000 rows each
python Main.py input.dat --csv --max-rows 10000
# Output: input_part1.csv, input_part2.csv, ... (each up to 10000 rows)
# 3) Keep families intact while splitting into 3 files (group by 'Family' header)
python Main.py input.dat --csv --split 3 --group-by Family
# Output: each file contains whole families β no family is split across files
# Note: If a single family's row count exceeds the requested --max-rows, that family will be placed alone in a file with a warning.You can also specify custom output paths:
python Main.py input.dat --csv output.csvCompare two DAT files and generate a detailed difference report:
# Simple comparison
python Main.py file1.dat file2.dat --compare
# With header mapping (useful for comparing files with different headers)
python Main.py file1.dat file2.dat --compare --mapping mapping.txtMapping File Format (mapping.txt):
OldHeader1,NewHeader1
OldHeader2,NewHeader2
Output: Creates file1_diff.csv containing all differences with SHA256 hashes for verification.
Replace or rename column headers using a mapping file:
python Main.py data.dat --replace-header mapping.txtMapping File Format:
OldName,NewName
Age,PersonAge
Score,TestScore
Output: Creates data_Replaced.dat with renamed headers.
Extract only selected columns from a file:
python Main.py data.dat --select fields.txtSelect File Format (fields.txt):
Name
Email
Age
Output: Creates data_selected.dat containing only the specified fields.
Remove rows matching specific field values:
python Main.py data.dat --delete delete_list.txtDelete File Format (delete_list.txt):
Status
Inactive
Deleted
Suspended
First line specifies the field, subsequent lines are values to delete.
Output:
- Creates
data{kept}.dat(rows to keep) - Creates
data{removed}.dat(rows deleted)
Perform a strict inner join on two DAT files based on key fields:
python Main.py file1.dat file2.dat --join --key "UserID"
# Multiple key fields
python Main.py file1.dat file2.dat --join --key "UserID Department"Features:
- Validates key field existence in both files
- Detects and handles duplicate headers with three resolution modes:
- Suffix mode: Adds
_2to file2 column names - File1 mode: Keeps file1 values (default)
- File2 mode: Overwrites with file2 values
- Suffix mode: Adds
- Detects and reports duplicate keys with error handling
Output: Creates file1_joined.dat containing merged data.
Merge multiple DAT files with automatic header validation:
python Main.py merge_list.txt --mergeMerge List Format (merge_list.txt):
/path/to/file1.dat
/path/to/file2.dat
/path/to/file3.dat
Features:
- Groups files by header hash
- Creates separate output files for each group
- Generates merge log with file counts and row statistics
- Validates file existence and readability
- Excludes problematic files with detailed warnings
Output:
- Creates
merge_list_group_1.dat,merge_list_group_2.dat, etc. - Creates
merge_list_merge_log.csvwith merge statistics
| Flag | Description |
|---|---|
--o DIR |
Set output directory |
--help |
Show help message |
- All exports go to the directory specified by
-o, or default to the input file's folder. - Output filenames include tags like
{kept},{removed}, or_Replaced.
Handles common encodings reliably:
- β UTF-8
- β UTF-8 with BOM
- β UTF-16 LE / BE (BOM detection)
- π Uses
chardetfallback for uncertain cases (based on confidence)
Warns if any field exceeds Excel's max cell limit (32,767 chars).
- Python 3.7+
- Dependencies (see
requirements.txt):
Add .vscode/launch.json:
{
"name": "Debug Merge Example",
"type": "python",
"request": "launch",
"program": "${workspaceFolder}/Main.py",
"console": "integratedTerminal",
"args": [
"--merge", "File_list.csv", "--csv", "-o", "merged/"
]
}Feel free to fork, enhance, or report issues! Contributions are welcome π¬
Md Ehsan Ahsan π§ MyGitHub π οΈ Built with love using Python π
This tool is provided as-is without any warranties.
Use it at your own risk.
I am not responsible if it eats your files, breaks your computer, or ruins your spreadsheet.π But Hey, if it helps you automate the boring stuff β you're welcome! π
This project is free to use under the MIT License.
