Skip to content

supunhg/Filo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

34 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Filo - Forensic Intelligence & Learning Operator

Battle-tested file forensics platform for security professionals

Filo transforms unknown binary blobs into classified, repairable, and explainable artifacts with offline ML learning capabilities.

Features

  • πŸ” Deep File Analysis: Multi-layered signature, structural, and ZIP container analysis
  • 🎯 Smart Format Detection: Distinguishes DOCX/XLSX/PPTX, ODT/ODP/ODS, ZIP, JAR, APK, EPUB
  • 🧠 Enhanced ML Learning: Discriminative pattern extraction, rich statistical features, n-gram profiling
  • πŸ”§ Intelligent Repair: Reconstruct corrupted headers automatically with 21 repair strategies
  • πŸ“Š Flexible Output: Concise evidence display (top 3 by default), full details with -a/--all-evidence
  • 😎 Confidence Breakdown: Auditable detection with --explain flag (court-ready transparency)
  • πŸ›‘οΈ Contradiction Detection: Identifies malware, polyglots, structural anomalies (malware triage)
  • πŸ•΅οΈ Embedded Detection: Find files hidden inside files - ZIP in EXE, PNG after EOF (malware hunter candy)
  • πŸ”§ Tool Fingerprinting: Identify how/when/with what tools a file was created (forensic attribution)
  • ⚠️ Polyglot Detection (NEW v0.2.5): Detect dual-format files (GIFAR, PNG+ZIP, PDF+JS) with risk assessment
  • πŸ–₯️ CPU Architecture Detection (NEW v0.2.8): Automatic detection of CPU architecture for executables (90+ architectures: x86, ARM, RISC-V, Xtensa, MIPS, etc.)
  • 🎨 zsteg-Compatible Steganography (v0.2.7): 60+ bit plane LSB/MSB extraction (PNG/BMP), auto base64 decoding, file type detection, CTF-optimized
  • 🌐 PCAP Analysis (v0.2.6): Network capture file analysis with protocol detection, string extraction, base64 decoding, flag hunting
  • πŸš€ Batch Processing: Parallel directory analysis with configurable workers
  • πŸ”— Hash Lineage Tracking: Cryptographic chain-of-custody for court evidence
  • πŸ“¦ Container Detection: Deep ZIP-based format inspection for Office and archive formats
  • ⚑ Performance Profiling: Identify bottlenecks in large-scale analysis
  • 🎨 Enhanced CLI: Color-coded output, hex dumps, repair suggestions
  • 🧹 Easy Maintenance: Reset ML model and lineage database with simple commands

Quick Start

Option 1: Easy Install (.deb package)

# Clone and build
git clone https://github.com/supunhg/Filo
cd Filo
./build-deb.sh

# Install
sudo dpkg -i filo-forensics_0.2.8_all.deb

Option 2: From Source

git clone https://github.com/supunhg/Filo
cd Filo
pip install -e .

Usage:

# Analyze unknown file
filo analyze suspicious.bin

# Identify CPU architecture (ELF/PE/Mach-O executables)
filo analyze binary  # Shows: x86-64, ARM64, Xtensa, etc.

# Detect steganography (zsteg-compatible with auto base64 decoding)
filo stego challenge.png  # CTF flag hunting
filo stego image.png --all  # Show all 60+ bit plane results
filo stego image.png --extract="b1,rgba,lsb,xy" -o flag.txt

# Analyze PCAP network capture files
filo pcap capture.pcap

# Show detailed confidence breakdown (forensic-grade)
filo analyze --explain file.bin

# Show all detection evidence and embedded artifacts
filo analyze -a -e file.bin

# Analyze with JSON output
filo analyze --json file.bin > report.json

# Teach ML about a file format
filo teach correct_file.zip -f zip

# Batch process directory
filo batch ./directory

# Repair corrupted file
filo repair --format=png broken_image.bin

# Reset ML model or lineage database
filo reset-ml -y
filo reset-lineage -y

Installation

πŸ“¦ Easy Install (Recommended) - Debian/Ubuntu

The easiest way to install Filo is to build and install the .deb package:

# Clone repository
git clone https://github.com/supunhg/Filo
cd Filo

# Build .deb package
./build-deb.sh

# Install
sudo dpkg -i filo-forensics_0.2.8_all.deb

# Start using immediately
filo --version
filo analyze file.bin

Features:

  • βœ… Isolated installation at /opt/filo/ (no system conflicts)
  • βœ… Automatic dependency management
  • βœ… Global filo command (works from anywhere)
  • βœ… No manual virtual environment activation
  • βœ… Clean uninstall: sudo dpkg -r filo-forensics

Supported: Ubuntu 20.04+, Debian 11+, and compatible distributions

Note: All user data is stored in /home/user/.filo/ directory:

  • ML model: /home/user/.filo/learned_patterns.pkl
  • Lineage database: /home/user/.filo/lineage.db

From Source (Development)

git clone https://github.com/supunhg/Filo
cd Filo
pip install -e .

Development Setup

# Clone and install with dev dependencies
git clone https://github.com/supunhg/Filo
cd Filo
pip install -e ".[dev]"

# Run tests
pytest

Usage Examples

Python API

from filo import Analyzer, RepairEngine
from filo.batch import analyze_directory
from filo.export import export_to_file
from filo.container import analyze_archive

# Analyze file with ML enabled
analyzer = Analyzer(use_ml=True)
result = analyzer.analyze_file("unknown.bin")
print(f"Detected: {result.primary_format} ({result.confidence:.0%})")
print(f"Alternatives: {result.alternative_formats[:3]}")

# View detection evidence
for evidence in result.evidence_chain[:3]:
    print(f"  {evidence['module']}: {evidence['confidence']:.0%}")

# Teach ML about correct format
with open("sample.zip", "rb") as f:
    analyzer.teach(f.read(), "zip")

# Batch process directory
batch_result = analyze_directory("./data", recursive=True)
print(f"Analyzed {batch_result.analyzed_count} files")

# Export to JSON/SARIF
export_to_file(result, "report.json", format="json")

# Analyze container (DOCX, ZIP, etc.)
container = analyze_archive("document.docx")
for entry in container.entries:
    print(f"{entry.path}: {entry.format}")

# Repair file
repair = RepairEngine()
repaired_data, report = repair.repair_file("corrupt.png")

CLI

# Analysis with limited evidence (default: top 3)
filo analyze suspicious.bin

# Show all evidence and embedded artifacts
filo analyze -a -e suspicious.bin

# Show detailed confidence breakdown (auditable, court-ready)
filo analyze --explain file.bin

# Combine for full transparency
filo analyze --explain -a -e file.bin

# Disable ML for pure signature detection
filo analyze --no-ml file.bin

# Analysis with JSON output
filo analyze --json suspicious.bin

# Detect embedded files (ZIP in EXE, PNG after EOF)
filo analyze malware.exe -e

# Identify tool/creator fingerprints
filo analyze document.pdf  # Automatically fingerprints

# Batch processing with export
filo batch ./directory --export=sarif --output=scan.sarif

# Teach ML about file formats
filo teach correct_file.zip -f zip
filo teach image.png -f png

# Reset ML model or lineage database
filo reset-ml -y
filo reset-lineage -y

# Export to JSON for scripting
filo analyze --json file.bin | jq '.primary_format'

# Security: Detect embedded malware in documents
filo analyze suspicious.docx  # Automatically checks for contradictions

# Automation: Filter files with critical contradictions
filo analyze *.docx --json | \
  jq 'select(.contradictions[]? | .severity == "critical")'

# Check for hidden files
filo analyze *.png --json | \
  jq 'select(.embedded_objects | length > 0)'

# Chain-of-custody: Query file transformation lineage
filo lineage $(sha256sum repaired.png | cut -d' ' -f1)

# View lineage history
filo lineage-history --operation repair

# Export lineage for court
filo lineage $FILE_HASH --format json --output chain-of-custody.json

Key Improvements

ZIP-Based Format Detection

Filo now accurately distinguishes between ZIP-based formats by inspecting container contents:

  • Office Open XML: DOCX, PPTX, XLSX (via [Content_Types].xml)
  • OpenDocument: ODT, ODP, ODS (via mimetype file)
  • Archives: JAR, APK, EPUB, plain ZIP
  • Large files: Efficient handling of files >10MB using file path access

Enhanced ML Features

Three major improvements to machine learning detection:

  1. Discriminative Pattern Extraction: Automatically discovers format-specific byte sequences
  2. Rich Feature Analysis: 8 statistical features including compression ratio, entropy, byte distribution
  3. N-gram Profiling: Fuzzy matching using top 100 byte trigrams for similarity detection

Cleaner Output

Evidence display now shows only the top 3 most relevant items by default:

# Concise output (default)
filo analyze file.zip

# Full evidence when needed
filo analyze --all-evidence file.zip

Documentation

What's New in v0.2.6

🎨 Steganography Detection

Detect hidden data in image files and documents:

filo stego image.png

# Output:
# πŸ” Steganography Analysis: image.png
# 
# βœ“ Potential Hidden Data Found (3 methods)
# 
# Method: b1,rgb,lsb,xy
#   Confidence: 95% (FLAG PATTERN DETECTED)
#   Data: picoCTF{h1dd3n_1n_LSB}

Features:

  • βœ… LSB/MSB Detection: Extract data from least/most significant bits (PNG, BMP)
  • βœ… Multiple Channels: Test RGB, RGBA, individual channels (r, g, b, a), BGR
  • βœ… Bit Orders: Both LSB and MSB with row/column-major ordering
  • βœ… PDF Metadata: Extract hidden flags from Author, Title, Subject, Keywords
  • βœ… Trailing Data: Detect data after JPEG EOI, PNG IEND, PDF EOF markers
  • βœ… Flag Recognition: Automatic CTF flag pattern detection (picoCTF{}, flag{}, HTB{})
  • βœ… Auto-Decode: Automatic base64 and zlib decompression
  • βœ… Extraction: Save specific channels/methods to files

Full Guide: Steganography Detection Documentation

🌐 PCAP Network Analysis

Quick triage for network capture files:

filo pcap dump.pcap

# Output:
# πŸ“Š Statistics
#   Packets: 1,234
#   Protocols: TCP (800), UDP (400), ICMP (34)
# 
# 🚩 FLAGS FOUND (2)
#   picoCTF{n3tw0rk_f0r3n51c5}
#   flag{hidden_in_packets}
# 
# πŸ“ Base64 Data
#   cGljb0NURnsuLi59 β†’ picoCTF{...}

Features:

  • βœ… Protocol Detection: IPv4, IPv6, TCP, UDP, ICMP, ARP
  • βœ… String Extraction: ASCII strings from packet payloads
  • βœ… Base64 Decoding: Automatic detection and decoding
  • βœ… Flag Hunting: CTF flag pattern search across all packets
  • βœ… HTTP Extraction: GET/POST requests and headers
  • βœ… Lightweight: No Wireshark/tshark dependency for quick triage

New Format Support:

  • πŸ“¦ PCAP/PCAPNG: Network capture files (little/big-endian)
  • πŸ“œ Shell Archives (shar): Self-extracting shell script archives

Previous Releases

v0.2.8 - CPU Architecture Detection (Latest)

πŸ–₯️ Major Enhancement: CPU Architecture Detection

Filo now automatically detects and reports CPU architecture for executable files:

filo analyze astronaut

# Output:
# πŸ–₯️  CPU Architecture:
#   β€’ Tensilica Xtensa Architecture (32-bit, Little-endian)
#     Format: ELF | Machine Code: 0x005E

Key Features:

  • βœ… 90+ architectures supported: x86, x86-64, ARM, ARM64, RISC-V, MIPS, PowerPC, Xtensa, SPARC, AVR, Alpha, IA-64, and many more
  • βœ… Three executable formats: ELF (Linux/Unix), PE/COFF (Windows), Mach-O (macOS/iOS)
  • βœ… Complete information: Architecture name, address width (32/64-bit), endianness, machine code
  • βœ… CTF-optimized: Instantly solve architecture identification challenges
  • βœ… Comprehensive testing: 24 tests covering all major architectures

Supported Architectures Include:

  • Common: x86, x86-64, ARM (32/64-bit), RISC-V, MIPS, PowerPC
  • Embedded: Xtensa (IoT/WiFi), AVR (Atmel), SuperH, M68k
  • Specialized: SPARC, Alpha AXP, IA-64 (Itanium), S390 (mainframe)
  • Exotic: VAX, PDP-10/11, TMS320C6000, Elbrus e2k, BPF

Documentation: See docs/ARCHITECTURE_DETECTION.md for complete guide

πŸ“Š Test Coverage: 24 new tests (100% passing) 🎯 CTF Ready: Solves architecture challenges in one command

v0.2.7 - zsteg-Compatible Steganography

✨ Major Enhancement: zsteg Algorithm Compatibility

Filo's steganography detection now matches the industry-standard zsteg tool exactly:

Key Features:

  • βœ… 60+ bit plane configurations tested per image
  • βœ… Byte-for-byte identical extraction compared to zsteg
  • βœ… Multi-bit extraction (b1, b2, b4) with correct nibble/byte packing
  • βœ… Auto base64 decoding - shows decoded flags directly (improvement over zsteg!)
  • βœ… File type detection - OpenPGP keys, Targa, Applesoft BASIC, Alliant
  • βœ… Smart result filtering - hides metadata noise by default
  • βœ… zsteg-style output - familiar format for CTF players

Also in v0.2.7:

  • Reduced embedded object false positives (confidence threshold 0.70 β†’ 0.80)
  • Added format exclusion rules (skip WASM/ICO patterns in ELF/PE binaries)
  • Parent format awareness in embedded detection

Testing:

  • Validated on CTF challenge images (picoCTF)
  • Algorithm verification against zsteg reference output
  • Multi-bit extraction tested (b2, b4 bit planes)

πŸ“Š Test Coverage: 85%+ (all tests passing)

Full Details: RELEASE_v0.2.7.md

v0.2.6 - Steganography & PCAP Analysis

✨ New Features:

  • Steganography detection (LSB/MSB analysis, PDF metadata, trailing data)
  • PCAP network capture analysis with flag hunting
  • Enhanced output filtering
v0.2.5 - Polyglot & Dual-Format Detection

⚠️ Major New Feature: Polyglot & Dual-Format Detection

Filo can now detect files that are simultaneously valid in multiple formats:

filo analyze suspicious_image.gif

# Output:
# ⚠ Polyglot Detected:
#   β€’ GIF + JAR - GIF + JAR hybrid (GIFAR attack) (91%)
#     Risk: HIGH | Pattern: gifar

Supported Polyglot Patterns:

  • GIFAR (GIF+JAR) - HIGH RISK: Classic attack vector for bypassing image filters
  • PDF + JavaScript - HIGH RISK: Malicious PDFs with embedded JS payloads
  • PE + ZIP - HIGH RISK: Windows executables that are also ZIP archives
  • PNG + ZIP - MEDIUM RISK: Images with hidden ZIP archives
  • JPEG + ZIP - MEDIUM RISK: JPEG files with embedded archives

Key Features:

  • βœ… Multi-format validation (PNG, GIF, JPEG, ZIP, JAR, RAR, PDF, PE, ELF)
  • βœ… Security risk assessment (HIGH, MEDIUM, LOW)
  • βœ… Confidence scoring (70-98%)
  • βœ… JavaScript payload detection in PDFs
  • βœ… Demo polyglot files for testing
  • βœ… Comprehensive test suite (26 new tests)

Documentation: See docs/POLYGLOT_DETECTION.md for complete guide

πŸ“Š Test Coverage: 67% overall (173/173 tests passing, +26 polyglot tests) 🎯 Supported Formats: 60+ file formats
πŸ”¬ Detection Accuracy: 95%+ on clean files, 70%+ on corrupted files

v0.2.4 - Embedded Detection & Tool Fingerprinting (Previous)

✨ Enhancements:

  1. Embedded Object Detection - Find files hidden inside files (ZIP in EXE, PNG after EOF, polyglots)
  2. Tool Fingerprinting - Identify creation tools, versions, OS, timestamps (forensic attribution)
  3. Short Flags - -a for all evidence, -e for all embedded artifacts
  4. Reset Commands - filo reset-ml and filo reset-lineage for easy maintenance
  5. Demo Files - Sophisticated test files in demo/ directory
  6. Hash Lineage Tracking - Cryptographic chain-of-custody for all transformations
  7. Format Contradiction Detection - Identifies malware, polyglots, embedded executables
  8. Confidence Decomposition - Auditable detection with --explain flag
  9. ZIP Container Analysis - Accurate DOCX/XLSX/PPTX/ODT/ODP/ODS detection
  10. Enhanced ML Learning - Pattern extraction, rich features, n-gram profiling

πŸ“Š 147/147 tests passing

Contributing

We welcome contributions! Priority areas:

  • Format specifications (YAML)
  • Analysis plugins
  • Test corpus samples
  • Performance optimizations

Security & Safety

Filo is designed with security in mind:

  • Non-destructive analysis (unless explicitly requested with repair commands)
  • Resource-limited processing
  • Input-validated at all layers
  • No external network calls (fully offline ML)

Author

Supun Hewagamage (@supunhg)


When you need to know not just what something is, but why it's that, and how to fix it.

About

Forensic file intelligence & repair. When file says "data," Filo knows what it really is and can fix it. Built for CTFs, forensics, and security professionals.

Topics

Resources

License

Stars

Watchers

Forks

Contributors