Automated File Classification and Malicious Intent Scoring using Python and Machine Learning for Digital Forensic Triage
- Overview
- Key Features
- Problem Statement
- Installation
- Usage
- How It Works
- Technology Stack
- Project Architecture
- Configuration
- Contributing
- Research Background
- Roadmap
- Liscense
- Contact
Scanlytic-ForensicAI is an AI-driven forensic triage system designed to revolutionize digital forensic analysis. This innovative tool automatically classifies files and assigns malicious intent scores based on extracted features, significantly enhancing the speed and accuracy of digital forensic investigations.
Note: This project is currently under active development. The documentation below describes the planned architecture, features, and functionality that are being built based on applied research in digital forensics.
Built from applied research in the security perspective of digital forensics, Scanlytic-ForensicAI empowers security personnel, forensic analysts, and incident responders to quickly identify and prioritize potentially malicious files in large datasets, streamlining the investigative process.
In modern digital forensics, analysts face the challenge of examining massive volumes of data within tight timeframes. Manual analysis is time-consuming and prone to human error. Scanlytic-ForensicAI addresses these challenges by:
- Automating the initial triage process
- Prioritizing files based on malicious intent scores
- Reducing analysis time from hours to minutes
- Improving detection accuracy through machine learning
- Enabling forensic teams to focus on high-priority threats
-
π€ Intelligent File Classification
- Automatic categorization of files based on type, content, and behavior
- Support for multiple file formats (executables, documents, archives, scripts, etc.)
- Binary and multi-class classification models
-
π― Malicious Intent Scoring
- Advanced scoring algorithm that assigns risk levels (0-100)
- Real-time threat assessment based on extracted features
- Configurable threshold settings for different security policies
-
π Feature Extraction Engine
- Static analysis of file properties (size, entropy, headers, metadata)
- PE header analysis for executables
- String extraction and pattern matching
- Hash computation (MD5, SHA1, SHA256)
- File signature detection
-
β‘ High-Performance Processing
- Batch processing capabilities for large datasets
- Parallel processing support
- Optimized for speed without compromising accuracy
-
π Reporting & Visualization
- Detailed analysis reports in multiple formats (JSON, CSV, HTML)
- Visual dashboards for threat distribution
- Timeline analysis of suspicious activities
- Export capabilities for integration with SIEM systems
-
π Continuous Learning
- Model retraining capabilities
- Integration with threat intelligence feeds
- Adaptive learning from analyst feedback
Digital forensic investigations are increasingly challenging due to:
- Volume: Massive amounts of data requiring analysis
- Velocity: Time-sensitive nature of incident response
- Variety: Diverse file types and attack vectors
- Complexity: Sophisticated evasion techniques used by malware
Traditional manual triage methods cannot keep pace with these challenges. Security personnel need intelligent tools that can:
- Quickly identify suspicious files in large datasets
- Accurately assess threat levels
- Provide actionable intelligence
- Scale with growing data volumes
Scanlytic-ForensicAI solves these problems through machine learning-powered automation, enabling faster and more accurate forensic triage.
Before installing Scanlytic-ForensicAI, ensure you have:
- Python 3.8 or higher
- pip (Python package manager)
- Virtual environment tool (recommended: venv or conda)
- Git
- At least 4GB RAM (8GB recommended)
- 2GB free disk space
- Clone the Repository
git clone https://github.com/rohteemie/Scanlytic-ForensicAI.git
cd Scanlytic-ForensicAI- Create Virtual Environment
# Using venv
python -m venv venv
# Activate on Linux/Mac
source venv/bin/activate
# Activate on Windows
venv\Scripts\activate- Install Dependencies
# Install core dependencies
pip install -r requirements.txt
# Install development dependencies (optional)
pip install -r requirements-dev.txt- Download Pre-trained Models (if available)
python scripts/download_models.py- Verify Installation
python -m scanlytic --version
python -m scanlytic --health-checkDocker provides the easiest way to get started with Scanlytic-ForensicAI:
# Build Docker image
docker build -t scanlytic-forensicai .
# Run analysis on a file
docker run -v /path/to/files:/data scanlytic-forensicai analyze /data/file.exe
# Run analysis on a directory with report output
docker run -v /path/to/files:/data -v /path/to/reports:/reports \
scanlytic-forensicai analyze /data -o /reports/report.jsonFor detailed Docker usage, see the Docker Guide
New to Docker or digital forensics? Check out our comprehensive guides:
- π Beginner's Guide - For students and new programmers
- π€ Non-Technical User Guide - For non-technical users
- ποΈ Architecture Guide - Understanding the system design
- π‘ Development Process Guide - How we built this
Analyze a Single File
python -m scanlytic analyze /path/to/suspicious_file.exeAnalyze a Directory
python -m scanlytic analyze /path/to/directory --recursiveBatch Processing with Custom Output
python -m scanlytic analyze /path/to/files \
--output report.json \
--format json \
--threshold 50 \
--verbosepython -m scanlytic analyze /path/to/files \
--config custom_config.yaml \
--model custom_model.pkl \
--workers 4python -m scanlytic analyze /path/to/files \
--report-type detailed \
--output-format html \
--include-visualsfrom scanlytic import ForensicAnalyzer
# Initialize analyzer
analyzer = ForensicAnalyzer(
model_path='models/classifier.pkl',
config='config.yaml'
)
# Analyze single file
result = analyzer.analyze_file('/path/to/file.exe')
print(f"Classification: {result.classification}")
print(f"Malicious Score: {result.score}")
print(f"Features: {result.features}")
# Analyze directory
results = analyzer.analyze_directory(
'/path/to/directory',
recursive=True,
threshold=50
)
# Generate report
analyzer.generate_report(
results,
output_path='report.html',
format='html'
){
"file": "suspicious.exe",
"classification": "malicious",
"malicious_score": 87.3,
"confidence": 0.94,
"features": {
"file_size": 245760,
"entropy": 7.2,
"file_type": "PE32 executable",
"sections": 5,
"imports": ["kernel32.dll", "advapi32.dll"],
"suspicious_strings": ["cmd.exe", "powershell", "download"]
},
"threat_indicators": [
"High entropy suggests packing/encryption",
"Suspicious API calls detected",
"Contains obfuscated strings"
],
"recommendation": "Quarantine and investigate further"
}βββββββββββββββββββ
β Input Files β
β (Any Format) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β File Intake β
β & Validation β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Feature β
β Extraction β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β ML Model β
β Classification β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Malicious Intentβ
β Scoring β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Report β
β Generation β
βββββββββββββββββββ- File Ingestion: System accepts individual files or directories for analysis
- Preprocessing: File validation, format detection, and metadata extraction
- Feature Engineering: Extraction of relevant features including:
- Static properties (size, type, timestamps)
- Content-based features (entropy, byte distribution)
- Structural features (PE headers, file signatures)
- Behavioral indicators (strings, imports, exports)
- ML Classification: Trained models classify files into categories:
- Benign
- Suspicious
- Malicious
- Unknown
- Scoring Algorithm: Assigns a malicious intent score (0-100) based on:
- Classification confidence
- Feature weights
- Known threat patterns
- Historical data
- Result Aggregation: Compiles findings into comprehensive reports with:
- File classifications
- Risk scores
- Detailed analysis
- Recommendations
- Language: Python 3.8+
- Machine Learning:
- scikit-learn (Classification algorithms)
- TensorFlow/Keras (Deep learning models)
- XGBoost (Gradient boosting)
- pandas (Data manipulation)
- NumPy (Numerical computing)
-
Static Analysis:
- pefile (PE file analysis)
- python-magic (File type detection)
- yara-python (Pattern matching)
-
Feature Extraction:
- ssdeep (Fuzzy hashing)
- entropy calculation libraries
- regex patterns for string analysis
- Database: SQLite/PostgreSQL (Analysis results storage)
- Caching: Redis (Performance optimization)
- Serialization: pickle/joblib (Model persistence)
- Visualization:
- Matplotlib
- Seaborn
- Plotly (Interactive dashboards)
- Reporting:
- Jinja2 (HTML reports)
- ReportLab (PDF generation)
- Testing: pytest, unittest
- Code Quality: pylint, black, flake8
- Documentation: Sphinx
- Version Control: Git
Scanlytic-ForensicAI/
β
βββ scanlytic/ # Main package
β βββ __init__.py
β βββ analyzer.py # Core analysis engine
β βββ classifier.py # ML classification models
β βββ feature_extractor.py # Feature extraction logic
β βββ scorer.py # Malicious intent scoring
β βββ preprocessor.py # Data preprocessing
β βββ utils/ # Utility functions
β βββ file_handler.py
β βββ hash_utils.py
β βββ logger.py
β
βββ models/ # Pre-trained ML models
β βββ classifier_v1.pkl
β βββ feature_scaler.pkl
β
βββ config/ # Configuration files
β βββ default_config.yaml
β βββ logging_config.yaml
β
βββ scripts/ # Utility scripts
β βββ train_model.py
β βββ download_models.py
β βββ benchmark.py
β
βββ tests/ # Test suite
β βββ test_analyzer.py
β βββ test_classifier.py
β βββ test_features.py
β
βββ docs/ # Documentation
β βββ API.md
β βββ ARCHITECTURE.md
β βββ TRAINING.md
β
βββ examples/ # Example usage scripts
β βββ basic_analysis.py
β
βββ data/ # Sample data (not in repo)
β βββ benign/
β βββ malicious/
β
βββ requirements.txt # Python dependencies
βββ requirements-dev.txt # Development dependencies
βββ setup.py # Package setup
βββ .gitignore
βββ LICENSE
βββ README.mdCentral orchestrator that coordinates the analysis pipeline, managing data flow between components.
Implements various feature extraction techniques:
- Static file properties
- PE header analysis
- String extraction
- Entropy calculation
- Hash generation
Machine learning models for file classification:
- Random Forest
- Gradient Boosting
- Neural Networks
- Ensemble methods
Sophisticated scoring algorithm that combines:
- Model predictions
- Feature weights
- Threat intelligence
- Historical patterns
Create a config.yaml file to customize behavior:
# Analysis Configuration
analysis:
max_file_size: 104857600 # 100MB
timeout: 300 # seconds
parallel_workers: 4
# Feature Extraction
features:
extract_strings: true
string_min_length: 4
calculate_entropy: true
extract_pe_headers: true
compute_hashes: ["md5", "sha1", "sha256"]
# Classification
classification:
model_path: "models/classifier_v1.pkl"
confidence_threshold: 0.7
enable_ensemble: true
# Scoring
scoring:
malicious_threshold: 50
high_risk_threshold: 75
weight_features: true
# Output
output:
format: "json" # json, csv, html
verbose: true
include_features: true
save_samples: false
# Logging
logging:
level: "INFO"
file: "scanlytic.log"
console: trueexport SCANLYTIC_CONFIG=/path/to/config.yaml
export SCANLYTIC_MODEL_PATH=/path/to/models
export SCANLYTIC_LOG_LEVEL=DEBUG
export SCANLYTIC_WORKERS=8We welcome contributions from the community! Whether it's bug reports, feature requests, documentation improvements, or code contributions, your help is appreciated.
-
Fork the Repository
git clone https://github.com/rohteemie/Scanlytic-ForensicAI.git cd Scanlytic-ForensicAI -
Create a Feature Branch
git checkout -b feature/your-feature-name
-
Make Your Changes
- Write clean, documented code
- Follow PEP 8 style guidelines
- Add tests for new features
- Update documentation as needed
-
Test Your Changes
pytest tests/ pylint scanlytic/
-
Commit Your Changes
git add . git commit -m "Add: brief description of changes"
-
Push and Create Pull Request
git push origin feature/your-feature-name
# Install development dependencies
pip install -r requirements-dev.txt
# Install pre-commit hooks
pre-commit install
# Run tests
pytest tests/ -v --cov=scanlytic
# Check code quality
pylint scanlytic/
black scanlytic/
flake8 scanlytic/- Follow Python PEP 8 style guide
- Write comprehensive tests for new features
- Document all public APIs
- Keep commits atomic and well-described
- Update README if adding new features
- Ensure all tests pass before submitting PR
Please read and follow our Code of Conduct to maintain a welcoming and inclusive community.
-
Machine Learning in Forensics
- Application of supervised learning to malware detection
- Feature engineering for file classification
- Model interpretability in security contexts
-
Automated Triage Systems
- Reducing manual analysis overhead
- Prioritization algorithms for forensic investigations
- Real-time threat assessment methodologies
-
Security Perspective
- Threat modeling and risk assessment
- Attack vector analysis
- Evasion technique detection
The system was developed using:
- Analysis of real-world forensic datasets
- Collaboration with security professionals
- Iterative testing and validation
- Benchmarking against existing tools
- Core file classification engine
- Basic feature extraction
- Malicious intent scoring
- Command-line interface
- JSON/CSV report generation
- Web-based dashboard
- Enhanced PE analysis
- Support for archive files
- Integration with VirusTotal API
- Real-time monitoring capabilities
- Dynamic analysis integration
- Memory forensics support
- Network traffic analysis
- Behavioral analysis engine
- Custom YARA rule support
- Enterprise features
- Multi-user support
- REST API
- Plugin architecture
- Cloud deployment options
- Advanced reporting and analytics
- Distributed processing for large-scale investigations
- Integration with major SIEM platforms
- Mobile application for field analysis
- Automated threat hunting capabilities
- Community-driven threat intelligence sharing
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2025 Rotimi Owolabi
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.Author: Rotimi Owolabi
- GitHub: @rohteemie
- Project Link: https://github.com/rohteemie/Scanlytic-Forensic-AI
- Email: Contact via GitHub
- Issues: Report bugs or request features via GitHub Issues
- Discussions: Join community discussions on GitHub Discussions
- Security: Report security vulnerabilities privately via GitHub Security Advisories
This project was inspired by my love for security systems and the need for accessible, efficient forensic tools that can keep pace with modern cyber threats, and the ease of investigations for security professionals at all levels.
Note: This project is under active development. Features and documentation are subject to change.
If you find this project useful, please consider giving it a star! β
Made with β€οΈ for the digital forensics and cybersecurity community