Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions PROJECTS/intermediate/dlp-scanner/.dlp-scanner.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# ©AngelaMos | 2026
# .dlp-scanner.yml

scan:
file:
max_file_size_mb: 100
recursive: true
exclude_patterns:
- "*.pyc"
- "__pycache__"
- ".git"
- "node_modules"
- ".venv"
include_extensions:
- ".pdf"
- ".docx"
- ".xlsx"
- ".xls"
- ".csv"
- ".json"
- ".xml"
- ".yaml"
- ".yml"
- ".txt"
- ".log"
- ".eml"
- ".msg"
- ".parquet"
- ".avro"
- ".tar.gz"
- ".tar.bz2"
- ".zip"

database:
sample_percentage: 5
max_rows_per_table: 10000
timeout_seconds: 30
exclude_tables: []
include_tables: []

network:
bpf_filter: ""
entropy_threshold: 7.2
dns_label_entropy_threshold: 4.0
max_packets: 0

detection:
min_confidence: 0.20
severity_threshold: "low"
context_window_tokens: 10
enable_rules:
- "*"
disable_rules: []
allowlists:
values:
- "123-45-6789"
- "000-00-0000"
- "4111111111111111"
domains:
- "example.com"
- "test.com"
file_patterns:
- "test_*"
- "*_fixture*"
- "mock_*"

compliance:
frameworks:
- "HIPAA"
- "PCI_DSS"
- "GDPR"
- "CCPA"
- "SOX"
- "GLBA"

output:
format: "console"
output_file: ""
redaction_style: "partial"
verbose: false
color: true

logging:
level: "INFO"
json_output: false
log_file: ""
24 changes: 24 additions & 0 deletions PROJECTS/intermediate/dlp-scanner/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# ©AngelaMos | 2026
# .env.example

# PostgreSQL
PGHOST=localhost
PGPORT=5432
PGUSER=dlp_scanner
PGPASSWORD=changeme
PGDATABASE=target_db

# MySQL
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=dlp_scanner
MYSQL_PASSWORD=changeme
MYSQL_DATABASE=target_db

# MongoDB
MONGO_URI=mongodb://localhost:27017
MONGO_DATABASE=target_db

# Logging
DLP_LOG_LEVEL=INFO
DLP_LOG_JSON=false
11 changes: 11 additions & 0 deletions PROJECTS/intermediate/dlp-scanner/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
docs/
__pycache__/
*.pyc
.env
.venv/
*.egg-info/
dist/
build/
.mypy_cache/
.ruff_cache/
.pytest_cache/
46 changes: 46 additions & 0 deletions PROJECTS/intermediate/dlp-scanner/.style.yapf
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
[style]
based_on_style = pep8
column_limit = 75
indent_width = 4
continuation_indent_width = 4
indent_closing_brackets = false
dedent_closing_brackets = true
indent_blank_lines = false
spaces_before_comment = 2
spaces_around_power_operator = false
spaces_around_default_or_named_assign = true
space_between_ending_comma_and_closing_bracket = false
space_inside_brackets = false
spaces_around_subscript_colon = true
blank_line_before_nested_class_or_def = false
blank_line_before_class_docstring = false
blank_lines_around_top_level_definition = 2
blank_lines_between_top_level_imports_and_variables = 2
blank_line_before_module_docstring = false
split_before_logical_operator = true
split_before_first_argument = true
split_before_named_assigns = true
split_complex_comprehension = true
split_before_expression_after_opening_paren = false
split_before_closing_bracket = true
split_all_comma_separated_values = true
split_all_top_level_comma_separated_values = false
coalesce_brackets = false
each_dict_entry_on_separate_line = true
allow_multiline_lambdas = false
allow_multiline_dictionary_keys = false
split_penalty_import_names = 0
join_multiple_lines = false
align_closing_bracket_with_visual_indent = true
arithmetic_precedence_indication = false
split_penalty_for_added_line_split = 275
use_tabs = false
split_before_dot = false
split_arguments_when_comma_terminated = true
i18n_function_call = ['_', 'N_', 'gettext', 'ngettext']
i18n_comment = ['# Translators:', '# i18n:']
split_penalty_comprehension = 80
split_penalty_after_opening_bracket = 280
split_penalty_before_if_expr = 0
split_penalty_bitwise_operator = 290
split_penalty_logical_operator = 0
112 changes: 112 additions & 0 deletions PROJECTS/intermediate/dlp-scanner/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
```ruby
██████╗ ██╗ ██████╗ ███████╗ ██████╗ █████╗ ███╗ ██╗
██╔══██╗██║ ██╔══██╗ ██╔════╝██╔════╝██╔══██╗████╗ ██║
██║ ██║██║ ██████╔╝█████╗███████╗██║ ███████║██╔██╗ ██║
██║ ██║██║ ██╔═══╝ ╚════╝╚════██║██║ ██╔══██║██║╚██╗██║
██████╔╝███████╗██║ ███████║╚██████╗██║ ██║██║ ╚████║
╚═════╝ ╚══════╝╚═╝ ╚══════╝ ╚═════╝╚═╝ ╚═╝╚═╝ ╚═══╝
```

[![Cybersecurity Projects](https://img.shields.io/badge/Cybersecurity--Projects-intermediate-red?style=flat&logo=github)](https://github.com/CarterPerez-dev/Cybersecurity-Projects/tree/main/PROJECTS/intermediate/dlp-scanner)
[![Python](https://img.shields.io/badge/Python-3.12+-3776AB?style=flat&logo=python&logoColor=white)](https://python.org)
[![License: AGPLv3](https://img.shields.io/badge/License-AGPL_v3-purple.svg)](https://www.gnu.org/licenses/agpl-3.0)

> Data Loss Prevention scanner for files, databases, and network traffic.

*This is a quick overview. Security theory, architecture, and full walkthroughs are in the [learn modules](#learn).*

## What It Does

- Scans files (PDF, DOCX, XLSX, CSV, JSON, XML, YAML, Parquet, Avro, archives, emails) for PII, credentials, financial data, and PHI
- Scans databases (PostgreSQL, MySQL, MongoDB, SQLite) with schema introspection and sampling
- Scans network captures (PCAP/PCAPNG) with protocol parsing, TCP reassembly, and DNS exfiltration detection
- Confidence scoring pipeline: regex match, checksum validation (Luhn, Mod-97, Mod-11), context keyword proximity, entity co-occurrence
- Maps findings to compliance frameworks (HIPAA, PCI-DSS, GDPR, CCPA, SOX, GLBA, FERPA)
- Reports in console (Rich tables), JSON, SARIF 2.1.0, or CSV

## Quick Start

```bash
bash install.sh
dlp-scan file ./data
```

## Usage

```bash
dlp-scan file ./data/employees/ # scan a directory
dlp-scan file ./report.pdf -f json # scan a file, JSON output
dlp-scan db postgres://user:pass@host/db # scan PostgreSQL
dlp-scan db sqlite:///path/to/local.db # scan SQLite
dlp-scan network capture.pcap # scan network traffic
dlp-scan file ./data -f sarif -o results.sarif # SARIF for CI/CD
dlp-scan report convert results.json -f csv # convert report format
dlp-scan report summary results.json # print summary stats
```

### Global Options

```
--config, -c Path to YAML config file
--verbose, -v Enable debug logging
--version Show version
```

### Output Formats

| Format | Flag | Use Case |
|--------|------|----------|
| Console | `-f console` | Interactive review with Rich tables |
| JSON | `-f json` | Structured analysis and archival |
| SARIF | `-f sarif` | GitHub code scanning, CI/CD integration |
| CSV | `-f csv` | Compliance team export, spreadsheet import |

## Stack

**Language:** Python 3.12+

**CLI:** Typer 0.15+ with Rich integration

**Detection:** Regex + checksum validators + Shannon entropy + context keyword scoring

**File Formats:** PyMuPDF, python-docx, openpyxl, xlrd, defusedxml, lxml, pyarrow, fastavro, extract-msg

**Databases:** asyncpg (PostgreSQL), aiomysql (MySQL), pymongo async (MongoDB), aiosqlite (SQLite)

**Network:** dpkt (PCAP parsing), TCP reassembly, DPI protocol identification, DNS exfiltration heuristics

**Config:** Pydantic 2.10+ models with YAML config loading (ruamel.yaml)

**Quality:** ruff, mypy (strict), yapf, pytest + hypothesis, structlog

## Configuration

Copy `.dlp-scanner.yml` to your project root and customize. Key settings:

```yaml
detection:
min_confidence: 0.20 # minimum score to report
enable_rules: ["*"] # glob patterns for rule IDs
allowlists:
values: ["123-45-6789"] # suppress known test values

output:
format: "console" # console, json, sarif, csv
redaction_style: "partial" # partial, full, none
```

## Learn

This project includes step-by-step learning materials covering security theory, architecture, and implementation.

| Module | Topic |
|--------|-------|
| [00 - Overview](learn/00-OVERVIEW.md) | Prerequisites and quick start |
| [01 - Concepts](learn/01-CONCEPTS.md) | DLP theory and real-world breaches |
| [02 - Architecture](learn/02-ARCHITECTURE.md) | System design and data flow |
| [03 - Implementation](learn/03-IMPLEMENTATION.md) | Code walkthrough |
| [04 - Challenges](learn/04-CHALLENGES.md) | Extension ideas and exercises |

## License

[AGPLv3](https://www.gnu.org/licenses/agpl-3.0)
26 changes: 26 additions & 0 deletions PROJECTS/intermediate/dlp-scanner/install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/usr/bin/env bash
# ©AngelaMos | 2026
# install.sh

set -euo pipefail

command -v uv >/dev/null 2>&1 || {
echo "Installing uv..."
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"
}

echo "Syncing dependencies..."
uv sync

echo "Downloading spaCy model (optional, for NLP-based detection)..."
uv run python -m spacy download en_core_web_sm 2>/dev/null || true

echo ""
echo "Setup complete. Run the scanner with:"
echo " uv run dlp-scan --help"
echo ""
echo "Quick start:"
echo " uv run dlp-scan scan file ./path/to/scan"
echo " uv run dlp-scan scan db sqlite:///path/to/db.sqlite3"
echo " uv run dlp-scan scan network ./capture.pcap"
76 changes: 76 additions & 0 deletions PROJECTS/intermediate/dlp-scanner/learn/00-OVERVIEW.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# 00-OVERVIEW.md

# DLP Scanner

## What This Is

A command-line Data Loss Prevention scanner that detects sensitive data across three surfaces: files (PDF, DOCX, XLSX, CSV, JSON, XML, YAML, Parquet, Avro, archives, emails), databases (PostgreSQL, MySQL, MongoDB, SQLite), and network captures (PCAP/PCAPNG with protocol parsing and TCP reassembly). It uses a confidence scoring pipeline combining regex matching, checksum validation (Luhn for credit cards, Mod-97 for IBANs, Mod-11 for NHS numbers), keyword proximity analysis, and Shannon entropy detection. Findings are classified by severity and mapped to compliance frameworks (HIPAA, PCI-DSS, GDPR, CCPA, SOX, GLBA, FERPA). Output supports console Rich tables, JSON, SARIF 2.1.0 for CI/CD, and CSV for compliance teams.

## Why This Matters

Data breaches involving PII exposure keep appearing because organizations cannot find sensitive data they do not know exists. The 2017 Equifax breach exposed 147 million SSNs from an unpatched Apache Struts application, but the underlying problem was that SSNs were stored in plaintext across multiple database tables without anyone tracking where that data lived. In 2019, Capital One lost 100 million credit applications from an S3 bucket because a misconfigured WAF allowed server-side request forgery, and nobody had scanned those files to realize unencrypted SSNs and credit card numbers sat in flat CSV exports. The Marriott breach (2018) exposed 500 million records including 5.25 million unencrypted passport numbers, partially because the Starwood reservation system merged without a data inventory that would have flagged those fields as sensitive.

These are not failure-of-firewall problems. They are failure-of-visibility problems. DLP tools exist to answer "where is our sensitive data?" before attackers answer it for you. Commercial solutions (Symantec DLP, Microsoft Purview, Netskope) cost six figures and require enterprise deployment, but the core detection logic is straightforward: pattern matching with validation, context analysis to reduce false positives, and compliance framework mapping to prioritize remediation.

This project builds a DLP engine from scratch, teaching you the same detection techniques that power production systems.

**Real world scenarios where this applies:**
- Security engineers scanning file shares before a cloud migration to find PII that needs encryption
- Compliance teams auditing database tables for HIPAA-regulated PHI that should not be in plaintext
- SOC analysts inspecting PCAP captures for credentials or PII transmitted in the clear
- DevOps teams running DLP checks in CI/CD pipelines to catch secrets before they reach production
- Incident responders determining what sensitive data was accessible from a compromised network segment

## What You'll Learn

**Security Concepts:**
- Data classification tiers and how PII, PHI, PCI, and credential data map to regulatory requirements
- Confidence scoring: why regex alone produces false positives and how checksum validation, context keywords, and entity co-occurrence reduce them
- Compliance framework mapping: HIPAA's 18 identifiers, PCI-DSS cardholder data, GDPR personal data categories, CCPA consumer information
- Network DLP: detecting sensitive data in transit, DNS exfiltration via high-entropy subdomain labels, base64-encoded payloads in HTTP bodies
- Redaction strategies: why you never store the raw matched content in findings

**Technical Skills:**
- Building a multi-format text extraction pipeline that handles 14+ file formats through a unified Protocol interface
- Database schema introspection across 4 database engines with statistical sampling (TABLESAMPLE BERNOULLI, $sample aggregation)
- TCP stream reassembly from raw packets using sequence-number ordering and bidirectional flow key normalization
- Confidence scoring pipeline: base scores, checksum boosts, context keyword proximity windows, entity co-occurrence
- SARIF 2.1.0 output for GitHub code scanning integration

**Tools and Techniques:**
- Typer CLI with Annotated-style parameters and global option propagation through Click context
- Pydantic 2.x for configuration validation with YAML loading
- structlog with stdlib integration for structured JSON logging
- orjson for high-performance JSON serialization
- asyncpg, aiomysql, pymongo async, aiosqlite for async database access
- dpkt for fast PCAP parsing (100x faster than Scapy)
- pytest with hypothesis for property-based testing of detection rules

## Prerequisites

**Required knowledge:**
- Python fundamentals: dataclasses, type hints, list comprehensions, context managers
- Basic networking: TCP/IP, ports, packets, what PCAP files contain
- Basic SQL: SELECT, WHERE, table schemas, column types
- Security basics: what PII is, why SSNs and credit card numbers need protection, what compliance frameworks exist

**Tools you'll need:**
- Python 3.12+ (uses modern generic syntax and `from __future__ import annotations`)
- uv package manager (install: `curl -LsSf https://astral.sh/uv/install.sh | sh`)
- A terminal with UTF-8 support (for Rich console output)

**Helpful but not required:**
- Experience with regex and pattern matching
- Familiarity with dpkt or Scapy for packet analysis
- Knowledge of database URIs and connection strings
- Understanding of SARIF format for CI/CD security tooling

## Quick Start

```bash
bash install.sh
dlp-scan file ./data
dlp-scan file ./data -f json -o results.json
dlp-scan db sqlite:///path/to/database.db
dlp-scan report summary results.json
```
Loading
Loading