Go PII Name Detector

A high-performance Go library for detecting PII (Personally Identifiable Information) names by Montevive.AI. This is a Go migration of the popular names-dataset Python library, offering 10x faster performance and 6x less memory usage while maintaining the same comprehensive name database.

🎉 v1.0.0 Released - Production-ready with embedded dataset and out-of-the-box functionality!

🚀 Migration from Python to Go

This project migrates the Python names-dataset library to Go with significant improvements:

Feature	Python (Original)	Go (This Project)
Memory Usage	3.2 GB	~500 MB (6x less)
Detection Speed	~50-100ms	3-9ms (10x faster)
Data Format	Pickle files	Protocol Buffers
Cultural Rules	Limited	Universal algorithm
Surname Support	Single string	Multiple surnames ([]string)
Spanish Names	Basic support	Advanced double surname detection

Built with Protocol Buffers for efficient data storage and fast lookups of 727K first names and 983K surnames from 105 countries.

🚀 Quick Start

go get github.com/montevive/go-name-detector

import "github.com/montevive/go-name-detector/pkg/detector"

d, _ := detector.NewDefault() // Works out of the box!
result := d.DetectPII([]string{"John", "Smith"})
fmt.Printf("Is PII: %v (%.2f confidence)\n", result.IsLikelyName, result.Confidence)

Features

Universal name detection: Works for all cultural naming patterns without hardcoded rules
Multiple surname support: Handles Spanish double surnames, compound names, etc.
Confidence scoring: Returns 0.0-1.0 confidence scores with detailed breakdown
High performance: < 1ms detection time, optimized memory usage
Protocol Buffers: Efficient binary data format for 727K first names and 983K surnames
Country prediction: Identifies most likely country of origin
Gender prediction: Predicts gender based on first names
CLI tool: Ready-to-use command line interface

Installation

Simple Installation (Recommended)

The library now includes embedded data and works out of the box:

# Latest version (recommended)
go get github.com/montevive/go-name-detector@latest

# Or pin to v1.0.0
go get github.com/montevive/go-name-detector@v1.0.0

That's it! No additional setup, file downloads, or protobuf compilation required. The library is production-ready as of v1.0.0.

Advanced Installation (Custom Data)

If you want to use custom data or the latest dataset, you can follow the advanced setup:

# Install original Python library (for custom data export)
pip install names-dataset

# Clone this Go migration project
git clone https://github.com/montevive/go-name-detector.git
cd go-name-detector

# Install dependencies and generate protobuf code
make generate

# Export data from Python pickle format to Protocol Buffers
make export-data

# Build the CLI tool
make build

Quick Start

CLI Usage

# Basic name detection
./bin/pii-check "John Smith"
# Output: ✓ Likely PII name (85.3% confidence)

# Spanish names with double surnames  
./bin/pii-check "Jose Manuel García López"
# Output: ✓ Likely PII name (89.0% confidence)
#         First names: Jose, Manuel
#         Surnames: García, López

# Spanish names with accents (60% threshold recommended)
./bin/pii-check -threshold 0.6 "José García"
# Output: ✓ Likely PII name (64.8% confidence)

# Business terminology properly rejected
./bin/pii-check "Informe de Cliente"  
# Output: ✗ Not a PII name (26.4% confidence)

# Non-name phrases
./bin/pii-check "The quick brown fox"
# Output: ✗ Not a PII name (12.4% confidence)

# JSON output with accent support
./bin/pii-check -json "María García López"

# High precision threshold
./bin/pii-check -threshold 0.8 "José Manuel García López"
# Output: ✓ Likely PII name (89.0% confidence)

# Batch processing
./bin/pii-check -batch names.txt

# Dataset statistics
./bin/pii-check -stats

Threshold Recommendations

Based on enhanced scoring with popularity-based ranking:

Standard Use Cases

70% (Default): Conservative - High confidence, minimal false positives
- Best for: Production systems, automated filtering, compliance workflows
- Example: "José Manuel García López" (89%) ✓, "Informe de Cliente" (26%) ✗
- Accuracy: ~95% precision, may miss some legitimate two-word names
60%: Balanced - Good detection with reasonable accuracy
- Best for: General purpose detection, Spanish/Latin name datasets
- Example: "José García" (65%) ✓, "María López" (62%) ✓
- Accuracy: ~90% precision, catches most legitimate name patterns
50%: Aggressive - Catches more names but higher false positive rate
- Best for: Initial screening, manual review workflows, research
- Example: "María de García" (43%) ✓ despite preposition
- Accuracy: ~80% precision, requires human verification

Cultural Considerations

Spanish/Latin names: Use 60-65% threshold (two-word names often score 60-70%)
Anglo/English names: Use 70% threshold (typically score higher due to simpler patterns)
Mixed international datasets: Use 65% as balanced compromise
Asian names: Use 70% (usually compound patterns with higher confidence)

Business vs Personal Names

The enhanced algorithm distinguishes between:

✅ Top-ranked personal names (José #1, García #1): Score 65-95%
❌ Business terms ("Informe de Cliente", "Documento de Identidad"): Score 15-30%
❌ Rare/noise entries (Cliente #6,970): Score under 20%
⚠️ Prepositions in names ("María de García"): Moderate penalty but detectable at 60%

Threshold Selection Guide

# High precision (minimal false positives)
./bin/pii-check -threshold 0.8 "José Manuel García López"

# Balanced detection (recommended for most use cases)  
./bin/pii-check -threshold 0.65 "José García"

# High recall (catch more names, review manually)
./bin/pii-check -threshold 0.5 "María de la Cruz"

Library Usage

Simple Usage (Recommended)

package main

import (
    "fmt"
    "log"
    "github.com/montevive/go-name-detector/pkg/detector"
)

func main() {
    // Create detector with embedded data - works out of the box!
    d, err := detector.NewDefault()
    if err != nil {
        log.Fatal(err)
    }

    // Detect PII
    words := []string{"Jose", "Manuel", "Robles", "Hermoso"}
    result := d.DetectPII(words)

    fmt.Printf("Is PII: %v\n", result.IsLikelyName)
    fmt.Printf("Confidence: %.2f\n", result.Confidence)
    fmt.Printf("First names: %v\n", result.Details.FirstNames)
    fmt.Printf("Surnames: %v\n", result.Details.Surnames)
    fmt.Printf("Country: %s\n", result.Details.TopCountry)
    fmt.Printf("Gender: %s\n", result.Details.Gender)
}

Advanced Usage (Custom Data Files)

package main

import (
    "fmt"
    "github.com/montevive/go-name-detector/pkg/detector"
    "github.com/montevive/go-name-detector/pkg/loader"
)

func main() {
    // Load dataset from file
    l := loader.New()
    err := l.LoadFromFile("data/combined_names.pb.gz")
    if err != nil {
        panic(err)
    }

    // Create detector
    d := detector.New(l.GetDataset())

    // Detect PII
    words := []string{"Jose", "Manuel", "Robles", "Hermoso"}
    result := d.DetectPII(words)

    fmt.Printf("Is PII: %v\n", result.IsLikelyName)
    fmt.Printf("Confidence: %.2f\n", result.Confidence)
}

How It Works

The detector uses a data-driven approach:

All possible splits: For input words, tries every possible combination of first names vs surnames
Database lookup: Checks each component against 727K first names and 983K surnames
Confidence scoring: Combines multiple factors:
- Database match (base score)
- Name popularity (lower rank = higher confidence)
- Gender consistency across first names
- Country overlap between components
- Multiple valid name bonus
Best combination: Returns the split with highest confidence score

Example Analysis

Input: ["Jose", "Manuel", "Robles", "Hermoso"]

The algorithm tries:

[Jose] + [Manuel, Robles, Hermoso] → Score: 0.42
[Jose, Manuel] + [Robles, Hermoso] → Score: 0.92 ✓
[Jose, Manuel, Robles] + [Hermoso] → Score: 0.35

Returns the best scoring combination with detailed breakdown.

Data Format

The system uses Protocol Buffer files converted from the original Python pickle format:

first_names.pb.gz: 727,556 first names with country/gender/rank data
last_names.pb.gz: 983,826 surnames with country/rank data
combined_names.pb.gz: Both datasets in a single file

Each name entry contains:

Country probabilities: Likelihood per country (105 countries supported)
Gender data: Male/Female probabilities (first names only)
Popularity ranks: 1-indexed ranking per country (1 = most popular)

Performance

🚀 Benchmarks vs Python Original

Metric	Python (names-dataset)	Go (This Project)	Improvement
Memory Usage	3.2 GB	500 MB	6.4x less
Load Time	~30-60 seconds	4.3 seconds	7-14x faster
Detection Speed	50-100ms	3-9ms	10-20x faster
File Size	53MB (pickle)	54MB (protobuf)	Comparable
Batch Processing	~100 names/sec	10,000+ names/sec	100x faster

📊 Detailed Performance

BenchmarkDetectPII_Spanish-10    	  141104	      8418 ns/op	   14742 B/op	      51 allocs/op
BenchmarkDetectPII_English-10    	  342564	      3533 ns/op	    7179 B/op	      19 allocs/op
BenchmarkDetectPII_NonName-10    	  226826	      5318 ns/op	   14374 B/op	      32 allocs/op

Spanish names: 8.4ms per detection
English names: 3.5ms per detection
Non-names: 5.3ms per detection

Testing

# Run unit tests
make test

# Run benchmarks
make bench

# Test with examples
go test ./examples -v

Configuration

The scoring algorithm uses enhanced popularity-based scoring and can be customized:

config := detector.ScoreConfig{
    BaseMatchScore:     0.25, // Base score for database match
    PopularityWeight:   0.35, // HIGH: Popularity is key differentiator
    GenderConsistency:  0.1,  // Bonus for consistent gender
    CountryOverlap:     0.15, // Bonus for country overlap  
    MultipleNamesBonus: 0.15, // Bonus for multiple names
}

d := detector.NewWithConfig(dataset, config)

Enhanced Scoring Features

Step-based popularity scoring:
- Top-10 names (José, García): 1.0 score
- Top-50 names: 0.8 score
- Top-200 names: 0.5 score
- Top-1000 names: 0.2 score
- Rare names (Cliente #6,970): 0.02 score
Preposition penalties: Words like "de", "van", "von", "del" heavily penalized
- 70% penalty (×0.3) when used as first names
- 30% penalty (×0.7) when used as surnames
Pattern bonuses:
- Two top-100 names together: 40% boost (×1.4)
- Two top-10 names together: 60% boost (×1.6)
Accent normalization: Automatic handling of "José" → "Jose" lookups

Supported Patterns

The universal algorithm automatically handles:

English: John Smith, Mary Jane Doe
Spanish: Jose Manuel Garcia Lopez (double first + double surname)
Compound names: Maria del Carmen Rodriguez
Asian patterns: Based on statistical data in the dataset
Any cultural pattern: No hardcoded rules, purely data-driven

Accent and Unicode Support

Full support for accented characters and international Unicode names:

Input Handling

✅ Accepts any Unicode letters: José, María, François, Müller, 李明, محمد
✅ Automatic normalization: Converts accents for database lookup
✅ Case insensitive: JOSÉ, josé, José all work identically
✅ Mixed scripts: Handles Latin, Cyrillic, Arabic, Chinese characters

Dual Lookup Strategy

The library uses intelligent dual lookup:

Exact match: First tries to find "José" in the database
Normalized match: Then tries "Jose" (accent removed)
Best result: Uses whichever version has better popularity ranking

Examples

# Spanish names with accents
./bin/pii-check "José García"          # ✓ 64.8% confidence
./bin/pii-check "María López"          # ✓ Proper detection
./bin/pii-check "José Manuel García"   # ✓ 81.4% confidence

# French names  
./bin/pii-check "François Dupont"      # ✓ Handles French accents

# Mixed accent/non-accent
./bin/pii-check "José Smith"           # ✓ Works with mixed names
./bin/pii-check "Maria García"         # ✓ One accent, one without

Technical Details

Normalization: Uses Unicode NFD decomposition + combining character removal
Preservation: Original accented input preserved in results
Performance: Minimal overhead (~1μs per normalization)
Languages: Supports Spanish, French, German, Portuguese, Italian, and more

Migration from ASCII-only Systems

If migrating from systems that only handle ASCII:

// Old ASCII-only approach ❌
if isASCII(name) {
    result := detector.DetectPII([]string{name})
}

// New Unicode approach ✅  
result := detector.DetectPII([]string{"José", "García"}) // Just works!

Dataset Source

This Go implementation uses the same high-quality dataset as the original Python library:

Source: Facebook data leak dataset (533M users) from 105 countries
Original Library: names-dataset by Philippe Remy

What we provide:

Same 727K first names and 983K surnames
Same country-specific popularity rankings
Same gender predictions with statistical confidence
Same real-world name frequency distributions
Enhanced: Better support for cultural naming patterns through universal algorithm

Migration benefits:

Performance: 10x faster detection with 6x less memory
Format: Protocol Buffers instead of Python pickle
Algorithm: Universal approach vs hardcoded rules
Spanish names: Advanced double surname detection
Production ready: CLI tool, JSON API, batch processing

License

Apache License 2.0 - see the LICENSE file for details.

This project is a Go migration of the original names-dataset Python library by Philippe Remy, which is also licensed under Apache 2.0.

Contributing

We welcome contributions to improve the Go PII Name Detector! This project is maintained by Montevive.AI.

Fork the repository
Create a feature branch
Add tests for new functionality
Run make test and make bench
Submit a pull request

For questions or support, visit Montevive.AI or open an issue on GitHub.

Migration Notes

From Python names-dataset

If you're migrating from the Python library:

# Python (original)
from names_dataset import NameDataset
nd = NameDataset()
result = nd.search('Jose Manuel')

# Go (this project) - equivalent functionality  
words := []string{"Jose", "Manuel", "Garcia", "Lopez"}
result := detector.DetectPII(words)

Key differences:

Multiple surnames: Go version returns []string for surnames (supports Spanish double surnames)
Universal detection: No need to call separate functions for different name types
Performance: 10x faster with 6x less memory usage
Confidence scoring: Enhanced algorithm with detailed breakdown

Troubleshooting

Data file not found: Make sure to run make export-data first to generate the protobuf files from the original Python dataset.

Memory issues: The full dataset requires ~500MB RAM (vs 3.2GB for Python). Consider using partial loading if needed.

Low confidence scores: The system is conservative by design. Names not in the 533M person dataset will score lower.

Spanish names with accents not detected:

Two-word Spanish names (José García) typically score 60-70% with enhanced scoring
Use threshold 0.6 instead of default 0.7 for Spanish/Latin datasets
The algorithm prioritizes accuracy over false positives to avoid detecting business terms
Example: ./bin/pii-check -threshold 0.6 "José García" → ✓ 64.8% confidence

Business terms being detected as names:

Enhanced scoring now heavily penalizes business terminology
"Informe de Cliente" scores ~26% (down from 42% in older versions)
Prepositions like "de", "del" receive automatic penalties
If still getting false positives, try increasing threshold to 0.75-0.8

Mixed results with accented vs non-accented names:

Library uses dual lookup: tries "José" first, then "Jose" as fallback
Both versions exist in dataset with different popularity rankings
Results may vary based on which version has better ranking in the database

Changelog

v1.0.0 - Initial Stable Release 🎉

Released: August 2025

What's New:

✅ Out-of-the-box functionality - No setup required, just go get and use
✅ Embedded dataset - 54MB protobuf file with 727K first names and 983K surnames
✅ detector.NewDefault() - Instant initialization with embedded data
✅ Production ready - Stable API, comprehensive documentation
✅ Performance optimized - 10x faster than Python, 6x less memory
✅ Universal algorithm - Works with all cultural naming patterns
✅ CLI tool included - Ready-to-use command line interface

Breaking Changes: None (initial release)

Migration: This is the first stable release. Previous development versions are not supported.

About Montevive.AI

This project is developed and maintained by Montevive.AI, a company focused on building advanced AI solutions for data privacy and security.

Other Projects by Montevive.AI

Privacy-focused AI tools: Specialized solutions for PII detection and data anonymization
Security AI systems: Advanced threat detection and analysis platforms
Data intelligence: Smart data processing and insight generation tools

Contact & Support

🌐 Website: montevive.ai
📧 Email: Contact us through our website
💼 GitHub: @montevive
🐛 Issues: Report bugs or request features via GitHub Issues

Built with ❤️ by the Montevive.AI team

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cmd/pii-check		cmd/pii-check
data		data
pkg		pkg
proto		proto
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

Go PII Name Detector

🚀 Migration from Python to Go

🚀 Quick Start

Features

Installation

Simple Installation (Recommended)

Advanced Installation (Custom Data)

Quick Start

CLI Usage

Threshold Recommendations

Standard Use Cases

Cultural Considerations

Business vs Personal Names

Threshold Selection Guide

Library Usage

Simple Usage (Recommended)

Advanced Usage (Custom Data Files)

How It Works

Example Analysis

Data Format

Performance

🚀 Benchmarks vs Python Original

📊 Detailed Performance

Testing

Configuration

Enhanced Scoring Features

Supported Patterns

Accent and Unicode Support

Input Handling

Dual Lookup Strategy

Examples

Technical Details

Migration from ASCII-only Systems

Dataset Source

License

Contributing

Migration Notes

From Python names-dataset

Troubleshooting

Changelog

v1.0.0 - Initial Stable Release 🎉

About Montevive.AI

Other Projects by Montevive.AI

Contact & Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages