A high-performance Go library for detecting PII (Personally Identifiable Information) names by Montevive.AI. This is a Go migration of the popular names-dataset Python library, offering 10x faster performance and 6x less memory usage while maintaining the same comprehensive name database.
🎉 v1.0.0 Released - Production-ready with embedded dataset and out-of-the-box functionality!
This project migrates the Python names-dataset library to Go with significant improvements:
| Feature | Python (Original) | Go (This Project) |
|---|---|---|
| Memory Usage | 3.2 GB | ~500 MB (6x less) |
| Detection Speed | ~50-100ms | 3-9ms (10x faster) |
| Data Format | Pickle files | Protocol Buffers |
| Cultural Rules | Limited | Universal algorithm |
| Surname Support | Single string | Multiple surnames ([]string) |
| Spanish Names | Basic support | Advanced double surname detection |
Built with Protocol Buffers for efficient data storage and fast lookups of 727K first names and 983K surnames from 105 countries.
go get github.com/montevive/go-name-detectorimport "github.com/montevive/go-name-detector/pkg/detector"
d, _ := detector.NewDefault() // Works out of the box!
result := d.DetectPII([]string{"John", "Smith"})
fmt.Printf("Is PII: %v (%.2f confidence)\n", result.IsLikelyName, result.Confidence)- Universal name detection: Works for all cultural naming patterns without hardcoded rules
- Multiple surname support: Handles Spanish double surnames, compound names, etc.
- Confidence scoring: Returns 0.0-1.0 confidence scores with detailed breakdown
- High performance: < 1ms detection time, optimized memory usage
- Protocol Buffers: Efficient binary data format for 727K first names and 983K surnames
- Country prediction: Identifies most likely country of origin
- Gender prediction: Predicts gender based on first names
- CLI tool: Ready-to-use command line interface
The library now includes embedded data and works out of the box:
# Latest version (recommended)
go get github.com/montevive/go-name-detector@latest
# Or pin to v1.0.0
go get github.com/montevive/go-name-detector@v1.0.0That's it! No additional setup, file downloads, or protobuf compilation required. The library is production-ready as of v1.0.0.
If you want to use custom data or the latest dataset, you can follow the advanced setup:
# Install original Python library (for custom data export)
pip install names-dataset
# Clone this Go migration project
git clone https://github.com/montevive/go-name-detector.git
cd go-name-detector
# Install dependencies and generate protobuf code
make generate
# Export data from Python pickle format to Protocol Buffers
make export-data
# Build the CLI tool
make build# Basic name detection
./bin/pii-check "John Smith"
# Output: ✓ Likely PII name (85.3% confidence)
# Spanish names with double surnames
./bin/pii-check "Jose Manuel García López"
# Output: ✓ Likely PII name (89.0% confidence)
# First names: Jose, Manuel
# Surnames: García, López
# Spanish names with accents (60% threshold recommended)
./bin/pii-check -threshold 0.6 "José García"
# Output: ✓ Likely PII name (64.8% confidence)
# Business terminology properly rejected
./bin/pii-check "Informe de Cliente"
# Output: ✗ Not a PII name (26.4% confidence)
# Non-name phrases
./bin/pii-check "The quick brown fox"
# Output: ✗ Not a PII name (12.4% confidence)
# JSON output with accent support
./bin/pii-check -json "María García López"
# High precision threshold
./bin/pii-check -threshold 0.8 "José Manuel García López"
# Output: ✓ Likely PII name (89.0% confidence)
# Batch processing
./bin/pii-check -batch names.txt
# Dataset statistics
./bin/pii-check -statsBased on enhanced scoring with popularity-based ranking:
-
70% (Default): Conservative - High confidence, minimal false positives
- Best for: Production systems, automated filtering, compliance workflows
- Example: "José Manuel García López" (89%) ✓, "Informe de Cliente" (26%) ✗
- Accuracy: ~95% precision, may miss some legitimate two-word names
-
60%: Balanced - Good detection with reasonable accuracy
- Best for: General purpose detection, Spanish/Latin name datasets
- Example: "José García" (65%) ✓, "María López" (62%) ✓
- Accuracy: ~90% precision, catches most legitimate name patterns
-
50%: Aggressive - Catches more names but higher false positive rate
- Best for: Initial screening, manual review workflows, research
- Example: "María de García" (43%) ✓ despite preposition
- Accuracy: ~80% precision, requires human verification
- Spanish/Latin names: Use 60-65% threshold (two-word names often score 60-70%)
- Anglo/English names: Use 70% threshold (typically score higher due to simpler patterns)
- Mixed international datasets: Use 65% as balanced compromise
- Asian names: Use 70% (usually compound patterns with higher confidence)
The enhanced algorithm distinguishes between:
- ✅ Top-ranked personal names (José #1, García #1): Score 65-95%
- ❌ Business terms ("Informe de Cliente", "Documento de Identidad"): Score 15-30%
- ❌ Rare/noise entries (Cliente #6,970): Score under 20%
⚠️ Prepositions in names ("María de García"): Moderate penalty but detectable at 60%
# High precision (minimal false positives)
./bin/pii-check -threshold 0.8 "José Manuel García López"
# Balanced detection (recommended for most use cases)
./bin/pii-check -threshold 0.65 "José García"
# High recall (catch more names, review manually)
./bin/pii-check -threshold 0.5 "María de la Cruz"package main
import (
"fmt"
"log"
"github.com/montevive/go-name-detector/pkg/detector"
)
func main() {
// Create detector with embedded data - works out of the box!
d, err := detector.NewDefault()
if err != nil {
log.Fatal(err)
}
// Detect PII
words := []string{"Jose", "Manuel", "Robles", "Hermoso"}
result := d.DetectPII(words)
fmt.Printf("Is PII: %v\n", result.IsLikelyName)
fmt.Printf("Confidence: %.2f\n", result.Confidence)
fmt.Printf("First names: %v\n", result.Details.FirstNames)
fmt.Printf("Surnames: %v\n", result.Details.Surnames)
fmt.Printf("Country: %s\n", result.Details.TopCountry)
fmt.Printf("Gender: %s\n", result.Details.Gender)
}package main
import (
"fmt"
"github.com/montevive/go-name-detector/pkg/detector"
"github.com/montevive/go-name-detector/pkg/loader"
)
func main() {
// Load dataset from file
l := loader.New()
err := l.LoadFromFile("data/combined_names.pb.gz")
if err != nil {
panic(err)
}
// Create detector
d := detector.New(l.GetDataset())
// Detect PII
words := []string{"Jose", "Manuel", "Robles", "Hermoso"}
result := d.DetectPII(words)
fmt.Printf("Is PII: %v\n", result.IsLikelyName)
fmt.Printf("Confidence: %.2f\n", result.Confidence)
}The detector uses a data-driven approach:
-
All possible splits: For input words, tries every possible combination of first names vs surnames
-
Database lookup: Checks each component against 727K first names and 983K surnames
-
Confidence scoring: Combines multiple factors:
- Database match (base score)
- Name popularity (lower rank = higher confidence)
- Gender consistency across first names
- Country overlap between components
- Multiple valid name bonus
-
Best combination: Returns the split with highest confidence score
Input: ["Jose", "Manuel", "Robles", "Hermoso"]
The algorithm tries:
[Jose] + [Manuel, Robles, Hermoso]→ Score: 0.42[Jose, Manuel] + [Robles, Hermoso]→ Score: 0.92 ✓[Jose, Manuel, Robles] + [Hermoso]→ Score: 0.35
Returns the best scoring combination with detailed breakdown.
The system uses Protocol Buffer files converted from the original Python pickle format:
- first_names.pb.gz: 727,556 first names with country/gender/rank data
- last_names.pb.gz: 983,826 surnames with country/rank data
- combined_names.pb.gz: Both datasets in a single file
Each name entry contains:
- Country probabilities: Likelihood per country (105 countries supported)
- Gender data: Male/Female probabilities (first names only)
- Popularity ranks: 1-indexed ranking per country (1 = most popular)
| Metric | Python (names-dataset) | Go (This Project) | Improvement |
|---|---|---|---|
| Memory Usage | 3.2 GB | 500 MB | 6.4x less |
| Load Time | ~30-60 seconds | 4.3 seconds | 7-14x faster |
| Detection Speed | 50-100ms | 3-9ms | 10-20x faster |
| File Size | 53MB (pickle) | 54MB (protobuf) | Comparable |
| Batch Processing | ~100 names/sec | 10,000+ names/sec | 100x faster |
BenchmarkDetectPII_Spanish-10 141104 8418 ns/op 14742 B/op 51 allocs/op
BenchmarkDetectPII_English-10 342564 3533 ns/op 7179 B/op 19 allocs/op
BenchmarkDetectPII_NonName-10 226826 5318 ns/op 14374 B/op 32 allocs/op
- Spanish names: 8.4ms per detection
- English names: 3.5ms per detection
- Non-names: 5.3ms per detection
# Run unit tests
make test
# Run benchmarks
make bench
# Test with examples
go test ./examples -vThe scoring algorithm uses enhanced popularity-based scoring and can be customized:
config := detector.ScoreConfig{
BaseMatchScore: 0.25, // Base score for database match
PopularityWeight: 0.35, // HIGH: Popularity is key differentiator
GenderConsistency: 0.1, // Bonus for consistent gender
CountryOverlap: 0.15, // Bonus for country overlap
MultipleNamesBonus: 0.15, // Bonus for multiple names
}
d := detector.NewWithConfig(dataset, config)-
Step-based popularity scoring:
- Top-10 names (José, García): 1.0 score
- Top-50 names: 0.8 score
- Top-200 names: 0.5 score
- Top-1000 names: 0.2 score
- Rare names (Cliente #6,970): 0.02 score
-
Preposition penalties: Words like "de", "van", "von", "del" heavily penalized
- 70% penalty (×0.3) when used as first names
- 30% penalty (×0.7) when used as surnames
-
Pattern bonuses:
- Two top-100 names together: 40% boost (×1.4)
- Two top-10 names together: 60% boost (×1.6)
-
Accent normalization: Automatic handling of "José" → "Jose" lookups
The universal algorithm automatically handles:
- English:
John Smith,Mary Jane Doe - Spanish:
Jose Manuel Garcia Lopez(double first + double surname) - Compound names:
Maria del Carmen Rodriguez - Asian patterns: Based on statistical data in the dataset
- Any cultural pattern: No hardcoded rules, purely data-driven
Full support for accented characters and international Unicode names:
- ✅ Accepts any Unicode letters: José, María, François, Müller, 李明, محمد
- ✅ Automatic normalization: Converts accents for database lookup
- ✅ Case insensitive: JOSÉ, josé, José all work identically
- ✅ Mixed scripts: Handles Latin, Cyrillic, Arabic, Chinese characters
The library uses intelligent dual lookup:
- Exact match: First tries to find "José" in the database
- Normalized match: Then tries "Jose" (accent removed)
- Best result: Uses whichever version has better popularity ranking
# Spanish names with accents
./bin/pii-check "José García" # ✓ 64.8% confidence
./bin/pii-check "María López" # ✓ Proper detection
./bin/pii-check "José Manuel García" # ✓ 81.4% confidence
# French names
./bin/pii-check "François Dupont" # ✓ Handles French accents
# Mixed accent/non-accent
./bin/pii-check "José Smith" # ✓ Works with mixed names
./bin/pii-check "Maria García" # ✓ One accent, one without- Normalization: Uses Unicode NFD decomposition + combining character removal
- Preservation: Original accented input preserved in results
- Performance: Minimal overhead (~1μs per normalization)
- Languages: Supports Spanish, French, German, Portuguese, Italian, and more
If migrating from systems that only handle ASCII:
// Old ASCII-only approach ❌
if isASCII(name) {
result := detector.DetectPII([]string{name})
}
// New Unicode approach ✅
result := detector.DetectPII([]string{"José", "García"}) // Just works!This Go implementation uses the same high-quality dataset as the original Python library:
Source: Facebook data leak dataset (533M users) from 105 countries
Original Library: names-dataset by Philippe Remy
What we provide:
- Same 727K first names and 983K surnames
- Same country-specific popularity rankings
- Same gender predictions with statistical confidence
- Same real-world name frequency distributions
- Enhanced: Better support for cultural naming patterns through universal algorithm
Migration benefits:
- Performance: 10x faster detection with 6x less memory
- Format: Protocol Buffers instead of Python pickle
- Algorithm: Universal approach vs hardcoded rules
- Spanish names: Advanced double surname detection
- Production ready: CLI tool, JSON API, batch processing
Copyright 2024 Montevive.AI
Apache License 2.0 - see the LICENSE file for details.
This project is a Go migration of the original names-dataset Python library by Philippe Remy, which is also licensed under Apache 2.0.
We welcome contributions to improve the Go PII Name Detector! This project is maintained by Montevive.AI.
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Run
make testandmake bench - Submit a pull request
For questions or support, visit Montevive.AI or open an issue on GitHub.
If you're migrating from the Python library:
# Python (original)
from names_dataset import NameDataset
nd = NameDataset()
result = nd.search('Jose Manuel')
# Go (this project) - equivalent functionality
words := []string{"Jose", "Manuel", "Garcia", "Lopez"}
result := detector.DetectPII(words)Key differences:
- Multiple surnames: Go version returns
[]stringfor surnames (supports Spanish double surnames) - Universal detection: No need to call separate functions for different name types
- Performance: 10x faster with 6x less memory usage
- Confidence scoring: Enhanced algorithm with detailed breakdown
Data file not found: Make sure to run make export-data first to generate the protobuf files from the original Python dataset.
Memory issues: The full dataset requires ~500MB RAM (vs 3.2GB for Python). Consider using partial loading if needed.
Low confidence scores: The system is conservative by design. Names not in the 533M person dataset will score lower.
Spanish names with accents not detected:
- Two-word Spanish names (José García) typically score 60-70% with enhanced scoring
- Use threshold 0.6 instead of default 0.7 for Spanish/Latin datasets
- The algorithm prioritizes accuracy over false positives to avoid detecting business terms
- Example:
./bin/pii-check -threshold 0.6 "José García"→ ✓ 64.8% confidence
Business terms being detected as names:
- Enhanced scoring now heavily penalizes business terminology
- "Informe de Cliente" scores ~26% (down from 42% in older versions)
- Prepositions like "de", "del" receive automatic penalties
- If still getting false positives, try increasing threshold to 0.75-0.8
Mixed results with accented vs non-accented names:
- Library uses dual lookup: tries "José" first, then "Jose" as fallback
- Both versions exist in dataset with different popularity rankings
- Results may vary based on which version has better ranking in the database
Released: August 2025
What's New:
- ✅ Out-of-the-box functionality - No setup required, just
go getand use - ✅ Embedded dataset - 54MB protobuf file with 727K first names and 983K surnames
- ✅
detector.NewDefault()- Instant initialization with embedded data - ✅ Production ready - Stable API, comprehensive documentation
- ✅ Performance optimized - 10x faster than Python, 6x less memory
- ✅ Universal algorithm - Works with all cultural naming patterns
- ✅ CLI tool included - Ready-to-use command line interface
Breaking Changes: None (initial release)
Migration: This is the first stable release. Previous development versions are not supported.
This project is developed and maintained by Montevive.AI, a company focused on building advanced AI solutions for data privacy and security.
- Privacy-focused AI tools: Specialized solutions for PII detection and data anonymization
- Security AI systems: Advanced threat detection and analysis platforms
- Data intelligence: Smart data processing and insight generation tools
- 🌐 Website: montevive.ai
- 📧 Email: Contact us through our website
- 💼 GitHub: @montevive
- 🐛 Issues: Report bugs or request features via GitHub Issues
Built with ❤️ by the Montevive.AI team