Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .vscode/cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
"CODEOWNER",
"cooldown",
"esbenp",
"firmographics",
"FORCEPURGE",
"ICLA",
"kernelsam",
Expand Down
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,21 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
[markdownlint](https://dlaa.me/markdownlint/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [1.2.0] - 2026-03-14

### Added in 1.2.0

- Alternate truth set key for ER audit comparison with a competing algorithm
- README with documentation for data sources, demo usage, and ER auditing
- Purge confirmation prompt in all shell scripts

### Changed in 1.2.0

- Converted demo JSONL files to V4 FEATURES array format
- Renamed `truthset_key.csv` to `actual_truthset_key.csv`
- Consolidated demo scripts into a single `truthset_demo.sh` with audit workflow
- Snapshot and audit output files now written to current working directory

## [1.1.2] - 2025-04-25

### Changed in 1.1.2
Expand Down
112 changes: 111 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,111 @@
# truth-sets
# truth-sets

Curated data files that Senzing uses for demos, testing, and auditing entity resolution (ER) results. Available in JSON/JSONL and CSV formats.

## Repository Structure

```
truthsets/
demo/
truthset_demo.sh # Shell script to load, snapshot, audit, and explore
truthset_config.g2c # Senzing engine configuration
customers.jsonl # Customer records
watchlist.jsonl # Watchlist records
reference.jsonl # Reference/third-party records
customers.csv # Customer records (CSV version)
watchlist.csv # Watchlist records (CSV version)
reference.csv # Reference/third-party records (CSV version)
actual_truthset_key.csv # Expected correct ER results (ground truth)
alternate_truthset_key.csv # Simulated results from a competing algorithm
demo_v3/ # V3 format (older)
```

## Data Sources

The truth set includes three data sources that represent common real-world scenarios:

- **Customers** — Your subjects of interest. These could be employees for insider threat detection, vendors for supply chain management, or whatever entities your organization tracks.
- **Watchlist** — Entities you don't want near your organization. These could be entities that have defrauded you in the past or entities you are mandated not to do business with, such as known terrorists and money launderers.
- **Reference** — External data you might purchase about people (demographics, past addresses, contact methods) or companies (firmographics, corporate structure, executives, and ownership).

## File Formats

The V4 demo files use the Senzing FEATURES array format in JSONL (one record per line). CSV versions of each data file are also provided. The V3 demo files use the older flat JSON format.

## Running the Demo

### Prerequisites

Senzing must be installed on a Linux server. See [Explore Senzing Entity Resolution](https://senzing.com/explore-senzing-entity-resolution/) for installation options. If using Docker, run a Senzing tools image with a volume mapped to the directory containing these files.

### Usage

From an initialized Senzing environment, run the demo script:

```bash
./truthset_demo.sh
```

The script performs the following steps:

1. **Purges the database** — Make sure you are OK with this before running!
2. **Loads configuration** — Applies the truth set config (`truthset_config.g2c`)
3. **Loads data** — Loads the customers, watchlist, and reference JSONL files
4. **Takes a snapshot** — Exports matches and calculates reports
5. **Performs an audit** — Compares the snapshot with the alternate truth set key to identify differences
6. **Opens sz_explorer** — An interactive tool for viewing matching statistics and drilling into entity examples, including how records matched or why they did not

## Truth Set Keys

The demo includes two key files that map records to expected entity clusters:

- **`actual_truthset_key.csv`** — The expected correct ER results (ground truth)
- **`alternate_truthset_key.csv`** — Simulated results from a legacy or competing algorithm, used to demonstrate ER auditing

## ER Auditing

An ER audit compares the results of two different systems or algorithms that resolve records to entities. By comparing how each system groups records, you can identify where they agree and where they differ — specifically which records one system merges that the other keeps separate, and vice versa. This highlights the strengths, weaknesses, and philosophical differences between the two approaches.

### Alternate Key Format

The alternate key represents results from a legacy or competing algorithm as a simple CSV:

| Column | Description | Required |
|---|---|---|
| CLUSTER_ID | The entity/cluster identifier assigned by the alternate algorithm | Yes |
| RECORD_ID | The source record identifier | Yes |
| DATA_SOURCE | The data source the record came from | Only if multiple data sources are present |

Records sharing the same `CLUSTER_ID` were merged into the same entity. The `CLUSTER_ID` values do not need to match Senzing's entity IDs — the audit process compares groupings, not IDs.

### Creating an Alternate Key

To generate an alternate key from another ER system, query its results database for the cluster-to-record mapping:

```sql
-- With multiple data sources:
SELECT cluster_id, record_id, data_source
FROM entity_resolution_results
ORDER BY cluster_id, data_source, record_id;

-- With a single data source (DATA_SOURCE column can be omitted):
SELECT cluster_id, record_id
FROM entity_resolution_results
ORDER BY cluster_id, record_id;
```

### About the Demo Alternate Key

The alternate key included here was derived from Senzing's own results and then modified to simulate a legacy or competing algorithm with a different matching philosophy. Two types of changes were made:

**More aggressive name matching** — The alternate algorithm treats close name variants with matching date of birth as definitive matches, even when Senzing considers them only possible matches. For example, "Darla Anderson" and "Darlene Anderson" sharing the same DOB are merged by the alternate algorithm but kept separate by Senzing. This reflects an algorithm that prioritizes recall over precision for name similarity.

**No employer-based matching** — The alternate algorithm does not use employer as a matching feature. Where Senzing merges records based on name + employer (e.g., "Howard Hughes" at "Universal Exports" across REFERENCE and WATCHLIST sources), the alternate algorithm keeps these as separate entities. This is common in algorithms that view employer as too volatile or unreliable to contribute to identity resolution.

### Why Audit?

These are simple examples meant to illustrate the concept. In the real world, an audit like this helps bring algorithmic differences to light, quantify them, and lead to solutions. If the alternate algorithm's results are preferred in certain cases, Senzing can be tuned to match that behavior. Conversely, you may find that the alternate method was too optimistic (merging records that shouldn't be together) or too pessimistic (keeping apart records that clearly belong to the same entity). The good news is that Senzing is tunable — its matching rules, thresholds, and feature usage can all be adjusted to align with your organization's requirements.

## V3 Demo

The `demo_v3` directory contains the V3 format files and multi-stage load scripts (`truthset-load1.sh`, `truthset-load2.sh`, `truthset-load3.sh`) that demonstrate incremental loading and auditing.
160 changes: 160 additions & 0 deletions truthsets/demo/alternate_truthset_key.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
CLUSTER_ID,RECORD_ID,DATA_SOURCE
2,1001,CUSTOMERS
2,1002,CUSTOMERS
2,1003,CUSTOMERS
2,1004,CUSTOMERS
1,1005,CUSTOMERS
1,1006,WATCHLIST
155,1007,WATCHLIST
145,1008,WATCHLIST
7,1009,CUSTOMERS
7,1010,CUSTOMERS
7,1011,CUSTOMERS
7,1012,WATCHLIST
7,1014,WATCHLIST
3,1015,CUSTOMERS
3,1016,CUSTOMERS
3,1017,CUSTOMERS
3,1018,CUSTOMERS
13,1019,CUSTOMERS
16,1020,CUSTOMERS
16,1021,WATCHLIST
14,1022,CUSTOMERS
14,1023,CUSTOMERS
14,1024,WATCHLIST
17,1025,CUSTOMERS
17,1026,CUSTOMERS
17,1027,WATCHLIST
19,1028,CUSTOMERS
149,1029,WATCHLIST
20,1030,CUSTOMERS
20,1031,CUSTOMERS
22,1032,CUSTOMERS
22,1033,CUSTOMERS
24,1034,CUSTOMERS
24,1035,CUSTOMERS
24,1036,CUSTOMERS
24,1037,WATCHLIST
24,1038,WATCHLIST
25,1039,CUSTOMERS
27,1040,CUSTOMERS
153,1041,WATCHLIST
153,1042,WATCHLIST
29,1043,CUSTOMERS
30,1044,CUSTOMERS
29,1045,CUSTOMERS
30,1046,CUSTOMERS
32,1047,CUSTOMERS
32,1048,CUSTOMERS
32,1049,CUSTOMERS
36,1050,CUSTOMERS
36,1051,CUSTOMERS
36,1052,CUSTOMERS
39,1053,CUSTOMERS
40,1054,CUSTOMERS
39,1055,CUSTOMERS
40,1056,CUSTOMERS
43,1057,CUSTOMERS
44,1058,CUSTOMERS
45,1059,CUSTOMERS
45,1060,CUSTOMERS
47,1061,CUSTOMERS
47,1062,CUSTOMERS
49,1063,CUSTOMERS
49,1064,CUSTOMERS
49,1065,CUSTOMERS
49,1066,CUSTOMERS
49,1067,CUSTOMERS
49,1068,CUSTOMERS
55,1069,CUSTOMERS
55,1070,CUSTOMERS
57,1071,CUSTOMERS
57,1072,CUSTOMERS
59,1073,CUSTOMERS
59,1074,CUSTOMERS
61,1075,CUSTOMERS
61,1076,CUSTOMERS
63,1077,CUSTOMERS
63,1078,CUSTOMERS
64,1079,CUSTOMERS
64,1080,CUSTOMERS
67,1081,CUSTOMERS
67,1082,CUSTOMERS
69,1083,CUSTOMERS
69,1084,CUSTOMERS
71,1085,CUSTOMERS
71,1086,CUSTOMERS
73,1087,CUSTOMERS
73,1088,CUSTOMERS
76,1089,CUSTOMERS
76,1090,CUSTOMERS
77,1091,CUSTOMERS
77,1092,CUSTOMERS
79,1093,CUSTOMERS
79,1094,CUSTOMERS
82,1095,CUSTOMERS
82,1096,CUSTOMERS
83,1097,CUSTOMERS
83,1098,CUSTOMERS
86,1099,CUSTOMERS
86,1100,CUSTOMERS
87,1101,CUSTOMERS
87,1102,CUSTOMERS
90,1103,CUSTOMERS
90,1104,CUSTOMERS
92,2011,CUSTOMERS
92,2012,REFERENCE
55,2013,REFERENCE
57,2014,REFERENCE
94,2031,CUSTOMERS
94,2032,CUSTOMERS
97,2041,REFERENCE
97,2042,CUSTOMERS
126,2051,REFERENCE
126,2052,WATCHLIST
93,2061,REFERENCE
93,2062,WATCHLIST
93,2063,CUSTOMERS
96,2071,REFERENCE
96,2072,CUSTOMERS
98,2073,CUSTOMERS
98,2074,REFERENCE
134,2081,REFERENCE
9082,2082,WATCHLIST
131,2091,REFERENCE
9092,2092,WATCHLIST
127,2101,REFERENCE
127,2102,REFERENCE
135,2111,REFERENCE
135,2112,REFERENCE
137,2121,REFERENCE
137,2122,REFERENCE
139,2131,REFERENCE
139,2132,REFERENCE
99,2141,REFERENCE
99,2142,CUSTOMERS
101,2151,REFERENCE
101,2152,CUSTOMERS
143,2161,REFERENCE
143,2162,REFERENCE
100,2171,CUSTOMERS
102,2172,CUSTOMERS
103,2181,CUSTOMERS
106,2182,CUSTOMERS
108,2191,CUSTOMERS
104,2192,CUSTOMERS
104,2193,CUSTOMERS
107,2201,CUSTOMERS
109,2202,CUSTOMERS
110,2203,CUSTOMERS
111,2204,CUSTOMERS
112,2205,CUSTOMERS
113,2206,CUSTOMERS
114,2207,CUSTOMERS
115,2208,CUSTOMERS
116,2209,CUSTOMERS
117,2210,CUSTOMERS
118,2211,CUSTOMERS
119,2212,CUSTOMERS
114,2213,CUSTOMERS
121,2214,CUSTOMERS
Loading
Loading