[DE-6999] Enable image deduplication within nucleus sdk by edwinpav · Pull Request #452 · scaleapi/nucleus-python-client

edwinpav · 2026-02-23T20:52:05Z

See title.

Merge in after sister pr is deployed: https://github.com/scaleapi/scaleapi/pull/134861

Testing
Will add some tests to this repo

From root of this repo, create a venv and run pip install -e . to have this venv connect to the local version of the sdk (this repo). Make sure the venv is created with python11. Currently, some of the sdk code doesn't support newer versions of python (I can look upgrading it). Based on the client installation tests, it only supports python3.7-3.11.
Run test scripts within this venv (can run from any repo).

Example test script of valid usage:

import nucleus

# define variables
corp_api_key="<SCALE_API_KEY>"
customer_id="68921622befbf26f9e535024"
SCALE_API_KEY=f"{corp_api_key}|{customer_id}"
endpoint="http://localhost:3000/v1/nucleus"
dataset_id = "ds_d6ccka5zks5g0bheab8g"

# initialize client
client = nucleus.NucleusClient(SCALE_API_KEY, endpoint=endpoint)
print(client)
dataset = client.get_dataset(dataset_id)
print(dataset)

entire_dataset_dedup = dataset.deduplicate(threshold=30)
print(entire_dataset_dedup)
print()

ref_ids_dedup = dataset.deduplicate(threshold=10, reference_ids=["video1/0", "video1/1", "video1/2", "video1/3", "video1/4", "video1/5"])
print(ref_ids_dedup)
print(ref_ids_dedup.stats)
print()

dataset_item_ids_dedup = dataset.deduplicate_by_ids(threshold=10, dataset_item_ids=["di_d6ccmm2mc93g23g1maag", "di_d6ccmm2mc93g23g1mab0", "di_d6ccmm2mc93g23g1mabg", "di_d6ccmm2mc93g23g1mac0", "di_d6ccmm2mc93g23g1macg", "di_d6ccmm2mc93g23g1mad0"])
print(ref_ids_dedup)
print(ref_ids_dedup.stats)

Output:

NucleusClient(api_key='scaleint_c5477527b28e4911b887ac2ede355eac|68921622befbf26f9e535024', use_notebook=False, endpoint='http://localhost:3000/v1/nucleus')
Dataset(name='test-phash-scene-1, dataset_id='ds_d6ccka5zks5g0bheab8g', is_scene='True')
DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mb1g', 'di_d6ccmm2mc93g23g1mbyg', 'di_d6ccmm2mc93g23g1mchg', 'di_d6ccmm2mc93g23g1mfk0', 'di_d6ccmm2mc93g23g1mgeg', 'di_d6ccmm2mc93g23g1mh10', 'di_d6ccp5rmc93g200pr6n0', 'di_d6ccp70mc93g1yqr4pw0'], unique_reference_ids=['video1/0', 'video1/46', 'video1/104', 'video1/142', 'video1/337', 'video1/392', 'video1/429', 'video3/108', 'video2/397'], stats=DeduplicationStats(threshold=30, original_count=1802, deduplicated_count=9))

DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mac0'], unique_reference_ids=['video1/0', 'video1/3'], stats=DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2))
DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2)

DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mac0'], unique_reference_ids=['video1/0', 'video1/3'], stats=DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2))
DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2)

Examples of invalid usage:

# not passing threshold
entire_dataset_dedup = dataset.deduplicate()

# output
    entire_dataset_dedup = dataset.deduplicate()
                           ^^^^^^^^^^^^^^^^^^^^^
TypeError: Dataset.deduplicate() missing 1 required positional argument: 'threshold'

# invalid threshold
entire_dataset_dedup = dataset.deduplicate(threshold=70) 
# or
entire_dataset_dedup = dataset.deduplicate(threshold=-5)

# output (for both)
Tried to post http://localhost:3000/v1/nucleus/dataset/ds_d6ccka5zks5g0bheab8g/deduplicate, but received 400: Bad Request.
The detailed error is:
{"error":"An unexpected internal error occured: threshold must be an integer between 0 and 64","route":"/v1/nucleus/dataset/ds_d6ccka5zks5g0bheab8g/deduplicate","request_id":"51bb8a48-0a63-44f8-8ef5-5954493b0edb","status_code":400}

# empty list for ref_ids instead of just not passing in that param
entire_dataset_dedup = dataset.deduplicate(threshold=10, reference_ids=[])

# output
ValueError: reference_ids cannot be empty. Omit reference_ids parameter to deduplicate entire dataset.

# empty list for dataset_item_ids
entire_dataset_dedup = dataset.deduplicate_by_ids(threshold=10, dataset_item_ids=[])

# output
ValueError: dataset_item_ids must be non-empty. Use deduplicate() for entire dataset.

Enable deduplication in nucleus sdk

b8ff516

edwinpav self-assigned this Feb 23, 2026

edwinpav added 2 commits February 23, 2026 15:54

Lint fixes

436fbaf

Fix import order

0a1c8d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[DE-6999] Enable image deduplication within nucleus sdk#452

[DE-6999] Enable image deduplication within nucleus sdk#452
edwinpav wants to merge 3 commits intomasterfrom
edwinpav/dedup

edwinpav commented Feb 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

edwinpav commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

edwinpav commented Feb 23, 2026 •

edited

Loading