Skip to content

refactor!: Align DatasetClient and KeyValueStoreClient interfaces with their Python counterparts#3627

Open
janbuchar wants to merge 10 commits into
v4from
simplify-storage-subclients
Open

refactor!: Align DatasetClient and KeyValueStoreClient interfaces with their Python counterparts#3627
janbuchar wants to merge 10 commits into
v4from
simplify-storage-subclients

Conversation

@janbuchar
Copy link
Copy Markdown
Contributor

@janbuchar janbuchar commented May 5, 2026

  • closes Rework the storage client system #3075
  • follow up to refactor!: Overhaul the StorageClient interface #3570 which updated the StorageClient interface - this handles the "subclients"
  • Aligning the RequestQueueClient is left for Move Apify RequestQueue-specific logic to SDK #3328
    • Straightforward changes such as removing update were done here, the likes of listHead and fetchNextRequest will have to wait
  • The dual iterable return values (AsyncIterator<T> | Promise<Page<T>>) for iterating dataset items and key-value store keys (values, key-value pairs) were left only in storage "frontends" so that individual storage clients do not need to implement this - they only have to provide a fetch-single-page method. The "awaitable" part of the dual iterable always returns a single page of the results - for key-value stores, fetching everything is potentially very heavy on memory and we want to keep the interface consistent.

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label May 5, 2026
@janbuchar janbuchar changed the title refactor!: Trim the storage subclient interfaces refactor!: Align DatasetClient and KeyValueStoreClient interfaces with their Python counterparts May 7, 2026
@janbuchar janbuchar requested review from B4nan and barjin May 7, 2026 10:39
@janbuchar janbuchar marked this pull request as ready for review May 7, 2026 10:39
Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @janbuchar ! Few ideas below ⬇️

Comment thread packages/core/src/storages/dataset.ts Outdated
Comment thread packages/core/src/storages/key_value_store.ts Outdated
Base automatically changed from refactor-storage-manager to v4 May 11, 2026 11:18
@janbuchar janbuchar force-pushed the simplify-storage-subclients branch from d998bc7 to 1e349f5 Compare May 12, 2026 11:20
const persistStateIntervalMillis = serviceLocator.getConfiguration().persistStateIntervalMillis;
const timeoutSecs = persistStateIntervalMillis / 2_000;
await this.keyValueStore
.setValue(this.persistStateKey, this.toJSON(), {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a drastic change, but as part of #3652, we will unify and re-introduce the timeout handling.


ow(data, 'data', ow.object);
const dispatch = async (payload: string) => this.client.pushItems(payload);
const limit = MAX_PAYLOAD_SIZE_BYTES - Math.ceil(MAX_PAYLOAD_SIZE_BYTES * SAFETY_BUFFER_PERCENT);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling of oversize payloads will be reintroduced in SDK as a part of apify/apify-sdk-js#603

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The methods removed from this file were reincarnated in packages/core/src/storages/utils.ts


test('can be awaited directly (backward compatibility)', async () => {
const result = await dataset.listItems({ limit: 10 });
test('getData returns a paginated result', async () => {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a BC break, but for consistency. Thoughts, @barjin, @B4nan, @l2ysho (you implemented this after all)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here deserve most attention - they define the interface for future storage implementations.

@janbuchar janbuchar requested a review from barjin May 13, 2026 16:34
Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @janbuchar! I have a few more questions/ideas ⬇️

Comment thread packages/types/src/storages.ts
Comment on lines +615 to +619
return createDualIterable({
createPages: () => this.fetchPages(options),
extractItems: (page) => page.items,
mapFirstPage: (page) => page,
});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that await Dataset.values() will return the first page (of size dependent on the storage client implementation, e.g., 1000)

Let's assume a dataset of size, e.g., 3000

for await (const value of Dataset.values()) {
     // iterates over all the 3000 items 
}

for (const value of await Array.fromAsync(Dataset.values())) {
     // blocks while fetching the entire dataset, then iterates over all the 3000 items 
}

for (const value of await Dataset.values()) {
    // explodes, returns a { items: [1000 items], limit: 1000, offset: 0, total: 3000 }
}

This imo breaks the whole idea of the dual iterables. Fetching pages can be done (and has historically been done) with getData. imo the values / keys / entries methods should hide the backend pagination and allow the user to work directly with the data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just to make sure what you're proposing:

  • Dataset.getData - one page of backend-dependent length, in a wrapper container with some metadata, no dual iterator
  • Dataset.export - all items in the dataset, as a simple raw array
  • Dataset.values - dual iterator, just items (Data[]), no metadata
  • Dataset.entries - same as the above, but with indexes
  • KeyValueStore.keys, KeyValueStore.values, KeyValueStore.entries - dual iterator, no metadata
  • all dual iterators follow the "await to get one page, iterate to go through everything" paradigm

Or did I miss the point completely?

Copy link
Copy Markdown
Member

@barjin barjin May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with all points except for the last one

all dual iterators follow the "await to get one page, iterate to go through everything" paradigm

Imo awaiting the dual iterators should return all the items (same as Dataset.export)... the need for pagination comes from Apify API, Crawlee users imo shouldn't care for this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two things that led me to this conclusion - the result of KeyValueStore.values() will easily be very big and allowing pulling it into memory with a single await feels like a major footgun to me. And if we make one dual iterator return a single page, I'd prefer to keep it consistent accross all of them.

Which of these two do you consider worth breaking? 😁

Copy link
Copy Markdown
Member

@barjin barjin May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd break the first one, 100% 😄

Imo if a user drops tons of data in the KVS, then calls console.log(JSON.stringify(await kvs.values()))... it's on them, and we shouldn't stop them from doing this. The performance flop will be fully their responsibility (and quite obvious to spot).

Imo pagination has its place in the lower abstraction levels (Apify API, AWS API, etc.), but Crawlee shouldn't leak this (Dataset.getData I'm willing to accept, as the name is rather technical.) Key-Value store and Map are well-known concepts, and I don't think many ppl connect these with pagination (which might be, therefore, surprising for them).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's do it this way, accepting the footgun. 6e70953 (this PR)

Comment thread packages/core/src/storages/utils.ts Outdated
Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more minuscule ideas, but I like where this is going!

I'd love a second (third) pair of eyes here, though; I'm getting tunnel vision already :)

batchAddRequests(requests: RequestSchema[], options?: RequestOptions): Promise<BatchAddRequestsResult>;
getRequest(id: string): Promise<RequestOptions | undefined>;
updateRequest(request: UpdateRequestSchema, options?: RequestOptions): Promise<QueueOperationInfo>;
deleteRequest(id: string): Promise<unknown>;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this removal intentional? I see we have this method in API. What is the alternative?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the crawlers don't really need this method.

Comment on lines +69 to +70
/** Remove all items from the dataset but keep the dataset itself. */
purge(): Promise<void>;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is purge possible with the Apify Platform Dataset implementation?

I suppose we could drop + create in case of named datasets (but it will likely still change the dataset id), but you won't be able to do this with unnamed / default datasets imo.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you should be able to drop any storage by ID and recreate it. And yes, it will change the ID. In the Python version, we log a warning if someone does this on a storage backend that doesn't have native support for purge

* ```javascript
* const dataset = await Dataset.open('my-results');
*
* // Stream all items (memory-efficient for large datasets)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

"stream" all items

This might be slightly confusing for some (we're async iterating over the items, but loading the actual items is "blocking" - you have to load the entire item from the storage before processing it)

@B4nan B4nan requested a review from l2ysho May 20, 2026 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants