refactor!: Align `DatasetClient` and `KeyValueStoreClient` interfaces with their Python counterparts by janbuchar · Pull Request #3627 · apify/crawlee

janbuchar · 2026-05-05T16:38:11Z

closes Rework the storage client system #3075
follow up to refactor!: Overhaul the StorageClient interface #3570 which updated the StorageClient interface - this handles the "subclients"
Aligning the RequestQueueClient is left for Move Apify RequestQueue-specific logic to SDK #3328
- Straightforward changes such as removing update were done here, the likes of listHead and fetchNextRequest will have to wait
The dual iterable return values (AsyncIterator<T> | Promise<Page<T>>) for iterating dataset items and key-value store keys (values, key-value pairs) were left only in storage "frontends" so that individual storage clients do not need to implement this - they only have to provide a fetch-single-page method. The "awaitable" part of the dual iterable always returns a single page of the results - for key-value stores, fetching everything is potentially very heavy on memory and we want to keep the interface consistent.

barjin

Thank you @janbuchar ! Few ideas below ⬇️

janbuchar · 2026-05-13T16:17:38Z

-        const persistStateIntervalMillis = serviceLocator.getConfiguration().persistStateIntervalMillis;
-        const timeoutSecs = persistStateIntervalMillis / 2_000;
        await this.keyValueStore
-            .setValue(this.persistStateKey, this.toJSON(), {


This seems like a drastic change, but as part of #3652, we will unify and re-introduce the timeout handling.

janbuchar · 2026-05-13T16:19:59Z


        ow(data, 'data', ow.object);
-        const dispatch = async (payload: string) => this.client.pushItems(payload);
-        const limit = MAX_PAYLOAD_SIZE_BYTES - Math.ceil(MAX_PAYLOAD_SIZE_BYTES * SAFETY_BUFFER_PERCENT);


Handling of oversize payloads will be reintroduced in SDK as a part of apify/apify-sdk-js#603

janbuchar · 2026-05-13T16:23:34Z

The methods removed from this file were reincarnated in packages/core/src/storages/utils.ts

janbuchar · 2026-05-13T16:28:49Z


-        test('can be awaited directly (backward compatibility)', async () => {
-            const result = await dataset.listItems({ limit: 10 });
+        test('getData returns a paginated result', async () => {


So this is a BC break, but for consistency. Thoughts, @barjin, @B4nan, @l2ysho (you implemented this after all)

janbuchar · 2026-05-13T16:30:11Z

The changes here deserve most attention - they define the interface for future storage implementations.

barjin

Thank you @janbuchar! I have a few more questions/ideas ⬇️

barjin · 2026-05-14T09:30:38Z

+        return createDualIterable({
+            createPages: () => this.fetchPages(options),
+            extractItems: (page) => page.items,
+            mapFirstPage: (page) => page,
+        });


This means that await Dataset.values() will return the first page (of size dependent on the storage client implementation, e.g., 1000)

Let's assume a dataset of size, e.g., 3000

for await (const value of Dataset.values()) { // iterates over all the 3000 items } for (const value of await Array.fromAsync(Dataset.values())) { // blocks while fetching the entire dataset, then iterates over all the 3000 items } for (const value of await Dataset.values()) { // explodes, returns a { items: [1000 items], limit: 1000, offset: 0, total: 3000 } }

This imo breaks the whole idea of the dual iterables. Fetching pages can be done (and has historically been done) with getData. imo the values / keys / entries methods should hide the backend pagination and allow the user to work directly with the data.

So, just to make sure what you're proposing:

Dataset.getData - one page of backend-dependent length, in a wrapper container with some metadata, no dual iterator

Dataset.export - all items in the dataset, as a simple raw array

Dataset.values - dual iterator, just items (Data[]), no metadata

Dataset.entries - same as the above, but with indexes

KeyValueStore.keys, KeyValueStore.values, KeyValueStore.entries - dual iterator, no metadata

all dual iterators follow the "await to get one page, iterate to go through everything" paradigm

Or did I miss the point completely?

Yes, I agree with all points except for the last one

all dual iterators follow the "await to get one page, iterate to go through everything" paradigm

Imo awaiting the dual iterators should return all the items (same as Dataset.export)... the need for pagination comes from Apify API, Crawlee users imo shouldn't care for this.

There are two things that led me to this conclusion - the result of KeyValueStore.values() will easily be very big and allowing pulling it into memory with a single await feels like a major footgun to me. And if we make one dual iterator return a single page, I'd prefer to keep it consistent accross all of them.

Which of these two do you consider worth breaking? 😁

I'd break the first one, 100% 😄

Imo if a user drops tons of data in the KVS, then calls console.log(JSON.stringify(await kvs.values()))... it's on them, and we shouldn't stop them from doing this. The performance flop will be fully their responsibility (and quite obvious to spot).

Imo pagination has its place in the lower abstraction levels (Apify API, AWS API, etc.), but Crawlee shouldn't leak this (Dataset.getData I'm willing to accept, as the name is rather technical.) Key-Value store and Map are well-known concepts, and I don't think many ppl connect these with pagination (which might be, therefore, surprising for them).

OK, let's do it this way, accepting the footgun. 6e70953 (this PR)

barjin

a few more minuscule ideas, but I like where this is going!

I'd love a second (third) pair of eyes here, though; I'm getting tunnel vision already :)

barjin · 2026-05-19T15:03:28Z

    batchAddRequests(requests: RequestSchema[], options?: RequestOptions): Promise<BatchAddRequestsResult>;
    getRequest(id: string): Promise<RequestOptions | undefined>;
    updateRequest(request: UpdateRequestSchema, options?: RequestOptions): Promise<QueueOperationInfo>;
-    deleteRequest(id: string): Promise<unknown>;


Was this removal intentional? I see we have this method in API. What is the alternative?

Yeah, the crawlers don't really need this method.

barjin · 2026-05-19T15:05:49Z

+    /** Remove all items from the dataset but keep the dataset itself. */
+    purge(): Promise<void>;


Is purge possible with the Apify Platform Dataset implementation?

I suppose we could drop + create in case of named datasets (but it will likely still change the dataset id), but you won't be able to do this with unnamed / default datasets imo.

Actually, you should be able to drop any storage by ID and recreate it. And yes, it will change the ID. In the Python version, we log a warning if someone does this on a storage backend that doesn't have native support for purge

barjin · 2026-05-19T15:09:52Z

     * ```javascript
     * const dataset = await Dataset.open('my-results');
+     *
+     * // Stream all items (memory-efficient for large datasets)


nit:

"stream" all items

This might be slightly confusing for some (we're async iterating over the items, but loading the actual items is "blocking" - you have to load the entire item from the storage before processing it)

janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label May 5, 2026

janbuchar mentioned this pull request May 5, 2026

Re-introduce functionality removed from crawlee in v4 apify/apify-sdk-js#603

Open

janbuchar changed the title ~~refactor!: Trim the storage subclient interfaces~~ refactor!: Align DatasetClient and KeyValueStoreClient interfaces with their Python counterparts May 7, 2026

janbuchar requested review from B4nan and barjin May 7, 2026 10:39

janbuchar marked this pull request as ready for review May 7, 2026 10:39

barjin reviewed May 11, 2026

View reviewed changes

Comment thread packages/core/src/storages/dataset.ts Outdated

Comment thread packages/core/src/storages/key_value_store.ts Outdated

Base automatically changed from refactor-storage-manager to v4 May 11, 2026 11:18

refactor!: Trim the storage subclient interfaces

1e349f5

janbuchar force-pushed the simplify-storage-subclients branch from d998bc7 to 1e349f5 Compare May 12, 2026 11:20

janbuchar added 7 commits May 12, 2026 15:07

Address review comments

f724635

Improve dual iterator semantics

1786bcd

Pass limit to first-page call in KVS

b999174

Remove obsolete options

07d6793

Throw on missing exclusiveStartKey in memory storage

76c2543

Simplify

1d6129b

Remove timeout from state persistence callbacks

9469b2e

janbuchar commented May 13, 2026

View reviewed changes

Update tests

10ee5a4

janbuchar requested a review from barjin May 13, 2026 16:34

barjin reviewed May 14, 2026

View reviewed changes

awaiting a dual iterator yields all items

6e70953

barjin reviewed May 19, 2026

View reviewed changes

B4nan requested a review from l2ysho May 20, 2026 13:10

		/** Remove all items from the dataset but keep the dataset itself. */
		purge(): Promise<void>;

Conversation

janbuchar commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barjin May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barjin May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

janbuchar commented May 5, 2026 •

edited

Loading

barjin May 15, 2026 •

edited

Loading

barjin May 19, 2026 •

edited

Loading