`MemPool.datastore` memory utilization keeps increasing when using DTables with multiple processes

I have the following setup:
- Start Julia with `julia --project -t1 --heap-size-hint=3G`
- Add 4 processes with `addprocs(4; exeflags = "--heap-size-hint=3G")`
- Worker 1 receives a query request and then tells worker 2 to do the work

The actual query includes loading a table from a .csv file into a `DTable` (with a `DataFrame` table type). Operations include `select`ing columns, `fetch`ing the table into a `DataFrame` for adding/removing rows/columns and other processing as needed, and re-wrapping the table in a `DTable` to later be processed further. At the end of processing, the result is returned as a `DataFrame`.

The .csv file contains a table with 233930 rows and 102 columns: 1 column of `InlineStrings.String15`, 2 columns of `InlineStrings.String1`, 45 columns of `Int64`, and 54 columns of `Float64`.

**The issue:** I noticed that if I keep running the same query repeatedly, the `MemPool.datastore` on worker 2 consumes more and more memory, as determined by
```julia
remotecall_fetch(2) do
    Base.summarysize(MyPackage.Dagger.MemPool.datastore)
end
```
Eventually, the memory usage grows enough to cause my WSL 2 Linux OOM manager to kill worker 2, crashing my program.

Notably, I do *not* observe this growth in memory usage in the following scenarios:
- when running everything on a single process (i.e., not calling `addprocs`), or
- when using `DataFrame`s exclusively (i.e., not using DTables.jl at all).

I *do* observe this growth in memory usage in the following additional scenarios:
- when using `NamedTuple` as the table type for the `DTable`s, or
- when running everything on a single process, but with multiple processes available. (To clarify, my code exclusively uses worker 1 in this scenario, but it appears DTables.jl/Dagger.jl uses the other available workers. And in this case the `MemPool.datastore` on worker 1 (not worker 2) is what consumes more and more memory. However, I never ran into any issues with the OOM manager killing my processes.)

I'm posting this issue in DTables.jl in case there's something DTables.jl is doing that somehow causes the MemPool.jl data store to keep references around longer than expected, but of course please transfer this issue to Dagger.jl or MemPool.jl as needed.

Please let me know if there is any other information that would help with finding the root cause of this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`MemPool.datastore` memory utilization keeps increasing when using DTables with multiple processes #60

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MemPool.datastore memory utilization keeps increasing when using DTables with multiple processes #60

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`MemPool.datastore` memory utilization keeps increasing when using DTables with multiple processes #60