Skip to content

Mini-batching time cost #4

@ASharmaML

Description

@ASharmaML

Firstly, this is a great library for clustering unit-normed feature spaces fast and coherently!

I had a query about mini-batching: for some reason I expected each mini-batch to take roughly the same amount of time (or even less time on subsequent batches) when calling partial_fit. However, each mini-batch seems to take longer than the last, almost in linear fashion. This is on actual structured data with hierarchies and clusters to be found, not on randomly generated matrices.

Pseudo-behaviour
First 10,000 data-points: 10 seconds to run
Second 10,000 data-points: 20 seconds to run
Third 10,000 data-points: 30 seconds to run
Total time taken for 30000 data-points: 60 seconds.

Is this expected behaviour? As a side-note, while I found implementation details in the accompanying SCC paper and could follow them, I cannot find any details regarding the mini-batching.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions