You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Prevent unbounded memory growth when profiling columns with very high cardinality while preserving actionable profile outputs.
Provide higher-level multi-file analysis, reporting and UI/desktop affordances to make multi-CSV workflows approachable from CLI and browser/desktop.
Description
Replace per-column set unique tracking with a fixed-size bitmap approximation (UNIQUE_BITMAP_SIZE) and add _update_unique_bitmap/_estimate_unique_count in bitnet_tools/multi_csv.py to bound memory for distinct-count estimation.
Introduce a bounded top-value counter (TOP_VALUE_TRACK_CAP) and _update_bounded_counter that aggregates overflow into an __OTHER__ bucket and exposes top_values_capped in column profiles.
Add/expand multi-file features and UI/CLI integrations including analyze_multiple_csv, multi-analyze CLI command, /api/multi-analyze web endpoint, dashboard rendering in ui/ assets, desktop launcher (bitnet_tools/desktop.py, bitnet_desktop.pyw), environment doctor (bitnet_tools/doctor.py), visualization helpers (bitnet_tools/visualize.py), and report helpers (build_markdown_report, multi-report builder).
Refactor streaming summarization in bitnet_tools/analysis.py to avoid materializing entire readers (summarize_reader) and add markdown report generation; update pyproject.toml to expose bitnet-desktop script plus README.md and small UI/CSS/JS enhancements for multi-analyze dashboard.
Testing
Ran the test suite with pytest -q and all tests passed (20 passed).
Added test_multi_csv_top_values_capped_marker in tests/test_analysis.py which forces a small TOP_VALUE_TRACK_CAP and asserts top_values_capped is set and __OTHER__ appears, and the test passed under the suite run.
Existing multi-file/CLI related tests were updated/extended and passed as part of the full pytest -q run.
Sanitize uploaded filenames before writing temp CSVs
The /api/multi-analyze handler trusts files[*].name from the request body and writes it with Path(td) / name; crafted values like ../../somefile or absolute paths escape the temp directory and let the request overwrite arbitrary files writable by the server process. This is a concrete path traversal/write primitive on the new endpoint (even if the UI sends safe names, the API is still directly callable), so filenames should be normalized to a basename and validated before writing.
Preserve cardinality estimate after bitmap saturation
When all bits are set, _estimate_unique_count returns UNIQUE_BITMAP_SIZE (65,536) regardless of how many distinct values were seen, so very high-cardinality columns are massively undercounted once the bitmap saturates. That undercount feeds unique_ratio and semantic typing in _profile_csv_stream, which can misclassify mostly-unique string identifiers as low-cardinality categories and distort multi-file drift outputs on large datasets.
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Description
setunique tracking with a fixed-size bitmap approximation (UNIQUE_BITMAP_SIZE) and add_update_unique_bitmap/_estimate_unique_countinbitnet_tools/multi_csv.pyto bound memory for distinct-count estimation.TOP_VALUE_TRACK_CAP) and_update_bounded_counterthat aggregates overflow into an__OTHER__bucket and exposestop_values_cappedin column profiles.analyze_multiple_csv,multi-analyzeCLI command,/api/multi-analyzeweb endpoint, dashboard rendering inui/assets, desktop launcher (bitnet_tools/desktop.py,bitnet_desktop.pyw), environment doctor (bitnet_tools/doctor.py), visualization helpers (bitnet_tools/visualize.py), and report helpers (build_markdown_report, multi-report builder).bitnet_tools/analysis.pyto avoid materializing entire readers (summarize_reader) and add markdown report generation; updatepyproject.tomlto exposebitnet-desktopscript plusREADME.mdand small UI/CSS/JS enhancements for multi-analyze dashboard.Testing
pytest -qand all tests passed (20 passed).test_multi_csv_top_values_capped_markerintests/test_analysis.pywhich forces a smallTOP_VALUE_TRACK_CAPand assertstop_values_cappedis set and__OTHER__appears, and the test passed under the suite run.pytest -qrun.Codex Task