Skip to content

Optimize high-cardinality multi-CSV profiling and add multi-analyze / desktop features#25

Merged
rad1092 merged 2 commits into
mainfrom
codex/evaluate-current-project-completion-level
Feb 15, 2026
Merged

Optimize high-cardinality multi-CSV profiling and add multi-analyze / desktop features#25
rad1092 merged 2 commits into
mainfrom
codex/evaluate-current-project-completion-level

Conversation

@rad1092
Copy link
Copy Markdown
Owner

@rad1092 rad1092 commented Feb 15, 2026

Motivation

  • Prevent unbounded memory growth when profiling columns with very high cardinality while preserving actionable profile outputs.
  • Provide higher-level multi-file analysis, reporting and UI/desktop affordances to make multi-CSV workflows approachable from CLI and browser/desktop.

Description

  • Replace per-column set unique tracking with a fixed-size bitmap approximation (UNIQUE_BITMAP_SIZE) and add _update_unique_bitmap/_estimate_unique_count in bitnet_tools/multi_csv.py to bound memory for distinct-count estimation.
  • Introduce a bounded top-value counter (TOP_VALUE_TRACK_CAP) and _update_bounded_counter that aggregates overflow into an __OTHER__ bucket and exposes top_values_capped in column profiles.
  • Add/expand multi-file features and UI/CLI integrations including analyze_multiple_csv, multi-analyze CLI command, /api/multi-analyze web endpoint, dashboard rendering in ui/ assets, desktop launcher (bitnet_tools/desktop.py, bitnet_desktop.pyw), environment doctor (bitnet_tools/doctor.py), visualization helpers (bitnet_tools/visualize.py), and report helpers (build_markdown_report, multi-report builder).
  • Refactor streaming summarization in bitnet_tools/analysis.py to avoid materializing entire readers (summarize_reader) and add markdown report generation; update pyproject.toml to expose bitnet-desktop script plus README.md and small UI/CSS/JS enhancements for multi-analyze dashboard.

Testing

  • Ran the test suite with pytest -q and all tests passed (20 passed).
  • Added test_multi_csv_top_values_capped_marker in tests/test_analysis.py which forces a small TOP_VALUE_TRACK_CAP and asserts top_values_capped is set and __OTHER__ appears, and the test passed under the suite run.
  • Existing multi-file/CLI related tests were updated/extended and passed as part of the full pytest -q run.

Codex Task

@rad1092 rad1092 merged commit 35b3b84 into main Feb 15, 2026
4 checks passed
@rad1092 rad1092 deleted the codex/evaluate-current-project-completion-level branch February 15, 2026 00:36
@chatgpt-codex-connector
Copy link
Copy Markdown

💡 Codex Review

path = Path(td) / name
path.write_text(text, encoding="utf-8")

P1 Badge Sanitize uploaded filenames before writing temp CSVs

The /api/multi-analyze handler trusts files[*].name from the request body and writes it with Path(td) / name; crafted values like ../../somefile or absolute paths escape the temp directory and let the request overwrite arbitrary files writable by the server process. This is a concrete path traversal/write primitive on the new endpoint (even if the UI sends safe names, the API is still directly callable), so filenames should be normalized to a basename and validated before writing.


if set_bits >= UNIQUE_BITMAP_SIZE:
return UNIQUE_BITMAP_SIZE

P2 Badge Preserve cardinality estimate after bitmap saturation

When all bits are set, _estimate_unique_count returns UNIQUE_BITMAP_SIZE (65,536) regardless of how many distinct values were seen, so very high-cardinality columns are massively undercounted once the bitmap saturates. That undercount feeds unique_ratio and semantic typing in _profile_csv_stream, which can misclassify mostly-unique string identifiers as low-cardinality categories and distort multi-file drift outputs on large datasets.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant