Modern analytical databases use data skipping to avoid scanning irrelevant data blocks. Your task is to design a data skipping index for a simplified scenario. For more context, see the TUMuchData Blog — Data Skipping.
Design a custom data skipping index structure that helps decide whether a data block can be skipped for a given query. The goal is to minimize the total workload cost consisting of storage cost and access cost.
Each query is of the form:
SELECT COUNT(*) FROM table WHERE column = x;
The table consists of a single column of unsigned 32-bit integers.
For each block of data, the index must report how many times the value x appears.
The index may return the exact count or indicate "unknown" — in which case the query engine computes the result from the base data.
A single wrong response invalidates the entire run.
Your implementation goes in code/User.hpp. You need to implement two functions:
std::vector<std::byte> build_idx(std::span<const uint32_t> data, Parameters config)build_idx receives the contents of a data block and a workload configuration. The configuration includes
Hint: use reinterpret_cast.
std::optional<size_t> query_idx(uint32_t predicate, const std::vector<std::byte>& index)query_idx receives the index previously built for a block and decides whether scanning that block can be skipped. If the index can answer the query, return the count of predicate in the block; otherwise return std::nullopt.
Total Workload Cost = Access Cost + Storage Cost
Normalized score (100 points per test case):
Correctly skipping blocks earns points; index storage costs deduct points.
Your solution is evaluated on 2 test cases with different cost parameters:
| Test Case | Description | ||
|---|---|---|---|
| frequent | 3 | 1 | Storage is cheap. Larger, more accurate indices pay off. |
| normal | 1 | 5 | Storage is moderately expensive. Be selective about what to store. |
Both test cases use 512 data blocks × 131,072 unsigned 32-bit integers each, with 512 queries.
$1 \leq F_A, F_S \leq 100$ - Each data block contains 131,072 unsigned 32-bit integers
Step 1. Generate test data (requires Python + NumPy):
make generate_dataStep 2. Implement your solution in code/User.hpp — this is the only file you need to modify.
Step 3. Build and evaluate:
make run_allYour total score is the sum of normalized scores across both test cases (max 100 per case, 200 total).
You can also run individual test cases: make run_frequent, make run_normal.
Complete the assignment by following the steps below. Each step builds on the previous one.
Score requirements (total = frequent + normal, out of 200):
| Step | Points | Minimum Score Required |
|---|---|---|
| Step 1 — Min-Max Index | 30 pts | ≥ 2 |
| Step 2 — Bloom Filter | 30 pts | ≥ 80 |
| Step 3 — Exact Count for Frequent Values | 40 pts | ≥ 105 |
| Bonus: Step 4 — Compression | +10 pts | ≥ 120 |
The simplest data skipping index. This is the baseline used by real-world systems such as Snowflake and Databricks.
What to do:
- In
build_idx, store the minimum and maximum value of the block (just 8 bytes). - In
query_idx, ifpredicate < min || predicate > max, the value cannot exist in this block — return0to skip it. Otherwise, returnstd::nullopt.
What you learn: Even this minimal structure can skip a significant fraction of blocks at negligible storage cost.
A Bloom filter is a compact probabilistic data structure that tests set membership. It has no false negatives (if it says "not present", the value is definitely absent) but may have false positives.
What to do:
- In
build_idx, construct a Bloom filter from all values in the block, and store it alongside the min-max values. - In
query_idx, first apply the min-max check. Then, if the predicate is in range, query the Bloom filter. If the filter says "not present", return0. Otherwise, returnstd::nullopt. - Consider how to size the filter: a larger filter has fewer false positives (more skipping) but higher storage cost. The ratio
$F_A / F_S$ should guide this trade-off.
What you learn: Probabilistic data structures let you trade a small amount of storage for substantially better skip rates.
When some values appear very frequently in a block, storing their exact counts lets you answer queries without scanning.
What to do:
- In
build_idx, identify the top-K most frequent values in the block. Store each value and its count alongside your existing index structures. - In
query_idx, if the predicate matches a stored value, return the exact count. Otherwise, fall back to the Bloom filter / min-max check. - Choose K based on the storage budget: when
$F_A / F_S$ is large (storage is cheap), you can afford to store more entries.
What you learn: Combining multiple index strategies yields much better results than any single approach. Data-dependent decisions (choosing K) are important for real-world index design.
When storage is cheap, the ideal index stores exact counts for all values in the block — but a naive implementation wastes space.
What to do:
- Implement at least one compression technique to reduce index size. Options include:
- Delta encoding — store sorted values as successive differences
- Varint encoding — encode small integers using fewer bytes
- Bitpacking — pack values using only as many bits as needed
- Run-length encoding (RLE) — exploit repeated values
- When
$F_A / F_S$ is high enough, switch from TopK to a full compressed count map.
What you learn: Encoding schemes are critical in database systems — the same logical data can vary enormously in physical size.
.
├── README.md # This file
├── Makefile # Build, run, submit targets
├── code/
│ ├── User.hpp # ★ The file you need to modify
│ ├── main.cpp # Evaluation harness
│ ├── Parameters.hpp # Config (f_a, f_s)
│ └── FileUtils.hpp # File I/O
└── data/
├── eval.json # Test case configs
├── generate.py # Data generator
└── generate_all.py # Generate all test cases
make submitThis cleans all generated data, build artifacts, and creates submission.zip in the project root.
Then submit submission.zip to Web Learning (网络学堂).
Note for Windows users: The
zipcommand is not available on Windows by default. You may need to pack the files intosubmission.zipmanually.
If you have any questions, you may ask them in the WeChat group or Web Learning.
This assignment is adapted from the TUMuchData Coding Challenge 2025. Thanks to TUMuchData for the excellent problem framework.