Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
159 commits
Select commit Hold shift + click to select a range
b7821f6
Merge remote-tracking branch 'upstream/master'
hmusta Dec 9, 2024
c15b1c9
inline sshash::perc
hmusta Dec 9, 2024
c62d1ae
suppress uninitialized warning
hmusta Dec 9, 2024
c86c661
Grant access to underlying dictionary (#11)
hmusta Mar 17, 2025
bedff07
Grant access to minimizers (#12)
hmusta Aug 11, 2025
b042476
some comments on random lookup benchmark
jermp Sep 3, 2025
5a47677
another comment on random lookup benchmark
jermp Sep 3, 2025
fd6a813
Add Bioconda installation badge to README
jermp Sep 4, 2025
41071fe
some results on random lookup benchmark
jermp Sep 4, 2025
e894045
updated pthash; simplified hash utils
jermp Sep 8, 2025
21c2e85
updated hash utils
jermp Sep 8, 2025
1f507a7
point to benchmarks folder
jermp Sep 10, 2025
d0e39db
tripartition of offsets
jermp Sep 12, 2025
550dbd9
fix
jermp Sep 12, 2025
80c9d00
using 32-bit words for buckets.start_lists_of_size
jermp Sep 12, 2025
a0140aa
lookup for canonical indexes
jermp Sep 12, 2025
34af717
a note on presence of minimizers when lookup is resolved via the skew…
jermp Sep 13, 2025
42c3d1d
fixed constants for skew index; merge parse and skew index construction
jermp Sep 14, 2025
c49df82
new results taken on 14/09/25: slightly faster construction, faster q…
jermp Sep 14, 2025
2c2ccc0
new results.png
jermp Sep 14, 2025
7c2e9c2
new results.png
jermp Sep 14, 2025
2155fff
a note on SIMD for encoding in dictionary::lookup; optimized string_t…
jermp Sep 16, 2025
c907f6d
a note about loop-unrolling in string_to_uint_kmer
jermp Sep 16, 2025
041d6d5
removed useless line
jermp Sep 16, 2025
c716fe7
minor fix to num. partitions in skew index; better access
jermp Sep 19, 2025
1ec6110
use a bits::compact_vector for (iteration to be fixed)
jermp Sep 19, 2025
bbcc2b6
updated external/bits
jermp Sep 20, 2025
27d8b72
updated external/bits and using bits::endpoints_sequence
jermp Sep 22, 2025
6b48d47
added missing include for compilation on Linux
jermp Sep 22, 2025
dd1a7d2
added missing include for compilation on Linux
jermp Sep 22, 2025
053f012
results 22-09-25 for k=31
jermp Sep 23, 2025
a972335
a note in readme
jermp Sep 23, 2025
cfc22a2
perf lookup by list size
jermp Sep 23, 2025
3c698d7
updated results to 22/09/25
jermp Sep 25, 2025
43dd436
added endpoints.hpp
jermp Sep 26, 2025
4390b13
minor
jermp Oct 1, 2025
df7438b
using encoded offsets
jermp Oct 3, 2025
3ccbbf4
clean up
jermp Oct 4, 2025
2cc6339
Fix processor check in CMakeLists.txt
adamant-pwn Oct 4, 2025
500747a
Merge upstream/master from jermp/sshash
adamant-pwn Oct 4, 2025
2efac6c
clean up and implemented endpoints::id_to_offset
jermp Oct 4, 2025
cfccdcc
Merge pull request #81 from adamant-pwn/patch-2
jermp Oct 4, 2025
6e4d9aa
fixed CMakeLists.txt
jermp Oct 4, 2025
77f8e59
include_directories -> target_include_directories
adamant-pwn Oct 4, 2025
0d329a6
Merge pull request #14 from jermp/master
adamant-pwn Oct 4, 2025
238e817
Add canonicalize_basepair_reverse_map to Protein kmers
adamant-pwn Oct 4, 2025
d86f2eb
Fix Clang compilation and update pthash submodule
adamant-pwn Oct 4, 2025
0ea5aca
Fix Clang compilation
adamant-pwn Oct 4, 2025
d1abe66
static -> static inline
adamant-pwn Oct 4, 2025
452972f
maybe_unused
adamant-pwn Oct 4, 2025
424dea0
Check if AVX2 is enabled instead of checking for x86_64
adamant-pwn Oct 4, 2025
841661d
Update pthash
adamant-pwn Oct 4, 2025
7758aa8
Update pthash
adamant-pwn Oct 5, 2025
a0859be
revert pthash to upstreaam
adamant-pwn Oct 5, 2025
f66ed13
fixed endpoints and parallel correctness check
jermp Oct 5, 2025
9399b62
Support any number of threads
adamant-pwn Oct 5, 2025
4a80d20
Fix parallel_sort.hpp
adamant-pwn Oct 5, 2025
571e3d4
added bioconda badge
jermp Oct 5, 2025
d87e11f
Multithreading fixes
adamant-pwn Oct 5, 2025
5380d4d
Merge pull request #82 from ratschlab/for-upstream
jermp Oct 6, 2025
9dc43a4
Try safe-guarding offsets_builder.set with mutex
adamant-pwn Oct 6, 2025
a01f0f8
fix kmer_t processing for uint_kmer_bits > 64
adamant-pwn Oct 6, 2025
b8f589c
implemented all miscellaneous fixes by Oleksandr Kulkov
jermp Oct 6, 2025
eb4d1c4
updated external/pthash
jermp Oct 6, 2025
ac4abe6
set offsets using a single thread
jermp Oct 7, 2025
26f48a5
removed unused code
jermp Oct 7, 2025
6578ddd
Single-threaded build_sparse_index
adamant-pwn Oct 7, 2025
b47bec1
fixes per review
adamant-pwn Oct 7, 2025
96cecad
return newline after pragma once
adamant-pwn Oct 7, 2025
e12fc8d
minor
jermp Oct 7, 2025
4e86487
Merge pull request #83 from ratschlab/for-upstream
jermp Oct 7, 2025
d22d01d
back to previous scheme
jermp Oct 10, 2025
4c07bde
more
jermp Oct 11, 2025
91677e7
more (needs fixing)
jermp Oct 12, 2025
14c832f
fix
jermp Oct 12, 2025
4733dae
fix perf test iterator
jermp Oct 12, 2025
a9055e2
big refactoring
jermp Oct 15, 2025
0f31776
minor
jermp Oct 15, 2025
13360a4
optimized num. locate queries
jermp Oct 16, 2025
858e71b
optimized num. locate queries
jermp Oct 16, 2025
127ca04
minor
jermp Oct 16, 2025
091f244
minor
jermp Oct 16, 2025
64c8443
XXH128 does not work on AMD processor: rewritten hashers for minimize…
jermp Oct 18, 2025
f5215ef
added cityhash
jermp Oct 19, 2025
09244aa
parallel checks
jermp Oct 21, 2025
5bb6ef3
print cmd; build and bench scripts updated
jermp Oct 21, 2025
d5987b2
build and bench scripts updated
jermp Oct 21, 2025
46d2118
new benchmarks logs: 21/10/25
jermp Oct 22, 2025
1b373f3
cap kmers to scan in perf_test_iterator to 10^8
jermp Oct 22, 2025
813b9bc
updated scripts
jermp Oct 22, 2025
2e42570
minor
jermp Oct 22, 2025
f66ce60
fixed build script and new results (22/10/25); also, noted that encod…
jermp Oct 22, 2025
a028972
added results
jermp Oct 23, 2025
db02c17
compute min by scan is actually faster than using a min-heap
jermp Oct 24, 2025
e67257e
scripts updated
jermp Oct 25, 2025
e9a525d
simplified file_merging_iterator
jermp Oct 25, 2025
ff33ec7
optimized merging with a looser tree (faster then a min-heap because …
jermp Oct 25, 2025
7f0b05d
avoid branch in tight loop
jermp Oct 27, 2025
ac04609
wrong namespace
jermp Oct 27, 2025
0ee2aa7
minor
jermp Oct 31, 2025
33020f4
quiet build
jermp Oct 31, 2025
70ceef1
quiet build
jermp Oct 31, 2025
2ae21fc
refctoring of build steps
jermp Nov 2, 2025
007ca31
json stats and refactored dictionary_builder
jermp Nov 3, 2025
c31d22f
minor
jermp Nov 3, 2025
e275d51
prefetching experiment: a little gain
jermp Nov 3, 2025
c41bdb8
json stats for perf benchmark
jermp Nov 3, 2025
2efb5d4
prefetching helps indeed random lookup
jermp Nov 3, 2025
7530305
prefetching also for canonical lookup
jermp Nov 4, 2025
0c53a23
updated external/pthash and refactored offsets.hpp
jermp Nov 4, 2025
e644a94
step 7.1 and 7.2 timed as well
jermp Nov 4, 2025
6fb7925
minor
jermp Nov 4, 2025
7d29302
examples in the readme updated
jermp Nov 5, 2025
fe05a41
minor
jermp Nov 6, 2025
f264b9b
minor
jermp Nov 6, 2025
5a12f40
build.py
jermp Nov 6, 2025
9219105
bench.py
jermp Nov 6, 2025
3e01643
build.py
jermp Nov 6, 2025
d5fb57c
deleted old scripts
jermp Nov 6, 2025
2b751b2
fix build.py script
jermp Nov 7, 2025
bd8be44
fix build.py script
jermp Nov 7, 2025
8c8562a
fix script
jermp Nov 7, 2025
eefc24e
updated essentials; fixed script
jermp Nov 10, 2025
efc3212
fix streaming query multiline fasta
jermp Nov 10, 2025
b7f815f
more stats to json
jermp Nov 10, 2025
2e6c05b
bench results 10/11/25
jermp Nov 10, 2025
99900cb
updated results; better streaming query script
jermp Nov 11, 2025
c4218d1
different query file for SE
jermp Nov 11, 2025
0502751
results updated
jermp Nov 12, 2025
71a6a93
benchmarks subfolder refactored
jermp Nov 12, 2025
98ee7a8
print version number in main tool
jermp Nov 12, 2025
3ed53ad
a note on benchmarks
jermp Nov 12, 2025
4365223
minor
jermp Nov 15, 2025
c769e05
sbwt results for k=63
jermp Nov 16, 2025
8c6ac62
prefetching does not actually help but writing offsets to an array fi…
jermp Nov 16, 2025
6f225a2
added results for sshash-v3 to compare against
jermp Nov 28, 2025
b59a3e4
removed empty json files
jermp Dec 8, 2025
32fd510
minor name cleanup
jermp Dec 18, 2025
de450db
minor name cleanup
jermp Dec 18, 2025
ad0cac9
removed some old comments
jermp Dec 18, 2025
1fbb593
README UPDATED
jermp Dec 18, 2025
4d9786f
Merge branch 'master' into bench
jermp Dec 19, 2025
ae61108
resolved some conflicts for merging into master
jermp Dec 19, 2025
63f5927
resolved some conflicts for merging into master
jermp Dec 19, 2025
aa95d7f
removed some old comments
jermp Dec 19, 2025
05a2626
Merge pull request #84 from jermp/bench
jermp Dec 19, 2025
be07ab7
swap locate query with string checking in lookup
jermp Jan 21, 2026
bf3a72b
Merge pull request #85 from jermp/lookup-logic-alt
jermp Jan 21, 2026
f3a35a7
results updated
jermp Jan 21, 2026
c4f2d43
readme and license updated
jermp Jan 23, 2026
e65dcb2
minor
jermp Jan 23, 2026
4c086b5
removed param
Jan 26, 2026
1152708
point to Rust port in README
jermp Feb 12, 2026
bf1ee85
Fix sparse index bug and improve library integration
adamant-pwn Feb 16, 2026
7fdd93a
Refine sparse index fix: track only buckets with size > 1
adamant-pwn Feb 16, 2026
946c8c4
Merge pull request #87 from ratschlab/upstream-bug-fixes
jermp Feb 23, 2026
4cc6255
clang format
jermp Feb 23, 2026
96314d2
updated external/pthash
jermp Feb 24, 2026
d2aa799
using cmov in the update loop of tournament tree
jermp Mar 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -148,5 +148,3 @@ StatementMacros:
TabWidth: 8
UseTab: Never
...


75 changes: 45 additions & 30 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ endif ()

set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})

if (UNIX AND (CMAKE_HOST_SYSTEM_PROCESSOR STREQUAL "x86_64"))
if (UNIX AND (CMAKE_SYSTEM_PROCESSOR STREQUAL "x86_64"))
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mbmi2 -mavx2")
if (SSHASH_USE_ARCH_NATIVE)
if (SSHASH_USE_ARCH_NATIVE AND NOT CMAKE_CROSSCOMPILING)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native")
endif()
endif()
Expand All @@ -33,7 +33,7 @@ if (UNIX)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ggdb")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -Wno-missing-braces -Wno-unknown-attributes -Wno-unused-function")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -Werror -Wno-missing-braces -Wno-unknown-attributes -Wno-unused-function")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pthread")

if (SSHASH_USE_SANITIZERS)
Expand All @@ -48,58 +48,73 @@ else()
set(CONDA_BUILD FALSE)
endif()

option(SSHASH_BUILD_EXECUTABLES "Build sshash executables" ON)
MESSAGE(STATUS "Build type: ${CMAKE_BUILD_TYPE}")
MESSAGE(STATUS "Conda build: ${CONDA_BUILD}")
MESSAGE(STATUS "Installation prefix: ${CMAKE_INSTALL_PREFIX}")
MESSAGE(STATUS "Compiling for processor: ${CMAKE_HOST_SYSTEM_PROCESSOR}")
MESSAGE(STATUS "Compiling for processor: ${CMAKE_SYSTEM_PROCESSOR}")
MESSAGE(STATUS "Compiling with flags:${CMAKE_CXX_FLAGS}")

include_directories(.) # all include paths relative to parent directory
include_directories(external/pthash/include)
include_directories(external/pthash/external/bits/include)
include_directories(external/pthash/external/fastmod)
include_directories(external/pthash/external/bits/external/essentials/include)
include_directories(external/pthash/external/xxHash)
include_directories(external/pthash/external/mm_file/include)
MESSAGE(STATUS "Build executables: ${SSHASH_BUILD_EXECUTABLES}")

set(Z_LIB_SOURCES
external/gz/zip_stream.cpp
)

set(CITYHASH_SOURCES
external/cityhash/cityhash.cpp
)

set(SSHASH_SOURCES
src/build.cpp
src/dictionary.cpp
src/query.cpp
src/info.cpp
src/statistics.cpp
)

set(SSHASH_INCLUDE_DIRS
external/pthash/include
external/pthash/external/bits/include
external/pthash/external/fastmod
external/pthash/external/bits/external/essentials/include
external/pthash/external/xxHash
external/pthash/external/mm_file/include
${CMAKE_CURRENT_SOURCE_DIR}
${CMAKE_CURRENT_SOURCE_DIR}/include
)

# Create a static lib
add_library(sshash_static STATIC
${Z_LIB_SOURCES}
${CITYHASH_SOURCES}
${SSHASH_SOURCES}
)

add_executable(sshash tools/sshash.cpp)
target_link_libraries(sshash
z
)
target_include_directories(sshash_static PUBLIC ${SSHASH_INCLUDE_DIRS})

# tests:
if(SSHASH_BUILD_EXECUTABLES)
add_executable(sshash tools/sshash.cpp)
target_include_directories(sshash PUBLIC ${SSHASH_INCLUDE_DIRS})
target_link_libraries(sshash
sshash_static
z
)

add_executable(test_alphabet test/test_alphabet.cpp)
target_link_libraries(test_alphabet
sshash_static
)
# tests:

add_executable(check test/check.cpp)
target_link_libraries(check
sshash_static
z
)
add_executable(test_alphabet test/test_alphabet.cpp)
target_link_libraries(test_alphabet
sshash_static
)

if (CONDA_BUILD)
install(TARGETS sshash
RUNTIME DESTINATION bin
add_executable(check test/check.cpp)
target_link_libraries(check
sshash_static
z
)

if (CONDA_BUILD)
install(TARGETS sshash
RUNTIME DESTINATION bin
)
endif()
endif()
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright 2021-2025 Giulio Ermanno Pibiri, Oleksandr Kulkov, and COMBINE Lab
Copyright 2021-2026 Giulio Ermanno Pibiri, Oleksandr Kulkov, and COMBINE Lab

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
103 changes: 48 additions & 55 deletions README.md

Large diffs are not rendered by default.

39 changes: 17 additions & 22 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,29 @@
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7239205.svg)](https://doi.org/10.5281/zenodo.7239205)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17582116.svg)](https://doi.org/10.5281/zenodo.17582116)

Benchmarks
----------

For these benchmarks we used the whole genomes of the following organisms:
For these benchmarks we used the datasets available here
[https://zenodo.org/records/17582116](https://zenodo.org/records/17582116).

- Gadus Morhua ("Cod")
- Falco Tinnunculus ("Kestrel")
- Homo Sapiens ("Human")

for k = 31 and 63.
To run the benchmarks, from within the `build` directory, run

The datasets and queries used in these benchmarks can be downloaded
by running the script
python3 ../script/build.py <log_label> <input_datasets_dir> <output_index_dir>
python3 ../script/bench.py <log_label> <input_index_dir>
python3 ../script/streaming-query-high-hit.py <log_label> <input_index_dir> <input_queries_dir>

```
bash download-datasets.sh
```
where `<log_label>` should be replaced by a suitable basename, e.g., the current date.

To run the benchmarks, from within the `build` directory, run
These are the results obtained on 21/01/26 (see logs [here](results-21-01-26))
on a machine equipped with an AMD Ryzen Threadripper PRO 7985WX processor clocked at 5.40GHz.
The code was compiled with `gcc` 13.3.0.

```
bash ../script/build.sh [prefix]
bash ../script/bench.sh [prefix]
bash ../script/streaming-query-high-hit.sh [prefix]
bash ../script/streaming-query-low-hit.sh [prefix]
```
The indexes were build with a max RAM usage of 16 GB and 64 threads.
Queries were run using one thread, instead.

where `[prefix]` should be replaced by a suitable basename, e.g., the current date.
![](results-21-01-26/results.png)

These are the results obtained on 22/08/25 (see logs [here](results-22-08-25)).
The results can be exported to CSV format with

![](results-22-08-25/results.png)
python3 ../script/print_csv.py ../benchmarks/results-10-11-25/k31
python3 ../script/print_csv.py ../benchmarks/results-10-11-25/k63
16 changes: 0 additions & 16 deletions benchmarks/download-datasets.sh

This file was deleted.

175 changes: 175 additions & 0 deletions benchmarks/print_csv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
#!/usr/bin/env python3

import sys
import json
import os
from statistics import mean, StatisticsError
import math

def format_time(microseconds):
seconds = microseconds / 1_000_000
minutes = int(seconds // 60)
seconds = int(seconds % 60)
return f"{minutes}:{seconds:02d}"

def parse_build_file(path, canonical_flag):
"""Parse build JSONL file."""
results = []
with open(path) as f:
for line in f:
line = line.strip()
if not line:
continue
try:
d = json.loads(line)
except json.JSONDecodeError:
print(f"Skipping invalid JSON line in {path}", file=sys.stderr)
continue

num_kmers = int(d["num_kmers"])
index_bytes = int(d["index_size_in_bytes"])
build_time_us = int(d["total_build_time_in_microsec"])

bits_per_kmer = (index_bytes * 8) / num_kmers
gb = index_bytes / 1e9
build_time_fmt = format_time(build_time_us)

fname = os.path.basename(d["input_filename"])
collection = fname.split(".")[0].capitalize()
k = d["k"]

results.append({
"k": k,
"Collection": collection,
"m": d["m"],
"canonical": "yes" if canonical_flag else "no",
"bits_per_kmer": f"{bits_per_kmer:.2f}",
"total_GB": f"{gb:.2f}",
"build_time": build_time_fmt
})
return results

def parse_bench_file(path, canonical_flag):
"""Parse benchmark JSONL file and average per collection."""
lookup_data = {}
with open(path) as f:
for line in f:
line = line.strip()
if not line:
continue
try:
d = json.loads(line)
except json.JSONDecodeError:
print(f"Skipping invalid JSON line in {path}", file=sys.stderr)
continue

fname = os.path.basename(d["index_filename"])
collection = fname.split(".")[0].capitalize()
m = d["m"]
k = d["k"]
canonical = "yes" if canonical_flag else "no"

key = (collection, m, canonical)
entry = lookup_data.setdefault(key, {
"k": k,
"pos": [], "neg": [], "access": [], "iter": []
})
entry["pos"].append(float(d["positive lookup (avg_nanosec_per_kmer)"]))
entry["neg"].append(float(d["negative lookup (avg_nanosec_per_kmer)"]))
entry["access"].append(float(d["access (avg_nanosec_per_kmer)"]))
entry["iter"].append(float(d["iterator (avg_nanosec_per_kmer)"]))

# average the results
for k, v in lookup_data.items():
try:
lookup_data[k] = {
"k": v["k"],
"pos": f"{mean(v['pos'])/1000:.2f}",
"neg": f"{mean(v['neg'])/1000:.2f}",
"access": f"{mean(v['access'])/1000:.2f}",
"iter": f"{mean(v['iter']):.2f}",
}
except StatisticsError:
lookup_data[k] = {"k": v["k"], "pos": "NA", "neg": "NA", "access": "NA", "iter": "NA"}
return lookup_data


def parse_streaming_file(path, canonical_flag):
"""Parse streaming queries JSON file."""
stream_data = {}
if not os.path.exists(path):
return stream_data

with open(path) as f:
for line in f:
line = line.strip()
if not line:
continue
try:
d = json.loads(line)
except json.JSONDecodeError:
print(f"Skipping invalid JSON line in {path}", file=sys.stderr)
continue

fname = os.path.basename(d["index_filename"])
collection = fname.split(".")[0].capitalize()
canonical = "yes" if canonical_flag else "no"

key = (collection, canonical)
num_kmers = int(d["num_kmers"])
num_pos = int(d["num_positive_kmers"])
num_ext = int(d["num_extensions"])
elapsed_ms = int(d["elapsed_millisec"])

ns_per_kmer = int(math.ceil(elapsed_ms * 1e6 / num_kmers))
hit_rate = (num_pos / num_kmers) * 100 if num_kmers else 0
extension_rate = (num_ext / num_pos) * 100 if num_pos else 0

stream_data[key] = {
"ns_per_kmer": f"{ns_per_kmer}",
"hit_rate": f"{hit_rate:.2f}",
"extension_rate": f"{extension_rate:.2f}"
}
return stream_data


def main():
if len(sys.argv) != 2:
print("Usage: print.py input_dir", file=sys.stderr)
sys.exit(1)

input_dir = sys.argv[1]
reg_build_path = input_dir + "/regular-build.json"
canon_build_path = input_dir + "/canon-build.json"
reg_bench_path = input_dir + "/regular-bench.json"
canon_bench_path = input_dir + "/canon-bench.json"
reg_stream_path = input_dir + "/regular-streaming-queries-high-hit.json"
canon_stream_path = input_dir + "/canon-streaming-queries-high-hit.json"

reg_build = parse_build_file(reg_build_path, False)
canon_build = parse_build_file(canon_build_path, True)
reg_bench = parse_bench_file(reg_bench_path, False)
canon_bench = parse_bench_file(canon_bench_path, True)
reg_stream = parse_streaming_file(reg_stream_path, False)
canon_stream = parse_streaming_file(canon_stream_path, True)

# merge everything
all_builds = reg_build + canon_build
lookup_all = {**reg_bench, **canon_bench}
stream_all = {**reg_stream, **canon_stream}

# CSV header
print("k,Collection,m,canonical,bits_per_kmer,total_GB,build_time,positive_lookup_ns,negative_lookup_ns,access_ns,iteration_ns,ns_per_kmer,hit_rate,extension_rate")

for r in sorted(all_builds, key=lambda x: (int(x["k"]), x["Collection"], x["canonical"])):
lookup = lookup_all.get(
(r["Collection"], r["m"], r["canonical"]), # key
{"pos": "NA", "neg": "NA", "access": "NA", "iter": "NA", "k": r["k"]})
stream = stream_all.get(
(r["Collection"], r["canonical"]), # key
{"ns_per_kmer": "NA", "hit_rate": "NA", "extension_rate": "NA"})

print(f"{r['k']},{r['Collection']},{r['m']},{r['canonical']},{r['bits_per_kmer']},{r['total_GB']},{r['build_time']},{lookup['pos']},{lookup['neg']},{lookup['access']},{lookup['iter']},{stream['ns_per_kmer']},{stream['hit_rate']},{stream['extension_rate']}")

if __name__ == "__main__":
main()
Loading
Loading