-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Hi,
I'm the developer of Seqwin, and it depends on btllib to calculate minimizers and create minimizer graphs. We really appreciate the effort of developing and maintaining such an amazing toolkit!
To access minimizers in python, nested for loops are needed (e.g, for record in indexlr and for mx in record.minimizers). Since Seqwin could take thousands of microbial genomes as input, it could be much faster if the minimizers are packed in c++ and returned to python as a single array. For example
with Indexlr(assembly_path, kmerlen, windowsize, IndexlrFlag.LONG_MODE, 1) as assembly:
kmers, record_ids = assembly._minimizers_packed() # kmers: raw bytes; record_ids: python tuple
kmers = np.frombuffer(kmers, dtype=KMER_DTYPE) # a structured dtype including hashes and positionsTo do this, I tried to add a new private method of Indexlr in wrappers/python/extra.i, with some custom Seqwin variables as input (see code below). I then rebuilt btllib following build.sh in the bioconda recipe. However, the new method was about 3x slower than the original python loops (the outputs were the same). So I was a bit confused and would like to seek for your advice.
Best,
Michael
In wrappers/python/extra.i, I first added a struct definition in the %{ %} block
%{
using btllib::SpacedSeed;
// a packed struct to match Seqwin's KMER_DTYPE (17 bytes)
#pragma pack(push, 1)
struct PackedMinimizer {
uint64_t hash; // 8 bytes
uint32_t pos; // 4 bytes
uint16_t record_idx; // 2 bytes
uint16_t assembly_idx;// 2 bytes
bool is_target; // 1 byte
};
#pragma pack(pop)
%}Then I added the new method for Indexlr.
%extend btllib::Indexlr {
btllib::Indexlr* __enter__() {
return $self;
}
void __exit__(PyObject*, PyObject*, PyObject*) {
$self->close();
}
// --- NEW METHOD START ---
// assembly_idx and is_target are assembly-specific variables used in Seqwin
PyObject* _minimizers_packed(uint16_t assembly_idx, bool is_target) {
std::vector<PackedMinimizer> buffer;
std::vector<std::string> ids;
uint16_t record_idx = 0;
buffer.reserve(10000);
for (const auto& record : *$self) {
ids.push_back(record.id);
if (!record.minimizers.empty()) {
for (const auto& m : record.minimizers) {
PackedMinimizer pm;
pm.hash = m.out_hash;
pm.pos = m.pos;
pm.record_idx = record_idx;
pm.assembly_idx = assembly_idx;
pm.is_target = is_target;
buffer.push_back(pm);
}
}
record_idx++;
}
// create python bytes object
PyObject* py_bytes = PyBytes_FromStringAndSize(
reinterpret_cast<const char*>(buffer.data()),
buffer.size() * sizeof(PackedMinimizer)
);
// create python tuple for record IDs
PyObject* py_ids = PyTuple_New(ids.size());
for (size_t i = 0; i < ids.size(); ++i) {
PyTuple_SetItem(py_ids, i, PyUnicode_FromString(ids[i].c_str()));
}
PyObject* result = PyTuple_New(2);
PyTuple_SetItem(result, 0, py_bytes);
PyTuple_SetItem(result, 1, py_ids);
return result;
}
// --- NEW METHOD END ---
}It built and ran without error. Below is the python part, and outputs were the same as the original python API.
KMER_DTYPE = np.dtype([
('hash', np.uint64),
('pos', np.uint32),
('record_idx', np.uint16),
('assembly_idx', np.uint16),
('is_target', np.bool_),
# no padding to save memory
], align=False)
with Indexlr(assembly_path, kmerlen, windowsize, IndexlrFlag.LONG_MODE, 1) as assembly:
kmers, ids = assembly._minimizers_packed(
assembly_idx, is_target # seqwin variables
)
kmers = np.frombuffer(kmers, dtype=KMER_DTYPE)