Bypass Python loop in Indexlr

Hi, 

I'm the developer of [Seqwin](https://www.biorxiv.org/content/10.1101/2025.11.07.687294v1), and it depends on btllib to calculate minimizers and create minimizer graphs. We really appreciate the effort of developing and maintaining such an amazing toolkit! 

To access minimizers in python, nested for loops are needed (e.g, `for record in indexlr` and `for mx in record.minimizers`). Since Seqwin could take thousands of microbial genomes as input, it could be much faster if the minimizers are packed in c++ and returned to python as a single array. For example
```python
with Indexlr(assembly_path, kmerlen, windowsize, IndexlrFlag.LONG_MODE, 1) as assembly:
    kmers, record_ids = assembly._minimizers_packed() # kmers: raw bytes; record_ids: python tuple
kmers = np.frombuffer(kmers, dtype=KMER_DTYPE) # a structured dtype including hashes and positions
```

To do this, I tried to add a new private method of Indexlr in `wrappers/python/extra.i`, with some custom Seqwin variables as input (see code below). I then rebuilt btllib following `build.sh` in the bioconda recipe. However, the new method was about 3x slower than the original python loops (the outputs were the same). So I was a bit confused and would like to seek for your advice. 

Best, 
Michael

---

In `wrappers/python/extra.i`, I first added a struct definition in the `%{ %}` block
```c++
%{
  using btllib::SpacedSeed;
  
  // a packed struct to match Seqwin's KMER_DTYPE (17 bytes)
  #pragma pack(push, 1)
  struct PackedMinimizer {
      uint64_t hash;        // 8 bytes
      uint32_t pos;         // 4 bytes
      uint16_t record_idx;  // 2 bytes
      uint16_t assembly_idx;// 2 bytes
      bool is_target;       // 1 byte
  };
  #pragma pack(pop)
%}
```

Then I added the new method for Indexlr. 
```c++
%extend btllib::Indexlr {
  btllib::Indexlr* __enter__() {
    return $self;
  }

  void __exit__(PyObject*, PyObject*, PyObject*) {
    $self->close();
  }

  // --- NEW METHOD START ---
  // assembly_idx and is_target are assembly-specific variables used in Seqwin
  PyObject* _minimizers_packed(uint16_t assembly_idx, bool is_target) {
    std::vector<PackedMinimizer> buffer;
    std::vector<std::string> ids;

    uint16_t record_idx = 0;
    buffer.reserve(10000);

    for (const auto& record : *$self) {
        ids.push_back(record.id);

        if (!record.minimizers.empty()) {
            for (const auto& m : record.minimizers) {
                PackedMinimizer pm;
                pm.hash = m.out_hash;
                pm.pos = m.pos;
                pm.record_idx = record_idx;
                pm.assembly_idx = assembly_idx;
                pm.is_target = is_target;
                
                buffer.push_back(pm);
            }
        }
        record_idx++;
    }

    // create python bytes object
    PyObject* py_bytes = PyBytes_FromStringAndSize(
        reinterpret_cast<const char*>(buffer.data()), 
        buffer.size() * sizeof(PackedMinimizer)
    );

    // create python tuple for record IDs
    PyObject* py_ids = PyTuple_New(ids.size());
    for (size_t i = 0; i < ids.size(); ++i) {
        PyTuple_SetItem(py_ids, i, PyUnicode_FromString(ids[i].c_str()));
    }

    PyObject* result = PyTuple_New(2);
    PyTuple_SetItem(result, 0, py_bytes);
    PyTuple_SetItem(result, 1, py_ids);

    return result;
  }
  // --- NEW METHOD END ---
}
```

It built and ran without error. Below is the python part, and outputs were the same as the original python API. 
```python
KMER_DTYPE = np.dtype([
    ('hash', np.uint64), 
    ('pos', np.uint32), 
    ('record_idx', np.uint16), 
    ('assembly_idx', np.uint16), 
    ('is_target', np.bool_), 
# no padding to save memory
], align=False)

with Indexlr(assembly_path, kmerlen, windowsize, IndexlrFlag.LONG_MODE, 1) as assembly:
    kmers, ids = assembly._minimizers_packed(
        assembly_idx, is_target # seqwin variables
    )

kmers = np.frombuffer(kmers, dtype=KMER_DTYPE)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bypass Python loop in Indexlr #142

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bypass Python loop in Indexlr #142

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions