Skip to content

Bypass Python loop in Indexlr #142

@mxwang66

Description

@mxwang66

Hi,

I'm the developer of Seqwin, and it depends on btllib to calculate minimizers and create minimizer graphs. We really appreciate the effort of developing and maintaining such an amazing toolkit!

To access minimizers in python, nested for loops are needed (e.g, for record in indexlr and for mx in record.minimizers). Since Seqwin could take thousands of microbial genomes as input, it could be much faster if the minimizers are packed in c++ and returned to python as a single array. For example

with Indexlr(assembly_path, kmerlen, windowsize, IndexlrFlag.LONG_MODE, 1) as assembly:
    kmers, record_ids = assembly._minimizers_packed() # kmers: raw bytes; record_ids: python tuple
kmers = np.frombuffer(kmers, dtype=KMER_DTYPE) # a structured dtype including hashes and positions

To do this, I tried to add a new private method of Indexlr in wrappers/python/extra.i, with some custom Seqwin variables as input (see code below). I then rebuilt btllib following build.sh in the bioconda recipe. However, the new method was about 3x slower than the original python loops (the outputs were the same). So I was a bit confused and would like to seek for your advice.

Best,
Michael


In wrappers/python/extra.i, I first added a struct definition in the %{ %} block

%{
  using btllib::SpacedSeed;
  
  // a packed struct to match Seqwin's KMER_DTYPE (17 bytes)
  #pragma pack(push, 1)
  struct PackedMinimizer {
      uint64_t hash;        // 8 bytes
      uint32_t pos;         // 4 bytes
      uint16_t record_idx;  // 2 bytes
      uint16_t assembly_idx;// 2 bytes
      bool is_target;       // 1 byte
  };
  #pragma pack(pop)
%}

Then I added the new method for Indexlr.

%extend btllib::Indexlr {
  btllib::Indexlr* __enter__() {
    return $self;
  }

  void __exit__(PyObject*, PyObject*, PyObject*) {
    $self->close();
  }

  // --- NEW METHOD START ---
  // assembly_idx and is_target are assembly-specific variables used in Seqwin
  PyObject* _minimizers_packed(uint16_t assembly_idx, bool is_target) {
    std::vector<PackedMinimizer> buffer;
    std::vector<std::string> ids;

    uint16_t record_idx = 0;
    buffer.reserve(10000);

    for (const auto& record : *$self) {
        ids.push_back(record.id);

        if (!record.minimizers.empty()) {
            for (const auto& m : record.minimizers) {
                PackedMinimizer pm;
                pm.hash = m.out_hash;
                pm.pos = m.pos;
                pm.record_idx = record_idx;
                pm.assembly_idx = assembly_idx;
                pm.is_target = is_target;
                
                buffer.push_back(pm);
            }
        }
        record_idx++;
    }

    // create python bytes object
    PyObject* py_bytes = PyBytes_FromStringAndSize(
        reinterpret_cast<const char*>(buffer.data()), 
        buffer.size() * sizeof(PackedMinimizer)
    );

    // create python tuple for record IDs
    PyObject* py_ids = PyTuple_New(ids.size());
    for (size_t i = 0; i < ids.size(); ++i) {
        PyTuple_SetItem(py_ids, i, PyUnicode_FromString(ids[i].c_str()));
    }

    PyObject* result = PyTuple_New(2);
    PyTuple_SetItem(result, 0, py_bytes);
    PyTuple_SetItem(result, 1, py_ids);

    return result;
  }
  // --- NEW METHOD END ---
}

It built and ran without error. Below is the python part, and outputs were the same as the original python API.

KMER_DTYPE = np.dtype([
    ('hash', np.uint64), 
    ('pos', np.uint32), 
    ('record_idx', np.uint16), 
    ('assembly_idx', np.uint16), 
    ('is_target', np.bool_), 
# no padding to save memory
], align=False)

with Indexlr(assembly_path, kmerlen, windowsize, IndexlrFlag.LONG_MODE, 1) as assembly:
    kmers, ids = assembly._minimizers_packed(
        assembly_idx, is_target # seqwin variables
    )

kmers = np.frombuffer(kmers, dtype=KMER_DTYPE)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions