Skip to content

Aho corasick#102

Closed
pscheil wants to merge 10 commits intocbielow:developfrom
pscheil:AhoCorasick
Closed

Aho corasick#102
pscheil wants to merge 10 commits intocbielow:developfrom
pscheil:AhoCorasick

Conversation

@pscheil
Copy link
Copy Markdown

@pscheil pscheil commented Aug 27, 2020

Description

Please include a summary of the change and which issue is fixed.

Also please:

  • Make sure that you are listed in the AUTHORS file
  • Add relevant changes and new features to the CHANGELOG file

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • New and existing unit tests pass locally with my changes

Comment thread src/openms/include/OpenMS/ANALYSIS/ID/AhoCorasickDA.h Outdated
Comment thread src/openms/include/OpenMS/ANALYSIS/ID/AhoCorasickDA.h Outdated
Comment thread src/openms/include/OpenMS/ANALYSIS/ID/AhoCorasickDA.h
Comment thread src/openms/include/OpenMS/ANALYSIS/ID/AhoCorasickDA.h Outdated
Comment thread src/openms/include/OpenMS/ANALYSIS/ID/AhoCorasickDA.h Outdated
Comment thread src/openms/include/OpenMS/ANALYSIS/ID/AhoCorasickDA.h Outdated
Comment thread src/openms/include/OpenMS/ANALYSIS/ID/AhoCorasickDA.h Outdated
{
UInt32 base :22 ; // base value of DA
UInt32 lcheck :8 ; // arc label from parent node
UInt32 leaf_flag :1 ; // node is leaf
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one could play around with the order of the members here, and see what performs the best (since extraction of a single bit is not cheap) and extraction of lcheck could be cheaper if moved to the back...
(keep in mind for later maybe)

Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Copy link
Copy Markdown
Owner

@cbielow cbielow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not get to the end. will finish later :)

Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
{
bc_[node_pos].leaf_flag = 1;
bc_[node_pos].base = static_cast<UInt32>(tail_.size());
storeInTail_(sequences_[unique_idx_[begin]].substr(depth));
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substr() is expensive.
try a StringView

Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp
Comment thread src/openms/include/OpenMS/ANALYSIS/ID/AhoCorasickDA.h Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/include/OpenMS/ANALYSIS/ID/AhoCorasickDA.h Outdated
Comment thread src/openms/include/OpenMS/ANALYSIS/ID/AhoCorasickDA.h Outdated
bool failure_();

/**
* @brief returns the supply link if this does not eliminate ambiguous aa. Don't do anything special with 0. Must be considered when calling
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is only useful for spawns, i.e. when dealing with ambAA's, right?

So should there be a sibling function
void followSupplyLink(Int32 from, Int32& to)?

Can you also make input=output, i.e. merge the first two parameters?

Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp Outdated
Comment thread src/openms/include/OpenMS/ANALYSIS/ID/AhoCorasickDA.h
{
Int32 node; ///< Node after the ambiguous aa transition
UInt32 prot_pos; ///< Protein position after the ambiguous aa
uint8_t counter; ///< Count of all read ambiguous aa
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, this is the number of aa's in the path from root to 'node'?

UInt32 aa_count_ = 0;

/// Indices of the sorted sequences, duplicate sequences are removed
std::vector<UInt32> unique_idx_{};
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique_sorted_idx_ ??

void buildTrie_(Size begin, const Size end, const Size depth, const UInt32 node_pos);

/// returns base value
UInt32 xCheck_();
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on what information?

OPENMS_LOG_INFO << "Mapping " << pep_DB.size() << " peptides to " << (proteins.size() == PROTEIN_CACHE_SIZE ? "? (unknown number of)" : String(proteins.size())) << " proteins." << std::endl;

if (length(pep_DB) == 0)
if (pep_DB.size() == 0)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep_DB.empty() ?

AhoCorasickDA fuzzyAC(pep_DB);
auto s2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double,std::milli> elapsed = s2 - s1;
std::cout << "Construction time " << elapsed.count() << " ms"<< std::endl;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the extra chrono?
StopWatch is a high-resolution clock
(just don't convert to int() as done below to retain its accuracy)

// CDA
//-------------------------------------------------------------------------------------------------------------------------------------

UInt32 AhoCorasickDA::stat()
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function can be const?

Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp
Comment thread src/openms/source/ANALYSIS/ID/AhoCorasickDA.cpp

#include <OpenMS/ANALYSIS/ID/AhoCorasickAmbiguous.h>
//#include <OpenMS/ANALYSIS/ID/AhoCorasickAmbiguous.h>
#include <OpenMS/ANALYSIS/ID/AhoCorasickDA.h>
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you get rid of the seqan includes?

namespace OpenMS
{

/**
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you describe briefly some of the special tweaks here?
e.g. duplicate sequences are handled externally; how internal end nodes are handled; how spawns are processed (at what time and in what order), .... so the user gets a rough idea of what to expect here

struct BCNode_
{
UInt32 base :22 ; ///< base value of the node
UInt32 lcheck :8 ; ///< arc label from parent node. Only code of the aa is stored
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be faster if you switch the order of base and lcheck because that should(!) save some bitshifting when accessing lcheck .. but it might not help....
don't redo the benchmarks!

@github-actions github-actions Bot added the Stale label Apr 2, 2026
@github-actions github-actions Bot closed this Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants