SiWalk identifies effector siRNA positions within a given precursor sequence using a Random Forest classifier trained on structural and sequence features. It operates on sequence alone — no expression data required.
Please cite the paper for your work using this tool.
Title: Localization of Phased Effector siRNAs within Precursor Sequences Using Feature-Informed Machine Learning
Cite: Chao-Jung Wu, Abdoulaye Baniré Diallo
BibTex:
@article{... ...
}
This is the primary functionality of siWalk. Given a precursor RNA or DNA sequence, siWalk scans it to identify the most likely effector siRNA position.
Download siWalk from GitHub: github/bioinfoUQAM/siWalk.
Option A — Local machine:
Install ViennaRNA (provides RNAfold), then:
pip install numpy pandas scipy statsmodels scikit-learn matplotlib
chmod +x /path/siWalk/lib/miranda
Option B — HPC (Compute Canada):
cd /path/siWalk/workdir/
module load StdEnv/2020 viennarna/2.5.1 gcc/9.3.0
module load blast+/2.14.0 samtools/1.17 bowtie/1.3.0 scipy-stack
chmod +x ../lib/miranda
virtualenv --no-download env
source env/bin/activate
pip install --no-index --upgrade pip
pip install --no-index statsmodels sklearn
The script siWalk_predict_siRNA_location.py takes two arguments: a precursor sequence and DicerCall (the expected siRNA length, typically 21).
Example
cd /path/siWalk/src/
priseq=CTTGACCTTGTAAGGCCTTTTCTTGACCTTGTAAGACCCCATCTCTTTCTAAACGTTTTATTATTTTCTCGTTTTACAGATTCTATTCTATCTCTTCTCAATATAGAATAGATATCTATCT
DicerCall=21
python siWalk_predict_siRNA_location.py $priseq $DicerCall
Note: Both DNA and RNA sequences are accepted. For best performance, limit precursor sequences to ≤ 200 nucleotides.
The predicted position is printed to stdout:
yourPrecursor, predicted siRNA start: 75, end: 93, score: -154.733188
The stdout shows the top-ranked candidate. The full list of 6 candidates can be found in yourPrecursor.effector_localization_top6_recommendation.tsv.
Output files are written to /path/siWalk/output/<timestamp>/:
| File | Description |
|---|---|
yourPrecursor.effector_localization_indication.tsv |
Scores for all candidate positions |
yourPrecursor.effector_localization_top6_recommendation.tsv |
Top 6 ranked siRNA candidates |
yourPrecursor.effector_localization_indication.png |
Graphical summary of candidate scores |
The top-6 recommendation file has the following columns:
| Column | Description |
|---|---|
| Position p | Candidate start position on the precursor |
| Start S(p) | Start-position score (sum of weighted indications) |
| End E(p) | End-position score |
| Sum of Indications | Combined score (higher is better) |
| Best length | Predicted siRNA length (nt) |
| End position | Predicted end position |
| ML predicted | Whether this position was flagged by the ML classifier |
Languages: python/3.10.2 (scipy, statsmodels, sklearn, numpy)
Packages: spark/3.3.0, bowtie/1.3.0, samtools/1.17, viennarna/2.5.1, perl/5.30.1, blast+/2.14.0
Distributed with siWalk: miRanda, miRCheck
Environments: The localization mode (Quick Start above) runs on any local machine or login node. Modules A and X (precursor prediction and model training) require a Spark-enabled server (developed on Compute Canada Narval).
| File | Description |
|---|---|
dbs/demoOnly_background.tsv |
Small subset of the training set for quick testing |
dbs/background.tsv |
Full Arabidopsis training set (185,539 segment samples); download from Zenodo (see below) |
dbs/TAIR10_genomic.fa |
Arabidopsis genome; download from Zenodo (see below) |
dbs/mature.fa |
miRNAs from miRBase v22.1 |
dbs/GSM1087998_C0RC1.txt |
Example sRNA-seq library in readcount format (GSE44622) |
dbs/feature_definition.txt |
Definitions of all variables in the analysis |
dbs/complete_feature_list.tsv |
Full feature list for reference |
model/GBA100.pkl |
Gradient Boosting + AdaBoost model, top 100 all-type features |
model/GBAs100.pkl |
Gradient Boosting + AdaBoost model, top 100 structural features |
model/RFAs100.pkl |
Random Forest + AdaBoost model, top 100 structural features; download from Zenodo (see below) |
model/Arabidopsis_feature_importance_n_correlation.tsv |
Feature evaluation file paired with GBA100.pkl |
model/Arabidopsis_structure_feature_importance_n_correlation.tsv |
Feature evaluation file paired with GBAs100.pkl and RFAs100.pkl |
The following large files can be downloaded from Zenodo and placed as indicated:
curl -L -O "https://zenodo.org/records/17258908/files/TAIR10_genomic.fa?download=1" # → dbs/TAIR10/
curl -L -O "https://zenodo.org/records/17258908/files/background.tsv?download=1" # → dbs/
curl -L -O "https://zenodo.org/records/17258908/files/RFAs100.pkl?download=1" # → model/The following modules require a Spark-enabled HPC environment.
Alternatively to downloading, generate it locally (~30 minutes):
cd /path/siWalk/src/
annotated_training_file=../dbs/background.tsv
python siWalk_pickle_localization.py $annotated_training_file
Rename the resulting file to RFAs100.pkl and move it to /path/siWalk/model/.
Align an sRNA-seq library to the genome using bowtie:
genome_file=/path/TAIR10_genomic.fa
index_prefix=/path/bowtie_index/TAIR10_genomic
bowtie-build -f --quiet $genome_file $index_prefix
samtools faidx $genome_file
path_siWalk=/path/siWalk
filename=GSM1087998_C0RC1
infile=$path_siWalk/dbs/$filename.txt # in readcount format
raw=$path_siWalk/input/$filename.raw
python $path_siWalk/src/rc2raw.py $infile > $raw
samfile=$path_siWalk/input/$filename.sam
bamfile=$path_siWalk/input/sorted_$filename.bam
bowtie -r $raw -x $index_prefix --threads 64 -v 0 --mm -a --sam --no-unal > $samfile
samtools sort $samfile > $bamfile
samtools index $bamfile
Edit submit_siWalk_precursor_features.sh to set:
INPUT_SAM_PATH=/path/siWalk/input
genome_file=/path/TAIR10_genomic.fa
Submit the job from the workdir folder:
cd /path/siWalk/workdir/
sbatch submit_siWalk_precursor_features.sh
Output files will be available in /path/siWalk/output/$jobID/$filename/.
The main output is the annotation file: $filename.contig_features.tsv.
cd /path/siWalk/src/
annotation_file=/path/siWalk/output/$jobID/$filename/$filename.contig_features.tsv
python siWalk_classify_precursors.py $annotation_file
The output prefix.prediction.tsv includes these key columns:
| Column | Description |
|---|---|
| CONTIG | 5250-nt genomic region (e.g., 3__5860000_5865250) |
| eff_strand | Strand of the most abundant effector siRNA |
| eff_seq | Sequence of the most abundant effector siRNA |
| segment | Predicted start and end of the siRNA-generating locus |
| Predicted_Class | True if predicted as siRNA precursor, False otherwise |
- Compute features for each precursor candidate using Module A.
- Add a
consistentcolumn:Truefor confirmed precursors,Falsefor non-precursors. - The resulting file is the
annotated_training_filefor the next step.
Top 100 all-type features (produces the GBA100-type model, for precursor prediction only):
cd /path/siWalk/src/
annotated_training_file=../dbs/background.tsv
python siWalk_pickle_precursor.py $annotated_training_file
Top 100 structural features (produces the RFAs100-type model, suitable for both precursor and effector prediction):
cd /path/siWalk/src/
annotated_training_file=../dbs/background.tsv
python siWalk_pickle_localization.py $annotated_training_file
Each run produces:
output/prefix_*_model.pkl— trained model with timestampoutput/prefix_*_feature_importance_n_correlation.tsv— ranked feature importances and correlations