-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME.Rmd
More file actions
904 lines (673 loc) · 42.3 KB
/
README.Rmd
File metadata and controls
904 lines (673 loc) · 42.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
---
title: "Scripts and Oneliners explained"
author: "Aimer G. Diaz"
output: github_document
---
<!--- Central Folfer
Github extras
https://github.com/gayanvoice/github-profile-views-counter
Sintaxis propia de github markdown https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
Sintaxis for all the R markdowns in general https://bookdown.org/yihui/rmarkdown-cookbook/raw-latex.html
https://bookdown.org/yihui/rmarkdown/language-engines.html
El tema de las licencias https://gist.github.com/lukas-h/2a5d00690736b4c3a7ba
Cuando lanze los pauetes tanto de deteccion de fragmentos como el script de reduccion de librerias
https://vlado.ca/blog/perl-app.html
https://docs.github.com/en/enterprise-server@2.22/packages/quickstart
https://github.com/LorenzoTa/step-by-step-tutorial-on-perl-module-creation-with-tests-and-git/blob/master/tutorial/tutorial-english.md
https://fpm.readthedocs.io/en/v1.13.1/intro.html
https://github.com/jordansissel/fpm/pull/876
git hub pulling pushing and difference https://github.blog/2011-10-21-github-secrets/
SQL too https://www.red-gate.com/hub/product-learning/sql-source-control/github-and-sql-source-control
Wikis en github https://guides.github.com/features/wikis/
--->
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#devtools::install_github("hadley/emo")
```
Welcome to my code repository, most or all of it I've learned in a very inclusive sense of the word (including copy from internet forums), following this beautiful logic of public domain I've made these notes available, the idea it's to have a server-independent repository, a repository on the cloud, or should I say, on the bottom of the ocean with several copies in several sites of the world, for today and maybe, just maybe, even available on a post-global ecological disaster era, available physically but not online.
I will try to indicate here the structure of all the folders of this repository, each folder anyway will have more information with its own R markdown. The idea of this out of site file is to indicate meta-folders features, like links for indicate where the same code is used in the exactly same way, or when I made a change for any reason for a specific project (all of them would be on bioinformatics next-generation data analysis, my main expertise).
As an example a very basic but useful code I wrote in bash syntax and I use it for every project, it's a code I call `sra_download.sh`, which take a given list (or even better with the SRA Run selector table from Sequence Read Archive (SRA)) and download differentiating, if the input table indicate, if the fastq file is single or paired end. Let's explore the latest data set I was working when I wrote this document, from the [SRA site](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP190362&o=acc_s%3Aa):
SRA_Code | Sequence_Tech | Cell_location_stage | Strain | Total_Sequences | Length | Source | ServerName
--- | --- | --- | --- | --- | --- | --- | ---
SRR8859642 | PAIR END | Procyclic | EATRO 1125 | 85,827,714 | 125/125 | Cooper2019_gRNA | Cooper2019_gRNA_E1125PC
SRR8859640 | PAIR END | Bloodstream | EATRO 1125 | 79,132,567 | 125/125 | Cooper2019_gRNA | Cooper2019_gRNA_E1125BS
SRR8859641 | PAIR END | Bloodstream + Procyclic MIX | EATRO 1125 | 148935513 | 125/125 | Cooper2019_gRNA | Cooper2019_gRNA_E1125Mix
The first version of the file is [on the master thesis codes folder](Total_processing/Download_script.sh), however this version did not has the option to download the SRA considering if it's single or paired end. The latest version of it does and it's on the [Trypanosoma repository](https://github.com/AimerGDiaz/Trypanosomes_RNA_editing).
***
# Codes repository
In order to start, I will comment here sections of the codes and the way I apply them to solve specific problems, in other terms, here what you will find is a selection of commented and finished codes in bash, awk, perl, python, R and Latex languages, plus R markdowns files explaining these codes using small toy data sets. Finally in this document you can also find useful one liners commented, also with toy examples.
Let's start with the one-liners, it's true they are not the best solution, however they are fast, they avoid us to write or re-adapt whole blocks of code to particular solutions, they are modular and somehow, they are like culture, beyond if it's recommended or not to use them, some people love them and I am one of those. I will introduce them first, because I always started with them to explore the original data, already made analysis, new formats, supplementary material, etc.
## Awk
Awk codes are my favorite, the one liners are extreamlly useful, fast, weird but incredible, but they can be adapted as bash codes and even using in parallel computing by the amazing GNU parallel software with minor modifications as I will show here. But let's start, the selected awk codes and toy data samples will be [here](AWK/README.md).
### Generalities of my awk codes
All the languague I've mentioned before will have the same structure for this section: main applications of the main control structures (<for>, <while> and <if>), combinations of control structures and most employed codes so far. Before enter to control structres, there are awk specific features important to highlight:
#### Printing header line -Begin command-:
Generating a header row, while modifying the content that follows, for instance, by substituting an element of a field:
```{bash eval=FALSE}
awk -v OFS="\t" -F'\t' 'NR==1 {print "ID","NGenes","Genes"} {split($2,a," "); gsub(" ",",", $2);print $1,length(a),$2}'
```
In case there is already a header line, a new one can be generated as follow:
```{bash eval=FALSE}
awk -F',' -v n=1 -v OFS="\t" 'NR==1 {print "#chrom","pos","rsid","ref","alt","neg_log_pvalue","maf","score","alt_allele_freq" } NR>2{print "Chr"$1,$2,"rs"n,"A","T","0.01",$3,$4,"."} ' GWAS_CaMV.csv
```
The newly generated line count as a 1, then to ignore the previous header we would need `NR>2`
#### Reducing multiples lines into a single line to analyse -getline command-:
Awk can work not only line per line, it's able to integrate as a unit of analysis files which format might have a specific set of lines per element, for instance 2 lines describe a single sequence in a fasta format or 4 lines a fastq file, how to integrate those lines to a single unit of analysis?. The `getline` awk function it's a faster way to make it, meaning very simple and useful one-liners of awk might be used for files like this.
First example a Fastq to Fasta file converter using the next one-liner:
```{bash eval=FALSE}
awk ' BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; print ">"header,seq}' $1 > $name.fa
```
A very similar task can be achieved with a derivative code, like change the headers of multi-fasta file for a shorter version of it or simply to add the sequence length for blast output filters, which can be achieved using:
```{bash eval=FALSE}
awk ' BEGIN {OFS = "\n"} {header = $0 ; getline seq ; gsub(/;Ant.*/,"",header); header=header";length="length(seq); print header,seq}' $1 > $name.fa
```
This code is integrated to a function of change suffix on the code called [fq2fa.sh](AWK/fq2fa.sh). The function getline must be call as many line we want to integrated in a single unity of analysis.
#### Focus on user defined lines `NR%` command
Individual read length and total quantity of nucleotides
```{bash, eval=FALSE}
echo -ne "@seq1\nACGAGGAGGACGTACGTAGCTAGCGAGCGATCTATCGATGCTAC\n+\nACGAGGAGGACGTACGTAGCTAGCGAGCGATCTATCGATGCTAC\n@seq2\nACGAGCGAGCATCGAGCTAGCTGACTAGCTTAGCTATCG\n+\nnCGAGCGAGCATCGAGCTAGCTGACTAGCTTAGCTATCG\n" > file.fastq
awk '{ if(NR%4==2){print length($0);} }' file.fastq
awk 'BEGIN{sum=0;}{if(NR%4==2){sum+=length($0);}}END{print sum;}' file.fastq
```
***
#### Change delimiters, regrex in awk 101:
A related issue to the previous one it's the case when you want to change the field delimiter of certain lines, but not all, a typical example in bioinformatics of this task it's transform a multi-fasta file with multiple jump lines into a single oneline sequence per feature, graphically:
From
```
>1
ACCCAGAGAGTGAG
ACCAACACACAGTT`
```
To
```
>1
ACCCAGAGAGTGAGACCAACACACAGTT`
```
To make it we can use the regrex expression integrated with the simplified if control loop of awk:
```{bash eval=FALSE}
awk '/^>/{printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' $1 | grep -v "^$" > $1.ol.fa
```
On this awk-oneliner, we first select those lines who do start with > `/^>/` anything in between would be separated only by one jump of line per each `^>` line found. This code it's called [oneliner.awk](AWK/oneliner.awk)
#### Read files separated by comma, -F and gsub commnad
There is not better fast and time-saving file separator as awk's -F commnad, let's start with a common issue analyzing common bioinformatics files as CSV files, haven't you face that annoying CSV files with text in the same field where is also commas there?, usually this :
NCBI gen omnibus dataset has such problem and awk has the solution:
```{bash, eval=FALSE}
awk -F',' 'OFS="\t"{gsub("\"","",$0);gsub(" |, ","_",$0);print $1,$2,$4,$8}' $1
```
```{bash, eval=TRUE, include=TRUE}
cat AWK/sentence.temp
cat AWK/separator.awk
#Run in terminal
# dos2unix AWK/separator.awk
bash AWK/separator.awk AWK/sentence.temp "gsub"
```
Another great solution is by using awk -F command, which allow us to split fields using regrex expression, in this case :
```{bash, eval=FALSE}
awk -F'(,"|",)' 'OFS="\t"{gsub(" ","_",$0); print $1,$2,$4,$8}' $1
```
A more complicated case comes from SraRunTable.txt formatting, which might have at the same time, comma separated fields, but not \" demarcation, a annoying solution is to homogenize the header and the fields with such characteristic, as here the original file now includes \" in the header, to be able to run with the same syntax
```{bash, eval=TRUE, include=TRUE}
head -n 1 AWK/SraRunTable_edited.txt
bash AWK/separator.awk AWK/SraRunTable_edited.txt "SRA"
```
For this situation a double separation using both previous criteria:
#### Text reformatting or search and save function //{}
The previous detailed command `getline`, applied on structured format might help to reformat complex files in simpler single lines formats. However if the text have not a regular and constant number of lines per object, this approach is not longer suitable. As alternative, awk has a saving command for regrex search, who might use conserved features of each entry as an organizer anchor, an example of it, it's the messy file of tasiRNAs from [Small RNAs Carrington](ftp://ftp.arabidopsis.org/Genes/SmallRNAsCarrington/Carrington_MIRNA_TAS_data.xls) :
```{bash, eval=TRUE, include=TRUE}
grep -n -B 1 -A 20 atTAS3a AWK/tasiRNA_generating_loci.txt
```
As the sequence length change depending on each loci, an approach to extract each row of interest using awk is:
```{bash, eval=FALSE, include=TRUE}
awk '/^Discription:/{d=$2};/^Coordinate/{c=$2};/^TAS region:/{s=$3}/^TAR/ {print d,c,s}' AWK/tasiRNA_generating_loci.txt | grep atTAS3a
```
```{bash, eval=TRUE}
bash AWK/reformatting.sh AWK/tasiRNA_generating_loci.txt | grep atTAS3a
```
### Substrings
```{bash, eval=FALSE}
awk ' BEGIN {OFS = "\n"} {header = $0 ; getline seq ; if (seq ~ /Y[A-Z]$/){ taa=substr(seq,length(seq)-1,length(seq)) ; print header"\n"taa}}' multi.fa > 2aa.fa
```
### If loops
#### Conditional evaluation
Using the same previous example, this time without modifiend the header of the SRA original file, but using if structure to indicate differential processing of first line vs the complementary text:
```{bash, eval=FALSE}
awk -F',' -v OFS="," '{if (NR == "1") {$10=$11=$12=""; print } else {gsub("ncRNA-Seq","smallRNA-Seq",$2);print }}'
```
```{bash, eval=TRUE, include=TRUE}
bash AWK/separator.awk AWK/SraRunTable.txt "IF"
```
This code has an additional unneeded feature, the deletion of three fields, but it works to illustrate differential processing depending of an If condition.
#### IF in arrays as a classificator:
One of my favorite codes made a classificatory tasks extremely fast and with just a handful set of commands, sadly the economy of the codes makes it a little dark, especially for AWK beginners.
<details>
<summary>Let's see a toy example:</summary>
```{bash, include=FALSE}
rm -f AWK/td_Gene_duplication_per.txt
```
```{bash}
#We can create a toy data (td) set with a chromosome location of a given gene
#Chromosome #Duplicated Gene
#chr1 geneA
#chr2 geneB
#chr3 geneA
echo -ne "Chromosome\tDuplicated Gene\nchr1\tgeneA\nchr2\tgeneB\nchr3\tgeneA\n" > AWK/td_Gene_duplication_per.txt
```
</details>
Sadly, although it seems awk engine is implemented for R markdown, it requires an additional effort to make it work, here the code of how to run awk using [knitr](https://github.com/yihui/knitr-examples/blob/master/024-engine-awk.Rmd), and here the [R markdown output](https://github.com/yihui/knitr-examples/blob/master/024-engine-awk.md) of that code, however after several attempts I could not make awk work here, anyway, awk is integrated as a command on bash, then we can write the comand of awk as a awk script and executing with bash.
<details>
<summary>The code is</summary>
```{bash}
head -n 25 AWK/classificator.awk
```
</details>
<details>
<summary>The output of this script</summary>
```{bash}
bash AWK/classificator.awk AWK/td_Gene_duplication_per.txt
```
</details>
<details>
<summary> What happen if we include more genes ? </summary>
```{bash}
echo -ne "chr4\tgeneA\n" >> AWK/td_Gene_duplication_per.txt
bash AWK/classificator.awk AWK/td_Gene_duplication_per.txt
```
</details>
Now let's see this code applied to real world problems, at least how I used, one example is here
PRINT here how it work to make quick table (Mirnomics project) or for multi-fasta file duplication cleaning (gRNAs_total.fa)
DEVELOP HERE
***
#### IF and arrays as de-duplicated control structures:
Sum a particular column
```{bash}
echo -ne "reads\n12\n123\n213\n12\n" > AWK/extra_column.txt
paste AWK/td_Gene_duplication_per.txt AWK/extra_column.txt > AWK/td_Gene_duplication_values.txt
rm AWK/extra_column.txt
head AWK/td_Gene_duplication_values.txt
awk '{sum+=$3} END {print sum}' AWK/td_Gene_duplication_values.txt
```
Sum entries belonging to the same factor
```{bash}
echo -ne "chr1\tgeneA\t15\n" >> AWK/td_Gene_duplication_values.txt
sort -k 1 AWK/td_Gene_duplication_values.txt
echo -ne "\n\n"
awk 'NR>1{sum[$1"\t"$2]+=$3} END {for (gene in sum) print gene, sum[gene] }' AWK/td_Gene_duplication_values.txt | sort -k 1
```
Sum only non duplicated entries
```{bash}
awk 'NR>1 && !sum[$2]++' AWK/td_Gene_duplication_values.txt | awk 'NR>1{sum[$1"\t"$2]+=$3} END {for (gene in sum) print gene, sum[gene] }'
```
```{bash, include=FALSE}
rm -f AWK/td_Gene_duplication_per.txt
```
### For and While
Sum all columns
```{bash, eval=FALSE}
awk 'BEGIN{FS=OFS=","}
NR==1{print}
NR>1{for (i=1;i<=NF;i++) a[i]+=$i}
END{for (i=1;i<=NF;i++) printf a[i] OFS; printf "\n"}' file
```
### Change the file format
A useful way to change the file format, and at the same time printing recursively all columns of a file, the next code takes a file comma separated, filtered by padj value field and them print the selected rows in a tab separated file
```{bash, eval=FALSE}
awk -F',' -v OFS="\t" '$8 < 0.05 {out=$1; for (i=2;i<=NF;i++) out= out "\t" $i; print out}'
```
#### From blast out to Bed file
Awk is a powerful text processor and there is not a very frequent task in bioinformatics than blast results cleaning. It's important to know the field you want to use for text parsing, a practical way to get this information is:
```{bash, eval=FALSE}
head -n 1 <FILE> | tr '\t' '\n' | awk '{print NR"\t"$0}'
```
I usually work with a modified version of the output format 6, declaring all the name of the fields I'm interested and including `nident` or number of identical matches:
```{bash, eval=FALSE}
blastn -query <QUERY> -db <SUBJECT> -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen nident" -perc_identity 10 -task blastn -word_size 10 -out
```
If there is not interest in filtering the blast result, reusing it as an annotation file, the next command would be quite useful
```{bash, eval=TRUE}
awk -v i=1 'OFS="\t"{if ($9 < $10) {sense="+"; S=$9} else {sense="-";S=$10}; if ($9 < $10) E=$10; else E=$9; print $2,S,E,"viral2TAIR_hit_"i,sense, E-S , $1, "id:"$3";aln:"$4";nind:"$15 ; i++ }' collection2TAIR.out | sort -k1,1 -k2,2n > collection2TAIR.bed
```
<!--
Annotation of sequences hits
intersectBed -wo -a ../../Arabidopsis/TAIR10.1_NCBI.gff3 -b collection2TAIR.bed | awk '$3 !~ /^region$/' | cut -f 3 | sort | uniq -c
Genes of interest
intersectBed -wo -a ../../Arabidopsis/TAIR10.1_NCBI.gff3 -b collection2TAIR.bed | awk '$3 ~ /^CDS$/' | egrep -i "AT1G19120|AT3G14080|AT1G08370|AT5G13570|AT1G26110|AT5G45330|AT4G19360|AT3G13300|AT3G13290|AT1G54490|AT5G47010|AT5G23570|AT1G09700|AT3G62800|AT3G26932|AT5G41070|AT2G28380|AT1G01760|AT1G14790|AT4G11130|AT3G49500|AT2G19910|AT2G19930|AT2G19920|AT1G01040|AT3G03300|AT3G43920|AT5G20320|AT4G15417|AT3G20420|AT1G48410|AT5G43810|AT2G27880|AT2G32940|AT5G21150|AT1G69440|AT2G27040|AT5G21030|AT1G31280|AT1G31290|AT1G74700 |AT2G04530|AT1G52160|AT3G16260|AT2G02990|AT1G14220|AT1G26820 |AT2G39780 |AT1G14210 |AT5G17290|AT5G45900|AT2G45170|AT4G24690|AT1G28470|AT1G06670|AT1G05660|AT1G01800|AT1G78390|AT4G13000|AT4G13130|AT4G13180|AT5G49130|AT5G42325|AT1G17910|AT1G62670|AT3G42830|AT3G57157|AT4G17780|AT4G17760|AT5G59120|AT5G43840|AT1G58602|AT4G18350|AT3G14440|AT1G30100|AT1G23070|AT5G48650|AT1G13730|AT1G69250|AT2G03640|AT3G07250|AT3G25150|AT5G43960|AT5G60980|AT3G55540|AT1G34140|AT4G34110|AT1G22760|AT2G23350|AT1G71770|AT3G16380|AT2G36660|AT1G49760|AT5G54900|AT1G11650|AT4G27000|AT1G49600|AT3G19130|AT1G47490|AT1G47500|AT1G54080|AT1G17370|AT3G14100|AT1G54270|AT3G13920|AT1G72730|AT3G19760|AT3G61240|AT4G00660|AT2G45810|AT3G58570|AT1G16280|AT4G09730|AT4G15850|AT3G11400|AT5G06000|AT3G60240|AT4G18040|AT5G07350|AT5G61780|AT2G25900|AT2G19810|AT3G02830|AT2G47850|AT5G16540|AT3G06410|AT1G04990|AT5G18550|AT3G48440|AT2G32930|AT1G01510|AT1G64720|AT5G02500|AT5G20700|AT2G39730|AT1G59870|AT3G18780|AT2G30520|AT1G20620|AT5G13630|AT2G21330|AT3G54890|AT1G29930|AT2G34430|AT1G56070|AT1G10170|AT5G46470|AT5G19400|AT2G27400|AT1G50055|AT2G39675|AT2G39681" | grep -oP "TAIR:.*,"
-->
<!-- POSSIBLE awk commands forgotten
/mnt/g/My\ Drive/Bioinformatica/0_Tesis\ Maestria/Code/Analysis_miR/Create_and_filter_anotation_file.txt
/mnt/g/My\ Drive/Bioinformatica/0_Tesis\
./Code/Blockbuster_parsearch/Clust2Block/No_blocks_loci/blockbuster_test.txt
./EGlab_Code/Trimmomatic/trimmomatic_behaviour.txt
/mnt/g/My\ Drive/Bioinformatica/0_Tesis\ Maestria/Precise_counting/Readme.txt
-->
***
* While, being honest I have never used while loop on awk. But I do use while structure on bash followed by an awk code, which is, awk as line per line variable mining tool, which employ the next code structure:
```{bash}
echo "This a toy example; with a semicolon separator structure; and I want to show you; hidden variable X
that using bash while loop; and awk -F separator command; I can extract pf each line the; hidden variable Y" > toy_structure.txt
head toy_structure.txt
```
```{bash}
counter=1 ### Bash counter - reference here for access from other sites
while read line
do
variable=`echo $line | awk -F'; hidden variable' '{print $2}'`
echo this is the value of the hidden variable: "$variable" on the line: "$counter"
let counter=counter+1
done < toy_structure.txt
rm -f toy_structure.txt
```
A real world problem where I use awk inside a bash while loop structure is in a fastqc quality information miner script, extracted from [NGS_statistics.sh script](Total_processing/NGS_statistics.sh):
<details>
<summary> Non excecutable script </summary>
```
while read library
do
FILE="$1"/Fastq/Quality/$library/fastqc_data.txt
type_file=`echo $library | egrep -o "_fastqc|_reduced_cor_fastqc|.reduced_fastqc|_trim_fastqc"`
name=`echo $library | awk -v sus="$type_file" '{gsub(sus,"",$0);print $0}'`
if [ "$type_file" == "_fastqc" ]
then
lib_type="raw"
elif [ "$type_file" == "_reduced_cor_fastqc" ]
then
lib_type="ms_reduced"
elif [ "$type_file" == ".reduced_fastqc" ]
then
lib_type="reduced"
elif [ "$type_file" == "_trim_fastqc" ]
then
lib_type="tmm_reduced"
fi
awk '/^Total Sequences/{ts=$3} /^Sequence length/{print "'$name'" "\t" "'$lib_type'" "\t" $3 "\t" ts}' $FILE >> $1"Results/Statistics/Total_size.txt"
done < dir_list.txt
```
</details>
***
* For loops on awk are mainly array handlers:
DEVELOP HERE
***
## Bash
Here as well as with awk there are many commands that using cleverly you can get results with a single one-liner, or post process, clean, adjust ins/outs quickly or to change formats to give to more complicated programs. Bash it's for me, as you saw previously, the main executor, for other people it could be shell, or even anything without command line, by running all in environments like this, but with VIM + Bash + awk + perl virtually you could do everything on structured programming. But before to get into the details of how I used recurrent blocks of code, I will introduce a unique aspect of bash:
### Generalities of my bash codes
Bash as well as AWK is quite flexible, however most of the time I do use the three control structures `for`, `while` and `if, then, else` for data organization, classification or text processing and quality control. But before to enter to each control structure let's talk some other features of Bash beyond the control statements itself.
#### Sort command for bionformatics format
Sort unix command is viable and quite frequent way to organize annotation files, bed or gff formats:
1. For bed files :
```
sort -k1,1 -k2,2n
```
2. For gff files
```
(grep ^"#" <gff file> ; grep -v ^"#" <gff file> | grep -v "^$" | grep "\t" | sort -k1,1 -k4,4n)
```
#### Piping on bash: beyond "|"
Bash is awesome specially when we talk about pipes, input and output immediately re-direction. However for beginners the main command for piping is usually "|", which limits piping to comands who tolerate input redirection in such way, however there are software which cannot be piped with "|" but it might accept "-" for input file declaration, specially those who requires a -i parameter to indicate the input file, then the piping way for such programs is " | software -i - " .
DEVELOP HERE - example of - for piping
A less usual piping commands is the combined operator "<( code here )", which could be interpreted as "take the output of the command between parenthesis as an input". This structure could be used to expand the capabilities to of grep command but in the option of grepping a list of terms (grep -f). Using the -f parameter on a grep search we cannot add the single pattern options available also as parameter, for instance, searching the list of pattern at the beginning of each line: `grep -f "^" `, however using `<()` this task it's possible. Let's explore using a real world problem.
In a file of small RNAs who target genes for mRNA edition process (U deletion or insertion), each line represent a guide RNA, whoever some of them has complex gene target assignation, meaning for instance like:
"Unknown_RPS12"
Where the first category means the guideRNA have gotten a gene target after a secondary process, but a gene without a gene target even after applying the secondary process, could still lack of a gene assignation. Additional patterns like RPS12_Unknown_RPS12, make even harder the use of auxiliary awk syntax. Therefore, to quantify the exact amount of gRNAs with pure Unknown gene targets we can use the `<()` syntax as follow:
First generate the pattern to search using the single line grep regular expressions as "^" or "$" or others, for instance starting with a file like this:
```{bash}
head -n 1 BASH/grep_lists_example.txt
```
Where each line represent a single ID, now as this pattern in the file we want to search it might be place on more than one column, a exact grep (-w) search would not be enough, the file to search looks like:
```{bash}
head -n 1 BASH/file2search.txt
```
The pattern we are interested is the first field of this comma separated file, as it's numerical it might happen in many places, then what we need to search is a list of ids who start exactly with that number and ends with a comma, we can create a list of IDs with these features:
```{bash}
bash AWK/printBeginningEnd.awk BASH/grep_lists_example.txt > BASH/tempids.sh
head -n 1 BASH/tempids.sh
```
With awk basically we have created a script who writes another script, which tells to bash how to print the code, if we excecute this code it will look as follow:
```{bash}
bash BASH/tempids.sh | head -n 2
```
Now with this script we can run the line `bash tempids.sh` inside `<()` command as an input for the command `grep -f`:
```{bash}
time (bash BASH/piping.sh BASH/file2search.txt)
head BASH/piping.sh
```
In such way this code is equivalent to run grep in a for loop as:
```{bash}
bash BASH/piping_alternative.sh BASH/file2search.txt
head BASH/piping_alternative.sh
```
Even faster than an awk search:
```{bash}
bash BASH/awk_regrex_insideFor.sh BASH/file2search.txt
tail -n 1 BASH/awk_regrex_insideFor.sh
```
However the first code is much faster as time command show us, the reason of this is because the first is in reality a single grep search without control structures.
The second code has the bash array structure and the way it can be feed it, also as an alternative to avoid temporary files. Also includes how we can access or called in a bash for loop, [more of bash arrays here](https://opensource.com/article/18/5/you-dont-know-bash-intro-bash-arrays).
#### Folder syncronization with rsync
Specially useful command for servers with file deletion programs, this command will ensure you data security:
```{bash, eval=FALSE}
rsync -av /mnt/ls15/scratch/users/gutie190/MaxiCircles/ /mnt/home/gutie190/Trypanosomes_RNA_editing/MaxiCircles/NewMaxiCircles/
```
This command will copy all files from the scratch directory to the safe local directory, where there is not deletion policty, but disk space limitations. [Source ](https://www.tutorialspoint.com/unix_commands/rsync.htm)
`-a` means The files are transferred in "archive" mode, which ensures that symbolic links, devices, attributes, permissions, ownerships, etc. are preserved in the transfer.
Each time the mother folder is modified, you should run again the command, rsync remote-update protocol will update the file by sending only those different to the already synchronized files.
By default `rsync` only add files to the destination folder that have been added to the source folder, adding or modifying files that have been changed in the source folder but does NOT delete any files if they have been deleted in the source. To delete files in the target, add the --delete option to your command.
```{bash, eval=FALSE}
rsync -avh source/ dest/ --delete
```
Sometimes previous to rsync it's important to delete empty folders, which may be done with find:
```{bash, eval=FALSE}
find . -type d -empty -delete
```
#### Grep with perl regrex
Grep command is one of the most famous search software known, it's so powerful thank it's possible to use together with other languague as perl, for instance, in a complex data set as the following:
```{bash}
head -n 8 BASH/complex_formatting.txt
```
For each entry there are 7 lines of information, to extract an specific query, non separated information, with only alphanumeric characters match, including connector characters as "_" or, we can use grep -oP command and perl ["\\w+"](https://stackoverflow.com/questions/47361430/about-the-meaning-of-perl-w) expression to include the whole word, as follow :
```{bash}
grep -oP "Accession: \w+" BASH/complex_formatting.txt
```
The information include in perl matching characters as \w+ can be displayed using:
```{bash, eval=TRUE}
unichars -a -u '\w' | wc -l
```
To check the difference in perl matching characters, we can use unichars and diff command, however they are not available in all unix systems
```{bash, eval=TRUE}
diff -u <( unichars -a -u '\w' ) <( unichars -a -u '\d' )
```
#### Use defined match delimiters
A more intuitive, versatile and controllable way for the same task is the perl "?" and "(?=)" regrex pairs, which allow us to define the limits of character exploration. Let's use an aburd example, such as matching any characters before a defined word ("abc") is found, without including the define word as an output.
```{bash}
echo addfadfaabc1234 | grep -oP "(.+?)(?=abc)"
#One liner perl version of this is
# perl -ne ' $_ =~ s/(.+?)(?=abc)/$1/; print $1;'
```
Using the previous complex formatting text:
```{bash}
cat BASH/complex_formatting.txt | grep -oP "Type:(.+?)(?=[\n|;])"
```
An equivalent expression to "\\w+" examples
```{bash, eval=TRUE}
grep -oP "(Accession:.+?)(?=ID)" BASH/complex_formatting.txt
```
#### Generate egrep pattern
Using the previous syntax allow us to generate a multiple egrep pattern generator, a commonly use term with several applications as single gene protein isoform quantification, for instance. In a file with Uniprot Ids for each gene with this format (`TAIRid2Uniprot.txt`):
TAIR | Uniprot
--- | ---
AT1G13730 | AEE29061.1
AT1G69250 | AEE34901.1
AT1G69250 | AEE34902.1
The next command would be able to generate the combined non-redundant search term "AEE29061.1|AEE34901.1|AEE34902.1"
```{bash, eval=FALSE}
for g3bp in AT1G13730 AT1G69250
do
uniprot=`grep -w $g3bp TAIRid2Uniprot.txt | awk '{gsub(" ","",$2);printf $2"|"}' | grep -oP "(.+?)(?=\|)" \|
perl -ne 'chomp $_; print $_ ' `
echo $uniprot
# Beyond the interest of print the modification, we can also print the sequence associated with both ids
# zcat TAIR10.1_protein.ol.fa.gz | egrep -A 1 --no-group-separator $uniprot | awk -v tair="$g3bp" ' BEGIN {OFS = "\n"} {header = $0 ; getline seq ; gsub(">",">"tair"_",header) ; print header,seq}'
done
```
#### Find and other bash commands
Find command is one of the most useful tools in unix systems, it's the google inside this beautiful operative system, not only because it does allow us to find any file or folder using regrex, but also because it's able to search inside the files, including pdfs files (in case pdfgrep is installed).
1. The basics, search for all the files with common extension and list their sizes and time of origin
```{bash, eval=FALSE}
find . -name '*.sh' -exec ls -lht {} \;
```
keep in mind just "\*" is enough for global search, "." in find it's a non-special character
What if you want to copy those files to a new folder?
```{bash, eval=FALSE}
find BlastNT/ -iname "*sb" -print -exec cp {} ~/GS21/BLAST_DB/ \;
```
1. Grep inside a file providing the name of the file
```{bash, eval=FALSE}
find . -name '*.txt' -exec grep -H "perl \-ne" {} \;
```
### Bask parallelizing 101:
#### Measuring excecution time
[check execution time of process whose id is known](https://www.2daygeek.com/how-to-check-how-long-a-process-has-been-running-in-linux/)
```{bash, eval=FALSE}
ps -p 16337 -o etime
```
#### Forks and CPUs work distribution
#### Parallel
### Soft and Hard links (ln)
A hard link acts as a copy (mirrored) of the selected file, while soft or symbolic links acts as a pointer or a reference to the file name, however keep in mind, deleting a pointer is also deleting the original folder. A situation taht does not happen with hard links.
### Creating comands on Bash
* Save routinary functions and alias as bash commands with the label you want to use it for them on the terminal. To start, what we need it's edit the file `~/.bashrc` with your favorite text editor, in my case Vim ([(guide for vim commands)](VIM/README.md)). The difference between alias and the functions is mainly that the functions read arguments, while alias not. My favorite list of alias and functions on .bashrc file
<details>
<summary> * Alias </summary>
```{bash}
# List the files in human readble format withut writing -lh and
# organize them by size, from bigger to smaller
alias lso='ls -lh --sort=size'
# List the files in human readble format withut writing -lh and
# organize them by time of modification, from sooner to later
alias lst='ls -lh --sort=time'
# Delete the files but asking first if that's what you really want
alias rm='rm -i'
# anyway rm -f would delete without asking, but it requires from you an extra
# time thinking
# I usually work on servers, I do not like to write ssh hostname@server thousands of time
alias msuhpc=' ssh -X USER@SERVER'
# First time on the matrix? sometimes is hard to know if you are working on a screen session
alias iscreen=' echo $TERM '
# If you are on a server always work on screens, if you lose connection, the screens nope,
# but outside the matrix, the world is harder.
# Again an useful alias when you do not want to be kick out from the server, but you are not
# mentally there
alias waste_time='for f in {1..200}; do echo ------ wasting time minutes $f; sleep 60; done'
```
</details>
<details>
<summary> * Functions </summary>
Alias are boring, but save time. Functions are quite interesting, they are basically mini software and bash allow you to create commnads as variated as a blast of a sequence you want to test against blast databases.
```{bash eval=FALSE}
# LiSt the Absolut path Directory of a file
lsad() {
find $PWD -iname $1
}
# List the size of a folder
lsf(){
du -ch --max-depth=0 $1
}
# Search on history a past command
hisgrep() {
history | grep $1
}
# What if you use a file many times and the previous command shows many boring history
# You can also check only when the file you are interested in was generated
hisgrep_gen() {
hisgrep "-P \>\\s+$1"
}
# That's all for today darling screen
delete_screen() {
screen -X -S $1 quit
}
# Adjust the title of a paper to my format of saving it
papers() {
final=`echo "$1" | tr '\n' '_' | tr -d '–' |tr ' ' '_' | tr -d ',' |tr -d ':' | tr -d '(' | tr -d ')' | tr '/' '-' | awk -F '_$' '{gsub("__","_",$1);print $1}' | awk -F '_$' '{print $1}' `
echo $2.$final
}
# How many nucleotides do a sequence has
count_nt() {
seq=`echo -ne $1 | wc -c`
echo $seq nucleotides
}
# Get the reverse complementary sequence of a
revcom() {
if [ "$1" == "DNA" ]
then
echo $2 | rev | tr 'ACGTacgt' 'TGCAtgca'
elif [ "$1" == "RNA" ]
then
echo $2 | rev | tr 'ACGUacgu' 'UGCAugca'
fi
}
# Check how large is the load of a server
server_status() {
ssh server free -h
ssh server ps aux --sort=-pcpu | egrep -v "root" | head # add any annoying software always coming up from ps
ssh server ps aux --sort=-pcpu | awk 'NR>1{sum+=$3;sum2+=$4}END{print "Cores used "sum "% and RAM used "sum2" %"}'
}
# Search in a given directory a file by its pattern
reference(){
pattern=`echo $1`
find $2 -iname "*$pattern*" -print 2>/dev/null
}
# Search a word in a pdf o thousands of pdfs on a common directory or a single file
# Of course it does requires the installation of pdfgrep
intext(){
find "$1" -name '*pdf' -exec pdfgrep -i "$2" /dev/null {} \; 2>/dev/null
}
```
</details>
These newly defined codes can be used even into common scripts, but first it requires installation:
```{bash}
source ~/.bashrc
```
To running here, I need to use these commands into scripts, and then you must use the `-i` parameter, which means:
`-i If the -i option is present, the shell is interactive.`
Let's see an example, tell me computer how looks the reverse complementary sequence of ACCCCGAGACTAGGTAGAGACA and how many nucleotides are there?:
This is how looks the script for this:
```{bash}
head BASH/running_bashrc.sh
```
... And here the output:
```{bash, eval=FALSE}
bash BASH/running_bashrc.sh
```
***
## Perl
Perl engulf awk in its language evolutionary history,
### Oneliners or -ne command
In previous sections I've already shown how versatile is perl's regrex, using them with grep or the ones shared between perl and awk, however if we want to manipulate a file and not only detect, the most precise way before running into formatting-scripts is the use of perl oneliners. The next example shows a perl one-liner script with a economic writing or non-strict style, starting with:
1. `chomp $_`, dispensable but used to remove the last trailing newline from the input string.
1. `@F = split /\t/, $_` , feed the environment array F with each part of the text separated by tab or any other character.
1. `if ($N[1] ne "hsaP-") {$F[3] =~ s/(hsaP)[a-z].*-([let|mir].*)/$1-$2/;` control structure, restricting the substitution to only those entries with different name to "hsaP-"
1. `$string=join ("\t", @F); print "$string\n";}`, re join split and modified fields and close control structure.
#### Search and modify in a specific column
```{bash}
echo -ne "chr1\t115215\t115320\ttesting-syntax_hsaPmmu-mir-23a\nchr1\t115215\t115320\ttesting-syntax_hsaP-let-23a\n-Previous to perl edition-\n"
tail -n 1 Perl/oneliner_search-replace.sh
```
```{bash, eval=TRUE}
bash Perl/oneliner_search-replace.sh
```
### Search and save multiple hits in a single line
NCBI headers
zcat TAIR10.1_cds.ol.fa.gz | grep "protein_id=" | perl -ne ' $_ =~ s/TAIR:([a-zA-Z0-9]++)\]\s.*protein_id=([a-zA-Z0-9]++.[0-9])/$1/; print $1."\t".$2."\n";' | sort | uniq > TAIRid2Uniprot.txt
### Pop for cleaning multiple separators inside a single column
It's quite often I use "_" to differentiate, annotate or provide depper detail for a given entry, however usually, this would be added in addition to previous uses of the same character, as in the next example:
```{bash}
echo -ne "MH899121.1_ALV1.2\nNC_001557.1_TMVsat\n"
grep -oP "perl.*" Perl/reduce_delimeters.sh
```
In this example there is an entry with double "_" as name, `NC_001557.1_TMVsat` however the initial one have nothing to do with the second use of it, introducing complexities to the format, a way to remove is by using pop perl function which takes off the last element of an array.
```{bash, eval=TRUE}
bash Perl/reduce_delimeters.sh
```
Note the [perl oneliner](Perl/reduce_delimeters.sh) delete the first "_" while conserving the indicative function of the second "_"
<!-- Usar perl en R markdown, se puede https://stackoverflow.com/questions/45857934/executing-perl-6-code-in-rmarkdown
--->
### Search until
grep "*" true_adapter_r.fa.clstr | grep -Po ">.+?(?=\.)"
## R
### Using grepl to merge dataframes
1. Using collapsed search terms
```{r, eval=FALSE}
merge_criteria <- paste(gsub(pattern = " ", replacement = "", gene_main_interest$TAIR),collapse ="|")
CaMV_Mock_GSE36457_interest <- CaMV_Mock_GSE36457[grepl(ignore.case = T, merge_criteria, CaMV_Mock_GSE36457$AGI.CODE),]
```
The previous command it's different to merge R function, in the sense taht tolerates strings variations
```{r, eval=FALSE}
exact_match <- merge(CaMV_Mock_GSE36457,gene_main_interest, by.x = "AGI.CODE", by.y = "TAIR", all.y = T)
# Comparing two dataframes
CaMV_Mock_GSE36457_interest$AGI.CODE[! CaMV_Mock_GSE36457_interest$AGI.CODE %in% exact_match$AGI.CODE]
```
1. Using grepl in a for loop
```{r, eval=FALSE}
# First generate a Null entry to keep track of those unmatched features
CaMV_Mock_GSE36457 [ nrow(CaMV_Mock_GSE36457) + 1 , ] <- NA
empty_raw <- CaMV_Mock_GSE36457[nrow(CaMV_Mock_GSE36457),]
# Call the DF to generate the output fist and feed it with both match and unmatched queries
CaMV_Mock_GSE36457_interest <- NULL
for (f in gene_main_interest$TAIR) {
g <- CaMV_Mock_GSE36457[grepl(ignore.case = T, f, CaMV_Mock_GSE36457$AGI.CODE),]
h <- gene_main_interest[grepl(ignore.case = T, f, gene_main_interest$TAIR),]
if(nrow(g)>0){
i <- data.frame(h,g)
} else {
i <- data.frame(h, empty_raw)
}
CaMV_Mock_GSE36457_interest <- rbind(CaMV_Mock_GSE36457_interest, i)
}
```
### Multiple substitutions
```{r}
mgsub <- function(pattern, replacement, x, ...) {
if (length(pattern)!=length(replacement)) {
stop("pattern and replacement do not have the same length.")
}
result <- x
for (i in 1:length(pattern)) {
result <- gsub(pattern[i], replacement[i], result, ...)
}
result
}
```
## GitHub
### Reconnecting from a new repository
[Pushing to github with ssh-authentication](https://www.r-bloggers.com/2014/05/rstudio-pushing-to-github-with-ssh-authentication/)
The current project was syncronized by this method using the command
```{bash, eval=FALSE}
git config remote.origin.url git@github.com:AimerGDiaz/Codes_Repository
```
### Reconnecting in windows
Install first [git](http://git-scm.com/downloads) and configure github account using github credentials.
```{bash, eval=FALSE}
git config --global user.name AimerGDiaz
git config --global user.email aiagutierrezdi@unal.edu.co
```
### New connections in MSU server
In case of need follow github guide:
* [Generating SSH keys](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent)
* [Genereting .pub pair in github](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github)
```{bash, eval=FALSE}
# Generate git.pub in conventional .ssh folder, copy its content into github ssh keys
# BE Aware of not overwrite previous rsa !!
ssh-keygen -t ed25519 -C aiagutierrezdi@unal.edu.co -f ~/.ssh/git
# Write the actual password for login there
# Go to github settings and open a new ssh key using the content of .pub document previuosly created
#Tesdt key funcion by
ssh -T git@github.com
module load GCCcore/12.2.0
# To initialize a new Git repository in the current directory, use:
git init
git remote add origin git@github.com:AimerGDiaz/UPSC_Riboseq
git config remote.origin.url git@github.com:USER/REPO
git config --global user.name USER
git config --global user.email MAIL
# Same time write .ssh/config file with
Host github.com
HostName github.com
User git
IdentityFile ~/.ssh/.git
IdentitiesOnly yes
# DOwnload what is already on the repo
git fetch origin
git pull origin main
# Let's start to switching branch, from master to main
git branch main
git checkout main
git fetch origin
git pull origin main
# Ready to fill
git add .
git commit -m "Welcome MSU"
git push -u origin main
```
## License
My codes are licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc/4.0/).
[](https://creativecommons.org/licenses/by-nc/4.0/)
## Viewers
[](https://github.com/AimerGDiaz/Viewers/blob/master/readme/409164432/week.md)