Skip to content

Tutorial 4: Legacy Integration

chutter edited this page Apr 15, 2023 · 8 revisions

This tutorial is a guide to using the "integrateLegacy" function, which allows you to combine existing Sanger or other types of data into your sequence capture dataset. For example, if you have only 30 species from a genus and there is data available for the gene RAG1 for 60 species on GenBank, you can include these extra samples in your sequence capture alignment to increase taxon sampling. This may be useful for comparative methods studies or in creating a strongly supported backbone in hopes the overall tree will be supported.

What is needed to get started:

  1. Run the pipeline workflow-1 and workflow-2 to obtain alignments for your sequence capture dataset. You will want to use the "untrimmed_all-markers" folder as trimming may have modified the alignment too much. You can trim again after. Alternatively, if you have not processed your data with PhyloCap, any set of alignments from other pipelines will work.

  2. If you want to integrate mitochondrial data with the mitochondrial by-catch from your sequence capture raw reads, please run the MitoCap pipeline first on your sequence capture samples. MitoCap available here: https://github.com/chutter/MitoCap

  3. Obtain your GenBank data for the same loci/samples you have in your sequence capture dataset. Optionally, you can also include alignments for markers not in your sequence capture set of markers by setting the "include.uncaptured.legacy" option to TRUE.

  4. Make sure the names in the legacy alignment match the names in the sequence capture alignment, if you wish for the samples to be combined. If they have different names they will be treated as different samples. It is important to include the same samples/species if available in case the marker was not successfully captured for the sequence capture sample. The function will combine sequences from the same sample (or species) if desired also.

  5. Finally, place your legacy marker data in a directory named informatively placed in your "alignments" folder, which should contain the "untrimmed_all-markers" folder or whichever alignments you are using. The alignments can be saved as either fasta or phylip, and they do not need to be aligned.

Everything is now ready to run the function, which you can do as follows:

integrateLegacy(alignment.directory = NULL,
                alignment.format = "phylip",
                output.directory = NULL,
                output.format = "phylip",
                legacy.directory = NULL,
                legacy.format = "phylip",
                target.markers = NULL,
                combine.same.sample = TRUE,
                include.uncaptured.legacy = FALSE,
                include.all.together = FALSE,
                threads = 1,
                memory = 1,
                overwrite = FALSE,
                quiet = FALSE,
                mafft.path = NULL,
                blast.path = NULL) 

Where the parameters for the function are:

alignment.directory: directory of your sequence capture alignments 
alignment.format: "phylip" or "fasta"
output.directory: output directory name for your final alignments that include all sequence capture and legacy data
output.format: "phylip" or "fasta"
legacy.directory: directory of your legacy/GenBank alignments 
legacy.format: "phylip"
target.markers: path to the target marker file used for sequence capture. Not the probe file. 
combine.same.sample: TRUE to combine samples with the same name present in both sequence capture and legacy datasets
include.uncaptured.legacy = TRUE to include legacy alignments that have no sequence capture data (default: FALSE)
include.all.together = TRUE to combine the new legacy+sequence capture alignments with the sequence capture only alignments (default: FALSE)
threads = number of threads
memory = amount of memory in GigaBytes
overwrite = TRUE to overwrite output directories
quiet = TRUE to display verbose output
mafft.path = path to mafft
blast.path = path to blast

The default output is one directory named "output.directory"-only, which is your chosen output directory with only the legacy markers. IF you set "include.all.together" as TRUE, then you will have a directory named "output.directory"-all, which will contain all the legacy alignments and the sequence capture alignments, replacing the old sequence capture alignments without the added taxa.

And with that, everything should be ready for downstream trimming and analyses!

Clone this wiki locally