Documentation for LD matrix methods#3416
Conversation
271bcaa to
e2f50d6
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3416 +/- ##
=======================================
Coverage 91.92% 91.92%
=======================================
Files 37 37
Lines 32153 32153
Branches 5143 5143
=======================================
Hits 29556 29556
Misses 2264 2264
Partials 333 333
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
jeromekelleher
left a comment
There was a problem hiding this comment.
I've had a quick read through and generally looks great. I've spotted a few typos and left a few take-it-or-leave-it comments.
I haven't thought through the details at all through, and I think it needs a careful review from @petrelharp for that.
|
Thanks so much for the quick review, @jeromekelleher. I agree with your suggestions and will fix things up accordingly. |
e2f50d6 to
7e2b62a
Compare
|
I think this will also need a changelog entry, since adding documentation will announce that this API is now public. Does that happen here or in a later part of the release pipeline? |
|
That could happen here! |
|
We can add to the changelog here or not, whichever is easier. We do need a review from you though @petrelharp as I'm not on top of the stats details. |
|
I've created a milestone for version 1.1, which will basically be this new LD API plus the new metadata codec. Would be good to get it shipped in the next week or so if possible. |
7e2b62a to
bae4065
Compare
petrelharp
left a comment
There was a problem hiding this comment.
See suggested edits (and feel free to say I'm wrong on any).
|
|
||
| The `site` mode computes two-locus statistics summarized over alleles between | ||
| all pairs of specified sites. The default behavior, leaving `sites` | ||
| unspecified, this method will compute a matrix for all pairs of sites, with |
There was a problem hiding this comment.
| unspecified, this method will compute a matrix for all pairs of sites, with | |
| unspecified, will compute a matrix for all pairs of sites, with |
| The `site` mode computes two-locus statistics summarized over alleles between | ||
| all pairs of specified sites. The default behavior, leaving `sites` | ||
| unspecified, this method will compute a matrix for all pairs of sites, with | ||
| rows and columns representing each site in the tree sequence (i.e., an n×n |
There was a problem hiding this comment.
| rows and columns representing each site in the tree sequence (i.e., an n×n | |
| one row and column for each site in the tree sequence (i.e., an n x n |
| a method for combining the statistical results from each pair of alleles into | ||
| one summary for a pair of sites. There two methods for combining results from | ||
| multiple alleles, `hap_weighted` and `total_weighted`, which are | ||
| statistic-specific and not chosen by the user: |
There was a problem hiding this comment.
missing description of hap_weighted and total_weighted here?
| resulting statistic is a summary depending on marginal tree topologies and the | ||
| branch lengths of the trees at a pair of positions. To perform this | ||
| computation, we consider the counts of all possible two-locus haplotypes that | ||
| may be generated by possible mutations on each pair of branches between the two |
There was a problem hiding this comment.
| may be generated by possible mutations on each pair of branches between the two | |
| could be generated by mutations on each pair of branches between the two |
| The time complexity of this method is quadratic, due to the pairwise | ||
| comparisons of branches from a pair of trees. By default, this method computes |
There was a problem hiding this comment.
| The time complexity of this method is quadratic, due to the pairwise | |
| comparisons of branches from a pair of trees. By default, this method computes | |
| The time complexity of this method is quadratic in the number of samples, | |
| due to the pairwise comparisons of branches from each pair of trees. | |
| By default, this method computes |
| :param list sites: A list of sites over which to compute LD. Can be | ||
| specified as a list of lists to control the row and column sites. | ||
| Only applicable in site mode. Specify as | ||
| ``[[row_sites], [col_sites]]`` or ``[all_sites]``. |
There was a problem hiding this comment.
| ``[[row_sites], [col_sites]]`` or ``[all_sites]``. | |
| ``[[row_sites], [col_sites]]`` or ``[all_sites]``. | |
| Defaults to all sites. |
| :param list positions: A list of genomic positions where expected LD is | ||
| computed. Only applicable in branch mode. Can be specified as a list | ||
| of lists to control the row and column positions. Specify as | ||
| ``[[row_positions], [col_positions]]`` or ``[all_positions]``. |
There was a problem hiding this comment.
| ``[[row_positions], [col_positions]]`` or ``[all_positions]``. | |
| ``[[row_positions], [col_positions]]`` or ``[all_positions]``. | |
| Defaults to the leftmost coordinates of all trees. |
| Similarly, in the branch mode, the ``positions`` argument specifies | ||
| loci for which the expectation for the two-locus statistic is computed | ||
| over pairs of trees at those positions. LD stats are computed between | ||
| trees whose ``[start, end)`` contains the given position (such that | ||
| repeats of trees are possible). Similar to the site mode, a nested list |
There was a problem hiding this comment.
| Similarly, in the branch mode, the ``positions`` argument specifies | |
| loci for which the expectation for the two-locus statistic is computed | |
| over pairs of trees at those positions. LD stats are computed between | |
| trees whose ``[start, end)`` contains the given position (such that | |
| repeats of trees are possible). Similar to the site mode, a nested list | |
| Similarly, in the branch mode, the ``positions`` argument specifies | |
| genomic coordinates at which the expectation for the two-locus statistic | |
| is computed, given the local tree structure. This defaults to computing | |
| the LD for each pair of distinct trees, which is equivalent to passing in | |
| the leftmost coordinates of each tree's span (since intervals are closed on | |
| the left and open on the right). Similar to the site mode, a nested list |
There was a problem hiding this comment.
We'll need to adjust this given the "expectation" discussion above.
| Some LD statistics are defined for two sample sets instead of within a | ||
| single set of samples. If the ``indexes`` argument is specified, at | ||
| least two sample sets must also be specified. ``indexes`` specifies the | ||
| sample set indexes between which to compute LD. |
There was a problem hiding this comment.
| sample set indexes between which to compute LD. | |
| indexes of the sample sets in the ``sample_sets`` list | |
| between which to compute LD. |
| :math:`D` y n "D" | ||
| :math:`D'` y n "D_prime" | ||
| :math:`D_z` n n "Dz" | ||
| :math:`\pi2` n n "pi2" |
There was a problem hiding this comment.
| :math:`\pi2` n n "pi2" | |
| :math:`\pi_2` n n "pi2" |
|
This is great, thanks! Remaining things:
Background:
|
|
Okay, just checking the details of branch mode:
Okay, so: for a given pair of branches (on the same or different trees), the sample counts So, for computation, I think we should say very clearly somewhere that branch mode works by summing over all pairs of branches the value of f( ) multiplied by the lengths of the two branches. |
|
As for interpretation (ie in what sense is it the expected value): First off, I'm confused about a dumb silly thing. If I simulate a short tree sequence and compute The
|
|
For the multiallelic question, the computations happen here and here; based on where they're called from I'm assuming that the first function is just a quicker special-purpose version of the second one for both biallelic sites. (ps I'm assuming you know this all, but I forget everything and it's nice to go look at the details) So, it is summing over all pairs of alleles but then normalised. |
| : {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{ab} - p_{a}p_{b}` | ||
|
|
||
| This statistic is polarised, as the unpolarised result, which averages over | ||
| allele labelings, is zero. Uses the `total` normalisation method. |
There was a problem hiding this comment.
below this is called "total weighted" not "normalised"; just checking?
|
I think we just need a few more paragraphs, documenting exactly what's being calculated, as above. Can you take a stab at this @apragsdale? |
Hi @petrelharp - yes, of course! I'll tackle these today and this weekend. Thank you for such detailed comments. With some earlier rounds of revision on these docs, I had gone back and forth with including too much vs too little detail, and probably ended up falling on the too-little side of it. I also agree the nan calculations are confusing, and I remember @lkirk and I having discussions about that last year. I think we had a good explanation/work-around for it. I'll dig up those notes, because it will be important to document as well. I'm also returning to this from many months away from the code, so it will be good to remind myself of all of it as well, and make sure it's documented properly. |
See #3353.
This largely pulls material from an existing, open PR to complete minimal documentation for the LD matrix methods currently available in tskit. I've made edits for clarity from the original PR, and removed some material that is possibly confusing or more information than needed for a user.