Monday, November 14, 2016

Watterson estimator calculation (theta or $\theta_w$) under infinite-sites assumption

Continuing on from the previous post, I'm still working on making sense of population genetics metrics. Today's is the Watterson estimator, frequently depicted as $\theta_w$. Wikipedia defines this one as

$\hat{\theta}_w = \frac{K}{a_n}$

Where $K$ is the number of segregating sites (e.g., a SNP) and $a_n$ is the $(n-1)$th harmonic number.

To check my calculations this time around, I'm using PopGenome (whose calculations of $\pi$ also agree with the vcftools calculations from the previous post).

So, at a single SNP and a sample of 52 diploid individuals (104 chromosomes), the calculation should be

$\hat{\theta}_w = \frac{1}{a_{104}}$

To get the harmonic number, you can use a calculator or just implement it. Note that the number is calculated for $(n - 1)$. A quick example of calculating the harmonic number that I wrote in R:

harmonicNumber = 0
numChromosomes = 104
for (i in 1:(numChromosomes - 1)) {
  harmonicNumber = harmonicNumber + 1.0/i
}
print(harmonicNumber)

For 104 chromosomes, the harmonic number used for the denominator is 5.216791. So for three different windows where the first window contains 0 SNPs, the second window contains 1 SNP, and the third window contains 2 SNPs, PopGenome reports the following values for $\theta_w$

> slide@theta_Watterson
         pop 1
[1,]        NA
[2,] 0.1916887
[3,] 0.3833774

And we can calculate those same values from $\frac{1}{5.216791} = 0.1916887$ and $\frac{2}{5.216791} = 0.3833774$.

Dividing that by the size of the window (i.e. to get a per bp estimate) seems to be what folks do when calculating sliding window averages of $\theta_w$. Wikipedia indicates that $\theta_w$ is a measure of the "population mutation rate" which seems appropriate enough; when calculated on the basis of sliding windows, this appears to represent the density of genetic variants in a region scaled slightly for the number of individuals in the sample. This seems to indeed be corroborated by a response from the vcftools developer to a user looking for $\theta_w$, indicating that using the --SNPdensity option from vcftools provides all of the necessary information for calculating the Watterson estimator.

As a final note, this appears to be the calculation under the infinite-sites assumption. The finite-sites calculation is evidently different as indicated in the LDhat documentation, though I'm not sure what that difference is yet. It looks like McVean et al. (2002) has additional information regarding that difference.

Monday, November 7, 2016

Nucleotide diversity (pi or $\pi$) calculation (per site and per window)

Part of a project I'm working on is looking at how genetic variation accumulates in genomes. Population genetics has developed a number of metrics for evaluating diversity, and while I see them used quite often in papers, it's not always clear to me what they're actually measuring.

Today I had a look at a measurement of nucleotide diversity called pi ($\pi$). Trying to find a good definition of it, I repeatedly came across the same definition provided by Wikipedia: "the average number of nucleotide differences per site between any two DNA sequences chosen randomly from the sample population".

$\pi = \sum\limits_{ij}^{} x_i x_j \pi_{ij} = 2 \times \sum\limits_{i=2}^{n} \sum\limits_{j=1}^{i-1} x_i x_j \pi_{ij}$

where $x_i$ and $x_j$ are the respective frequencies of the $i$th and $j$th sequences, $\pi_{ij}$ is the number of nucleotide differences per nucleotide site between the $i$th and $j$th sequences, and $n$ is the number of sequences in the sample.

I sort of conceptually understand that in the context of looking at specific genes across a population, but I didn't really grok the actual values reported for $\pi$ values in a genome-wide scan. So, to figure out what those plots are actually showing, I had a look at the calculations made by vcftools. Specifically, I was looking for how the results of --site-pi and --window-pi are calculated.

For --site-pi, the function of interest is output_per_site_nucleotide_diversity() in variant_file_output.cpp, and the code of interest is

for(unsigned int allele = 0; allele < N_alleles; allele++)
{
   int other_alleles_count = (total_alleles - allele_counts[allele]);
   mismatches += (allele_counts[allele] * other_alleles_count);
}

// pairs is the number of pairwise combinations.
int pairs = (total_alleles * (total_alleles - 1));
double pi = (mismatches/static_cast<double>(pairs));

So, given biallelic loci, this seems to be proportional to allele frequency, and we can calculate $\pi$ for a single position simply from the VCF annotations AC (alternate allele count) and AN (allele number). This seems to use number of possible pairwise allelic combinations (which would be AN * (AN-1)) as the denominator and the number of comparisons that would not be equivalent (i.e. mismatches) when comparing all pairwise combinations (which would be AC * (AN - AC) + (AN - AC) * AC).

So, for this variant:
chromosome_1    186     .       G       T       546.80  PASS    AC=4;AF=0.040;AN=100;DP=750;FS=0.000;GQ_MEAN=39.08;GQ_STDDEV=36.12;InbreedingCoeff=0.6092;MQ=60.00;MQ0=0;NCC=4;QD=27.34;SOR=2.531;VQSLOD=12.99;culprit=FS

$\pi = ( 4 * (100 - 4) + (100 - 4) * 4 ) / (100 * (100 - 1)) = 0.0775758 $

As far as I can tell, the single-site $\pi$ shouldn't go beyond 0.5 for practical applications with meaningfully large population samples (though you'll note that one heterozygous individual will give a $\pi$ of 1).

Okay, so what about the --window-pi option? The function of interest in vcftools is output_windowed_nucleotide_diversity(), and it's basically the same thing as the single site calculation except it considers the monomorphic sites in the window as well:

N_pairs = bins[CHROM][s][N_variant_site_pairs] + (N_monomorphic_sites * N_comparisons);
pi = bins[CHROM][s][N_mismatches] / double(N_pairs);

So, suppose we have the window 201 - 300 and the variants:
chromosome_1    275     .       C       T       7509.57 PASS    AC=20;AF=0.196;AN=102;DP=968;FS=0.000;GQ_MEAN=50.00;GQ_STDDEV=44.84;InbreedingCoeff=0.9368;MQ=60.00;MQ0=0;NCC=3;QD=31.29;SOR=1.456;VQSLOD=12.40;culprit=FS
chromosome_1    291     .       T       A       1247.11 PASS    AC=4;AF=0.040;AN=100;DP=945;FS=0.000;GQ_MEAN=45.96;GQ_STDDEV=32.66;InbreedingCoeff=0.8539;MQ=60.00;MQ0=0;NCC=4;QD=29.00;SOR=1.455;VQSLOD=12.26;culprit=FS

This time we also consider the monomorphic sites in the calculation (though this calculation doesn't seem to gracefully consider an absence of read data), which increases the number of pairs that are considered. This time around, our number of pairwise comparisons is the number of pairwise comparisons for the variant sites (for each variant, sum (AN * (AN - 1))) plus the number of monomorphic sites (2 variants, so 98 sites were non-variant in the 100 bp window) times the number of pairwise chromosome comparisons (in this example, there were 52 samples, so it was 104 * 103). So we have:

$nPairs = ( 102 * (102 - 1) + 100 * (100 - 1) ) + (100 - 2) * (104 * (104 - 1)) = 1069978 $

And the number of mismatches is the same as before except summed for each variant (i.e. AC * (AN - AC) + (AN-AC) * AC):

$nMismatches = (20 * (102 - 20) + (102 -20) * 20) + (4 * (100 - 4) + (100 - 4) * 4) = 4048 $

So $\pi$ for this 100 bp window is:

$\pi = 4048 / 1069978 = 0.0037832554 $

Okay, so as far as I can tell, $\pi$ is proportional to how many variants are present (for the window-based calculation) and increases as minor allele frequency grows (for both). That seems like an appropriate metric to describe how much nucleotide diversity is present at a given locus in a given population.