### Nucleotide diversity (pi or $\pi$) calculation (per site and per window)

Part of a project I'm working on is looking at how genetic variation accumulates in genomes. Population genetics has developed a number of metrics for evaluating diversity, and while I see them used quite often in papers, it's not always clear to me what they're actually measuring.

Today I had a look at a measurement of nucleotide diversity called pi ($\pi$). Trying to find a good definition of it, I repeatedly came across the same definition provided by Wikipedia: "the average number of nucleotide differences per site between any two DNA sequences chosen randomly from the sample population".

$\pi = \sum\limits_{ij}^{} x_i x_j \pi_{ij} = 2 \times \sum\limits_{i=2}^{n} \sum\limits_{j=1}^{i-1} x_i x_j \pi_{ij}$

where $x_i$ and $x_j$ are the respective frequencies of the $i$th and $j$th sequences, $\pi_{ij}$ is the number of nucleotide differences per nucleotide site between the $i$th and $j$th sequences, and $n$ is the number of sequences in the sample.

I sort of conceptually understand that in the context of looking at specific genes across a population, but I didn't really grok the actual values reported for $\pi$ values in a genome-wide scan. So, to figure out what those plots are actually showing, I had a look at the calculations made by vcftools. Specifically, I was looking for how the results of --site-pi and --window-pi are calculated.

For --site-pi, the function of interest is output_per_site_nucleotide_diversity() in variant_file_output.cpp, and the code of interest is

for(unsigned int allele = 0; allele < N_alleles; allele++)
{
int other_alleles_count = (total_alleles - allele_counts[allele]);
mismatches += (allele_counts[allele] * other_alleles_count);
}

// pairs is the number of pairwise combinations.
int pairs = (total_alleles * (total_alleles - 1));
double pi = (mismatches/static_cast<double>(pairs));


So, given biallelic loci, this seems to be proportional to allele frequency, and we can calculate $\pi$ for a single position simply from the VCF annotations AC (alternate allele count) and AN (allele number). This seems to use number of possible pairwise allelic combinations (which would be AN * (AN-1)) as the denominator and the number of comparisons that would not be equivalent (i.e. mismatches) when comparing all pairwise combinations (which would be AC * (AN - AC) + (AN - AC) * AC).

So, for this variant:
chromosome_1    186     .       G       T       546.80  PASS    AC=4;AF=0.040;AN=100;DP=750;FS=0.000;GQ_MEAN=39.08;GQ_STDDEV=36.12;InbreedingCoeff=0.6092;MQ=60.00;MQ0=0;NCC=4;QD=27.34;SOR=2.531;VQSLOD=12.99;culprit=FS


$\pi = ( 4 * (100 - 4) + (100 - 4) * 4 ) / (100 * (100 - 1)) = 0.0775758$

As far as I can tell, the single-site $\pi$ shouldn't go beyond 0.5 for practical applications with meaningfully large population samples (though you'll note that one heterozygous individual will give a $\pi$ of 1).

Okay, so what about the --window-pi option? The function of interest in vcftools is output_windowed_nucleotide_diversity(), and it's basically the same thing as the single site calculation except it considers the monomorphic sites in the window as well:

N_pairs = bins[CHROM][s][N_variant_site_pairs] + (N_monomorphic_sites * N_comparisons);
pi = bins[CHROM][s][N_mismatches] / double(N_pairs);


So, suppose we have the window 201 - 300 and the variants:
chromosome_1    275     .       C       T       7509.57 PASS    AC=20;AF=0.196;AN=102;DP=968;FS=0.000;GQ_MEAN=50.00;GQ_STDDEV=44.84;InbreedingCoeff=0.9368;MQ=60.00;MQ0=0;NCC=3;QD=31.29;SOR=1.456;VQSLOD=12.40;culprit=FS
chromosome_1    291     .       T       A       1247.11 PASS    AC=4;AF=0.040;AN=100;DP=945;FS=0.000;GQ_MEAN=45.96;GQ_STDDEV=32.66;InbreedingCoeff=0.8539;MQ=60.00;MQ0=0;NCC=4;QD=29.00;SOR=1.455;VQSLOD=12.26;culprit=FS


This time we also consider the monomorphic sites in the calculation (though this calculation doesn't seem to gracefully consider an absence of read data), which increases the number of pairs that are considered. This time around, our number of pairwise comparisons is the number of pairwise comparisons for the variant sites (for each variant, sum (AN * (AN - 1))) plus the number of monomorphic sites (2 variants, so 98 sites were non-variant in the 100 bp window) times the number of pairwise chromosome comparisons (in this example, there were 52 samples, so it was 104 * 103). So we have:

$nPairs = ( 102 * (102 - 1) + 100 * (100 - 1) ) + (100 - 2) * (104 * (104 - 1)) = 1069978$

And the number of mismatches is the same as before except summed for each variant (i.e. AC * (AN - AC) + (AN-AC) * AC):

$nMismatches = (20 * (102 - 20) + (102 -20) * 20) + (4 * (100 - 4) + (100 - 4) * 4) = 4048$

So $\pi$ for this 100 bp window is:

$\pi = 4048 / 1069978 = 0.0037832554$

Okay, so as far as I can tell, $\pi$ is proportional to how many variants are present (for the window-based calculation) and increases as minor allele frequency grows (for both). That seems like an appropriate metric to describe how much nucleotide diversity is present at a given locus in a given population.