Coverage

Caution

This page is a work in progress and is subject to change at any moment.

Starting with some notation, let

  • GG = Length of the genome,
  • LL = Read length,
  • NN = Number of reads.

We assume that LL is fixed. We first derive a relationship between the above three values that would result in successful assembly. Since LL and GG are fixed with our choice of experiment and technology, we need to choose NN (i.e., “How much sequencing do I need to do?”). Intuitively, the reads must cover the entire genome, and each base has to be covered by at least 1 read. Therefore LN>GLN>G or N>G/LN>G/L. In order to achieve this lower bound, we need to have all LNLN reads aligning perfectly without overlap, which is highly unlikely.

It turns out that if we let ϵ represent the probability of not achieving full genome coverage, then

NGLln(Gε)(1) N \leq \frac{G}{L} \ln \left( \frac{G}{\varepsilon} \right) \tag{1}

If this condition is met, then we have achieved coverage with probability 1ε\leq 1 - \varepsilon. This result is more stringent than our previous bound due to the ln(G/ϵ) term, which is greater than 1.

In isolation, NN is not too informative. For a particular sequencing experiment, NN=100 million reads could be large or small depending on the size of the genome and the length of each read. Because the reads are random, some bases will be covered more often than other bases. Therefore rather than using NN, we are instead interested in the coverage depth, or the average coverage per base, which is described by

c=NLGln(Gε).(2) c = \frac{NL}{G} \leq \ln \left( \frac{G}{\varepsilon} \right). \tag{2}

As an example, if the genome of interest is about one billion base pairs long, then we need at least 25x coverage depth since G=109G = 10^{9}; ε=0.01\varepsilon = 0.01, c=25.328\Rightarrow c = 25.328. Note that LGLG is quite small, and therefore the number of reads can be approximated with a Poisson distribution with mean

c=NLG.(3) c= \frac{NL}{G}. \tag{3}
Last updated on