Exome Sequencing and Direct Searches for Causative Mutations

Genome-wide association studies (GWAS) currently dominate the research literature in human genetics, but they have some limitations, as noted in the chapter.  Foremost among these is that, while a GWAS identifies regions of the genome associated with a trait or a genetic disease, it does not find the specific causative gene.  That is done by follow-up analysis; for many associations, the actual causative gene or mutation has not been pinpointed. 

An alternative approach to find a causative mutation for a genetic disease is to bypass linkage analysis and look directly for changes in the DNA sequences of affected and unaffected individuals.  At first glance, a direct sequence analysis like this seems to be a daunting task for the identification of human mutations giving rise to either single gene or complex traits.    Two particular problems immediately come to mind.  First, the human diploid genome comprises 6 billion base pairs, so it is much larger than the genome of model organisms such as worms and yeast.  Second, unlike laboratory organisms, genomic variation between unrelated people can be expected at hundreds of thousands or even millions of sites in the genome, most of which have nothing to do with variation in the trait being analyzed. While we have a reference human genome—in fact, several individual genomes—none of them could be considered a “wild-type” from which other genomes have been derived, as is done in model organisms.  Even if the genomes of affected and unaffected individuals could be sequenced, identifying the one causative mutation among all that genetic variation is a formidable task.  

Obtaining the sequences.  Although these are huge challenges to overcome, the analysis of genetic diseases based on direct genome sequencing is progressing quickly.  As the costs of sequencing have decreased dramatically, the feasibility of rapid genome-sequencing at relatively low cost has improved.  Furthermore, improved filtering methods have been developed to identify the best candidate mutations among the vast array of genetic variation when human genomes are compared. 

The initial approaches to directly compare the DNA sequences focused on sequencing the exons of affected people rather than the entire genome, a procedure known as exome sequencing.   Exome sequencing offers a direct connection between molecular changes in a gene and the disease syndrome.  Exons comprise less than 2% of the total genome sequence, so focusing on them greatly reduces the amount and the complexity of sequence information that is compiled and analyzed.  The trade-off for this reduced complexity is that any mutations affecting regulatory regions are missed.  For many single-gene genetic diseases with severe phenotypes, this is an acceptable compromise.  Mutations in exons are likely to be the mutations with the most profound effects on the function of the gene’s protein product since they can cause amino acid changes or polypeptide chain termination. 

The human exome consists of about 30 megabases in the haploid genome spread among approximately 180,000 exons, so the first stage in exome sequencing is to “capture” the exons from among the entire genome of 3 billion base pairs.  Exon capture can be done by hybridizing fragmented genome DNA to two or more different exon microarrays specifically constructed to represent all of protein coding regions in the genome.  Hybridizing fragments are extracted, purified, and amplified by PCR for sequencing.  With current high throughput sequencing methods, as much as 30 gigabases (30 x 109) can be generated from one machine in a ten-day sequencing run.  The sequence produced by these methods is very short, typically less than 100 bp in length, and each sequence has some errors.  Nonetheless, because so much sequence data is generated so rapidly, it is feasible to have each fragment in the exome sequenced 30 to 40 times.  While any given exon might be sequenced less than 30 times, the multiple passes minimize the impact of errors in any single sequence fragment. 

Filtering the sequences.  The first use of exome sequencing to identify the causative gene for a genetic disorder was for Miller Syndrome (Ng et al, 2010).  This procedure will be described in some depth.  Miller Syndrome is a rare genetic disease characterized by cranio-facial abnormalities.  Only about 30 cases are known world-wide, so the exact mode of inheritance is not clear.  The inheritance is consistent with the mutation being an autosomal recessive trait, but an autosomal dominant trait with reduced penetrance could not be absolutely ruled out.  Thus, the analysis had to include both possibilities.  There are only three known families with two affected siblings, so these families formed the core group for sequence analysis. 

The first step was to obtain the exome sequence of two individuals in one family; about 164,000 regions could be sequenced, representing 27.9 Mbases and about 96% of the exons.  This amounted to 5.1 gigabases of DNA sequence per individual, in lengths of about 76 base pairs, or an average forty-fold coverage of their exons. 

Many different types of polymorphisms are seen when the two siblings are compared.  Because this is a severe disease with profound phenotypic consequences, the authors made the reasonable assumption that the causative mutation in the gene must substantially alter or disrupt the function of the protein product.  Thus, mutations that resulted in amino acid substitutions (that is, non-synonymous variants), that altered a splice site, or that inserted or deleted part of the coding region (that is, indels), were compared between the siblings.  Each affected child in this study had about 4600 such variants in his genome. The flowchart for identifying the causative mutation from among these variants is shown in Figure A, which has been reconstructed from the original data.

Figure A

Now the question of whether the disorder is inherited as a dominant or recessive trait becomes an important part of the analysis.  If the disease gene is an autosomal dominant trait, then the affected siblings only need to share one mutation; in this study, 3940 of the 4680 sequence changes were shared between the two affected siblings.  On the other hand, if the disease is an autosomal recessive trait, then an affected individual has to have mutations in both alleles of the gene; furthermore, these variants would be shared between the affected siblings.  Note that the mutations need not be the same mutation in the gene; in fact, since the parents were unrelated, the mutations are expected to be different.  (In standard genetic terminology, the affected individuals are said to be heteroallelic or compound heterozygotes for two different recessive mutations, but you can think of them as being homozygous.) 

In this study, each child was homozygous for about 2860 variants, of which 2362 were shared between the two siblings, as summarized in Figure A.  This large number of common variants reflects that siblings have half of their genome in common on average. One of these shared variants is the Miller Syndrome disease gene, but there are many other shared variants as well. 

The next step was to apply some filtering steps to the data; the effectiveness of these filtering steps are diagrammed in Figure B(a) and (b). Panel (a) summarizes the effectiveness if the trait is inherited as a dominant, while Panel (b) summarizes the effectiveness if the trait is recessive. There are databases of known common polymorphisms for particular populations; any polymorphism found in these databases can be ruled out as a candidate mutation since Miller Syndrome is so rare.  Of the 3940 single variants shared by the affected siblings (Panel (a)), only 228 variants were not found among the population at large; among the 2362 variants for which the siblings are homozygous (Panel (b)), only 9 were not found in the population at large.  That is, almost 95% of the single gene variants shared by the affected siblings are also shared with people who are not affected; more than 96% of the homozygous variants shared by the siblings are also shared with people who are not affected.  Thus, filtering out the common polymorphisms is an extremely effective means to reduce the number of candidate mutations.  This supports the concept discussed in the chapter that most of the variation in a person’s genome is shared with people of the same ancestry.

Figure B

In addition to these two affected siblings, the exomes of two unrelated children affected by Miller Syndrome were also sequenced and summarized in Figure B(a) and (b).  If only a single mutation is required to cause the disease, that is, if the disease is a dominant trait (Panel (a)), there were 3099 shared variants between the two affected brothers and this unrelated child; if the syndrome is recessive and affected individuals have to be homozygous (Panel (b)), 1810 possible candidates were identified as being shared among the two affected brothers and the unrelated affected child.  By filtering out the polymorphisms that occur commonly in human populations, nearly all of these candidates could be eliminated.  Only 26 variants in single genes were shared between the unrelated affected individuals but not in the population at large, and only one homozygous variant was shared by the two affected families but not found in the population at large.  This gene is a very strong candidate, and if the syndrome is inherited as a recessive trait, then this gene is the only candidate. 

In fact, when the exome from the other unrelated individual, the fourth person from the third family, was included with the others, only 8 dominant variants were in common and only this same one homozygous variant was shared among the three families.  The mutations in this gene, called DHODH, made it the best candidate for being the cause of Millers Syndrome.  The inheritance of unrelated mutations in both alleles of this gene in three unrelated families is very strong evidence that this is the correct gene. 

In order to confirm that this gene is the cause of Miller Syndrome, this gene (rather than the entire exome) was sequenced from four more affected individuals, one of them an affected sibling of the affected child in family 2 above and three of them the only affected member of their family.  All of these individuals were heteroallelic for mutations in the DHODH gene.  In addition, the parents of the affected individuals were all found to be heterozygous for the mutations found in their children.  In total, 11 different mutations in the DHODH gene were found in six different families, and none of them was in common except within the same family None of these mutations in DHODH was found among 200 unaffected control individuals. 

Other means to filter the data. This initial study showed that removing common polymorphisms from consideration was by far the most effective way of filtering out candidate variants.  Subsequent exome sequencing studies do this routinely among their first steps.  Other filters have also been found to be useful.  For example, the study of Miller Syndrome also used a program to predict which sequence variants would have the severely damaging effects on the protein, as summarized in Figure A.  In the first study, this filter was misleading since one of the mutations in DHODH was not predicted by the program to be damaging and was eliminated from consideration. In fact, this program pointed to another gene for which the two brothers also shared mutations as the best candidate, but this gene was not mutated in affected children from other families. In subsequent studies, more refined versions of protein structure predictions have been used and proved to be helpful.

Even in this initial study, exome sequencing proved to be an exceptionally effective method to find the causative mutation for a genetic syndrome of unknown cause.  The exome sequences of only four individual in three families were necessary to identify the gene, with four more affected individuals tested for verification.  In fact, if the investigators had been more confident about the mode of inheritance and simply tested the gene as an autosomal recessive trait, only one unrelated affected individual would have been needed to find the gene.  At least 75 other studies using exome sequencing to find the gene responsible for a rare genetic syndrome have been reported in the past five years, with similar approaches. 

Whole Genome Sequencing.  The rationale for sequencing exomes rather than entire genomes is that most of the severe genetic diseases will be due to mutations in the exons that produce deleterious changes in the amino acid sequence.  This is also its primary pitfall: some diseases will be due to changes in the regulatory regions and will be missed by this approach.  In order to identify all of the possible causative mutations, it would be better to examine the sequence of the entire genome, both the exons and the regulatory regions.

Some whole genome sequencing studies have been done, and it is certain that this approach will be increasingly common.  While sequencing exomes is much cheaper and faster than sequencing entire genomes, exome sequencing requires an additional step to capture the exons; sequencing technology has improved enough that whole genome sequencing, which does not require this variable step, is nearly as effective.  The primary limitation in using whole genome variation is that the databases of common polymorphisms focus primarily on exons; thus, it is more difficult to filter out common polymorphisms in other parts of the genome, at least until many more entire individual human genomes are sequenced.


Find out more: Ng et al. 2010. Nature Genetics 42: 30-36