sequence based identification of genetic variation...

36
ACTA UNIVERSITATIS UPSALIENSIS UPPSALA 2017 Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine 1343 Sequence based identification of genetic variation associated with intellectual disability JIN ZHAO ISSN 1651-6206 ISBN 978-91-513-0007-8 urn:nbn:se:uu:diva-326283

Upload: others

Post on 31-Aug-2019

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

ACTAUNIVERSITATIS

UPSALIENSISUPPSALA

2017

Digital Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Medicine 1343

Sequence based identification ofgenetic variation associated withintellectual disability

JIN ZHAO

ISSN 1651-6206ISBN 978-91-513-0007-8urn:nbn:se:uu:diva-326283

Page 2: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

Dissertation presented at Uppsala University to be publicly examined in B7 111:a, BMC,Husargatan 3, Uppsala, Wednesday, 13 September 2017 at 09:00 for the degree of Doctor ofPhilosophy (Faculty of Medicine). The examination will be conducted in English. Facultyexaminer: Associate professor Anna Lindstrand (Department of Molecular Medicine andSurgery, Karolinska Institutet).

AbstractZhao, J. 2017. Sequence based identification of genetic variation associated with intellectualdisability. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty ofMedicine 1343. 35 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-513-0007-8.

Intellectual disability (ID) is a common neurodevelopmental condition, often caused by geneticdefects. De novo variation (DNV) is an important cause of ID, especially in severe or syndromicforms of the disorder. Next generation sequencing has been a successful application for findingpathogenic variation in ID patients. The main focus of this thesis is to use whole exomesequencing (WES) and whole genome sequencing (WGS) to identify pathogenic variants inundiagnosed ID patients. In Paper I, WES was used in family trios to identify pathogenic DNVsin patients diagnosed with ID in combination with epilepsy. This work led to the identificationof several DNVs in both new and known disease genes, including the first report of variation inthe HECW2 gene in association with neurodevelopmental disorder and epilepsy. Paper II is thefirst independent validation of PIGG as a disease-causing gene in patients with developmentaldisorder. We used WES to identify the homozygous variation in PIGG, and transcriptomeanalysis as well as flow-cytometry studies were used to validate the pathogenicity of the PIGGvariation. We discovered that PIGG variation give different effects in different cell types,contributing new insights into the disease mechanism. Paper III is also an application of WESin trio families with patients diagnosed with ID in order to identify causal variants, a strategysimilar to that of Paper I. Several pathogenic variants were identified in this study; in particular,the gene NAA15 is highlighted as a new disease gene, and was recently confirmed in independentstudies. This study also adds evidence to support that variation in the PUF60 gene is causing thesymptoms in patients with Verheij syndrome. In Paper IV, WGS was used to analyze familieswith consanguineous marriages. All families in this study had been previously analyzed withWES without finding a disease cause. A number of new disease-causing variants were identifiedin the study, including a first validation of FRMD4A as a disease-associated gene. This studyalso shows that WGS performs better than WES in finding variants, even for variants in codingparts of the genome.

Jin Zhao, Department of Immunology, Genetics and Pathology, Rudbecklaboratoriet, UppsalaUniversity, SE-751 85 Uppsala, Sweden.

© Jin Zhao 2017

ISSN 1651-6206ISBN 978-91-513-0007-8urn:nbn:se:uu:diva-326283 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-326283)

Page 3: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

List of Papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I Halvardson J, Zhao JJ, Zaghlool A, Wentzel C, Georgii-Hemming P, Månsson E, Ederth Sävmarker H, Brandberg G, Soussi Zander C, Thuresson AC#, Feuk L#. Mutations in HECW2 are associated with intellectual disability and epilepsy. Journal of Medical Genetics 53, 697-704. 2016.

II Zhao JJ, Halvardson J, Knaus A, Georgii-Hemming P, Baeck P, Krawitz P, Thuresson A-C#, Feuk L#. Reduced cell surface levels of GPI-linked markers in a new case with PIGG loss of function. Hu-man Mutation, 2017 Jun 5. doi: 10.1002/humu.23268. [Epub ahead of print]

III Zhao JJ, Halvardson J, Soussi Zander C, Zaghlool A, Georgii-Hemming P, Månsson E, Brandberg G, Ederth Sävmarker H, Frykholm C, Kuchinskaya E, Thuresson A-C#, Lars Feuk#. Exome sequencing reveals NAA15 and PUF60 as candidate genes associated with intellectual disability. Accepted for publication in American Journal of Medical Genetics Part B: Neuropsychiatric Genetics.

IV Zhao JJ, Halvardson J, Soussi Zander C, Månsson E, Ederth Sävmarker H, Åberg M, Thuresson A-C#, Feuk L#. Whole genome sequencing of consanguineous families reveals novel pathogenic variants in intellectual disability. Manuscript.

Reprints were made with permission from the respective publishers.

Page 4: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation
Page 5: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

Contents

Introduction ..................................................................................................... 7Genetic sequence ........................................................................................ 7Human genome ........................................................................................... 7Human genes .............................................................................................. 8Variations in the human genome ................................................................ 9The effect of genetic variations ................................................................ 10Inheritance of pathogenic variants in human ............................................ 12De novo variants ....................................................................................... 13Intellectual disability ................................................................................ 13Genetic causes of intellectual disability ................................................... 13Next generation sequencing ...................................................................... 14Interpretation of identified variants in ID patients ................................... 15

Methods ......................................................................................................... 18Whole Exome Sequencing ........................................................................ 18Whole genome Sequencing ...................................................................... 18Transcriptome analysis ............................................................................. 19

Summary of included papers ......................................................................... 20Paper I: Mutations in HECW2 are associated with intellectual disability and epilepsy .............................................................................. 20Paper II: Reduced cell surface levels of GPI-linked markers in a new case with PIGG loss of function ..................................................... 21Paper III: Exome sequencing reveals NAA15 and PUF60 as candidate genes associated with intellectual disability ............................. 22Paper IV: Whole genome sequencing of consanguineous families reveals novel pathogenic variants in intellectual disability ...................... 24

Concluding remarks and future perspectives ................................................ 26

Acknowledgements ....................................................................................... 28

References ..................................................................................................... 30

Page 6: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

Abbreviations

A Adenine bp Base pair C Cytosine CADD Combined annotation dependent depletion CNV Copy number variant dbSNP Database of single nucleotide polymorphism DNA Deoxyribonucleic acid DNV De novo variant eQTL expression Quantitative Trait Locus ExAC Exome Aggregation Consortium database G Guanine GPI Glycosylphosphatidylinositol GPI-AP GPI-anchored proteins GWAS Genome Wide Association Study ID Intellectual Disability IGD Inherited GPI deficiencie Indel Insertion or deletion mRNA messenger RNA NGS Next generation sequencing RNA Ribonucleic acid SNV Single Nucleotide Variant T Thymine U Uracil VUS Variant of uncertain significant WES Whole exome sequencing WGS Whole genome sequencing

Page 7: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

7

Introduction

Genetic sequence All living organisms have genetic information, which are the essential in-structions for the development and maintenance of organisms, as well as producing offspring. Genetic information is stored in nucleic acids (DNA or RNA), which are long chains of linked nucleotides. Each DNA nucleotide comprises a phosphate, a sugar, and one of four bases, adenine (A), cytosine (C), guanine (G), or thymine (T) (Figure 1); in RNA, the bases are A, C, G as in DNA, but RNA contains uracil (U) instead of T. The sequence of bases along the DNA or RNA strand encodes genetic information. Most organ-isms, including human, use DNA as genetic material. Viruses also have ge-netic material, although they lack characteristics of living organisms, such as metabolism and reaction to stimuli. Both DNA and RNA have found to be genetic materials of viruses. This thesis will focus on the human genome.

Human genome The human genome comprises double-stranded nuclear DNA (Figure 1) and circular double stranded mitochondrial DNA. The vast majority of human genetic information is stored in the nuclear DNA, which normally exists as two chains wrapped round each other in a double helix form. The sequences of these two chains are complementary, as the two chains can only fit to-gether correctly when every A in one chain is paired with a T in the other chain, and every C is paired with a G. This crucial feature is termed base pairing, which enables the reparation and replication of DNA. In human cells, nuclear DNA is packed with various proteins into complex structures called chromosomes. A normal human somatic cell contains 46 chromo-somes, including one pair of each autosome, i.e. chromosome 1-22, and two sex chromosomes, i.e. the X and Y-chromosomes (XX in females and XY in males).

Page 8: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

8

Figure 1. DNA double helix and base-pairing of the four bases.

Human genes The expressed regions of DNA are called genes, which is an evolving con-cept. Early definition of a gene specified it to code for one particular en-zyme, which was a useful way of thinking about a gene’s function (Beadle and Tatum, 1941). It is now accepted that many genes encode non-enzymic proteins, and one gene could carry the instructions for making more than one protein. A gene is first used as a template to produce messenger RNA (mRNA) in a process termed transcription; mRNA is subsequently used to guide the production of polypeptides in a mechanism called translation (Fig-ure 2). Besides traditional protein-coding genes, many genes specify func-tional non-coding RNA. The amino acid coding sequence of most human genes are split into relative-ly short segments called exons, separated by relatively long segments of

Adenine

Thymine

Cytosine

Guanine

Sugar-phosphatebackbone

Page 9: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

9

non-coding sequences called introns. During transcription, all exons and introns of a gene are transcribed to make the primary transcript. The introns of primary transcripts are removed, and the exons joined together by a multi-molecular complex called the spliceosome in a process called splicing (Fig-ure 2). The introns are recognized because of special sequences called con-sensus sequences, which bind to the proteins or RNA molecules in the spliceosome. Most human introns start with GU (GT in DNA template) and end with AG. Sometimes certain exons are also removed, and the remaining exons are spliced together in a phenomenon called alternative splicing, which enables a single gene to encode multiple different protein isoforms. These alternative protein isoforms could have tissue specific expression pat-terns and distinct functions (Yang, et al., 2016).

Figure 2. DNA transcription, mRNA processing and translation.

Variations in the human genome Variations in the DNA sequence of our genomes are called genetic variation. Genetic variation provides the substrates for natural selection and is the foundation of evolution. While mechanisms such as gene flow and sexual reproduction could provide genetic variation to a population, the ultimate source of genetic variation is mutation, the permanent changes in DNA se-quence caused by DNA damage or errors during DNA replication (Malkova and Haber, 2012). Changes of the DNA sequence can also be caused by in-

Page 10: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

10

sertions of non-coding DNA sequence known as active retroelements (Kaer and Speek, 2013). Human genetic variation ranges from numerical chromosomal abnormalities, structural variation, to single nucleotide variation (SNV). In cases of numer-ical chromosomal abnormalities, cells have one or more chromosomes extra or missing. Structural variation is caused by translocations, deletions, dupli-cations and inversions. SNV is where one base in the DNA sequence is changed to another. In the context of evolution, variants could be classified as deleterious, neutral, or adaptive (Fay, et al., 2001). The focus of this thesis will be on the deleterious variations in human genome.

The effect of genetic variations Genetic variations could affect the process of transcription, translation, and the function of the protein product of genes. Duplication and deletion of whole genes could cause variation in the copy number of genes and poten-tially change the amount of gene products. The biological consequences of such changes depend on the dosage sensitivity of the genes. If a 50% in-crease or decrease of copy number of a gene is enough to cause a phenotypic change, this gene is dosage-sensitive (Rice and McLysaght, 2017). Deletions of genes are generally more likely to be pathogenic than duplications (Pinto, et al., 2010; Watson, et al., 2014). Structural variations could also disrupt a gene, in which case the gene could still be expressed from the 5’ end, and produce incomplete transcripts, which are unstable and unlikely to produce any protein product. Sometimes two different genes are disrupted, the 5’ part of one gene and 3’ part of the other gene are brought together, in which case a potential stable fusion transcript could be expressed and translated. The genomic regions containing sequence variants that affect the expression level of genes are called expression quantitative trait loci (eQTLs), examples of eQTLs could be found in regulatory DNA elements such as promoters upstream a gene and distantly located enhancers (Schadt, et al., 2003). The effect of changes in introns are challenging to predict, but could potentially affect expression as well as splicing. Changes of 5’ or 3’ untranslated region of mRNA could affect the translation or stability of mRNA, but the effect is also hard to predict. There are three stop codons in mRNA, UAA, UAG, and UGA, which signal to the ribosome to terminate protein synthesis. A single nucleotide change in the coding region of a gene could cause a premature stop codon in the tran-script, which is called a nonsense mutation. Although truncated proteins are sometimes produced, mRNAs with nonsense mutations are usually degraded

Page 11: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

11

by nonsense-mediated decay, a mechanism likely evolved to save energy and avoid formation of toxic truncated proteins. Frameshift mutation is caused by a coding region deletion or insertion of a number of nucleotides, which is not a multiple of three. A frameshift muta-tion changes all downstream codons, and usually leads to a premature stop codon, thereby causing nonsense mediated decay of mRNA. Even if a poly-peptide is produced, it is usually unable to fold correctly and will be degrad-ed by cells. The effect of frameshift mutation is typically similar to nonsense mutation. When a single nucleotide mutation changes a codon to that of another amino acid, it is called a missense mutation. The effect of a missense mutation is difficult to predict. It could potentially alter the properties of the protein product, e.g. by making a protein bind stronger or weaker to other mole-cules, by affect the stability or folding of the protein, or by changing the efficiency of an enzyme. A synonymous mutation is when a codon is re-placed by another codon, which encodes the same amino acid. Both mis-sense mutation and synonymous mutation could alter splicing. Recent stud-ies have shown that approximately 10% of exonic variations affect splicing by disrupting the assembly of spliceosome (Soemedi, et al., 2017). Splice site mutations could affect the splicing of transcripts in various ways, e.g. exons could be skipped, or introns could be retained in the mature mRNA. Changes of the GT and AG that mark the beginning and the end of an intron are likely to alter splicing. The effect of other changes in the con-sensus sequence or near the splice sites is more challenging to predict, be-cause these sequences are more loosely defined, and there could be nearby replacement splice sites. Novel splice sites could be created by mutations that change sequences to resemble splice sites, which is called cryptic splice site. Many times a transcript is already alternatively spliced, and splice site mutations could change the balance of different splice variants. Certain genetic variants could reduce or even abolish the function of a gene. These variants are called loss of function variants. Loss of function could be caused through various mechanisms, e.g. no synthesis or reduced synthesis of products, altered protein processing, reduced function of protein. Gain of function variation is less common compared to loss of function variation. A complete novel function is possible but unlikely; gain of function variation usually lead to increased gene-product dosage or hyper-activity of the mutat-ed protein, which could be caused by e.g. gene duplication, loss of transla-tional control, increased stability of a protein, or the activation of a protein when it should be shut off.

Page 12: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

12

If a phenotype is manifested when the relevant variant is present in a hetero-zygote, it is called a dominant phenotype. A recessive phenotype requires a variant to be present in both alleles. Gain of function variation usually pro-duces dominant phenotypes. Loss of function variation is likely to cause recessive phenotypes, but could also produce dominant phenotypes depend-ing on how a partial loss of gene function could affect the organism. A dom-inant phenotype could be caused by loss of function variation if half of nor-mal gene product is not sufficient, a phenomenon called haplo-insufficiency. A dominant phenotype could also be caused by loss of func-tion variation when it produces an abnormal protein that interferes with the function of the normal protein; such variants are called dominant negative variants. A pathogenic genetic variant is associated with a disease, but sometimes the condition is not manifested in all individuals carrying the variant. The pro-portion of people with a certain genetic variant who also show the relevant phenotype is called the penetrance. A reduced penetrance could be caused by other genetic factors, environmental factors, and life-style factors. Genet-ic disorders with high penetrance are called Mendelian disorders, while con-ditions with low penetrance are often complex diseases or multifactorial disorders.

Inheritance of pathogenic variants in human Pathogenic variants could be inherited and cause disease in a family pedigree with different patterns. Each child has a 50% chance of inheriting an auto-somal dominant disease from one affected parent, creating a vertical pedi-gree pattern. Autosomal recessive traits usually manifest a horizontal pedi-gree pattern, where the parents are normally unaffected, and the children of affected people are also usually unaffected; the siblings of an affected child has 25% risk of being affected. Autosomal recessive disorders are often seen in children of consanguineous parents. Both autosomal phenotypes normally affect males and females equally. X-linked recessive phenotypes affect mainly males, while the females could be carriers. X-linked dominant phe-notypes have an inheritance pattern similar to autosomal dominant pheno-types, but unique in that an affected father would have all daughters affected and no sons affected. Y-linked phenotypes affect only males and are inherit-ed from father to sons. Mitochondrial disease is only inherited from affect-ed mothers, and the phenotypes are usually very variable.

Page 13: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

13

De novo variants Do novo variants (DNV) are genetic alterations that are present for the first time in one family member. DNV could be caused by germ-line mutations, mutations during early embryogenesis, or somatic mutations. De novo vari-ants are on average more deleterious compared to inherited variations, which have been subjected to evolutionary selection. Although the human mutation rate per cell generation is relative low, factors such as paternal age could lead to an accumulated relatively high mutation rate per human generation (Besenbacher, et al., 2015). Recent family-based next generation sequencing (NGS) studies have shown that germ-line de novo variation is a major cause of neurodevelopmental disorders such as intellectual disability (Erickson, 2016; Helbig, et al., 2016).

Intellectual disability Intellectual disability (ID) is an early onset neurodevelopmental disorder starting before the age of 18. ID is characterized by significantly reduced ability in learning, reasoning, applying practical skills and social skills; pa-tients with ID have difficulties to cope independently in their daily lives (aaidd, 2010; WHO, 2010). ID is defined by an IQ score of less than 70; patients with 50-70 IQ score are diagnosed with mild ID, while severe ID patients have IQ scores lower than 50. The IQ test is designed as a normal distribution with a mean score of 100 points and a standard deviation of 15 points. An ID patient has a score more than two standard deviations below the mean, making the theoretical prevalence of ID to be around 2.3% (Ropers, 2010).

Genetic causes of intellectual disability While the etiologies of ID include a wide range of factors such as infections, injuries, malnutrition, and prenatal alcohol exposure, it is well established that a major cause of ID is genetic variation. Normal development of the human brain requires accurate timing and level of expression of a large number of developmental genes. Genetics is therefore a major part of ID etiology, especially for severe ID cases (Vissers, et al., 2016). Development of techniques to study chromosomes under the microscope led to the identification of the first ID syndrome with a genetic cause through the discovery of chromosome abnormalities in Down syndrome (Lejeune, et al., 1959). Later, Fragile X syndrome was found to be an important cause of ID in males, where expansion of CGG repeats in the FMR1 gene was shown to

Page 14: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

14

be causal (Pieretti, et al., 1991). Chromosome X became widely studied by methods such as linkage analysis in pedigrees with male patients, which led to the identification of a large number of X-linked ID genes (Lubs, et al., 2012; Michelson, et al., 2011). Structural mutations such as copy number variations (CNV) were among the first autosomal causes of ID to be studied. Many de novo CNVs were identified by microarray studies as autosomal-dominant causes of ID (Malcolm, 1996; Shaw-Smith, et al., 2004; Wagenstaller, et al., 2007). Homozygosity mapping using SNP microarray data also led to the identification of recessive mutations as causes of ID. However, identification of dominant single nucleotide mutations in ID pa-tient was challenging until the development of NGS technologies (Musante and Ropers, 2014; Najmabadi, et al., 2011). Variants in certain genes cause a variety of neurodevelopmental phenotypes. Therefore, ID is often found in patients with other cognitive disorders such as epilepsy, autism, and schizo-phrenia (van Bokhoven, 2011).

Next generation sequencing Sporadic point mutations could be a major cause of ID, both because the human genome has a higher per-generation mutation rate than other species, and also because a large number of genes contributes to the development of human brain (Lynch, 2010). With the development of whole exome sequenc-ing (WES), combining the capture of the coding region of human genome and NGS technologies (Figure 3), it became more tractable to search for DNVs in patients with unexplained ID (Gilissen, et al., 2014; Vissers, et al., 2010). WES studies are commonly performed on trio families with two healthy parents and a child with ID that is unexplained by previous tests such as cytogenetic and microarray analysis. By filtering out common variants and inherited variants, candidate DNVs can be identified and then verified by Sanger sequencing. Recent trio-based WES studies have confirmed DNV as a major cause of unexplained severe ID, with a diagnostic yield of 15-30% of case, depending on patient selection, sequencing depth, and other factors (de Ligt, et al., 2012; Hamdan, et al., 2014; Rauch, et al., 2012). WES is not suitable for identifying structural variants and variants in non-coding regions. Pathogenic variants in coding regions can also be missed by WES because of failure of capturing certain exons, or due to low coverage of certain regions of exons such as the edge of exons (Meienberg, et al., 2015).

Page 15: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

15

Figure 3. Whole exome sequencing.

Whole genome sequencing (WGS) aims to sequence all DNA in the genome. In clinical settings, WGS could provide a relatively comprehensive genetic diagnosis (Bowling, et al., 2017; Zahir, et al., 2017). Trio-based WGS was reported to successfully identify DNVs, which were missed by WES. For example, a trio-based WGS study successfully identified 84 coding region DNVs, which were missed by previous WES studies of the same 50 patients (Gilissen, et al., 2014). WGS is limited by a few factors, e.g. repetitive ge-nomic regions that are difficult to map, and the relatively high cost to achieve the sequencing depth required for identifying SNV and small indels (Ajay, et al., 2011; Treangen and Salzberg, 2011).

Interpretation of identified variants in ID patients The interpretation of the pathogenicity of genetic variants is a major chal-lenge, as novel variants are being rapidly identified. Certain types of variants could usually be assumed to cause no transcription or nonsense-mediated decay of transcripts, such as nonsense variants, frame-shift variants, initia-tion codon variants, deletion of exons and certain splice-site variants (Holbrook, et al., 2004). If these null variants are found in a gene known to be associated with ID, and the inheritance pattern is consistent with the es-

Page 16: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

16

tablished pattern for the specific ID syndrome, such variants could be classi-fied as pathogenic (Richards, et al., 2015). The effect of missense variants is much harder to predict. When there is a known pathogenic missense variant, a novel variant that causes the change into the same amino acid could be classified as pathogenic (Richards, et al., 2015). Missense variants in a protein domain that is vital for the protein’s normal function are more likely to be pathogenic, especially if other variants identified in the domain were shown to be pathogenic (David, et al., 2012; Wang, et al., 2012). A missense variant is also more likely to be pathogenic if it is a de novo variant and is identified in a gene known to be associated with a dominant form of ID. The frequency of a variant in control populations could help us to assess the pathogenicity of a novel variant. This information can be found in popula-tion databases such as the 1000 genome project database and the Exome Aggregation Consortium database (Genomes Project, et al., 2015; Lek, et al., 2016). Absence of a variant from these databases is a supporting evidence of the pathogenicity of the variant, while a relatively high allele frequency is evidence supporting that the variant is benign (Richards, et al., 2015). Computational methods could aid in predicting the effect of variations. The algorithms of these methods differs, they could be based on evolutionary conservation of a nucleotide or an amino acid, biochemical effect of an ami-no acid change, predicting the loss or creating of splice sites, mRNA stabil-ity etc. Examples of these methods are Polyphen2 and MutationTaster (Adzhubei, et al., 2010; Schwarz, et al., 2010). Recently, machine learning was developed to predict the pathogenicity of variants in coding and noncod-ing sequence. Combined Annotation Dependent Depletion (CADD), for instance, is a support vector machine trained by contrasting annotated vari-ants that survived natural selection with simulated variants in order to score deleteriousness of variants in human genome (Kircher, et al., 2014). Differ-ent algorithms could have different ways of prediction, if all these algo-rithms agree on the prediction of the effect of a variant; it could be a support-ing evidence for variant pathogenicity interpretation. Another way to prove a variant to be pathogenic is by functional studies. The effect of a variant can be estimated by knocking down the gene or introduc-ing the specific variant to cell lines; or by rescue of patient-derived cells. Based on Clustered regularly interspaced short palindromic repeats (CRISPR)-associated protein 9 nuclease (Cas9) from bacterium Streptococ-cus pyogenes, the CRISPR-Cas9 technology uses RNA as guides to target specific DNA site and is a highly efficient, rapidly developing tool for mod-eling disease-causing mutations (Wade, 2015). Although patient brain cells

Page 17: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

17

are difficult to access, by expressing specific sets of transcription factors, somatic cells can be reprogrammed to generate induced pluripotent stem cells (iPSCs). The iPSCs can then be differentiated into neurons, making it possible to study neuronal development with patient-derived cells (Takahashi and Yamanaka, 2006). Animal models such as mice, zebra fish and fruit flies are also often employed to study the disease biology of causa-tive human mutations. With large genomic similarity with human and high reproductive rate, mice is one of the most common mammalian model sys-tem used to study pathogenic mutations (Zeidler, et al., 2015). Zebra fish and fruit flies have even shorter generation times than mice, and often have enough genetic similarities with human that orthologous can be identified and knocked out or knocked down (Bellen, et al., 2010). Zebra fish have the advantage that a large variety of assays are available in order to study the effect of DNVs on the development of embryos (Davis, et al., 2014). An important indication of disease-causing variation is the observation of a variant in the same gene in other unrelated patients with phenotypic similari-ty. Therefore, it is important to perform WES or WGS on more trios both to identify novel candidate genes and to statistically establish known candidate ID genes as causal.

Page 18: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

18

Methods

Whole Exome Sequencing WES was the main method in Paper I and Paper III. In these two papers we aimed to find pathogenic DNVs in trio families where the parents are unaffected and with no family history of neurodevelopmental conditions, and the child has both ID and epilepsy (Paper I) or ID without epilepsy (Paper III). The families were previously tested with clinical microarray analysis without finding a molecular diagnosis. Genomic DNA was extract-ed from peripheral blood leukocytes of the family members. The coding regions of genomic DNA were enriched prior to sequencing by using SureSelect (Agilent, version 2-5), which is a solution-based human exome capture approach. The capture is achieved by hybridization of sheared ge-nomic DNA and biotinylated oligonucleotide probes, also known as baits, which are complementary to targeted human exonic regions. The targeted regions bound with baits are pull-down by magnetic streptavidin beads, and the captured regions are then amplified before sequencing. Three sequencing techniques were used in these studies, Illumina, Ion Proton, and SOLiD. To map the generated data, different programs were used for the three sequenc-ing techniques: BWA for Illumina, Torrent Suite for Ion Proton, and LifeScope for SOLiD. GATK toolkit was used to identify variants in the sequenced samples. GATK toolkit identifies mismatches and misalignments to the reference genome, estimate the quality of the data and possible biases. To filter out common variants, the identified variants were filtered using our in-house database, as well as the dbSNP database. To identify DNVs, the variants in the patients were filtered against the variants found in the their parents. The identified candidate DNVs were validated by Sanger sequenc-ing.

Whole genome Sequencing WGS was used in Paper IV in order to identify the genetic cause of ID in patients from six consanguineous families. These families had previously been tested using both clinical microarray analysis and WES without finding a casual variant. Genomic DNA of each family member was extracted from peripheral blood leukocytes. The DNA samples were sequenced on Illumina

Page 19: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

19

X10 or Illumina Hiseq platforms. The sequenced reads were mapped to the human reference genome using BWA. GATK toolkit was used to call vari-ants, which were filtered against our in-house database and dbSNP database. The identified inherited variants and DNVs were validated by Sanger se-quencing.

Transcriptome analysis In order to investigate the expression of PIGG gene in the patient family in Paper II, we performed transcriptome analysis using AmpliSeq, which is an amplification-based method that generates expression value for each gene in the genome. In the PIGG study, total RNA was extracted from the fibroblast cells of each family member. The RNA was reverse transcribed using the Ion AmpliSeq Transcriptome Human Gene Expression kit, which enables the measurement of the expression levels of more than 20,000 human genes. The cDNA was amplified using Ion AmpliSeq Transcriptome Human Gene Ex-pression core panel, and after partial digest of the primer sequences, the am-plicons were ligated with adaptors. The Adaptor-ligated amplicons were purified, eluted, and amplified. Quantification of amplicons was performed using Agilent Bioanalyzer with Agilent High Sensitivity DNA kit. After template preparation on Ion Chef system, the samples were sequenced in Ion Proton system. The expression levels of PIGG and other GPI synthesis gene in the patient were compared with that of their parents.

Page 20: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

20

Summary of included papers

Paper I: Mutations in HECW2 are associated with intellectual disability and epilepsy Aim: To identify causative DNVs in patients with both ID and epilepsy Methods: Whole exome sequencing Sanger sequencing ID is defined by an IQ score of lower than 70, and it is assessed that 20%-30% ID patients also have epilepsy (Lhatoo and Sander, 2001). The co-occurring of ID and epilepsy is even more frequent in severe cases of ID (McGrother, et al., 2006). In this study, we used WES to identify pathogenic DNV in trio families, where the patients have both ID and epilepsy, and the parents are health and with no family history of neurodevelopmental disor-ders. We used WES in 39 patient-parent trios, and identified 29 DNV and one pathogenic inherited SNV in the coding regions of 23 trios, which gave a 28% clinical yield. These results add new pathogenic variants in known ID genes, and also provided further description of phenotypes associated with these genes. To investigate the pathogenicity of the validated variants, we calculated CADD scores of each variant. This result showed that on average the identi-fied variants were more deleterious than the variants found in ExAC data-base. We also compiled a list of DNV identified in previously WES studies on neurodevelopmental disorders including ID, epilepsy and autism spec-trum disorder. In this study, we identified a DNV in HECW2 gene. Although HECW2 was previously not associated with ID, we found this gene carried DNVs in six patient in previous neurodevelopmental disorder studies, but zero in control cases. In addition, we constructed an interaction network with all genes where DNVs was identified in this study, as well as previously implicated ID genes. Network analysis showed that HECW2 gene is con-nected to a number of known ID genes. Based on these results combined with previous functional studies on HECW2, we highlight HECW2 gene as a novel candidate ID gene. This result has later been validated in an independ-ent study (Berko, et al., 2017).

Page 21: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

21

Paper II: Reduced cell surface levels of GPI-linked markers in a new case with PIGG loss of function Aim: To investigate how disruption of PIGG gene could cause developmen-tal disorders Methods: Transcriptome sequencing Sanger sequencing Flow Cytometry Glycosylphosphatidylinositol (GPI) is a glycolipid that tethers proteins to the cell surface. These proteins have important catalytic, regulatory and structur-al functions, and are called GPI-anchored proteins (GPI-APs) (Kinoshita, 2014). Disruptive variations in several GPI synthesis genes have been asso-ciated with a group of congenital disorders of glycosylation termed inherited GPI deficiencies (IGD), a congenital disorder with features such as ID, sei-zures and facial dysmorphism (Almeida, et al., 2006; Chiyonobu, et al., 2014; Johnston, et al., 2012; Krawitz, et al., 2012; Krawitz, et al., 2010; Kvarnung, et al., 2013; Maydan, et al., 2011). Using WES, we identified a homozygous nonsense variant in the gene Phos-phatidylinositol Glycan Anchor Biosynthesis Class G (PIGG) in two siblings with ID, early onset seizures, hypotonia, cerebellar hypoplasia, cerebellar ataxia, and minor facial dysmorphology. PIGG is a GPI synthesis gene; it encodes the catalytic component of GPI-EtNP transferase II, which adds an EtNP side chain on GPI (Shishioh, et al., 2005). Variants in PIGG has been reported to cause ID with seizures and hypotonia, however, granulocytes and lymphoblastoid cells of these patients showed normal levels of GPI-AP ex-pression, leaving the mechanism of pathogenesis of PIGG variants unclear (Makrythanasis, et al., 2016). We investigate the expression of PIGG in the family by AmpliSeq, which is a method based on amplification. AmpliSeq provides an expression value for targeted genes. This analysis showed that the expression of PIGG is around half in the patients in comparison to their parents, while all other GIP syn-thesis genes showed equal expression in patients and parents. We also per-formed Sanger sequencing of cDNA from the heterozygous parents, which showed a significantly lowered expression of the mutated allele. These two results indicate that transcripts containing the nonsense variant are likely degraded by nonsense-mediated decay prior to translation.

Page 22: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

22

To investigate the effect of the loss of function variant in PIGG, We used flow cytometry to measure the expression level of GPI-AP on the surface of granulocytes and primary fibroblasts from the two siblings and their parents, as well as independent controls. This analysis showed decreased levels of GPI anchors in patient fibroblasts. The expression levels of GPI-anchored markers such as CD55, CD59, CD73, and CD90 also showed various level of reduction in patient fibroblast. Analysis of patient granulocytes did not show an altered GPI-anchored marker expression. These results suggest the possibility that GPI-AP expressions are affected differently depending on the cell type. The EtNP added by PIGG is often removed in a later step (Shishioh, et al., 2005). However, the EtNP modification could play an important roll for sorting and transporting of GPI-APs from ER to cell surface (Fujita, et al., 2009). Different cell types are potentially differentially affected by PIGG loss of function depending on the distance from the ER to the cell surface, which might explain a GPI reduction in fibroblasts but not in granulocytes. There is a possibility that larger cells such as neurons, where the GPI must be transported a long distance, would be even more severely affected. Our results showed that loss of function of PIGG leads to a cell type specific reduction in GPI-APs, and that it is important to analyze the right type of cells when studying the effect of genetic variants on GPI-AP cell surface levels.

Paper III: Exome sequencing reveals NAA15 and PUF60 as candidate genes associated with intellectual disability Aim: To identify causal de novo variants in ID patient Methods: Whole exome sequencing Sanger sequencing ID is defined by an IQ score lower than 70; it is an early onset neurodevel-opmental disorder that affects 2-3 % of population globally. NGS has been a successful strategy for studies of genetic causes of ID, providing a growing list of validated ID genes and candidate ID genes. WES in trio families has been an especially efficient strategy to identify genetic causes of unex-plained ID cases, yielding clinically significant findings in 20%-30% of pa-

Page 23: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

23

tients previously screened by microarray analysis (de Ligt, et al., 2012; Miller, et al., 2010; Tammimies, et al., 2015). In this study, WES was performed on 28 ID patients in 27 patient-parent trios where the parents were healthy and had no family history of neurodevelopmental conditions. The aim of this study is to identify de novo variants (DNVs) in known and novel ID associated genes. A total of 25 DNVs were found in the exons of 16 patients, and validated using Sanger sequencing. Five variants in the genes ASXL3, FGFR2, PUF60, SETD5, and SMARCA4 are classified to be pathogenic or likely pathogenic according to the standards for interpretation of sequencing variants proposed by ACMG (Richards, et al., 2015), resulting in a 21% diagnostic rate. In order to assess the pathogenicity of the identified DNVs, the combined annotation-dependent depletion (CADD) score was tested for the variations. CADD scores are a relative measurement of pathogenicity of genetic vari-ants, with higher CADD scores indicating higher deleteriousness; a CADD score ≥ 20 indicates that the variant is among 1% most deleterious variants in the human genome (Kircher, et al., 2014). All Five pathogenic or likely pathogenic variants identified in this study had scores >20. Several nonsyn-onymous variants classified as variant of uncertain significance (VUS) also had CADD scores >20, e.g. a variant in gene NAA15 had a high CADD score of 24.2. NAA15 encodes the N(alpha)-acetyltransferase 15, NatA auxiliary subunit, which is responsible for the transmission of signals between neu-rons, a vital mechanism for regulating neuronal differentiation and migration (Fluge, et al., 2002; Sugiura, et al., 2001). We gathered data from 13 previous large WES studies in ID, autism and epilepsy. Three DNVs have been reported previously in patients with neuro-developmental disorders in NAA15, while zero DNV has been reported in healthy controls. We aslo used two metrics reported in the Exome Aggrega-tion Consortium (ExAC) browser, the probability of intolerance (pLI) and the z score (Lek, et al., 2016). NAA15 receive a positive z scores of 3.21 and a pLI score of 1, indicating that NAA15 contain less coding variation than expected, and that heterozygous LoF of NAA15 is not tolerated. Based on these results, we highlight NAA15 as the most interesting new can-didate ID gene. Among the pathogenic variants identified in this study, we highlight the de novo 2 base pair deletion identified in PUF60 gene, as this finding provides further evidence that the heterozygous loss-of-function of this gene causes ID, and that PUF60 is the major contributor to the 8q24.3 deletion phenotype (Dauber, et al., 2013).

Page 24: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

24

Paper IV: Whole genome sequencing of consanguineous families reveals novel pathogenic variants in intellectual disability Aim: To identify causal recessive variants in ID patients from consanguine-ous families Method: Whole genome sequencing Sanger sequencing Intellectual disability (ID) is a common disorder that affects 2-3% of the population (Ropers, 2010). Genomic studies of consanguineous family with ID patients have led to the discovery of many autosomal recessive ID genes. In certain areas with relatively high rate of consanguineous marriages, reces-sive forms of ID common. In this study, we performed WGS on six trio fam-ilies, where the patients had ID and the parents were consanguineous. All six families have been tested in of a previous WES study where no genetic diag-nosis had been obtained. The aims of this study were to find pathogenic var-iations in the patients, and to assess the benefit of using WGS on families that previously went through WES. We identified five autosomal variants and one X-linked variants, which are located in the coding region of PIGN, RTTN, COL27A1, FRMD4A1, FRRS1L, and SMC1A gene, and are validated by Sanger sequencing. Five of these variations were homozygous in the patients and heterozygous in their parents. In a female patient, a de novo two base pair (bp) deletion was identi-fied in SMC1A gene, which is located on the X chromosome. We calculated the combined annotation-dependent depletion score (CADD) in order to predict the deleteriousness of the identified variants. All varia-tions validated in this study scored over 23. Five variants have not been reported in the general population, and one variant has been reported only once in ExAC database and only as a heterozygous variant (allele frequency = 0.000008238). Following the standards for interpretation of variants recommended by the ACMG (Richards, et al., 2015), variants in the genes PIGN, SMC1A, COL27A1, and FRRS1L were determined to be pathogenic or likely patho-genic. The variants in gene FRMD4A1 and COL27A1 were found in a single patient. The patient has developmental delay and skeletal malformation, which are the symptoms reported in association with these two genes.

Page 25: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

25

This study shows that many pathogenic variants are missed using WES, and that WGS is an efficient application to identify genetic causes for ID pro-bands from consanguineous families. One of the advantages of WGS over WES is that the sequence capture step required in WES leads to a more une-ven coverage of exonic regions compared to WGS (Meienberg, et al., 2015).

Page 26: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

26

Concluding remarks and future perspectives

The advancement of NGS technology enables simultaneous and rapid analy-sis of virtually all variants in a genome. This unbiased approach has led to molecular diagnosis for an increasing percentage of ID patients, and contin-ues to further our understanding of the genetic causes of ID. In Paper I and Paper III, we used WES to identify causal variants in ID patient. WES method is based on the idea that the coding region of the ge-nome is medically important, and that limiting the sequencing region could lower cost and increase efficiency in genetic diagnosis. Our study in Paper I yielded a diagnostic rate of 28%, and in Paper III 21%. The findings in these two studies provided novel ID candidate genes, contributed to a better statistical association between causal genes and ID, as well as added novel variants and new phenotype information associated with known ID genes. While these results proves that WES is indeed a valuable method, the ma-jority of the patients in these studies remained undiagnosed; and in some patients, we did not even find a candidate ID gene using WES. There are many factors that could cause WES to miss causal variants, and some of the factors are associated with the exome capture step. By capturing the coding region, we miss most of potentially casual variants in the regulatory part of the genome. The exome capture does not cover all coding region of the ge-nome, and the capture leads to an uneven coverage of the coding regions, most noticeably, the coverage was higher in the center of exons and relative-ly low towards the edges of exons, which was not sufficient for accurate calling of variants. The oligonucleotide probes used as baits to capture exons are designed to match the human reference genome, while the patients could have variants in these sequences and potentially affect the efficiency of the capture. All these problems cause WES to miss variants in certain regions of the genome. However, despite of these flaws and the trend of moving to-wards using WGS, I believe there is still room for improvement in the cap-ture step, and efficiency and cost-effectiveness still have their value. WES and capture-based sequencing in general should continue to have a place in genetic diagnosis of ID. In paper IV, we used WGS to identify pathogenic variants in six consan-guineous family trios, which have been previously tested with WES without obtaining the genetic cause. The result of this study is the identification of potential pathogenic variants in the coding regions of patients from five of

Page 27: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

27

six families. The distinct selection requirement of consanguineous families and clear emphasis in analysis to search for recessive causal variants is one of the reasons that contribute to the high diagnostic rate. These results also reflect the advantage of WGS over WES; without the capture step, WGS provides much more comprehensive information of the genome, and more evenly distributed coverage. However, besides the higher cost, WGS as a genetic diagnostic tool still has a lot of limitations in its current state, and is far from the answer to all problems. These limitations come from many dif-ferent aspects, including the current NGS technologies, analytical algo-rithms, genotype and phenotype annotations, etc. One limiting factor is that the sequencing reads of WGS are relatively short compared to other se-quencing techniques such as Sanger sequencing. One of the advantages of WGS over WES is that theoretically, WGS should be more suitable for de-tecting structural variations, which is an important cause of ID. The short reads of current WGS prevents efficient and accurate identification of struc-tural variants, while efficiency and accuracy is key in clinical settings. WGS is also not accurate in analyzing repetitive regions of the genome; it is diffi-cult to find out the length of the whole repetitive region when this region is longer than the read length. Paralogous genes and duplicated segment of the genome in general are difficult regions to analyze for WGS data; it is chal-lenging to determine the origin of a read that maps to more than one place in the genome. All these problems could be solved with longer sequencing reads, however, current long read sequencing technologies are still expensive and with low throughput. The current method of alignment of reads to the reference genome is also a pitfall. The reference genome is a mixture of se-quence segments from a pool of individuals; it does not contain representa-tive long-range haplotypes, and contains potential risk variants. De novo assembly is the solution to this problem, however, for efficient diagnosis of ID patients, de novo assembly is too computationally intense, and again re-quires longer reads and deep sequence coverage. To achieve a comprehensive understanding of the functional consequences of genomic variations in case of ID patients, we need advances in both se-quencing technologies and analysis algorithms; especially efficient long-read sequencing technologies and improved algorithms for difficult genomic re-gions. Functional studies such as the study of PIGG gene in Paper III could provide deeper understanding of disease mechanisms of pathogenic variants. Population-level WGS studies with accurate genotypic and phenotypic data which are shared globally with consensus standards would eventually lead to improved diagnosis of ID and better care for ID patients.

Page 28: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

28

Acknowledgements

First and foremost, I would like to thank my supervisor Lars Feuk. Your knowledge and leadership has been a great guidance in our projects; your encouragement and patience has made me grow in the field of research. I would also like to thank my co-supervisor Lotta Thuresson. Your knowledge and enthusiasm has been a great help in my whole PhD study. Furthermore, I would like to give thanks to Ammar, you have been super-vising me from the beginning; you taught me not only how to work in the lab, but also how to approach scientific problems. Many thanks to Jonatan and Adnan for helping me with bioinformatics related questions. I am also very thankful for Eva and Mitra for always being helpful during my PhD study. I would like to thank my friends for the good times spend together in work-places, parties and trips. You guys are the reason that Uppsala became a second home to me: Tiger and Kai for all the good times, good food and well executed team fights; Wangshu and Mikke for showing how much you guys love each other in my face; Leikipedia and Hua for sharing infor-mation on cars and gossip, respectively, and sweet dog for the overwhelm-ing cuteness; Di Wu and Yanling for encouraging me to learn snowboard and all the exhilarating times on the slope, can’t wait to ride the mountain with little Bruce; Ding, Hui, little Sammy, and tiny Emmy for inviting me many times to your lovely home, not to mention the fantastic food; LuLu, Yani for the best barbecue ever and cute little Monkey king for playing with me; Cheng Xu a.k.a. Hua Juan for the treacherous board games; Zhirong for the masterfully made disserts, challenging spicy dishes and equally chal-lenging verbal dueling; Junhong Yan for being a good “sis”; Anqi for the wonderful times at Blodstensvägen; Gang Pan for all the helps in the lab and laughs in the lab corridor; Yanyu for pleasant times at lunch and fika in BMC, and Lei Zhang, Yuan Xie, Di Yu, and Chuan Jin for the good times in Rudbeck. I would like to thanks Xiao Lulu for being mine and for being the amazing person you are. You have decorated my house and my life, and made them more cozy and colorful. 最后, 我想感谢爸爸妈妈对我无限的支持和无

Page 29: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

29

微不至的关怀,你们让我无论行走在哪个远方,都能时刻感受到家的温暖。

Page 30: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

30

References

aaidd. 2010. American Association on Intellectual and Developmental Disabilities’ definition manual, Intellectual Disability: Definition, Classification, and Systems of Supports Eleventh edition ed.

Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. 2010. A method and server for predicting damaging missense mutations. Nat Methods 7(4):248-9.

Ajay SS, Parker SC, Abaan HO, Fajardo KV, Margulies EH. 2011. Accurate and comprehensive sequencing of personal genomes. Genome Res 21(9):1498-505.

Almeida AM, Murakami Y, Layton DM, Hillmen P, Sellick GS, Maeda Y, Richards S, Patterson S, Kotsianidis I, Mollica L, Crawford DH, Baker A and others. 2006. Hypomorphic promoter mutation in PIGM causes inherited glycosylphosphatidylinositol deficiency. Nat Med. p 846-51.

Beadle GW, Tatum EL. 1941. Genetic Control of Biochemical Reactions in Neurospora. Proc Natl Acad Sci U S A 27(11):499-506.

Bellen HJ, Tong C, Tsuda H. 2010. 100 years of Drosophila research and its impact on vertebrate neuroscience: a history lesson for the future. Nat Rev Neurosci 11(7):514-22.

Berko ER, Cho MT, Eng C, Shao Y, Sweetser DA, Waxler J, Robin NH, Brewer F, Donkervoort S, Mohassel P, Bonnemann CG, Bialer M and others. 2017. De novo missense variants in HECW2 are associated with neurodevelopmental delay and hypotonia. J Med Genet 54(2):84-86.

Besenbacher S, Liu S, Izarzugaza JM, Grove J, Belling K, Bork-Jensen J, Huang S, Als TD, Li S, Yadav R, Rubio-Garcia A, Lescai F and others. 2015. Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios. Nat Commun 6:5969.

Bowling KM, Thompson ML, Amaral MD, Finnila CR, Hiatt SM, Engel KL, Cochran JN, Brothers KB, East KM, Gray DE, Kelley WV, Lamb NE and others. 2017. Genomic diagnosis for children with intellectual disability and/or developmental delay. Genome Med 9(1):43.

Chiyonobu T, Inoue N, Morimoto M, Kinoshita T, Murakami Y. 2014. Glycosylphosphatidylinositol (GPI) anchor deficiency caused by mutations in PIGW is associated with West syndrome and hyperphosphatasia with mental retardation syndrome. J Med Genet 51(3):203-7.

Dauber A, Golzio C, Guenot C, Jodelka FM, Kibaek M, Kjaergaard S, Leheup B, Martinet D, Nowaczyk MJM, Rosenfeld JA, Zeesman S, Zunich J and others. 2013. SCRIB and PUF60 Are Primary Drivers of the Multisystemic Phenotypes of the 8q24.3 Copy-Number Variant. American Journal of Human Genetics 93(5):798-811.

Page 31: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

31

David A, Razali R, Wass MN, Sternberg MJ. 2012. Protein-protein interaction sites are hot spots for disease-associated nonsynonymous SNPs. Hum Mutat 33(2):359-63.

Davis EE, Frangakis S, Katsanis N. 2014. Interpreting human genetic variation with in vivo zebrafish assays. Biochim Biophys Acta 1842(10):1960-1970.

de Ligt J, Willemsen MH, van Bon BW, Kleefstra T, Yntema HG, Kroes T, Vulto-van Silfhout AT, Koolen DA, de Vries P, Gilissen C, del Rosario M, Hoischen A and others. 2012. Diagnostic exome sequencing in persons with severe intellectual disability. N Engl J Med 367(20):1921-9.

Erickson RP. 2016. The importance of de novo mutations for pediatric neurological disease--It is not all in utero or birth trauma. Mutat Res Rev Mutat Res 767:42-58.

Fay JC, Wyckoff GJ, Wu CI. 2001. Positive and negative selection on the human genome. Genetics 158(3):1227-34.

Fluge O, Bruland O, Akslen LA, Varhaug JE, Lillehaug JR. 2002. NATH, a novel gene overexpressed in papillary thyroid carcinomas. Oncogene 21(33):5056-5068.

Fujita M, Maeda Y, Ra M, Yamaguchi Y, Taguchi R, Kinoshita T. 2009. GPI glycan remodeling by PGAP5 regulates transport of GPI-anchored proteins from the ER to the Golgi. Cell 139(2):352-65.

Genomes Project C, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. 2015. A global reference for human genetic variation. Nature 526(7571):68-74.

Gilissen C, Hehir-Kwa JY, Thung DT, van de Vorst M, van Bon BW, Willemsen MH, Kwint M, Janssen IM, Hoischen A, Schenck A, Leach R, Klein R and others. 2014. Genome sequencing identifies major causes of severe intellectual disability. Nature 511(7509):344-7.

Hamdan FF, Srour M, Capo-Chichi JM, Daoud H, Nassif C, Patry L, Massicotte C, Ambalavanan A, Spiegelman D, Diallo O, Henrion E, Dionne-Laporte A and others. 2014. De novo mutations in moderate or severe intellectual disability. PLoS Genet 10(10):e1004772.

Helbig KL, Farwell Hagman KD, Shinde DN, Mroske C, Powis Z, Li S, Tang S, Helbig I. 2016. Diagnostic exome sequencing provides a molecular diagnosis for a significant proportion of patients with epilepsy. Genet Med 18(9):898-905.

Holbrook JA, Neu-Yilik G, Hentze MW, Kulozik AE. 2004. Nonsense-mediated decay approaches the clinic. Nat Genet 36(8):801-8.

Johnston JJ, Gropman AL, Sapp JC, Teer JK, Martin JM, Liu CF, Yuan X, Ye Z, Cheng L, Brodsky RA, Biesecker LG. 2012. The phenotype of a germline mutation in PIGA: the gene somatically mutated in paroxysmal nocturnal hemoglobinuria. Am J Hum Genet 90(2):295-300.

Kaer K, Speek M. 2013. Retroelements in human disease. Gene 518(2):231-41. Kinoshita T. 2014. Biosynthesis and deficiencies of glycosylphosphatidylinositol.

Proc Jpn Acad Ser B Phys Biol Sci 90(4):130-43. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. 2014. A

general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46(3):310-5.

Page 32: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

32

Krawitz PM, Murakami Y, Hecht J, Kruger U, Holder SE, Mortier GR, Delle Chiaie B, De Baere E, Thompson MD, Roscioli T, Kielbasa S, Kinoshita T and others. 2012. Mutations in PIGO, a member of the GPI-anchor-synthesis pathway, cause hyperphosphatasia with mental retardation. Am J Hum Genet 91(1):146-51.

Krawitz PM, Schweiger MR, Rodelsperger C, Marcelis C, Kolsch U, Meisel C, Stephani F, Kinoshita T, Murakami Y, Bauer S, Isau M, Fischer A and others. 2010. Identity-by-descent filtering of exome sequence data identifies PIGV mutations in hyperphosphatasia mental retardation syndrome. Nat Genet 42(10):827-9.

Kvarnung M, Nilsson D, Lindstrand A, Korenke GC, Chiang SC, Blennow E, Bergmann M, Stodberg T, Makitie O, Anderlid BM, Bryceson YT, Nordenskjold M and others. 2013. A novel intellectual disability syndrome caused by GPI anchor deficiency due to homozygous mutations in PIGT. J Med Genet 50(8):521-8.

Lejeune J, Turpin R, Gautier M. 1959. [Chromosomic diagnosis of mongolism]. Arch Fr Pediatr 16:962-3.

Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O'Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T, Birnbaum DP and others. 2016. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536(7616):285-91.

Lhatoo SD, Sander JW. 2001. The epidemiology of epilepsy and learning disability. Epilepsia 42 Suppl 1:6-9; discussion 19-20.

Lubs HA, Stevenson RE, Schwartz CE. 2012. Fragile X and X-linked intellectual disability: four decades of discovery. Am J Hum Genet 90(4):579-90.

Lynch M. 2010. Rate, molecular spectrum, and consequences of human mutation. Proc Natl Acad Sci U S A 107(3):961-8.

Makrythanasis P, Kato M, Zaki MS, Saitsu H, Nakamura K, Santoni FA, Miyatake S, Nakashima M, Issa MY, Guipponi M, Letourneau A, Logan CV and others. 2016. Pathogenic Variants in PIGG Cause Intellectual Disability with Seizures and Hypotonia. Am J Hum Genet 98(4):615-26.

Malcolm S. 1996. Microdeletion and microduplication syndromes. Prenat Diagn 16(13):1213-9.

Malkova A, Haber JE. 2012. Mutations arising during repair of chromosome breaks. Annu Rev Genet 46:455-73.

Maydan G, Noyman I, Har-Zahav A, Neriah ZB, Pasmanik-Chor M, Yeheskel A, Albin-Kaplanski A, Maya I, Magal N, Birk E, Simon AJ, Halevy A and others. 2011. Multiple congenital anomalies-hypotonia-seizures syndrome is caused by a mutation in PIGN. J Med Genet 48(6):383-9.

McGrother CW, Bhaumik S, Thorp CF, Hauck A, Branford D, Watson JM. 2006. Epilepsy in adults with intellectual disabilities: prevalence, associations and service implications. Seizure 15(6):376-86.

Meienberg J, Zerjavic K, Keller I, Okoniewski M, Patrignani A, Ludin K, Xu Z, Steinmann B, Carrel T, Rothlisberger B, Schlapbach R, Bruggmann R and others. 2015. New insights into the performance of human whole-exome capture platforms. Nucleic Acids Res 43(11):e76.

Page 33: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

33

Michelson DJ, Shevell MI, Sherr EH, Moeschler JB, Gropman AL, Ashwal S. 2011. Evidence report: Genetic and metabolic testing on children with global developmental delay: report of the Quality Standards Subcommittee of the American Academy of Neurology and the Practice Committee of the Child Neurology Society. Neurology 77(17):1629-35.

Miller DT, Adam MP, Aradhya S, Biesecker LG, Brothman AR, Carter NP, Church DM, Crolla JA, Eichler EE, Epstein CJ, Faucett WA, Feuk L and others. 2010. Consensus Statement: Chromosomal Microarray Is a First-Tier Clinical Diagnostic Test for Individuals with Developmental Disabilities or Congenital Anomalies. American Journal of Human Genetics 86(5):749-764.

Musante L, Ropers HH. 2014. Genetics of recessive cognitive disorders. Trends Genet 30(1):32-9.

Najmabadi H, Hu H, Garshasbi M, Zemojtel T, Abedini SS, Chen W, Hosseini M, Behjati F, Haas S, Jamali P, Zecha A, Mohseni M and others. 2011. Deep sequencing reveals 50 novel genes for recessive cognitive disorders. Nature 478(7367):57-63.

Pieretti M, Zhang FP, Fu YH, Warren ST, Oostra BA, Caskey CT, Nelson DL. 1991. Absence of expression of the FMR-1 gene in fragile X syndrome. Cell 66(4):817-22.

Pinto D, Pagnamenta AT, Klei L, Anney R, Merico D, Regan R, Conroy J, Magalhaes TR, Correia C, Abrahams BS, Almeida J, Bacchelli E and others. 2010. Functional impact of global rare copy number variation in autism spectrum disorders. Nature 466(7304):368-72.

Rauch A, Wieczorek D, Graf E, Wieland T, Endele S, Schwarzmayr T, Albrecht B, Bartholdi D, Beygo J, Di Donato N, Dufke A, Cremer K and others. 2012. Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study. Lancet 380(9854):1674-82.

Rice AM, McLysaght A. 2017. Dosage sensitivity is a major determinant of human copy number variant pathogenicity. Nat Commun 8:14366.

Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, Grody WW, Hegde M, Lyon E, Spector E, Voelkerding K, Rehm HL and others. 2015. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 17(5):405-24.

Ropers HH. 2010. Genetics of early onset cognitive impairment. Annu Rev Genomics Hum Genet 11:161-87.

Schadt EE, Monks SA, Drake TA, Lusis AJ, Che N, Colinayo V, Ruff TG, Milligan SB, Lamb JR, Cavet G, Linsley PS, Mao M and others. 2003. Genetics of gene expression surveyed in maize, mouse and man. Nature 422(6929):297-302.

Schwarz JM, Rodelsperger C, Schuelke M, Seelow D. 2010. MutationTaster evaluates disease-causing potential of sequence alterations. Nat Methods 7(8):575-6.

Page 34: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

34

Shaw-Smith C, Redon R, Rickman L, Rio M, Willatt L, Fiegler H, Firth H, Sanlaville D, Winter R, Colleaux L, Bobrow M, Carter NP. 2004. Microarray based comparative genomic hybridisation (array-CGH) detects submicroscopic chromosomal deletions and duplications in patients with learning disability/mental retardation and dysmorphic features. J Med Genet 41(4):241-8.

Shishioh N, Hong Y, Ohishi K, Ashida H, Maeda Y, Kinoshita T. 2005. GPI7 is the second partner of PIG-F and involved in modification of glycosylphosphatidylinositol. J Biol Chem 280(10):9728-34.

Soemedi R, Cygan KJ, Rhine CL, Wang J, Bulacan C, Yang J, Bayrak-Toydemir P, McDonald J, Fairbrother WG. 2017. Pathogenic variants that alter protein code often disrupt splicing. Nat Genet 49(6):848-855.

Sugiura N, Patel RG, Corriveau RA. 2001. N-methyl-D-aspartate receptors regulate a group of transiently expressed genes in the developing brain. Journal of Biological Chemistry 276(17):14257-14263.

Takahashi K, Yamanaka S. 2006. Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 126(4):663-76.

Tammimies K, Marshall CR, Walker S, Kaur G, Thiruvahindrapuram B, Lionel AC, Yuen RKC, Uddin M, Roberts W, Weksberg R, Woodbury-Smith M, Zwaigenbaum L and others. 2015. Molecular Diagnostic Yield of Chromosomal Microarray Analysis and Whole-Exome Sequencing in Children With Autism Spectrum Disorder. Jama-Journal of the American Medical Association 314(9):895-903.

Treangen TJ, Salzberg SL. 2011. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13(1):36-46.

van Bokhoven H. 2011. Genetic and epigenetic networks in intellectual disabilities. Annu Rev Genet 45:81-104.

Vissers LE, de Ligt J, Gilissen C, Janssen I, Steehouwer M, de Vries P, van Lier B, Arts P, Wieskamp N, del Rosario M, van Bon BW, Hoischen A and others. 2010. A de novo paradigm for mental retardation. Nat Genet 42(12):1109-12.

Vissers LE, Gilissen C, Veltman JA. 2016. Genetic studies in intellectual disability and related disorders. Nat Rev Genet 17(1):9-18.

Wade M. 2015. High-Throughput Silencing Using the CRISPR-Cas9 System: A Review of the Benefits and Challenges. J Biomol Screen 20(8):1027-39.

Wagenstaller J, Spranger S, Lorenz-Depiereux B, Kazmierczak B, Nathrath M, Wahl D, Heye B, Glaser D, Liebscher V, Meitinger T, Strom TM. 2007. Copy-number variations measured by single-nucleotide-polymorphism oligonucleotide arrays in patients with mental retardation. Am J Hum Genet 81(4):768-79.

Wang X, Wei X, Thijssen B, Das J, Lipkin SM, Yu H. 2012. Three-dimensional reconstruction of protein networks provides insight into human genetic disease. Nat Biotechnol 30(2):159-64.

Watson CT, Marques-Bonet T, Sharp AJ, Mefford HC. 2014. The genetics of microdeletion and microduplication syndromes: an update. Annu Rev Genomics Hum Genet 15:215-244.

Page 35: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

35

WHO. 2010. Better health, better lives: children and young people with intellectual disability and their families.

Yang X, Coulombe-Huntington J, Kang S, Sheynkman GM, Hao T, Richardson A, Sun S, Yang F, Shen YA, Murray RR, Spirohn K, Begg BE and others. 2016. Widespread Expansion of Protein Interaction Capabilities by Alternative Splicing. Cell 164(4):805-17.

Zahir FR, Mwenifumbo JC, Chun HE, Lim EL, Van Karnebeek CDM, Couse M, Mungall KL, Lee L, Makela N, Armstrong L, Boerkoel CF, Langlois SL and others. 2017. Comprehensive whole genome sequence analyses yields novel genetic and structural insights for Intellectual Disability. BMC Genomics 18(1):403.

Zeidler S, Hukema RK, Willemsen R. 2015. The quest for targeted therapy in fragile X syndrome. Expert Opin Ther Targets 19(10):1277-81.

Page 36: Sequence based identification of genetic variation ...uu.diva-portal.org/smash/get/diva2:1119774/FULLTEXT01.pdf · Zhao, J. 2017. Sequence based identification of genetic variation

Acta Universitatis UpsaliensisDigital Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Medicine 1343

Editor: The Dean of the Faculty of Medicine

A doctoral dissertation from the Faculty of Medicine, UppsalaUniversity, is usually a summary of a number of papers. A fewcopies of the complete dissertation are kept at major Swedishresearch libraries, while the summary alone is distributedinternationally through the series Digital ComprehensiveSummaries of Uppsala Dissertations from the Faculty ofMedicine. (Prior to January, 2005, the series was publishedunder the title “Comprehensive Summaries of UppsalaDissertations from the Faculty of Medicine”.)

Distribution: publications.uu.seurn:nbn:se:uu:diva-326283

ACTAUNIVERSITATIS

UPSALIENSISUPPSALA

2017