genomics · genomics 101 / 3 genomics glossary adaptors a short nucleotide molecule that binds to...

64
PRODUCED BY: IN PARTNERSHIP WITH: Genomics 101 An Introduction To The Genomic Workflow

Upload: others

Post on 31-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

PRODUCED BY: IN PARTNERSHIP WITH:

GEN

OM

ICS

101

Genomics 101An Introduction To The Genomic Workflow

Page 2: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,
Page 3: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

INTRODUCTIONW

e have come a very long way since DNA was first isolated back in 1869. As Friedrich Miescher was investigating ‘nuclein’, I doubt he could have imagined the advances that took place in the subsequent century and a half. The latter half of the 20th century in particular saw tremendous intellects drive forward what would eventually develop into the field of

genomics. 1953 saw James Watson and Francis Crick describe the structure of the DNA helix, and as technology advanced the 21st century began with the publication of the human genome in 2001.

Today, genomics not only represents the pinnacle of our understanding of human biology, but also an industry of extraordinary potential set to impact several aspects of our lives.

The raw requirements to generate, manage, analyse and interpret genomic data have become far more accessibly in recent years. This has led to a phenomenal boom in not just data creation, but our understanding and leveraging of that data. Simply put, there has never been a better time to adopt genomic technology. And that is exactly what so many of you are already doing.

This is where this handbook comes in. Genomics is moving at such a rapid pace that finding easy to understand information to help explain how it all works is hard to come by. Here at Front Line Genomics, we want to do what we can to help lower the barrier to entry to adoption of genomic technology. With the help of some of the leading technology companies (our Strategic Partners, Agilent Technologies, Seven Bridges Genomics, and Twist Bioscience, and partners Affymetrix, DNAnexus, New England Biolabs and WuXi NextCODE), we’ve put together the Genomics 101 as a guided tour through the world of human genomics.

The 101 is not intended to offer detailed protocols to take into the lab. Our intention is to help you understand the kinds of questions you can use genomic approaches to ask, the kinds of platforms available to you, and how they work. We’ll help you explore DNA microarrays and Next Generation Sequencing. We’ll familiarise you with the basic chemistries involved in producing sequence information. We’ll then explain how all that data gets turned into something you can use to help improve patients’ lives.

There is a lot that we could have included in this edition, but we tried to focus on the core technology areas that are at the heart of genomics today and shaping the future of the field. We will endeavour to not only keep these chapters up to date and add new content as technology and applications progress. For now, we hope you find the handbook an interesting read, and above all useful.

Page 4: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

2 / Genomics 101

GENOMICS 101

CONTENTS

24

1 INTRODUCTION

3 GLOSSARY

4 CHAPTER 1: DESIGNING GENOMICS EXPERIMENTSWhat is the right method to use for an experiment? In this chapter we guide you through the available tools and the wide range of uses for genomic data.

10 AFFYMETRIX: PIONEERING WHOLE-GENOME ANALYSIS

14 CHAPTER 2: TURNING DNA INTO DATATo sequence a strand of DNA involves a sequence of precise steps that begin with sample preparation. Here we take a look at how different sequencing techniques actually work, and the critical steps involved in preparing a sample for sequencing.

16 AGILENT TECHNOLOGIES: DEEP CLONAL PROFILING OF PURIFIED TUMOR CELL POPULATIONS FROM FFPE SAMPLES

20 NEW ENGLAND BIOLABS: MEETING THE CHANGING DEMANDS OF NGS SAMPLE PREPARATION WITH NEBNEXT

24 CHAPTER 3: ANALYSING DATAGenerating image data from sequencing is just the beginning. From producing raw read data to reconstructing an entire sequence, this chapter focusses on how we build a DNA sequence for study and research.

26 SEVEN BRIDGES GENOMICS: DISCOVERY IN MILLIONS OF GENOMES

30 DNANEXUS: DNANEXUS MADE RIDICULOUSLY SIMPLE

34 CHAPTER 4: NGS INTERPRETATION AND DISCOVERYSo you have reconstructed a DNA sequence, what comes next? Here we explore the process of interpreting sequence data, searching for disease-linked variants, and the computing challenges facing researchers and clinicians.

38 WUXI NEXTCODE: THE OS OF THE GENOME

42 CHAPTER 5: NGS IN THE CLINICPerhaps the greatest challenge in clinical genomics is reporting the outcomes of genetic testing to patients. In this chapter we explore how the rise and rapid evolution of NGS tests have affected the nature of clinical reporting, and what the future holds for genomic medicine.

46 AGILENT TECHNOLOGIES: ACMG RECOMMENDATIONS ON SEQUENCE VARIANT INTERPRETATION: IMPLEMENTATION ON THE BENCH NGS PLATFORM

50 CHAPTER 6: GENOME EDITINGEditing the human genome has enormous potential to treat disease, but no topic has attracted more debate and ethical discussion. In this chapter we examine the history of gene editing, the rise of CRISPR, and look to the future of the technology.

54 TWIST BIOSCIENCE : REIMAGINE GENOME SCALE RESEARCH

50

4

14

Page 5: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 3

GENOMICS 101

GLOSSARYADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing.

ALLELEOne of two forms of a gene, or other portion of DNA, located at the same place on a chromosome

COMPLIMENTARY DNA (CDNA)A double-stranded DNA molecule synthesised from mRNA, often used during gene cloining

COMPLEX TRAIT/DISEASEA trait or disease that is determined by more than one gene, and/or environmental factors. Traits or diseases determined by just on gene are called single-gene traits/diseases)

COPY NUMBER VARIATION Variation between individuals in the number of copies of a particular region of genomic DNA

CRISPR/CAS9 A genome-editing technique that uses cas genes found in bacteria to cut, edit and regulate the genome of an organism

DNA MICROARRAY Known DNA sequence fragments attached to a slide or membrane, allowing for the detection of specific sequences in an unknown DNA sample

EXOME The region of an organism’s genome that codes for essential proteins. Coding regions of the genome are called Exons, while non-coding regions are called Introns

GENE EXPRESSION Process by which the information from a gene is used to create a functional protein product

GENE PANELA selection of genes relevant to a particular condition, that can be sequenced in order to make a clinical diagnosis

GENOME The full genetic sequence of an orgnism, including both coding and non-coding regions

GENOME-WIDE ASSOCIATION STUDY (GWAS) A study that evaluates the genomes of a large number of participants, looking for correlations between genetic variation and particular traits or diseases.

GENOTYPEThe complete genetic make up of an organism

GERMLINE CELLSCells that will go on to become sperm and ova. Mutations in the germline can be transmitted from parent to offspring.

MESSENGER RNA (MRNA)A single-stranded template for creating specific proteins, created during transcription

MUTATIONA DNA sequence variation that differs from the reference sequence. This can be a SNP, and insertion, or a deletion of the base pairs in the sequence

NEXT GENERATION SEQUENCING (NGS)A range of DNA sequencing technologies that can sequence millions of DNA fragments at once, creating larger datasets and more efficient results

PHENOTYPEThe physical and behavioural manifestations of an organism’s genotype

POLYMERASE CHAIN REACTION (PCR)A technique used to replicate a particular stretch of DNA rapidly and selectively

PRECISION/PERSONALISED MEDICINECustomised or tailored healthcare solutions based upon a patient’s genome, environment, lifestyle etc.

READSIn sequencing a “read” refers to a data string of A, T, C and G bases from the sample DNA. Different sequencing techniques generate different length reads

REFERENCE GENOMEA fully sequenced and assembled genome that acts as a template for reconstructing new sequences.

SINGLE NUCLEOTIDE POLYMORPHISM (SNP)‘Snips’ are single base-pair mutations at specific locations in the genome, and are one of the most common forms of genetic variation

SOMATIC CELLSCells that are not destined to become reproductive cells. Mutations in somatic cells are not passed on from parent to offspring

TRANSCRIPTIONThe process of creating a messenger RNA (mRNA) from a DNA sequence

TRANSLATIONThe process of creating a protein chain, composed of amino acids, from a strand of mRNA

WHOLE EXOME SEQUENCING (WES)The process for sequencing the entire coding portion of an individual genome

WHOLE GENOME SEQUENCING (WGS)The process for sequencing all of an individual’s DNA

Page 6: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

CHAPTER 1:

DESIGNING GENOMICS

EXPERIMENTSSPONSORED BY

Page 7: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 5

DESIGNING GENOMICS EXPERIMENTS

INTRODUCTIONIn this first chapter of the Genomics 101, we take a look at the broad range of options available to anyone looking to generate, or make use of genomic data. Genomic data can range from whole genome to just the exome, or to a subset of genes down to just a single gene. In addition, the genomic data can be in the form of DNA sequence, single nucleotide polymorphism, copy number variation, or structural variation. In addition to reading the genome, we can now also generate profiles of the genes expressed to gain another layer of information that can help understand disease at a cellular level by exposing novel transcripts, splice variants, and non-coding RNAs, which can become valuable biomarkers for diagnostic tests. Before we look into the different methods and platforms available to you, it is important to take a step back to consider what it is that you are trying to achieve.

Since the Human Genome Project was completed more than a decade ago, different whole genome analysis technologies have become available. Whole genome analysis using microarrays has been the traditional work horse for gene expression profiling as well as genotyping applications. They are the perfect tool for scientists in and out of the clinic, due to their affordability, consistency, and quick, easy, and standardised data analysis.

Relatively new next generation sequencing (NGS) methods have improved dramatically over the past few years. You may have seen several graphs showing how sequence output and thus the cost of sequencing has dropped faster than the rate described by Moore’s Law (which describes a long-term trend in the computer industry during which compute power doubles every two years). The dramatic increases

in sequence output over the past 10 years have now made it possible to consider sequencing projects that were impossible or at least completely unaffordable prior to these advances. January, 2014, brought the news that Illumina had delivered the first commercially available $1,000 genome with their HiSeq X Ten Sequencer. Although that includes the cost of reagents and sample preparation, there is still an argument that: to truly break the $1,000 barrier, the cost must also include interpretation of data produced, as well as storage of the resulting data.

The improvements in sequencing technology have led to a flood of genomic information. This has greatly increased the level of understanding of the genome and roles of specific genes. As well as advancing research, this is also leading to the development of more powerful DNA microarrays leveraging the growing wealth of identified gene variants.

Before you get too excited about NGS, consider whether it is really the best option for what you want to do. Although much cheaper than it used to be; NGS is still relatively expensive, and requires considerable I.T. capabilities (as we’ll see in the Analysis chapter). If you are undertaking discovery, or hypothesis-free research, have sufficient funds, and the necessary infrastructure to perform and analyse the data, NGS may well be the right option for you. However, if you are undertaking a hypothesis-driven study, a large sample size epidemiology study, or working with difficult samples such as FFPE or a limited amount of samples such as fine needle biopsy, you can leverage genomic data much more cost effectively by using well-designed microarrays.

Page 8: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

6 / Genomics 101

DESIGNING GENOMICS EXPERIMENTS

As we guide you through your options, try to keep your end result in mind. What kind of data do you need to answer your question most efficiently? You can then begin to make a judgement on platforms based on your operating restrictions:

• What is the question that you are asking?• What is your budget?• What are your overall accuracy and reproducibility requirements?• How many samples do you have access to and need to analyse?• What are your turnaround time requirements?• Is a large, active user-community and support network important?• How user-friendly do you need your platform to be?• Do you have the bioinformatics support for translating raw data

to answers for your biological questions?

NEXT GENERATION SEQUENCING (NGS)Next Generation Sequencing (NGS) is a catch-all term used to describe the current sequencing technologies. Depending on how much of the genome you need, here are three different approaches; each with their own pros and cons.

WHOLE GENOME SEQUENCING (WGS)As the name suggests, this lets you look at the full genome. This is the primary approach being used by the Precision Medicine Initiative (USA), 100 000 Genomes Project (UK), and several other

national sequencing projects. A considerable amount of work is being carried out to better understand how to integrate WGS into a healthcare system at a national scale. For now, WGS is still mainly a powerful research tool, with the notable exception of the pioneering work of Steven Kingsmore and colleagues who are clinically utilising it routinely in the NICU (Neonatal Intensive Care Unit) settings. Whole genome studies are the basis of most , if not all, genomic applications. WGS can provide you a very full picture of one individual, or can help you identify disease causing variants within a population. For hypothesis-free research and discovery, WGS will give you a lot to work with, and ensure you don’t miss anything of interest that may better help you understand your disease or trait of interest.

WHOLE EXOME SEQUENCING (WES)While the whole genome is undoubtedly very useful, much of the contained information may be irrelevant to your application. There are several instances when looking at the exome (protein-coding DNA) is much more practical. As the exome accounts for less than 2% of the genome, it is also considerably cheaper to read and generates much more manageable volumes of data.

WES is particularly useful when trying to map rare variants in complex disorders. Disease-causing variants with large effects will typically be found within the exome. Complex disorders are governed by multiple genes, so you will typically need a very large sample size to discover variants of interest. In this instance, WGS is not a practical option.

By contrast, Mendelian disorders typically have far fewer causative variants behind the condition. Selection pressures are likely to make these variants extremely rare, and may be missed by standard genotyping assays, but still not require WGS.

WES can also be used as a diagnostic tool. Ambry Genetics were the first CLIA-certified laboratory to offer exome sequencing for clinical diagnostic purposes. WES is also now being clinically utilized by such places as Washington University in St. Louis.

TARGETED GENE PANELSIn a clinical setting, WES still presents a few challenges. The cost can still be relatively high, particularly if you are sequencing a child and both parents. Perhaps the biggest drawback is a long turnaround time. As we will see in later chapters, sequencing is only the start of the journey. Returning full results can take months. WES is likely to uncover multiple variants, so the identification of the causal variant of interest can be a difficult task.

One of the contentious issues surrounding WGS and WES today is around ‘incidental findings’. This is when sequencing uncovers potential medically relevant or actionable results not related to the indication you were originally testing for. Is it the patient’s right to receive all information, or to decline results? Or is it the clinician’s duty to inform the patient regardless? We will cover this

Page 9: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 7

DESIGNING GENOMICS EXPERIMENTS

in more detail in the Genomics In The Clinic chapter, later on.

One way to avoid these issues is to use gene panels. Rather than sequencing the whole exome, you can choose to sequence a ‘panel’ or selection of genes relevant to a particular phenotype. This will certainly bring the operating costs down, and present a simpler analysis, but does present us with a new problem: panel design.

WGS and WES can present us with too much information. Gene panels can present us with too little information. One testing laboratory’s design may differ from another laboratory’s design. This is because the design itself will be based on published results and how we choose to interpret them. What is deemed to be relevant to an indication may vary from person to person, which can lead to a lack of uniformity of gene panel design. There are several commercially available panels for the previously mentioned sequencing platforms, and you can of course design your own to answer your own specific questions.

This raises an interesting question. Is it better to have thousands of different panels optimised to answer specific questions, or to have a single CLIA-certified exome that can be used to answer all of those same questions? Mayo Clinic in the USA, have pursued the gene panel route to simplify reimbursement issues, while Washington University have taken the bolder WES strategy.

ONE OF THE CONTENTIOUS ISSUES SURROUNDING WGS AND WES TODAY IS AROUND ‘INCIDENTAL FINDINGS’.

THIS IS WHEN SEQUENCING UNCOVERS POTENTIAL MEDICALLY RELEVANT OR ACTIONABLE RESULTS NOT RELATED TO THE INDICATION YOU WERE ORIGINALLY TESTING FOR.“

Page 10: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

8 / Genomics 101

DESIGNING GENOMICS EXPERIMENTS

NGS PLATFORMSThese platforms all have their advantages and disadvantages. How heavily those sway your potential decision should come down to what your own set of parameters are. So do investigate them all, and try to find first-hand testimonials from existing users.

The following are the most common NGS platforms available today:

454 Life Sciences: The Roche company produce high throughput sequencing machines based on their pyrosequencing technology. While the cost per run is relatively expensive, the machines are quite fast and produce longer reads than most at around 700 base pairs. However, Roche is no longer supporting this platform as other technologies have superseded this now antiquated platform.

Illumina: Sequencing by synthesis is by far the most popular way to sequence today. Illumina have a dominant market share due to the cost effectiveness of sequencing through their platforms, and the potential for particularly high yields. This company produces a wide range of NGS machines from the smallest machine the MiniSeq (capable of 7.5 Gbs of sequence/run) all the way up to the HiSeq XTen (which is the platform where the cost of WGS has dropped below $1,000 if you sequence at a high enough volume). The main draw backs here are the initial cost of the equipment itself, potentially short lifespan of the instrumentation, and uneven coverage associated with short reads.

Ion Torrent Sequencing: The ion semiconductor method of sequencing proved very popular when it first hit the market. The sequencer itself tends to be very competitively priced and is exceptionally fast. This helped it find a home in several diagnostic laboratories, where quick turnaround times are crucial, and absolute base-pair accuracy less so.

This platform is not viable for WGS as its output is significantly below what is needed for complex genomes. However, it is ideally suited for small gene panels up to exome-based analysis that we discussed in the previous section.

Pacific Biosciences: Using single-molecule real-time sequencing (SMRT), this platform is known for producing long reads (up to 60,000 base pairs with their latest machine). This gives you a considerable advantage if you want to identify structural variations, and increase your coverage in difficult to amplify areas of the genome. However, the Pacific Biosciences equipment does come in at a higher cost than most, and doesn’t quite have the same throughput as some of the other platforms available.

Another advantage of this platform is its potential ability to sequence modified bases (such as 5’-methyl-cytosine). Currently this is not a viable platform for whole genome sequencing simply because of its limited output. However, if cost is not an object, it is a good option for examining even the most complex genomes. What it is better at, however, is scaffolding genomes together from other short-read technologies.

Looking Ahead Developing sequencing technology could offer alternatives to the existing NGS platforms and potentially (down the road) even replace some of those platforms. Most popular amongst these is nanopore sequencing.

THESE PLATFORMS ALL HAVE THEIR ADVANTAGES AND DISADVANTAGES. HOW HEAVILY THOSE SWAY YOUR

POTENTIAL DECISION SHOULD COME DOWN TO WHAT YOUR OWN SET OF PARAMETERS ARE.”

Page 11: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 9

DESIGNING GENOMICS EXPERIMENTS

Oxford Nanopore Technologies are recognised as the leading company in this field. Their technology operates on the principle of reading the DNA molecule directly by passing it through a nanopore and measuring the effects on ions and electrical current flowing through the pore.

This effectively allows you to read DNA in real-time, and would be considerable cheaper and faster than current methods. Oxford Nanopore’s MinION is roughly the size of a USB stick, and plugs directly into a computer. Unfortunately sequence accuracy on all current single molecule sequences is quite poor, thus in order to obtain high sequence accuracy, you need to do considerably more sequence coverage of each base to reach a decent consensus sequence. If sequencing accuracy is high enough, and the output on this platform could be increased by at least several orders of magnitude, this could potentially help bring NGS into the clinic to quickly identify a person’s associated risk of disease, or to identify their pharmacogenomics profile.

DNA MICROARRAYSDNA microarray is a technology by which known DNA sequences are either deposited, or synthesised, onto a surface. This allows us to detect the presence, and concentration, of sequences of interest.

The turn of the century saw a dramatic increase in our understanding of the human genome. New production methods, and fluorescent detection, were adapted to build modern microarrays. While DNA arrays have been around in early forms since the 1970’s, it was only in the 1990s that microarrays started to become the invaluable tool we know them as today.

Microarrays can be used for whole genome as well as targeted analysis. Technological advancements over the years are enabling affordable and fast turnaround of uber-sized epidemiology studies by large biobanks, providing robust assays to address challenging samples such as FFPE and single cells, and obtaining regulatory clearance for diagnostic use in the clinic. Key applications for microarrays include gene expression profiling of mRNA and miRNA, genotyping and copy number variation analysis, and high-resolution chromosomal abnormality detection.

Gene Expression: RNA is isolated and enriched from the sample. It can then be amplified and labelled ready to be hybridised to a microarray. After a wash to remove unbound material the microarray is scanned to measure fluorescence at each spot. Depending on how many genes you are measuring, and across how many samples, you will likely find yourself with a multi-coloured grid, a heat map. Your image requires processing to convert colour and intensity of fluorescence into numbers. This then allows you to see which genes are expressed in which samples and at what levels.

If you are assaying multiple samples at the same time, you can begin to tease apart meaningful information, such as differential gene expression level between different sample types. Calculating similarities in gene expression across samples allows you to put them into hierarchical clusters. Clustering genes and samples can help build up an interesting picture of the genetics and biology of your indication of interest. With the advancement of microarray technology, such as the whole transcriptome arrays from Affymetrix, it is now possible to not only measure gene-level differences, but also exon-level and alternative splice variants. With the standardisation of microarray gene expression data analysis, one can easily derive meaningful biological information in weeks rather than months.

Genotyping: As NGS costs continue to fall, this is still one area in which microarrays continue to be the dominant, and much more cost-effective, technology. The most common methods of detecting single-nucleotide-polymorphisms (SNP’s) are allele discrimination by hybridisation, allele specific extension and ligation to a ‘bar-code’, or extending arrayed DNA across the SNP in a single nucleotide extension reaction. Affymetrix and Illumina both produce highly effective SNP genotyping arrays that have been used extensively around the world. As well as being able to detect over 1 million different human SNPs with high degrees of accuracy and reproducibility. The arrays can also be used to detect copy number variations. While SNPs are crucial biomarkers, copy number variations (a structural variation within a cell’s DNA that gives it fewer, correct, or more copies of a certain section of DNA) have also been associated with susceptibility and resistance to some diseases. Large biobank studies, such as UK Biobank and Million Veteran Program in the US, are utilising microarrays for generating genotyping data to understand the relationship between genes, lifestyle, environment, and medical history from 500,000 to 1,000,000 volunteers, respectively. The resulting database is being used by researchers to understand diseases with the goal for better diagnosis and treatment. Genotyping array is also what is used by direct-to-consumer companies such as 23andMe to generate customer profiles. p.12

Page 12: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

10 / Genomics 101

DESIGNING GENOMICS EXPERIMENTS

2009

Collaboration between UCSF, Kaiser Permanente, and Affymetrix to enable genomic studies of common diseases100,000 samples are genotyped in 15 months and the database made available to qualified scientists. Among recent studies published are the identification of potential biomarkers to improve prostate cancer screening5 and the finding of a genetic susceptibility to Staphylococcus aureus that could pave the way to new treatment and prevention of antibiotic resistant infections like MRSA.6

Roche and Affymetrix launch first FDA-cleared microarray-based testAmpliChip® CYP450 Test

detects cytochrome P450

genetic variations that

impact drug metabolism.

The test results can help

physicians individualize

patient treatment—a first

for precision medicine.

New test reports the probable tissue of origin for 15 common types of tumors Pathwork Diagnostics

introduces the Tissue of

Origin Test (now offered

by Cancer Genetics, Inc.),

the first FDA-cleared test

based on a customized

gene-expression array

from Affymetrix.

Translating biomarkersfrom lab to clinicAffymetrix collaborates

with diagnostic companies,

such as Almac, GenomeDx,

Lineagen, PathGEN Dx,

SkylineDx, and Veracyte,

turning their biomarker

signatures into microarray-

based tests for improved

diagnosis and treatment.

Ariosa, now part of Roche, selects Affymetrix’s platform for noninvasive prenatal testing developmentAffymetrix supplies a custom

array for the development

of a noninvasive prenatal

test that is faster and

more accurate than

next-gen sequencing.

Affymetrix launches first test to help diagnose postnatal developmental delay CytoScan® Dx Assay is

the first and only FDA-

cleared, whole-genome,

microarray-based genetic

test to aid the increase in

diagnostic yield for postnatal

developmental delay and

intellectual disability.

References1. Kozal M. J. et al. Nat Med 2(7):753-59 (1996). 2. Hofmann W. K., et al. Lancet 359(9305):481–86 (2002). 3. Puffenberger E. G., et al. Proc Natl Acad Sci USA 101(32):11689–694 (2004). 4. Caldwell M. D., et al. Blood 111(8):4106–12 (2008). 5. Hoffmann T. J., et al. Cancer Discov 5(8):878–91 (2015). 6. DeLorenze G. N., et al. J Infect Dis 213(5):816–23 (2016).

© 2016 Affymetrix, Inc. All rights reserved. Unless otherwise noted, Affymetrix products are For Research Use Only. Not for use in diagnostic procedures. P/N COR06694-1

Pioneering microarray analysis

Realizing the vision of precision medicine

Late 1980s to early 1990s

Stephen P.A. Fodor and a team of scientists integrated semiconductor manufacturing

techniques with combinatorial chemistry on a small silicon chip to enable the collection of

vast amounts of genetic data that would help scientists learn the biology of disease

at the molecular level.

Their findings, published in Science in 1991, launched the microarray industry and forever

changed how genomics studies are done.

Empowering biobanks to discover the interplay of genes, environment, and lifestyle

2005

Collaboration between Wellcome Trust and Affymetrix to identify genetic associations in 7 common diseases The Wellcome Trust Case Control Consortium (WTCCC) genotypes 17,000 samples to identify genetic variants associated to type 1 and 2 diabetes, coronary heart disease, hypertension, bipolar disorder, rheumatoid arthritis, and Crohn’s disease.

2013

Collaboration between UK Biobank and Affymetrix to genotype 500,000 volunteers for prospective studyThe resulting biomedical database is made available to scientists worldwide by UK Biobank. Many scientists have already published their discoveries on lung function, smoking behavior, neurobiological disorders, and many other conditions.

2015

Million Veteran Program builds huge database of genotyping data using Affymetrix’s platformThe US Department of Veterans Office of Research and Development funds the Million Veteran Program, resulting in one of the world’s largest medical databases that includes genotyping data. The data collected from one million veteran volunteers furthers scientists’ understanding of how genes affect health, especially military-related illnesses.

Affymetrix’s GeneChip® array measures only 1.5 x 3 inches. The silicon chip contains 6,903,680 probes.

Advancing clinical research

1996

The first GeneChip array is

commercialized for the

identification of novel mutations

associated with drug resistance

in HIV patients.1

2002

Early-stage GeneChip expression

array is used in a study identifying

95 genes whose expression could

be used to predict sensitivity

of leukemic cells to STI571,

a promising agent for treatment

of advanced Philadelphia-

chromosome-positive (Ph+) acute

lymphoblastic leukemia.2

2004

A small-scale genotyping

study using a GeneChip array

pinpoints a genetic mutation

associated with sudden infant

death to dysgenesis of the

testes in males among the

Old Order Amish.3

2008

Affymetrix’s DMET™ array is used

in a study finding a DNA variant

in CYP4F2, a gene for drug-

metabolizing enzymes, that affects

warfarin dose requirements.4

Today

More than 70,000 publications

cite Affymetrix’s technology,

many in clinical studies

contributing to improved

diagnosis and treatment.

2004 to today

1990s and beyond

The first commercial GeneChip® array contained ~18,000 oligonucleotides. Today a GeneChip

array can contain up to 6.9 million oligonucleotides, enabling the analysis of the entire human

genome on one array.

What were once visions, such as precision medicine, are becoming realities as Affymetrix

continues to bring new products to laboratories and clinics worldwide.

Enabling the translation of scientific discoveries to the clinic

COR06694-1_Infographic_Genomics101_final_24Feb2016_final.indd 1 2/24/2016 7:29:07 PM

Page 13: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 11

DESIGNING GENOMICS EXPERIMENTS

2009

Collaboration between UCSF, Kaiser Permanente, and Affymetrix to enable genomic studies of common diseases100,000 samples are genotyped in 15 months and the database made available to qualified scientists. Among recent studies published are the identification of potential biomarkers to improve prostate cancer screening5 and the finding of a genetic susceptibility to Staphylococcus aureus that could pave the way to new treatment and prevention of antibiotic resistant infections like MRSA.6

Roche and Affymetrix launch first FDA-cleared microarray-based testAmpliChip® CYP450 Test

detects cytochrome P450

genetic variations that

impact drug metabolism.

The test results can help

physicians individualize

patient treatment—a first

for precision medicine.

New test reports the probable tissue of origin for 15 common types of tumors Pathwork Diagnostics

introduces the Tissue of

Origin Test (now offered

by Cancer Genetics, Inc.),

the first FDA-cleared test

based on a customized

gene-expression array

from Affymetrix.

Translating biomarkersfrom lab to clinicAffymetrix collaborates

with diagnostic companies,

such as Almac, GenomeDx,

Lineagen, PathGEN Dx,

SkylineDx, and Veracyte,

turning their biomarker

signatures into microarray-

based tests for improved

diagnosis and treatment.

Ariosa, now part of Roche, selects Affymetrix’s platform for noninvasive prenatal testing developmentAffymetrix supplies a custom

array for the development

of a noninvasive prenatal

test that is faster and

more accurate than

next-gen sequencing.

Affymetrix launches first test to help diagnose postnatal developmental delay CytoScan® Dx Assay is

the first and only FDA-

cleared, whole-genome,

microarray-based genetic

test to aid the increase in

diagnostic yield for postnatal

developmental delay and

intellectual disability.

References1. Kozal M. J. et al. Nat Med 2(7):753-59 (1996). 2. Hofmann W. K., et al. Lancet 359(9305):481–86 (2002). 3. Puffenberger E. G., et al. Proc Natl Acad Sci USA 101(32):11689–694 (2004). 4. Caldwell M. D., et al. Blood 111(8):4106–12 (2008). 5. Hoffmann T. J., et al. Cancer Discov 5(8):878–91 (2015). 6. DeLorenze G. N., et al. J Infect Dis 213(5):816–23 (2016).

© 2016 Affymetrix, Inc. All rights reserved. Unless otherwise noted, Affymetrix products are For Research Use Only. Not for use in diagnostic procedures. P/N COR06694-1

Pioneering microarray analysis

Realizing the vision of precision medicine

Late 1980s to early 1990s

Stephen P.A. Fodor and a team of scientists integrated semiconductor manufacturing

techniques with combinatorial chemistry on a small silicon chip to enable the collection of

vast amounts of genetic data that would help scientists learn the biology of disease

at the molecular level.

Their findings, published in Science in 1991, launched the microarray industry and forever

changed how genomics studies are done.

Empowering biobanks to discover the interplay of genes, environment, and lifestyle

2005

Collaboration between Wellcome Trust and Affymetrix to identify genetic associations in 7 common diseases The Wellcome Trust Case Control Consortium (WTCCC) genotypes 17,000 samples to identify genetic variants associated to type 1 and 2 diabetes, coronary heart disease, hypertension, bipolar disorder, rheumatoid arthritis, and Crohn’s disease.

2013

Collaboration between UK Biobank and Affymetrix to genotype 500,000 volunteers for prospective studyThe resulting biomedical database is made available to scientists worldwide by UK Biobank. Many scientists have already published their discoveries on lung function, smoking behavior, neurobiological disorders, and many other conditions.

2015

Million Veteran Program builds huge database of genotyping data using Affymetrix’s platformThe US Department of Veterans Office of Research and Development funds the Million Veteran Program, resulting in one of the world’s largest medical databases that includes genotyping data. The data collected from one million veteran volunteers furthers scientists’ understanding of how genes affect health, especially military-related illnesses.

Affymetrix’s GeneChip® array measures only 1.5 x 3 inches. The silicon chip contains 6,903,680 probes.

Advancing clinical research

1996

The first GeneChip array is

commercialized for the

identification of novel mutations

associated with drug resistance

in HIV patients.1

2002

Early-stage GeneChip expression

array is used in a study identifying

95 genes whose expression could

be used to predict sensitivity

of leukemic cells to STI571,

a promising agent for treatment

of advanced Philadelphia-

chromosome-positive (Ph+) acute

lymphoblastic leukemia.2

2004

A small-scale genotyping

study using a GeneChip array

pinpoints a genetic mutation

associated with sudden infant

death to dysgenesis of the

testes in males among the

Old Order Amish.3

2008

Affymetrix’s DMET™ array is used

in a study finding a DNA variant

in CYP4F2, a gene for drug-

metabolizing enzymes, that affects

warfarin dose requirements.4

Today

More than 70,000 publications

cite Affymetrix’s technology,

many in clinical studies

contributing to improved

diagnosis and treatment.

2004 to today

1990s and beyond

The first commercial GeneChip® array contained ~18,000 oligonucleotides. Today a GeneChip

array can contain up to 6.9 million oligonucleotides, enabling the analysis of the entire human

genome on one array.

What were once visions, such as precision medicine, are becoming realities as Affymetrix

continues to bring new products to laboratories and clinics worldwide.

Enabling the translation of scientific discoveries to the clinic

COR06694-1_Infographic_Genomics101_final_24Feb2016_final.indd 1 2/24/2016 7:29:07 PM

Page 14: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

12 / Genomics 101

DESIGNING GENOMICS EXPERIMENTS

Chromosomal Microarrays Analysis (CMA): CMA is increasingly used to detect chromosomal abnormalities, including submicroscopic ones that are too small to be detected by conventional karyotyping. CMA is now recommended to be a first-tier test in the genetic evaluation of infants and children with unexplained developmental delay and intellectual disability. High-resolution chromosomal microarrays, containing both SNP and copy number probes, can elucidate allelic imbalances and identify LOH/AOH that can be associated with uniparental disomy or consanguinity, both of which increase the risk of recessive disorders. In 2014, Affymetrix introduced CytoScan® Dx Assay, the first-of-its-kind FDA-cleared and CE marked whole-genome postnatal blood test to aid in the diagnosis of developmental delay, intellectual disabilities, congenital anomalies, or dysmorphic feature in children.

NOT ALL MICROARRAYS ARE CREATED EQUAL Here are the different ways a microarray can be manufactured and the pros and cons of each.

In-situ Synthesized Arrays: These types of arrays were pioneered by the founder of Affymetrix (Fodor et. al.) in the 1990’s. These rely on photolithography, a method that uses UV masking and light-directed combinatorial chemical synthesis on a solid support to selectively synthesize probes directly on the surface of the array, one nucleotide at a time per spot. This technology saw early use to detect mutations in the reverse transcriptase and protease genes in the HIV-1 genome and to measure variation in the human mitochondrial genome. Since then, these kinds of arrays have been

Page 15: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 13

DESIGNING GENOMICS EXPERIMENTS

developed for a wide range of applications in gene expression analysis, genotyping, copy number analysis, and chromosomal abnormality detection.

In 1996, Blanchard et. al. published a method adapting inkjet printer heads to microarrays. Picoliter volumes of nucleotides are printed onto the array surface in repeated rounds of base-by-base printing that extends the length of specific probes. The method was commercialised by Rosetta Inpharmatics and eventually licensed to Agilent Technologies.

The shorter probe and manufacturing method used by Affymetrix produces much higher density microarrays which are much better suited for whole transcriptome analysis including splice variants, whole-genome genotyping, and SNP+CNV analysis. The Agilent designs allow the synthesis of longer probes with lower density limiting the applications to expression profiling and CGH.

Self-Assembled/High-Density Bead Arrays: Originally developed by David Walt’s group at Tufts University in the late 90’s and 2000, this technology was licensed to Illumina. DNA is synthesised onto small silica beads which are deposited on to the end of fibre-optic bundles or silicon slides, pitted with microwells for the beads. Different types of DNA are synthesised on different beads, and are randomly assembled into an array. Hybridising and detecting short, fluorescently labelled oligonucleotides in a sequential series of steps allows the beads to be decoded.

The simplicity of the assay and lower bioinformatics burden make microarrays not only a very powerful way to leverage today’s genomic knowledge, but they can also be much cheaper and have a faster turnaround time than NGS methods. However, to design a microarray you have to know the sequence the genome, whereas with NGS you can sequence and characterise both known and unknown genomes.

SUMMARYThis chapter is not intended to explain how sequencing or microarrays work. It is intended to show you that you have a range of options available to you. At the start of the chapter we asked you to keep a few questions in mind. Principally, what kind of data do you need to be able to answer your question most efficiently? If you are looking for novel or rare variants in an individual, looking at whole genome sequencing well be the way to go. If you want to explore known regions of interest, then just take a look at the exome or get a bit more specific with a targeted approach. Maybe you want to genotype a large population to identify associations? A genotyping array is going to be much cheaper and easier to manage than NGS. Pick the technology that works for you, your operating restrictions, and which will produce the data you need.

In the next chapter we take a look at the chemistry involved in turning your DNA into Data. Now that you’ve decided what you want to do with your sample, you’ll need to know how to prepare it for the right platform. n

Page 16: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

CHAPTER 2:

TURNING DNA INTO DATA

SPONSORED BY

Page 17: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 15

TURNING DNA INTO DATA

INTRODUCTIONOver the last twenty years fundamental advances in sequencing technology have brought us a long way from the 10 years and 10 billion dollars spent on the Human Genome Project. Next Generation Sequencing (NGS) and microarray techniques have dramatically reduced the time and the cost associated with large scale genome exploration. Today whether you are analysing a panel of genes, exploring an exome, or shooting for an entire genome there is an extraordinary wealth of different techniques available.

We will explore the basic science behind these different techniques, taking a look at how genetic sequencing actually works and how we generate high quality sequence data for research and clinical application. Successful analysis is critically dependent on accurate sample preparation, so we’ll take a look at the the basics of the sample preparation process for a range of sequencing techniques, including NGS and microarrays.

There are numerous kits and methods available for NGS sample preparations, but several of the basic steps needed to prepare DNA for sequencing are conserved across different sequencing techniques. So for example preparing DNA for Illumina sequencing, Ion Torrent sequencing or a DNA microarray requires DNA fragmentation. Given the widespread use of the Illumina platform, for this chapter we will largely focus on the crucial steps in preparing DNA for Illumina sequencing.

As well as understanding genomic sequences, there are also sequencing methods that enable us to explore DNA expression, which genes or gene regions are active, at a particular point in time. During this chapter we will look at how these methods work, when they are used and how sample preparation differs for these protocols.

FIRST, WE NEED SOME DNA...The first step for DNA sequencing is collecting a tissue sample. In a clinical setting DNA for sequencing is extracted from patient samples, such as peripheral blood, bone marrow, fresh tissue, or formalin-fixed paraffin-embedded (FFPE) tissue.

Extracting DNA from a tissue sample involves three steps. First the tissue cells are broken open to expose the DNA in the nucleus, which is commonly referred to as cell disruption or cell lysis. This can be done using physical methods, such as blending; chemical methods, or by sonication, in which high-frequency ultrasound is applied to samples to disrupt the cell membranes. Second, the remaining membrane lipids are removed by adding detergents.

Third, after extraction, the DNA has to be purified to remove the detergents, proteins, salts and reagents used during cell lysis. Finally, the DNA sample is run through an amplification polymerase chain reaction (PCR) to enrich the sample ready for library preparation. p.18

Page 18: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

16 / Genomics 101

TURNING DNA INTO DATA

16 / Genomics 101

Formalin fixed paraffin embedded (FFPE) tissues are a vast resource of clinically annotated samples. High definition genomics of these informative materials could improve patient management and provide a molecular basis for the selection of personalized therapeutics. A collaboration between several

research groups around the world, including the laboratory of Dr. Michael T. Barrett at Translational Genomics Research Institute (TGen), has developed a method based on flow cytometry cell-sorting to isolate individual tumor cell populations from FFPE tissues, as well as methods for their accurate genomic analysis1.

Recent studies have described various methods to interrogate FFPE samples with array and sequencing technologies. These methods typically select for samples that exceed a threshold for tumor cell content using histological methods. Solid tumors, however, exhibit high degrees of tissue heterogeneity. Current approaches for enriching tumor samples prior to analysis of cancer genomes in FFPE tissue, such as laser capture microdissection (LCM), are limited in their ability to sufficiently distinguish and isolate different cell types in a timely manner, making them less suitable for clinical research applications of highly sensitive single molecule-resolution approaches such as NGS.

RELIABLE CGH ARRAY DATA FROM CLONAL PROFILING OF FFPE SAMPLESRecent advances in cytometry-based cell sorting technology facilitate the detection of relatively rare events in dilute admixed samples, enabling DNA content-based flow cytometry assays for high definition analyses of human cancer biopsies. These assays provide intact nuclei for DNA extraction, eliminate the need and bias to preselect samples based on tumor content and non-quantitative morphology-based measures, and greatly increase the number of samples for analyses. To evaluate the use of sorted solid tissue FFPE samples, fresh frozen (FF) pancreatic ductal adenocarcinoma (PDA) tissue samples were compared to matching FFPE samples. Genomic intervals for ADM2 were used to measure the reproducibility of aCGH data in the matching FFPE and FF samples. The top 20 ranked amplicons in the FFPE sample were used for this analysis. In 19 of these, the overlap was >90 % with the same ADM2-defined interval in the sorted fresh frozen sample. The global utility of the CGH assay was determined with different tissues, including triple negative breast cancer, bladder carcinoma, glioblastoma and small cell carcinoma of the ovary. The CGH assay was able to discriminate homozygous and partial deletions, map breakpoints, and amplicon boundaries to the single gene level in these sorted samples, regardless of tumor cell content.

ACCURATE NEXT GENERATION SEQUENCING DATA ON FFPE CLONAL POPULATIONSNGS analysis of even highly tumor cell-enriched bulk cancer samples, including those prepared by LCM, cannot accurately distinguish whether aberrations in the tumor DNA are present in a single cancer genome or if they are distributed in multiple clonal populations in each biopsy. In contrast, NGS analysis of these highly defined clonal populations can provide accurate sequence information on specific tumor cell types. An analysis was performed on sorted FFPE samples prepared from a PDA cell line whose exome has been extensively studied. The PDA cell line, primary FF tissue from which the cell line was derived, and the corresponding FFPE blocks were used to validate the sorting-based NGS analyses. A comparison of the paired end reads alignments against the reference genome in each of the 3 samples showed that almost 80 % of the target areas had at least 20X coverage in all three samples. The overlap of unique reads and the detection of known mutations across the three independent sample preparations demonstrated that sorted FFPE samples can generate accurate NGS Data, using the SureSelect Human All Exon Kit.

CONCLUSIONSThese highly sensitive and quantitative sorting assays provide pure and objectively defined populations of neoplastic cells prior to analysis. The deep and unbiased clonal profiling of sorted FFPE samples by aCGH and NGS provides a valuable methodology with broad application for cancer research which can advance the development of personalized patient therapies.

Agilent offers a wide range of resources on CGH microarrays and NGS that include applications notes, featured articles, how-to videos and much more. These resources can be accessed from the links below:

• NGS Cancer Research Resource Centerwww.agilent.com/genomics/NGSCancer

• NGS Constitutional Research Resource Centerwww.agilent.com/genomics/NGSConstitutional

• CGH Resource Centerwww.agilent.com/genomics/CGHResource

DEEP CLONAL PROFILING OF PURIFIED TUMOR CELL POPULATIONS FROM

FFPE SAMPLESSUREPRINT G3 HUMAN CGH MICROARRAYS AND SURESELECT EXOME SEQUENCING

FACILITATED HIGH DEFINITION GENOMIC PROFILING OF PURIFIED TUMOR CELL POPULATIONS FROM FFPE SAMPLES

GenetiSure CGH + SNP ArraysHAVE A LOT TO SAY ABOUTEXON-LEVEL COVERAGE

Two Catalog Arrays for Postnatal and Cancer ResearchDesigned for exon-level coverage of disease-associated regions recommended by ClinGen/ISCA or COSMIC and Cancer Genetics Consortium databases

Enhanced loss of heterozygosity detection with a resolution validated to 2.5Mb

Easily customize your microarray at no additional cost

www.agilent.com/genomics/GenetiSureCGH+SNP

For Research Use Only. Not for use in diagnostic procedures.

C

M

Y

CM

MY

CY

CMY

K

GenetiSure FLG Ad February_2.0.pdf 1 2/8/16 1:30 PM

Genomics 101 / 17

GenetiSure CGH + SNP ArraysHAVE A LOT TO SAY ABOUTEXON-LEVEL COVERAGE

Two Catalog Arrays for Postnatal and Cancer ResearchDesigned for exon-level coverage of disease-associated regions recommended by ClinGen/ISCA or COSMIC and Cancer Genetics Consortium databases

Enhanced loss of heterozygosity detection with a resolution validated to 2.5Mb

Easily customize your microarray at no additional cost

www.agilent.com/genomics/GenetiSureCGH+SNP

For Research Use Only. Not for use in diagnostic procedures.

C

M

Y

CM

MY

CY

CMY

K

GenetiSure FLG Ad February_2.0.pdf 1 2/8/16 1:30 PM

Notes1. T. Holley et al., “Deep Clonal Profiling of Formalin Fixed Paraffin Embedded Clinical Samples.” PLoS ONE 7(11): e50586. doi:10.1371/journal.pone.0050586.

This article was adapted from Agilent Publication 5991-3333EN. For Research Use Only. Not for use in diagnostic procedures.

“THESE HIGHLY SENSITIVE AND QUANTITATIVE

SORTING ASSAYS PROVIDE PURE AND OBJECTIVELY DEFINED POPULATIONS OF NEOPLASTIC CELLS PRIOR TO ANALYSIS.”

Page 19: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 17

TURNING DNA INTO DATA

16 / Genomics 101

Formalin fixed paraffin embedded (FFPE) tissues are a vast resource of clinically annotated samples. High definition genomics of these informative materials could improve patient management and provide a molecular basis for the selection of personalized therapeutics. A collaboration between several

research groups around the world, including the laboratory of Dr. Michael T. Barrett at Translational Genomics Research Institute (TGen), has developed a method based on flow cytometry cell-sorting to isolate individual tumor cell populations from FFPE tissues, as well as methods for their accurate genomic analysis1.

Recent studies have described various methods to interrogate FFPE samples with array and sequencing technologies. These methods typically select for samples that exceed a threshold for tumor cell content using histological methods. Solid tumors, however, exhibit high degrees of tissue heterogeneity. Current approaches for enriching tumor samples prior to analysis of cancer genomes in FFPE tissue, such as laser capture microdissection (LCM), are limited in their ability to sufficiently distinguish and isolate different cell types in a timely manner, making them less suitable for clinical research applications of highly sensitive single molecule-resolution approaches such as NGS.

RELIABLE CGH ARRAY DATA FROM CLONAL PROFILING OF FFPE SAMPLESRecent advances in cytometry-based cell sorting technology facilitate the detection of relatively rare events in dilute admixed samples, enabling DNA content-based flow cytometry assays for high definition analyses of human cancer biopsies. These assays provide intact nuclei for DNA extraction, eliminate the need and bias to preselect samples based on tumor content and non-quantitative morphology-based measures, and greatly increase the number of samples for analyses. To evaluate the use of sorted solid tissue FFPE samples, fresh frozen (FF) pancreatic ductal adenocarcinoma (PDA) tissue samples were compared to matching FFPE samples. Genomic intervals for ADM2 were used to measure the reproducibility of aCGH data in the matching FFPE and FF samples. The top 20 ranked amplicons in the FFPE sample were used for this analysis. In 19 of these, the overlap was >90 % with the same ADM2-defined interval in the sorted fresh frozen sample. The global utility of the CGH assay was determined with different tissues, including triple negative breast cancer, bladder carcinoma, glioblastoma and small cell carcinoma of the ovary. The CGH assay was able to discriminate homozygous and partial deletions, map breakpoints, and amplicon boundaries to the single gene level in these sorted samples, regardless of tumor cell content.

ACCURATE NEXT GENERATION SEQUENCING DATA ON FFPE CLONAL POPULATIONSNGS analysis of even highly tumor cell-enriched bulk cancer samples, including those prepared by LCM, cannot accurately distinguish whether aberrations in the tumor DNA are present in a single cancer genome or if they are distributed in multiple clonal populations in each biopsy. In contrast, NGS analysis of these highly defined clonal populations can provide accurate sequence information on specific tumor cell types. An analysis was performed on sorted FFPE samples prepared from a PDA cell line whose exome has been extensively studied. The PDA cell line, primary FF tissue from which the cell line was derived, and the corresponding FFPE blocks were used to validate the sorting-based NGS analyses. A comparison of the paired end reads alignments against the reference genome in each of the 3 samples showed that almost 80 % of the target areas had at least 20X coverage in all three samples. The overlap of unique reads and the detection of known mutations across the three independent sample preparations demonstrated that sorted FFPE samples can generate accurate NGS Data, using the SureSelect Human All Exon Kit.

CONCLUSIONSThese highly sensitive and quantitative sorting assays provide pure and objectively defined populations of neoplastic cells prior to analysis. The deep and unbiased clonal profiling of sorted FFPE samples by aCGH and NGS provides a valuable methodology with broad application for cancer research which can advance the development of personalized patient therapies.

Agilent offers a wide range of resources on CGH microarrays and NGS that include applications notes, featured articles, how-to videos and much more. These resources can be accessed from the links below:

• NGS Cancer Research Resource Centerwww.agilent.com/genomics/NGSCancer

• NGS Constitutional Research Resource Centerwww.agilent.com/genomics/NGSConstitutional

• CGH Resource Centerwww.agilent.com/genomics/CGHResource

DEEP CLONAL PROFILING OF PURIFIED TUMOR CELL POPULATIONS FROM

FFPE SAMPLESSUREPRINT G3 HUMAN CGH MICROARRAYS AND SURESELECT EXOME SEQUENCING

FACILITATED HIGH DEFINITION GENOMIC PROFILING OF PURIFIED TUMOR CELL POPULATIONS FROM FFPE SAMPLES

GenetiSure CGH + SNP ArraysHAVE A LOT TO SAY ABOUTEXON-LEVEL COVERAGE

Two Catalog Arrays for Postnatal and Cancer ResearchDesigned for exon-level coverage of disease-associated regions recommended by ClinGen/ISCA or COSMIC and Cancer Genetics Consortium databases

Enhanced loss of heterozygosity detection with a resolution validated to 2.5Mb

Easily customize your microarray at no additional cost

www.agilent.com/genomics/GenetiSureCGH+SNP

For Research Use Only. Not for use in diagnostic procedures.

C

M

Y

CM

MY

CY

CMY

K

GenetiSure FLG Ad February_2.0.pdf 1 2/8/16 1:30 PM

Genomics 101 / 17

GenetiSure CGH + SNP ArraysHAVE A LOT TO SAY ABOUTEXON-LEVEL COVERAGE

Two Catalog Arrays for Postnatal and Cancer ResearchDesigned for exon-level coverage of disease-associated regions recommended by ClinGen/ISCA or COSMIC and Cancer Genetics Consortium databases

Enhanced loss of heterozygosity detection with a resolution validated to 2.5Mb

Easily customize your microarray at no additional cost

www.agilent.com/genomics/GenetiSureCGH+SNP

For Research Use Only. Not for use in diagnostic procedures.

C

M

Y

CM

MY

CY

CMY

K

GenetiSure FLG Ad February_2.0.pdf 1 2/8/16 1:30 PM

Notes1. T. Holley et al., “Deep Clonal Profiling of Formalin Fixed Paraffin Embedded Clinical Samples.” PLoS ONE 7(11): e50586. doi:10.1371/journal.pone.0050586.

This article was adapted from Agilent Publication 5991-3333EN. For Research Use Only. Not for use in diagnostic procedures.

“THESE HIGHLY SENSITIVE AND QUANTITATIVE

SORTING ASSAYS PROVIDE PURE AND OBJECTIVELY DEFINED POPULATIONS OF NEOPLASTIC CELLS PRIOR TO ANALYSIS.”

Page 20: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

18 / Genomics 101

TURNING DNA INTO DATA

The same process is applied when extracting RNA for gene expression studies, such as microarray tests. DNA microarrays designed to measure gene expression are compatible with a large spectrum of different sample types. RNA extracted from cells, blood, saliva, fecal samples, FFPE, and fresh/frozen tissue can all be processed on gene expression arrays.

HOW DOES SEQUENCING WORK?In order to understand the crucial steps in library preparation, we’ll need to do a run-through of what happens during sequencing itself. All sequencing methods involve a precise set of chemical reactions, and the DNA needs to be properly prepared.

The basic principle of DNA sequencing is converting the bases on a DNA strand into detectable physical events, such as fluorescence (Illumina, Sanger, SOLiD) or a change in pH (Ion Torrent). A sequencing machine can detect changes in this physical event and translate that into a read-out of base pairs.

SANGER SEQUENCING

Used during the Human Genome Project, Sanger sequencing involves the production of a large number of different length DNA fragments from the same base sequence, all ending in a fluorescently labelled base.

A laser is used to excite the fluorescent label, allowing the sequencer to record which base is present, and the DNA fragments are sorted from smallest to largest, allowing the sequencer to re-construct the original order of the bases.

SOLID SEQUENCING

As with the 454 protocol, as part of SOLiD (sequencing by Oligonucleotide Ligation and Detection) identical DNA fragments are attached to agarose beads. However, during SOLiD the fluorescence used to record the DNA sequence is generated by the action of DNA ligase enzyme as sections of DNA, rather than individual bases, are joined together.

454 LIFE SCIENCES

Also known as ‘pyrosequencing’, the 454 method detects the activity of DNA polymerase enzyme, through fluorescence, as bases are added to a DNA sequence.

Fragmented DNA strands are attached to an agarose bead 1µm in diameter. The fluorescence reaction takes place inside a picolitre tube as fluorescently-labelled bases are added.

SINGLE-MOLECULE REAL-TIME SEQUENCING (SMRT)

Allowing for longer DNA reads than other NGS protocols, SMRT uses fluorescently-labelled nucleotides to sequence DNA strands in real time.

A single molecule of DNA is immobilised at the bottom of a zero-mode waveguide – a tube whose dimensions are smaller than the wavelength of light, approximately 70nm in diameter – along with a single DNA polymerase enzyme. A detector records which bases are incorporated during DNA synthesis by measuring fluorescence.

ION TORRENT

When a nucleotide is incorporated into a DNA strand by an enzyme, an H+ ion is released. Instead of detecting fluorescence, Ion Torrent sequencing detects pH changes caused by H+ ion release.

During sequencing, each well of the sequencer is flooded with one nucleotide at a time until a pH change is recorded, indicating a base match.

ILLUMINA

Illumina sequencing all takes place on a specialised flow cell coated in a ‘lawn’ of primers. Fragments of DNA are ‘hybridised’(or attached) to the two-dimensional surface, forming localised clusters of about 2000 identical DNA fragments. This step is called ‘cluster generation’.

During sequencing these clusters are bathed in fluorescently labelled nucleotides, along with a DNA polymerase enzyme that attaches each fluorescent nucleotide to its non-fluorescent correspondent, so fluorescent A binds with non-fluorescent T, and so on.

As with Sanger sequencing and other fluorescent methods, the surface of the flow cell is then imaged using laser excitation and the resulting colours used to record the DNA sequence of each cluster.

THERE ARE TWO MAIN METHODS FOR STUDYING

GENE EXPRESSION: MICROARRAYS AND RNA-SEQ. THE CRUCIAL DIFFERENCE BETWEEN THE TWO TECHNIQUES IS THAT MICROARRAYS DETECT KNOWN TRANSCRIPTS THAT CORRESPOND TO KNOWN GENOMIC SEQUENCES, WHEREAS RNA-SEQ CAN DETECT BOTH KNOWN TRANSCRIPTS AS WELL AS POTENTIAL NOVEL TRANSCRIPTS WITHOUT THE NEED TO HAVE PRIOR KNOWLEDGE OF THE GENOMIC SEQUENCES.”

Page 21: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 19

TURNING DNA INTO DATA

SEQUENCING FOR GENE EXPRESSIONAs well as determining the presence and absence of genes, sequencing techniques can also be used to capture a snapshot of gene activity under particular conditions, also known as the “transcriptome”. As with the majority of DNA sequencing, fluorescent labelling is used to convert gene activity into a measurable, physical effect. So the stronger the fluorescent signal, the greater the level of expression in the sample.

There are two main methods for studying gene expression: microarrays and RNA-Seq. The crucial difference between the two techniques is that microarrays detect known transcripts that correspond to known genomic sequences, whereas RNA-Seq can detect both known transcripts as well as potential novel transcripts without the need to have prior knowledge of the genomic sequences.

Broadly speaking, microarrays are particularly useful in a clinical environment because they can be used to rapid, accurate assessment and diagnoses of established clinical variants. RNA-Seq is often favoured during research into unknown areas of the DNA transcriptome. p.22

LONG AND THE SHORT OF ITOne of the limitations of the most popular NGS methods are that they produce short reads of DNA sequence. Short reads are great for rapid, high throughput sequencing, but can make accurate genome reconstruction challenging.

Short reads also provide limited coverage of certain parts of the genome, for example areas that are rich in GC nucleotide bases, which have a higher denaturing point than AT nucleotides. As a result, during PCR, GC-rich regions are less well amplified than AT-rich regions.

In recent years sequencing techniques that produce longer reads have emerged to address this problem. Long reads make genome reconstruction simpler by making the puzzle pieces larger, and increase the coverage of genome areas that are harder to sequence.

WHAT IS PCR?Developed in 1983, the polymerase chain reaction or PCR has become an essential part of any genetics toolkit.

PCR is a molecular photocopier, which ‘amplifies’, or makes multiple copies of small segments of DNA. Genetic sequencing requires large amounts of sample DNA, making the process almost impossible without PCR.

The DNA sample is heated, so that the two DNA strands denature and pull apart into two separate strands.

Next, an enzyme called Taq polymerase builds two new strands of DNA using the original strands as a template. This process creates two identical versions of the original strand, which can then be used to create two new copies, and so on.

RNA-SEQ

The newer of the two technologies, RNA sequencing or ‘RNA-seq’ is exactly that; reading the sequence of a strand of RNA. RNA-Seq provides a ‘snapshot’ of the presence and quantity from a genome at a given moment in time, allowing for experiments that interrogate gene expression under particular conditions.

During Illumina RNA-seq, mRNA molecules fragmented and converted to double-stranded cDNA. The cDNA is then subjected to similar library preparation steps as for DNA sequencing: end repair, adaptor ligation, and PCR amplification. The sequencing reaction for RNA is the same as for DNA, beginning with the all-important cluster formation.

MICROARRAYS

A microarray is typically a glass slide or cell coated in DNA probes that the sample DNA fragments can attach to.

A DNA microarray protocol begins in much the same way as RNA-Seq: by isolating mRNA from biological samples, and converting those to cDNA. During the creation of cDNA, a fluorescent label is added to the cDNA fragment.

These labelled fragments hybridise to the microarray, and the fluorescence is activated by laser to generate a signal. These signals are used to create a digital image of the array, which is used for analysis.

Page 22: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

20 / Genomics 101

TURNING DNA INTO DATA

NEBNext Ultra II DNA Library Prep Kit for NGS

Visit NEBNextUltraII.com to request a sample.

Even more from less.

Figure 1. NEBNext Ultra II produces the highest yield libraries from a broad range of input amounts.

0

20

40

60

80

100

120

140

100 ng5

10 ng8

1 ng11

500 pg14

Libr

ary

Yie

ld (

nM)

DNA InputPCR Cycles

Ultra II

Kapa™

Hyper

TruSeq®

Nano

Libraries were prepared from Human NA19240 genomic DNA using the input amounts and numbers of PCR cycles shown. Manufacturers’ recommendations were followed, with the exception that size selection was omitted.

Libraries were prepared from Human NA19240 genomic DNA using the input amounts and library prep kits shown without an amplification step, and following manufacturers’ recommendations. qPCR was used to quantitate adaptor-ligated molecules, and quantitation values were then normalized to the conversion rate for Ultra II. The Ultra II kit produces the highest rate of conversion to adaptor-ligated molecules, for a broad range of input amounts.

Figure 2. NEBNext Ultra II produces the highest rates of conversion to adaptor-ligated molecules from a broad range of input amounts.

0

0.2

0.4

0.6

0.8

1

100 ng 10 ng 1 ng 500 pg

Rel

ativ

e C

onve

rsio

n R

ate

1.2

DNA Input

Ultra II

KapaHyper

TruSeqNano

Ultra II libraries were prepared from Human NA19240 genomic DNA using NEBNext Ultra II and the input amounts shown. Yields were measured after each PCR cycle and the number of cycles required to generate at least 1 µg of amplified library determined. Cycle numbers for Kapa Hyper were obtained from Kapa Biosystems website and plotted alongside the cycle numbers obtained experimentally for Ultra II.

Figure 3. Number of PCR cycles required to generate ≥ 1 µg amplified library for target enrichment.

PC

R C

ycle

s

DNA Input

0

10

12

14

16

18

8

6

4

2

1 µg 100 ng

Ultra II

KapaHyper

10 ng

20

Figure 4. NEBNext Ultra II provides uniform GC coverage for microbial genomic DNA over a broad range of GC composition and input amounts.

Libraries were made using 500 pg, 1 ng and 100 ng of the genomic DNAs shown and the Ultra II DNA Library Prep Kit (A) or using 100 ng of the genomic DNAs and the library prep kits shown (B), and sequenced on an Illumina MiSeq®. Reads were mapped using Bowtie 2.2.4 and GC coverage information was calculated using Picard’s CollectGCBiasMetrics (v1.117). Expected normalized coverage of 1.0 is indicated by the horizontal grey line, the number of 100 bp regions at each GC% is indicated by the vertical grey bars, and the colored lines represent the normalized coverage for each library.

B.

Ultra IIKapa HyperTruSeq Nano

LibraryUltra II 100 ngUltra II 1 ngUltra II 500 pg

Library

A.

As sequencing technologies improve and capacities expand, boundaries are also being pushed on library construction. High performance is required from ever-decreasing input quantities

and from samples of lower quality or those with extreme GC content. At the same time, the need is increasing for faster, automatable pro-tocols that perform reliably and do not compromise the quality of the libraries produced.

At New England Biolabs®, we understand the challenges you face, and are uniquely positioned to help you meet them. For over 40 years, NEB® has been a leading supplier of molecular biology enzymes for the life science community. We have been at the forefront of devel-oping recombinant enzymes, as well as stringent quality controls that ensure product purity and performance, and since 2009 we have been applying this expertise to improve sample preparation products for NGS.

To meet the increasing applications and new challenges of NGS sample preparation, we continue to expand our NEBNext product portfolio. Available for the Illumina® and Ion Torrent™ platforms, NEBNext reagents are designed to streamline workflows, minimize inputs and improve library yields and quality. NEBNext sample preparation kits are available for genomic DNA, ChIP DNA, FFPE DNA, microbiome DNA, RNA and small RNA samples. In addition to the extensive QCs on individual kit components, all NEBNext kits are functionally validated by library preparation, followed by sequencing on the appropriate platform.

SUBSTANTIALLY IMPROVED LIBRARY PREPARATION WITH THE NEBNEXT ULTRA™ II DNA LIBRARY PREP KIT FOR ILLUMINAThe NEBNext Ultra II DNA Library Prep Kit pushes the limits of library preparation. Each component of the kit has been carefully formulated, resulting in a several-fold increase in library yield with as little as 500 pg of human DNA. These advances deliver unprecedented perfor-mance, while enabling lower inputs and fewer PCR cycles, all with a fast, streamlined workflow.

IMPROVEMENTS IN LIBRARY YIELD AND CONVERSION RATEAn important measure of the success of library preparation is the yield of the final library. The reformulation of each step in the library prep workflow enables substantially higher yields from the NEBNext Ultra II

Kit compared to other commercially available kits (Figure 1). Even when using very low input amounts (e.g. 500 pg of human DNA), high yields of high quality libraries can be obtained, using fewer PCR cycles.

The efficiency of the end repair, dA-tailing and adaptor ligation steps during library construction can be measured separately from the PCR step by qPCR quantitation of adaptor-ligated fragments prior to library amplification. This enables determination of the rate of conversion of input DNA to adaptor-ligated fragments, i.e. sequenceable molecules. Therefore, measuring conversion rates is another way to assess the efficiency of library construction and also provide information on the diversity of the library. Again, NEBNext Ultra II enables substantially higher rates of conversion as compared to other commercially available kits (Figure 2).

MEETING THE CHANGING DEMANDS OF NGS SAMPLE PREPARATION WITH NEBNEXT

®

LIBRARY PREPARATION IS A CRITICAL PART OF THE NEXT GENERATION SEQUENCING WORKFLOW; SUCCESSFUL SEQUENCING REQUIRES HIGH QUALITY LIBRARIES OF SUFFICIENT YIELD AND QUALITY.

MINIMIZATION OF PCR CYCLESIn general, it is preferable to use as few PCR cycles as possible to ampli-fy libraries. In addition to reducing workflow time, this also limits the risk of introducing bias during PCR. A consequence of increased efficiency of end repair, dA-tailing and adaptor ligation is that fewer PCR cycles are required to achieve the library yields necessary for sequencing or other intermediate downstream workflows (Figure 3).

IMPROVEMENTS IN LIBRARY QUALITYWhile sufficient yield of a library is required for successful sequencing, quantity alone is not enough. The quality of a library is also critical, regardless of the input amount or GC content of the sample DNA. A high quality library will have uniform representation of the original sample, as well as even coverage across the GC spectrum.

UNIFORM GC COVERAGELibraries from varying input amounts of three microbial genomic DNAs with low, medium and high GC content (H. influenza, E. coli and H. pal-ustris) were prepared using the NEBNext Ultra II Kit. In all cases, uniform coverage was obtained, regardless of GC content and input amount (Figure 4A). GC coverage of libraries prepared using other commercially available kits was also analyzed using the same trio of genomic DNAs. Again, NEBNext Ultra II provided good GC coverage (Figure 4B).

When amplification is required to obtain sufficient library yields, it is important to ensure that no bias is introduced, and that representation of GC-rich and AT-rich regions is not skewed in the final library. Comparison with libraries produced without amplification (“PCR-free”) is also a useful measure (1). Coverage of libraries prepared from human genomic DNA using NEBNext Ultra II, as well as other commercially available kits, were compared to a PCR-free library. Results demonstrated that the Ultra II library coverage is most similar to the PCR-free library, and also covers the range of GC content (data not shown).

For more performance data using NEBNext Ultra II, visit NEBNextUltraII.com and download the full technical note.

REFERENCE (1) Kozarewa, I. et al. (2009). Amplification-free Illumina sequencing – library preparation facilitates improved mapping and assembly of (G+C) – biased genomes. Nat. Methods 6:291–295.

NEW ENGLAND BIOLABS®, NEB®, NEBNEXT® are registered trademarks of New England Biolabs, Inc.ULTRA™ is a trademark of New England Biolabs, Inc.ILLUMINA®, MISEQ® and TRUSEQ® are registered trademarks of Illumina, Inc.ION TORRENT™ is a trademark owned by Life Technologies, Inc.KAPA™ is a trademark of Kapa Biosystems.

Page 23: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 21

TURNING DNA INTO DATA

NEBNext Ultra II DNA Library Prep Kit for NGS

Visit NEBNextUltraII.com to request a sample.

Even more from less.

Figure 1. NEBNext Ultra II produces the highest yield libraries from a broad range of input amounts.

0

20

40

60

80

100

120

140

100 ng5

10 ng8

1 ng11

500 pg14

Libr

ary

Yie

ld (

nM)

DNA InputPCR Cycles

Ultra II

Kapa™

Hyper

TruSeq®

Nano

Libraries were prepared from Human NA19240 genomic DNA using the input amounts and numbers of PCR cycles shown. Manufacturers’ recommendations were followed, with the exception that size selection was omitted.

Libraries were prepared from Human NA19240 genomic DNA using the input amounts and library prep kits shown without an amplification step, and following manufacturers’ recommendations. qPCR was used to quantitate adaptor-ligated molecules, and quantitation values were then normalized to the conversion rate for Ultra II. The Ultra II kit produces the highest rate of conversion to adaptor-ligated molecules, for a broad range of input amounts.

Figure 2. NEBNext Ultra II produces the highest rates of conversion to adaptor-ligated molecules from a broad range of input amounts.

0

0.2

0.4

0.6

0.8

1

100 ng 10 ng 1 ng 500 pg

Rel

ativ

e C

onve

rsio

n R

ate

1.2

DNA Input

Ultra II

KapaHyper

TruSeqNano

Ultra II libraries were prepared from Human NA19240 genomic DNA using NEBNext Ultra II and the input amounts shown. Yields were measured after each PCR cycle and the number of cycles required to generate at least 1 µg of amplified library determined. Cycle numbers for Kapa Hyper were obtained from Kapa Biosystems website and plotted alongside the cycle numbers obtained experimentally for Ultra II.

Figure 3. Number of PCR cycles required to generate ≥ 1 µg amplified library for target enrichment.

PC

R C

ycle

s

DNA Input

0

10

12

14

16

18

8

6

4

2

1 µg 100 ng

Ultra II

KapaHyper

10 ng

20

Figure 4. NEBNext Ultra II provides uniform GC coverage for microbial genomic DNA over a broad range of GC composition and input amounts.

Libraries were made using 500 pg, 1 ng and 100 ng of the genomic DNAs shown and the Ultra II DNA Library Prep Kit (A) or using 100 ng of the genomic DNAs and the library prep kits shown (B), and sequenced on an Illumina MiSeq®. Reads were mapped using Bowtie 2.2.4 and GC coverage information was calculated using Picard’s CollectGCBiasMetrics (v1.117). Expected normalized coverage of 1.0 is indicated by the horizontal grey line, the number of 100 bp regions at each GC% is indicated by the vertical grey bars, and the colored lines represent the normalized coverage for each library.

B.

Ultra IIKapa HyperTruSeq Nano

LibraryUltra II 100 ngUltra II 1 ngUltra II 500 pg

Library

A.

As sequencing technologies improve and capacities expand, boundaries are also being pushed on library construction. High performance is required from ever-decreasing input quantities

and from samples of lower quality or those with extreme GC content. At the same time, the need is increasing for faster, automatable pro-tocols that perform reliably and do not compromise the quality of the libraries produced.

At New England Biolabs®, we understand the challenges you face, and are uniquely positioned to help you meet them. For over 40 years, NEB® has been a leading supplier of molecular biology enzymes for the life science community. We have been at the forefront of devel-oping recombinant enzymes, as well as stringent quality controls that ensure product purity and performance, and since 2009 we have been applying this expertise to improve sample preparation products for NGS.

To meet the increasing applications and new challenges of NGS sample preparation, we continue to expand our NEBNext product portfolio. Available for the Illumina® and Ion Torrent™ platforms, NEBNext reagents are designed to streamline workflows, minimize inputs and improve library yields and quality. NEBNext sample preparation kits are available for genomic DNA, ChIP DNA, FFPE DNA, microbiome DNA, RNA and small RNA samples. In addition to the extensive QCs on individual kit components, all NEBNext kits are functionally validated by library preparation, followed by sequencing on the appropriate platform.

SUBSTANTIALLY IMPROVED LIBRARY PREPARATION WITH THE NEBNEXT ULTRA™ II DNA LIBRARY PREP KIT FOR ILLUMINAThe NEBNext Ultra II DNA Library Prep Kit pushes the limits of library preparation. Each component of the kit has been carefully formulated, resulting in a several-fold increase in library yield with as little as 500 pg of human DNA. These advances deliver unprecedented perfor-mance, while enabling lower inputs and fewer PCR cycles, all with a fast, streamlined workflow.

IMPROVEMENTS IN LIBRARY YIELD AND CONVERSION RATEAn important measure of the success of library preparation is the yield of the final library. The reformulation of each step in the library prep workflow enables substantially higher yields from the NEBNext Ultra II

Kit compared to other commercially available kits (Figure 1). Even when using very low input amounts (e.g. 500 pg of human DNA), high yields of high quality libraries can be obtained, using fewer PCR cycles.

The efficiency of the end repair, dA-tailing and adaptor ligation steps during library construction can be measured separately from the PCR step by qPCR quantitation of adaptor-ligated fragments prior to library amplification. This enables determination of the rate of conversion of input DNA to adaptor-ligated fragments, i.e. sequenceable molecules. Therefore, measuring conversion rates is another way to assess the efficiency of library construction and also provide information on the diversity of the library. Again, NEBNext Ultra II enables substantially higher rates of conversion as compared to other commercially available kits (Figure 2).

MEETING THE CHANGING DEMANDS OF NGS SAMPLE PREPARATION WITH NEBNEXT

®

LIBRARY PREPARATION IS A CRITICAL PART OF THE NEXT GENERATION SEQUENCING WORKFLOW; SUCCESSFUL SEQUENCING REQUIRES HIGH QUALITY LIBRARIES OF SUFFICIENT YIELD AND QUALITY.

MINIMIZATION OF PCR CYCLESIn general, it is preferable to use as few PCR cycles as possible to ampli-fy libraries. In addition to reducing workflow time, this also limits the risk of introducing bias during PCR. A consequence of increased efficiency of end repair, dA-tailing and adaptor ligation is that fewer PCR cycles are required to achieve the library yields necessary for sequencing or other intermediate downstream workflows (Figure 3).

IMPROVEMENTS IN LIBRARY QUALITYWhile sufficient yield of a library is required for successful sequencing, quantity alone is not enough. The quality of a library is also critical, regardless of the input amount or GC content of the sample DNA. A high quality library will have uniform representation of the original sample, as well as even coverage across the GC spectrum.

UNIFORM GC COVERAGELibraries from varying input amounts of three microbial genomic DNAs with low, medium and high GC content (H. influenza, E. coli and H. pal-ustris) were prepared using the NEBNext Ultra II Kit. In all cases, uniform coverage was obtained, regardless of GC content and input amount (Figure 4A). GC coverage of libraries prepared using other commercially available kits was also analyzed using the same trio of genomic DNAs. Again, NEBNext Ultra II provided good GC coverage (Figure 4B).

When amplification is required to obtain sufficient library yields, it is important to ensure that no bias is introduced, and that representation of GC-rich and AT-rich regions is not skewed in the final library. Comparison with libraries produced without amplification (“PCR-free”) is also a useful measure (1). Coverage of libraries prepared from human genomic DNA using NEBNext Ultra II, as well as other commercially available kits, were compared to a PCR-free library. Results demonstrated that the Ultra II library coverage is most similar to the PCR-free library, and also covers the range of GC content (data not shown).

For more performance data using NEBNext Ultra II, visit NEBNextUltraII.com and download the full technical note.

REFERENCE (1) Kozarewa, I. et al. (2009). Amplification-free Illumina sequencing – library preparation facilitates improved mapping and assembly of (G+C) – biased genomes. Nat. Methods 6:291–295.

NEW ENGLAND BIOLABS®, NEB®, NEBNEXT® are registered trademarks of New England Biolabs, Inc.ULTRA™ is a trademark of New England Biolabs, Inc.ILLUMINA®, MISEQ® and TRUSEQ® are registered trademarks of Illumina, Inc.ION TORRENT™ is a trademark owned by Life Technologies, Inc.KAPA™ is a trademark of Kapa Biosystems.

Page 24: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

22 / Genomics 101

TURNING DNA INTO DATA

BUILDING A DNA LIBRARYJust as the speed of genetic sequencing has decreased phenomenally over the past decade, so has the speed of library preparation.

Early protocols, such as Sanger sequencing, used molecular cloning to create libraries of identical DNA fragments. This process involves using host bacteria to store and replicate target DNA. The downside of this approach, aside from the speed, was that the resulting DNA sequence might contain parts of the bacterial cloning vector. In the past few years Illumina NGS library preparation has increased in speed from 1-2 days to around 90 minutes.

There can be a lot of variation in library preparation methods, some of which are specific to particular sequencing methods or products, but broadly speaking there are five main stages involved.

1. DNA FRAGMENTATION

The first step in library preparation to fragment the sample DNA. Fragmentation is a crucial step in the library preparation process, as it ensures that all the DNA strands are the same size before sequencing.

This is because different length strands can behave in different ways during library preparation, particularly during the later PCR stages where smaller DNA fragments may be over amplified compared to larger ones.

There are numerous methods for fragmenting DNA, but three of the most common are:

• Acoustic sheering: using high-frequency acoustic energy waves to break DNA strands

• Nebulisation: forcing the DNA through a small hole in a nebuliser unit• Enzyme restriction: using enzymes to break the DNA strands into

smaller pieces

For NGS the fragments produced are typically less than 800 base pairs in length, as these approaches produce short read DNA data. However, for the newly emerging long read sequencing methods the DNA strands are fragmented into 10kb lengths.

Microarrays developed for SNP genotyping or genome-wide DNA copy number detection can be used with DNA extracted from a wide assortment of samples. These assays, unlike NGS-assays, do not require shearing of the DNA. The DNA is simply used ‘as is’, or can be restriction-enzyme digested.

2. END REPAIR

DNA fragmentation leaves a range of 3’ and 5’ ends (recessed, overhang, blunt) on DNA fragments. As with fragments of different length, fragments with different ends may react differently later on in the process, and so need to be repaired.

To repair the DNA strands a series of different treatments are applied that remove overhangs and fill in recesses, creating a sample of entirely blunt-ended fragments.

dA-tailing

During Illumina sequencing there is an additional step called dA-tailing of the 3’ end of the repaired fragment. An A nucleotide overhang is attached to the 3’ end of each DNA strand, which will enable the right adaptors to ‘ligate’ or attach to the DNA strand in the next step.

3. ADAPTER LIGATION

Quite simply, this involves attaching known sequences (‘adaptors’) to the ends of the prepared DNA fragments whose sequence is unknown. Adaptors are needed further downstream in the sequencing process, and are essential for sequencing to work properly.

For example, during Illumina sequencing the adaptors are needed to hybridise the DNA strands to the flow cell. The flow cell itself is covered in a dense ‘lawn’ of primers to which the DNA fragments attach. Adaptors can also contain an ‘index’ sequence, allowing for multiple different samples to be studied in a single flow cell.

During SOLiD or 454 sequencing protocols, the adaptors are required to bind the DNA fragments to the agarose beads on which the sequencing reaction takes place.

4. AMPLIFY

Finally, a PCR amplification is performed to create a robust library of DNA fragments that is suitable for sequencing. This step increases the amount of library, and ensures that only molecules with an adaptor at each end are selected for sequencing.

5. CLEAN-UP AND QUANTIFY

For Illumina sequencing, a final round of gel electrophoresis is often used to purify the final product, and conclude the library preparation process.

Before sequencing it is important to determine that the library contains a suitable number of molecules that are ready to be sequenced: that the right number of DNA fragments, with attached adaptors, is present in the sample.

Another reason to quantitate a library is if more than one are due to be sequenced at the same time, as is possible with Illumina sequencing.

There are several different methods for library quantitation, but they all broadly work in the same way, detecting the presence of the right sized fragments. For example:

• Spectrophotometry: this method detects the absorption of UV light by macromolecules in the sample. The larger the DNA molecule, the greater the UV absorption.

• Fluorimetry: this method involves binding a fluorescent dye to the DNA molecules and measuring the fluorescence. Larger molecules fluoresce more brightly than small.

Page 25: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 23

TURNING DNA INTO DATA

JUST THE BEGINNINGGenerating a series of DNA sequence fragments, whether using NGS techniques or DNA microarrays, is the first step in generating high quality genomic data that can form the basis of clinical testing or research. In the following chapters we will explore what happens next, including reconstructing entire genomes from those DNA sequence fragments, data analysis for microarrays, how to be confident in data quality, and how that sequence data can be used to inform both research and clinical diagnostics. n

“GENERATING A SERIES OF DNA

SEQUENCE FRAGMENTS IS THE FIRST STEP IN GENERATING HIGH

QUALITY SEQUENCE DATA”

DNA FRAGMENTATION

ADAPTER LIGATION

PCR ENRICHMENT

END REPAIR

MAKING A LIBRARY

Page 26: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

CHAPTER 3:

ANALYSING DATASPONSORED BY

Page 27: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 25

ANALYSING DATA

INTRODUCTIONIn the previous two chapters we have considered the kinds of genomic data we can generate, and the chemistry that makes it possible. At the heart of genomics is data analysis. Once you have digitised your DNA, you can start to explore it, understand it, and query it. In this chapter we will look at how to analyse microarray and NGS data, and how to turn it into useful information.

MICROARRAY DATA ANALYSISINTRODUCTION

Over the last several years the analysis of microarray data has standardised to relatively simple workflows. There are standard methodologies and analysis pipelines that can be used to generate genotypes, identify copy number aberrations, identify regions of absence of heterozygosity (AOH) or loss of heterozygosity (LOH), identify differential gene expression and identify splice variants from microarray data. The results from an analysis of microarray data are available on the order of minutes or hours as opposed to days. In addition analysis of microarray data usually does not require any specific hardware and is

commonly performed using a standard laptop computer. The exception to this is the large genotyping studies with hundreds of thousands of samples that are more commonly analysed using Linux clusters or cloud computing resources. In general, microarray analysis can be done by bench scientists using standard computers and programs provided by the microarray providers.

QUALITY CONTROL

One critical component of any analysis pipeline, including those for microarrays, is quality control checks. Each microarray manufacturer has a set of quality control processes and guidelines recommended for each of their microarray products. For example, there are spike-in controls like the ERCC RNA controls developed by N.I.S.T., recommendations to process standard sample controls concurrently with experimental samples to evaluate the performance of the reagents independent of sample quality, and specific algorithm metrics. Generally, each of the application specific algorithms referred to below contain metrics indicating the quality or confidence attributed to their output. In the overall analysis workflow, these quality control checks exist at multiple places providing insight into things ranging from the quality of the sample, to the performance of the reagents to the performance of the algorithm. In discussing analysis pipelines, quality control processes and guidelines are often omitted. Yet they are probably the single most important part of any analyses. p.28

Page 28: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

26 / Genomics 101

ANALYSING DATA

26 / Genomics 101

Discovering pathogenic variants, or those that might lead to drug targets is a numbers game. The more patients we sequence, the higher our statistical power to work out the relationships between genotype and phenotype, other expression or disease.

Getting to the numbers is more complex than anticipated. We need to examine an individual patient in the context of millions of others — and the data required to do so is massive.

Working with millions of genomes in a single study is about to become a reality.

There are projects around the world sequencing and analyzing more than 100,000 genomes in alongside to other biomedical data. These range from work at Human Longevity, to the International Cancer Genome Consortium, the Million Veteran Program, and SequenceBio.

We’ve also already seen the value that higher statistical power can bring. Lawrence, in a Nature paper1 published out of the Getz lab a the Broad Institute, showed that for cancers with low signal-to-noise ratios (i.e. mutations that occur in less than two percent of tumors of a given type), we only begin to identify mutations when sampling more than 5,000 patients. The paper posited that we will need significantly more data to make the kinds of advances we all hope will result in ever more precise medicine.

The question, then, is how this is accomplished. How do we work with petabytes of genomic and other data more efficiently, and how do we put it to use?

In our work with large-scale and national genomics projects, Seven Bridges has identified three key trends that will shape how we deliver on the promise of precision medicine.

COMPUTATION CENTERSWith the rise of elastically available cloud computation resources, the first is, in hindsight, blindingly obvious: Computation centers will replace data repositories.

Historically, any researcher wanting to work on large-scale genomic data needed large quantities of both money and time. Grants would contain funding for local high-performance computing (HPC) clusters and would expect researchers to spend months making copies of datasets to these facilities.

Only then could one start work — and by the time research

begins, a dataset could already be out of date. Further, there is ongoing expense to keep up the computing infrastructure.

Though projects like the National Cancer Institute’s Cancer Genomics Cloud (CGC) <www.cancergenomicscloud.org>, we’ve discovered a better way.

We bring the biological questions to the data, not the other way around.

By co-locating large datasets with computing resources, researchers can upload their tools, pipelines, and metadata — all of which are orders of magnitude smaller than the original datasets — and pay only for the computation they need in a given experiment.

This offers four tangible benefits. First, when combined with the right software these centers enable simple and secure collaboration. We know that the best science is done in teams, and giving everyone the exact access they need — and not more — is transformational. A PI can review all the data, and inspect every pipeline with every parameter. She can manage multiple funding sources, and distribute work among her lab. Meanwhile, a lab assistant can help construct a pipeline and select relevant cases to explore, without the ability to execute large computational tasks (or, in other words, spend money without approval).

Second, it reduces costs because funding bodies now only need to pay to keep one copy of a given dataset. We estimate that storing one copy of The Cancer Genome Atlas (TCGA) on the CGC will cost approximately $2 million per year.

Third, it increases access to the data because any researcher, not just those with available HPCs, can run their analysis. Instead of making a large capital expenditure, researchers can turn computation into an operating expense that scales directly with how much they actually use it. Better still, with the huge competition in the cloud-computing market, these prices are consistently falling.2

Fourth, it greatly accelerates research. Not only is the months-long waiting times to make copies of data eliminated, but the time to access computational resources is also reduced to just minutes. Further, in cases where time is of the absolute essence, researchers can optimise to take advantage of vastly more cores than they normally would have in an HPC to further accelerate a task. For example, we’ve worked with customers to reduce whole genome pipelines to just about five hours.

DISCOVERY IN MILLIONSOF GENOMES

STUDIES THAT ANALYZE MILLIONS OF GENOMES AT ONCE WON’T JUST BE TECHNICAL FEATS -- THEY WILL LEAD US TO TARGETED TREATMENTS FOR SUFFERERS OF MANY DISEASES,

INCLUDING CANCER.

Julia Fan Li, Senior Vice President, Seven Bridges

Page 29: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 27

ANALYSING DATA

Genomics 101 / 27

PORTABLE, REPRODUCIBLE WORKFLOWSThe second trend we’ve identified with our partners is the need for completely portable and thus reproducible workflows and pipelines.

The more large-scale data analysis enters the every-day practice of science and medicine, the clearer it is that “algorithms and the software used to implement them have become an integral and important part of research methods.”3 But the complexity of tracking — let alone sharing — software methods increases in lockstep with the complexity of the tools themselves. For example, a typical TCGA marker paper uses more than 50 bioinformatic tools, each of which comes in multiple versions and with many different parameters. Standardising the way we document tools, parameters, and their dependencies is crucial to making the process of repeating methods easier.

These issues were the impetus for the bioinformatics community to develop the Common Workflow Language (CWL) <www.commonwl.org>. CWL is a specification, much like HTML, that uses plain text to store every piece of a complex computational workflow. Better still, it was defined with Docker <www.docker.com> in mind, meaning CWL-compliant software can also perfectly reproduce a given workflow in the future by re-downloading the exact version of any given application. And, because CWL is an open specification, it prevents lock-in: researchers can use any analysis tool the prefer or even write their own.

We’ve become big believers in CWL, and have built it into both the CGC and the Seven Bridges Platform. Other industry partners are doing the same — including the Institute for Systems Biology, the Sanger Institute, the Galaxy Project, and the Broad Institute.

Making reproducibility a copy-and-paste affair is not just good for science — it accelerates the pace at which we can build off the discoveries of the entire community.

ADVANCED DATA STRUCTURESThe final trend we see across all our projects — from large pharmaceuticals to national projects like Genomics England — is a need to bring genomic data structures into the 21st century.

The linear data formats of traditional genomics tools can’t scale to the number of samples we need to analyze simultaneously.

Today, when we want to understand an individual patient we align their reads to a static reference, and store the results in static, flat files. And we repeat this process for each new patient. Worse still, the static reference isn’t updated but once every three years, and only represents a small collection of individuals, leading to inherent bias.

Instead, we need a reference that can be updated immediately with new evidence. We need a reference that learns. We need a reference that contains knowledge of an entire population.

We do this through a new technology we call the Graph Genome, which advances genetic analysis in two key ways. First, it helps us create an ever-more accurate view of both an individual’s genetic makeup and that of the population as a whole. Second, it is a more efficient method to store and analyze vast quantities of genetic data.

ACCURATE INDIVIDUALS AND POPULATIONSThe Graph Genome gets increasingly precise by keeping and learning from data that other genetic analysis tools would throw away.

A sequencer does not read a person’s entire genome from beginning to end. Instead, it breaks the genome up into smaller chunks and reads them at random. Then, large numbers of computers piece the small segments back together like a person completes a jigsaw puzzle, comparing the piece in her hand to the completed picture on the box top.

Until the Graph Genome, that box top, called the reference genome, was a composite of just a few people. The majority comes from one person.

But the Graph Genome does something different.Instead of comparing pieces to the box top and stopping, we

update the overall image with the unique pieces of each individual. Over time, as we complete more puzzles (sequence more people), the box top starts to look like the entire population. Importantly, we don’t store identifiable individuals, but instead update the frequency of their unique variations with each new sequence.

FASTER AND MORE EFFICIENTTraditionally, the data for each genome is stored separately. This makes genomic datasets so large that they are impractical to store and move. Even with the fastest connections in the world, copying a dataset could take weeks. We allow researchers to work with the data in the cloud and download only the results of their work. By collapsing many individuals into only the differences between them, we not only need less storage space, but can also more efficiently read and work with that data.

There’s also less of something else in graph genomes: patient-identifying information. Because graph-based references can be updated with just the new frequencies of variation, without the need to store an individual’s path through the graph, they also represent a way to truly anonymise patient data while simultaneously capturing the full benefit from it.

In this instance, less is actually more.

Notes1. “Discovery and saturation analysis of cancer genes across 21 tumour types” Nature. 2014 Jan 23;505(7484):495-501. doi: 10.1038/nature12912. Epub 2014 Jan 5.2. http://www.economist.com/news/business/21648685-cloud-computing-prices-keep-falling-whole-it-business-will-change-cheap-convenient3. “Software with Impact” Nature Methods 11, 211 (2014) doi:10.1038/nmeth.2880

This is what we mean when we say graph genomes are self-improving: It gets better the more you use it.

Julia is a Senior Vice President at Seven Bridges, the biomedical data analysis company accelerating breakthroughs in genomics research for cancer, drug development and precision medicine. She leads the Seven Bridges UK office, which focuses on advancing the state of the art in graph genome research, as well as serving national governments on their largest, million-genome-scale projects.

GG

G

GGG

GGG C C C C

C

C CG

CC

T AAAA

AC

AA

A

60%

60%

60%

80%

40%

40%

20%

50%

50%

10%

30%

Page 30: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

28 / Genomics 101

ANALYSING DATA

When evaluating different platforms and analysis pipelines make sure to understand the recommendations and limitations of the platforms quality control systems to ensure generation of accurate and robust data. The famous computer science adage ‘Garbage in, garbage out’ applies equally well to any data analysis pipeline and microarrays and NGS are no exceptions.

ANALYSIS

The fundamental steps in analysing microarray data, after quality control, are the same independent of the application field of interest (i.e. copy number, genotyping or expression) or the microarray platform. Each technology platform has its own recommended standard pipeline for data processing for their arrays, but the high level steps are similar. Intensities are captured off the microarray using a scanner and algorithms provided by the technology providers. These intensities are then subjected to signal processing algorithms including but not limited to normalisation, background correction and outlier removal. Following this signal processing the intensities are summarised together providing a signal for each probe/probe set/feature on the array. At this point the data processing diverges for the different application spaces utilising application specific algorithms. Gene expression arrays are now ready to be subjected to standard statistical tests like t-tests, and ANOVA analysis to identify differentially expressed genes. More sophisticated next generation expression arrays can also be further analysed for splice variants using specific algorithms provided in software from the manufacturer.

For genotyping microarrays the data is transformed into a space with properties more suitable for evaluating genotypes, usually utilising a clustering algorithm such as BRLMM-P or GenCAll.

Analysing microarray data for copy number variation is a multi-step process involving creating log2 ratios against a reference and then calculating the corresponding copy number for each of the probe sets based on the ratio. These individual copy number calls are then processed by a segmentation algorithm to identify stretches of identical copy number calls corresponding to stretches of the chromosome, commonly referred to as segments, with that copy number state. Some oligonucleotide microarrays also contain SNP probes on the arrays that can be processed to identify stretches of AOH. In general, these application specific analysis pipelines are packaged in easy to use desktop computer applications provided by either the microarray suppliers themselves or third party software providers.

In our opening chapter, we learned that DNA Microarrays often offer a simpler and more cost effective way to generate certain types of data. This is also true of the analysis of microarray data. While most bench scientists will be capable of carrying out the analysis outlined above, NGS analysis is a little more complicated.

NGS ANALYSISINTRODUCTION

In the last decade the genomics industry has seen the cost of next generation sequencing (NGS) drop faster than the slope of Moore’s law, from about US$10 million to now approximately $1,000 per

genome. The drop in cost and the increase in speed of sequencing technologies have led to widespread adoption of NGS by the research community and increasing use in the clinic for diagnosis and treatment of disease.

With NGS data being generated at an ever-increasing rate, the need for standardised genome informatics and data management practices have become critical. Gaining knowledge from genomic data, (stream of A’s, C’s, T’s, and G’s that make up the output of the DNA sequencer) involves three broad steps:

• Primary analysis – Production of raw reads to ACTG code and assigning quality scores

• Secondary analysis – QA filtering, alignment and assembly of reads, and variant calling

• Tertiary analysis – Annotation and filtering of variants for study specific investigations

PRIMARY ANALYSIS

Primary analysis involves steps required to convert data from the sequencer into base pairs and compute quality scores for each base. As NGS typically works by breaking the genome into fragments which are then read, the sequencer generates unordered and unaligned raw reads, along with a quality score for each of the bases—known as the Phred score (denoted by Q).

A Phred quality score is a measure of the quality of the identification of the nucleotide bases generated by automated DNA sequencing. It was originally developed for Phred base calling to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each nucleotide base call in automated sequencer traces. They have become widely accepted to characterise the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based consensus sequences.

Typically, the instrument manufacturers provide the software for primary analyses and the task is handled with in the sequencing instrument itself.

The output from the sequencer is typically a FASTQ file, which is ASCII text data that contains sequence identifiers, i.e., nucleotides (A, G, T, or C), along with corresponding Phred scores. Once the FASTQ file has been generated, it is ready for processing in a secondary analysis pipeline.

SECONDARY ANALYSIS

Dominant NGS technologies are based on the shotgun approach (see Creating Data and Reading DNA chapters). The sequence data that comes off the sequencer is composed of small nucleotide sequences of ACTG code (or reads) that need to be put back together. Secondary analysis involves the reassembly and alignment of these reads to construct the original sequence, a process known as genome assembly. But before the reassembly, the reads are assessed. They are filtered by length or the quality reported by the sequencer in order to produce the best results.

Page 31: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 29

ANALYSING DATA

It is known that certain SNPs are associated with particular traits, which makes them a widely used biomarker for drug development and genetic studies. For SNP mining, reads generated from multiple individuals are assembled against a reference genome to identify sites with single nucleotide variation.

Variant calling depends on several compounding factors, which include:

• Cloning process artifacts• Error rate associated with the sequence reads and mapping• Reliability of the reference genome

The final output of secondary analysis are variant calling files (VCFs) where every variant present in the sequence sample in annotated. Tertiary analysis aims to make biological sense of the generated data from secondary analysis and is covered in more detail in the following chapter.

COMPUTATIONAL AND STORAGE REQUIREMENTS FOR PRIMARY AND SECONDARY ANALYSIS

With sequencing time and cost no longer a bottleneck, scientists are now faced with computational analysis and storage requirement challenges.

Primary analysis is largely provided by the sequence instrument providers, the analysis software is designed to keep pace with the throughput of the sequencer. In contrast, secondary analysis requires much more computationally intensive resources.

Secondary analysis performs a given set of algorithms and bioinformatics on a per-sample basis. This repeatable process can be placed in an analytic pipeline that’s entirely automated. You can fine-tune the pipeline by exploring and optimising your set parameters to make sure that steps have been segmented in the most sensible way.

Often, pipelines are monitored through programs that have error reporting, quality control, and performance metric logging to keep improving them.

Over the past several years, the methods and algorithms designed for primary and secondary analysis have matured and there are many industry-recognised open source tools that are freely available to use and modify.

For sequence alignment to a reference genome, Burrows-Wheeler Alignment (BWA) algorithm is an industry favorite for its speed and accuracy. Bowtie is another popular software application commonly used for sequence alignment, promoted as an ultrafast, memory-efficient sequence aligner. Single Nucleotide Variant (SNV) detection has gone through a few generations of algorithmic improvements, GATK, an application for identifying SNPs and indels in DNA and RNAseq data, has become a commonly used tool. Developed by the Broad Institute, GATK is free for academic use, however there is a licensing fee for commercial use. Freebayes, like its name implies is a free open source tool and great alternative to GATK, used to find SNPs, indels, MNPs (multi-nucleotide polymorphisms), and complex events. p.32

GENOME ASSEMBLY CAN BE DONE IN THREE WAYS:

Reference genome mapping: Reads are assembled and aligned against a reference genome, a representative example of a species’ genome, and variant calling is performed, highlighting the difference between these two.

De novo sequence assembly: The sequence is assembled without the aid of a reference genome. Typically, this is a complex computational process, which uses different computing techniques (such as constructing de Bruijn graphs using k-mers (for short reads) or using an overlap-layout-consensus approach) for gene reassembly. Producing long read lengths allow you to span certain repetitive and complex elements of the genome that short reads are unable to resolve.

Graph-based reference Genomes: A third way of assembling reads involves an advance form of a reference genome based on a “graph” that contains not just a representative example, but instead data from many tens or hundreds of thousands of individuals in a population. This method holds the potential to be more accurate, less computationally expensive, and also anonymise individuals while still allowing them to be studied. It also helps to achieve many of the goals of de novo assembly, including identifying structural variation. This is covered in more detail on pages 26 and 27.

In terms of speed and resource intensiveness, reference genome mapping is simpler. At the same time, de novo methods produce sequences that are free of errors associated with alignment tools and can detect variations, such as structural variations, that could be overlooked while aligning the sequence using a reference genome.

In either case, scientists require a determined average depth and coverage of the sequence reads over the entire genome or targeted areas of interest. Depth is measured by how many reads are stacked over a given locus of the genome. For de novo assembly, a higher average depth is typically required, which aids the development of large contigs, sets of overlapping DNA segments that represent a consensus region of DNA, that can form a physical map of the genome and used to guide the assembly of the draft genome. Higher average depth, in the case of sequence alignment, means more confidence in the consensus sequence of the sample and more accuracy in detecting variants from the reference.

Post genome assembly, variant calling is carried out to identify the variants or differences in the assembled genome in comparison with the reference genome. These differences could include single nucleotide variants (SNVs); smaller insertions or deletions (INDELS); or larger structural variants, such as translocations, transversions, and copy number variants (CNVs).

SNV identification is used to identify germline mutations, and it is therefore an integral part of genetics-based research. An individual inherits these mutations from his or her parents, and such mutations can occur with some frequency (though limited) across population—in which case they are termed single nucleotide polymorphisms (SNPs).

Page 32: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

30 / Genomics 101

ANALYSING DATA

WHAT IS DNAnexus?DNAnexus is a professional grade platf orm that makes it easier for users to do three things, each in a secure and compliant fashion:

1. Analyze large amounts of raw geneti c data

2. Collaborate around large amounts of data (including but not limited to geneti cs)

3. Integrate geneti c data with other types of data, such as data from electronic medical records to advance science and improve clinical care

(1) Analysis Of Raw Sequencing DataThe basic idea here is that the machines that are used to read DNA sequence are incredibly powerful, but don’t generate a book of informati on that starts at the beginning of the fi rst chromosome and concludes at the end of the last one. Rather, most sequencing machines spit out phrases of about 100 lett ers, phrases randomly located anywhere in the 3 billion lett er book that is the human genome. A computer must fi gure out where each individual phrase fi ts in the book, and must also determine whether there are any typos. This can be a computati onally intensive task, but DNAnexus provides a way to do this effi ciently, by dividing the task into multi ple parallel streams each of which can be tackled by a powerful computer.

The “computers” DNAnexus uses are run by Amazon Web Services or other providers and our use of them is an example of what’s known as “cloud computi ng” because the computers operate from a massive, dedicated central facility, rather than from a user’s own insti tuti on. One advantage of using cloud computi ng is it’s very much “on demand” – i.e. you have essenti ally unlimited access to as many computers as you need, and you only pay for the computers that you actually use, and only when you are actually using them.

One example of the DNAnexus Platf orm’s scalability was the CHARGE Project, our collaborati on with the Human Genome Sequencing Center (HGSC) at Baylor College of Medicine. As part of its parti cipati on in the CHARGE consorti um, the HGSC uti lized the DNAnexus Platf orm to analyze the genomes of over 14,000 parti cipants, encompassing 3,751 whole genomes and 10,940 exomes. Over the course of a four-week period approximately 3.3 million core-hours of computati onal ti me were used, generati ng 430 TB of results. This data was made available for worldwide integrati on and collaborati on to over 300 researchers.

(2) Distributed Collaborati onProgress in both science and medicine can be accelerated when data can be easily shared. When there are large volumes of data, as is increasingly the case in research and clinical realms, this can be a real problem. Remarkably, the most common method of large-scale data sharing today is FedEx’ing hard drives between insti tuti ons. What DNAnexus enables for a distributed team of researchers or clinicians is access to the same data, tools and pipelines at the same ti me. By bringing together the data, the experts, and the tools for analysis, DNAnexus facilitates collaborati on and accelerates understanding.

DNAnexus is ideally suited to power many types of data sharing, involving:

• NIH investi gators (as in the case with our work with CHARGE in the area of cardiovascular disease);

• Federal agencies (our work with the FDA on the precisionFDA platf orm, building a community to advance regulatory science in the area of NGS);

• Diagnosti c companies (our work with Natera and CareDx);

• Translati onal research partnerships (our work with Regeneron and Geisinger Health System);

• Public/private partnership of cancer researchers (our work with ITOMIC led by University of Washington’s Tony Blau).

David Shaywitz, MD, PhDChief Medical Offi cer, DNAnexus

@DNANE XUS INFO@DNANE XUS.COM W W W.DNANE XUS.COM

Our ability to support distributed innovati on also enables DNAnexus to provide global support for commercial consorti a which have been created by companies like Natera. DNAnexus provides a key component of Natera’s Constellati on™ bioinformati cs platf orm which, combined with assay kits and protocols that Natera distributes, allows global sequencing labs to access the same analysis pipelines and algorithms that Natera employs in their central laboratories for applicati ons such as NIPT and cell free DNA analysis in oncology.

(3) Integrati on With Other Data TypesThe insights that may be available in geneti c data are oft en revealed only when the informati on is considered and analyzed in the context of other data types, such as data from electronic health records (EHR). Integrati ng geneti c and EHR data is fundamental to the drug discovery work of

Regeneron, for example. In the same way our partners can easily access and effi ciently uti lize the fundamental tools of geneti c analysis on our platf orm, so too can they access and uti lize the tools required for integrati ng geneti c data with other data types. DNAnexus is adding tools constantly, based on the needs expressed by our partners.

LOOKING AHEADGuided by the visionary partners with whom we are privileged to work, DNAnexus conti nues to enhance our abiliti es within each of these three areas: DNA analysis, distributed collaborati on, and integrati on with other data types. We are constantly seeking opportuniti es to leverage the technology we’ve developed through collaborati ons with innovati ve leaders looking to use the power of our platf orm to approach compelling scienti fi c and clinical challenges.

DNAnexus Made Ridiculously Simple

END-TO-END WORKFLOW

Sequencers & Related

2° Analysis & Collaboration

Clinical PharnaResearch/Gov’t

3° Analysis & Applications

Interpretation/Annotation Databases

REPORT

Integrationand Wrap

IntegratedPartner

Solutions

Tools

GATK, Graph

LIMS & Upstream

In response to questi ons I receive from friends and colleagues who ask “What does DNAnexus do”, I thought I might off er a high-level perspecti ve.

A fl exible enterprise-grade platf orm for organizati ons pursuing genomic-based approaches to health. Laboratory Informati on Management Systems (LIMS) and sequencing instruments easily integrate with DNAnexus, as well as downstream terti ary analysis and reporti ng soluti ons.

Page 33: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 31

ANALYSING DATA

WHAT IS DNAnexus?DNAnexus is a professional grade platf orm that makes it easier for users to do three things, each in a secure and compliant fashion:

1. Analyze large amounts of raw geneti c data

2. Collaborate around large amounts of data (including but not limited to geneti cs)

3. Integrate geneti c data with other types of data, such as data from electronic medical records to advance science and improve clinical care

(1) Analysis Of Raw Sequencing DataThe basic idea here is that the machines that are used to read DNA sequence are incredibly powerful, but don’t generate a book of informati on that starts at the beginning of the fi rst chromosome and concludes at the end of the last one. Rather, most sequencing machines spit out phrases of about 100 lett ers, phrases randomly located anywhere in the 3 billion lett er book that is the human genome. A computer must fi gure out where each individual phrase fi ts in the book, and must also determine whether there are any typos. This can be a computati onally intensive task, but DNAnexus provides a way to do this effi ciently, by dividing the task into multi ple parallel streams each of which can be tackled by a powerful computer.

The “computers” DNAnexus uses are run by Amazon Web Services or other providers and our use of them is an example of what’s known as “cloud computi ng” because the computers operate from a massive, dedicated central facility, rather than from a user’s own insti tuti on. One advantage of using cloud computi ng is it’s very much “on demand” – i.e. you have essenti ally unlimited access to as many computers as you need, and you only pay for the computers that you actually use, and only when you are actually using them.

One example of the DNAnexus Platf orm’s scalability was the CHARGE Project, our collaborati on with the Human Genome Sequencing Center (HGSC) at Baylor College of Medicine. As part of its parti cipati on in the CHARGE consorti um, the HGSC uti lized the DNAnexus Platf orm to analyze the genomes of over 14,000 parti cipants, encompassing 3,751 whole genomes and 10,940 exomes. Over the course of a four-week period approximately 3.3 million core-hours of computati onal ti me were used, generati ng 430 TB of results. This data was made available for worldwide integrati on and collaborati on to over 300 researchers.

(2) Distributed Collaborati onProgress in both science and medicine can be accelerated when data can be easily shared. When there are large volumes of data, as is increasingly the case in research and clinical realms, this can be a real problem. Remarkably, the most common method of large-scale data sharing today is FedEx’ing hard drives between insti tuti ons. What DNAnexus enables for a distributed team of researchers or clinicians is access to the same data, tools and pipelines at the same ti me. By bringing together the data, the experts, and the tools for analysis, DNAnexus facilitates collaborati on and accelerates understanding.

DNAnexus is ideally suited to power many types of data sharing, involving:

• NIH investi gators (as in the case with our work with CHARGE in the area of cardiovascular disease);

• Federal agencies (our work with the FDA on the precisionFDA platf orm, building a community to advance regulatory science in the area of NGS);

• Diagnosti c companies (our work with Natera and CareDx);

• Translati onal research partnerships (our work with Regeneron and Geisinger Health System);

• Public/private partnership of cancer researchers (our work with ITOMIC led by University of Washington’s Tony Blau).

David Shaywitz, MD, PhDChief Medical Offi cer, DNAnexus

@DNANE XUS INFO@DNANE XUS.COM W W W.DNANE XUS.COM

Our ability to support distributed innovati on also enables DNAnexus to provide global support for commercial consorti a which have been created by companies like Natera. DNAnexus provides a key component of Natera’s Constellati on™ bioinformati cs platf orm which, combined with assay kits and protocols that Natera distributes, allows global sequencing labs to access the same analysis pipelines and algorithms that Natera employs in their central laboratories for applicati ons such as NIPT and cell free DNA analysis in oncology.

(3) Integrati on With Other Data TypesThe insights that may be available in geneti c data are oft en revealed only when the informati on is considered and analyzed in the context of other data types, such as data from electronic health records (EHR). Integrati ng geneti c and EHR data is fundamental to the drug discovery work of

Regeneron, for example. In the same way our partners can easily access and effi ciently uti lize the fundamental tools of geneti c analysis on our platf orm, so too can they access and uti lize the tools required for integrati ng geneti c data with other data types. DNAnexus is adding tools constantly, based on the needs expressed by our partners.

LOOKING AHEADGuided by the visionary partners with whom we are privileged to work, DNAnexus conti nues to enhance our abiliti es within each of these three areas: DNA analysis, distributed collaborati on, and integrati on with other data types. We are constantly seeking opportuniti es to leverage the technology we’ve developed through collaborati ons with innovati ve leaders looking to use the power of our platf orm to approach compelling scienti fi c and clinical challenges.

DNAnexus Made Ridiculously Simple

END-TO-END WORKFLOW

Sequencers & Related

2° Analysis & Collaboration

Clinical PharnaResearch/Gov’t

3° Analysis & Applications

Interpretation/Annotation Databases

REPORT

Integrationand Wrap

IntegratedPartner

Solutions

Tools

GATK, Graph

LIMS & Upstream

In response to questi ons I receive from friends and colleagues who ask “What does DNAnexus do”, I thought I might off er a high-level perspecti ve.

A fl exible enterprise-grade platf orm for organizati ons pursuing genomic-based approaches to health. Laboratory Informati on Management Systems (LIMS) and sequencing instruments easily integrate with DNAnexus, as well as downstream terti ary analysis and reporti ng soluti ons.

Page 34: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

32 / Genomics 101

ANALYSING DATA

Although the analysis tools listed above are packaged for single-purpose, they can be linked together to form a secondary analysis pipeline. This can be done manually on your local cluster with a considerable amount of I.T. customisation or through an open source bioinformatics platform like Galaxy or commercial bioinformatics platforms like Seven Bridges and DNAnexus. If you have a very good grasp of the I.T. required, and the scope to carry it out, an open source platform might be a good fit. However it will require a more continuous I.T. effort than what you would need to commit with a commercial solution.

Example of a typical secondary analysis pipeline:

1. FASTQ input from primary analysis typically conducted on the sequencer

2. BWA or Bowtie for mapping to the reference genome, which generates a BAM file.

3. GATK or Freebayes takes the BAM file and identifies variants in the donor relevant to the reference genome.

4. Output is VCF, which lists all the donor variants in relation to the reference.

Today, many organisations are participating in global large-scale sequencing projects to study thousands or even millions of genomes, making the challenge of storing and managing NGS data more critical.

In a recently published paper, Big Data: Astronomical or Genomical?, published in the PLoS Biology journal, between 100 million and 2 billion human genomes are expected to be sequenced by 2025. The storage capacity required for this alone would be pegged at ~2–40 exabytes (1 exabyte = 1018 bytes), which exceeds the projected data storage requirement for three other major big data generators: YouTube, (data storage projection of 1–2 exabytes), Twitter (estimated to require 1–17 petabytes {1 petabyte = 1015 bytes per year} of data storage) and the Square Kilometer Array or SKA (which might create a demand for 1 exabyte data storage capacity).

On average, the storage space required for analysing a whole genome via a Illumina Hi-Seq is ~200 Gb. Considering the variations in genome of human species, the storage requirements for a large-scale genome sequencing project is huge. For example, the 1000 genomes project consists of more than 200 terabytes of data for the 1700 participants. The analysis costs associated with such a large project may sometimes exceed reagent costs, considering the fact that the genome sequencing cost has significantly reduced now.

CLOUD COMPUTING AS A SOLUTION

Converting DNA into meaningful genetic information involves extensive computational resources dedicated to the application of bioinformatics for secondary analysis, let alone considerable data storage capacity. With research projects involving the sequence and analysis of tens of thousands to millions of genomes becoming the norm, many organisations are finding that their local clusters can’t keep pace with the sequencing volume. The cloud is the only technology that is capable of keeping pace with big data. Accordingly, the genomics industry is finding cloud approaches to suit its need for scalable computational and storage requirements.

Cloud service providers like Amazon Web Services or Google Cloud offer scientists access to powerful computational resources without

the investment in costly on-premise infrastructure. Users are able to access the resources on an as needed basis to analyze big data genomic pipelines, store petabytes of data, and share results with collaborations around the world.

While some organisations are capable of building and housing the data storage and computational resources to analyse large genomic datasets, it may not be the most cost efficient way to approach this. One of the major advantages of cloud-based approaches is the flexibility they offer. In several cases, your demands on your computational resources will have peaks and troughs. Depending on your usage demands and patterns, a ‘pay as you use’ model may be more cost efficient in comparison to building everything in-house.

This complexity is why many organisations choose to use a commercial bioinformatics platform built on top of the cloud: they get the scalability benefits, without the need to work on software updates, optimisation, authentication, authorisation, collaboration, security, or compliance issues.

Many of the popular bioinformatics applications for research in genomics are parallelisable, which make them more suitable for running in a cloud environment. While larger-scale users tend to have clusters in-house, many of their workloads are erratic and need integrated ways to push analysis to the cloud when they lack enough compute resources in-house. Some commercial providers even aid in building such ‘hybrid’ clouds. For example, Seven Bridges is participating in collaborative research with the Precision Medicine Initiative in the United States by helping the Million Veteran Program more easily create a hybrid cloud. The cloud is actually making company’s existing infrastructure investments more valuable, they do not need to worry about overprovisioning, when they need more compute resources they can burst into the cloud, rather than increase hardware investment.

While cloud service providers, such as Amazon Web Services and Google Cloud Platform, do support data management, storage, compute, and security and compliance tools, there are still gaps. When users DIY on the cloud they still need to implement a formal Information Security Management System to ensure highest level of compliance with clinical regulations.

As big data moves to the cloud, new standards will need to emerge for discovering and querying datasets, as well as for authenticating requests, encrypting sensitive information, and controlling access. The Global Alliance for Genomics and Health and others are working together to develop approaches that facilitate interoperability.

SUMMARYIn this chapter we have looked at one of the most important parts of genomics: turning raw data into something you can use. For microarrays, there is a wealth of standardised, easy to use analysis options. For NGS, things are a little bit more complicated and potentially require considerably more resources.

In the next chapter we take a closer look at what you can do with your NGS data to add biological context to it. n

Page 35: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 33

ANALYSING DATA

“ON AVERAGE, THE STORAGE SPACE REQUIRED FOR

ANALYSING A WHOLE GENOME VIA A ILLUMINA

HI-SEQ IS ~200 GB. ”

Page 36: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

CHAPTER 4:

NGS INTERPRETATION AND DISCOVERY

SPONSORED BY

Page 37: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomic Big Data Interpretation & Discovery

Genomics 101 / 35

NGS INTERPRETATION AND DISCOVERY

INTRODUCTIONInterpretation and discovery is where the genome meets medicine.

Up to this point we have explored the history of sequencing, and the steps required to generate high-quality sequence data. But reading the genome sequence is just the first step. Turning this information into meaningful insights into disease involves correlating variation in the sequence with phenotypes, namely diseases or other traits. This can be done for a single patient, as clinical interpretation, or at large-scale for research and discovery. In this chapter we will explore the complex challenges associated with mining vast genomic datasets for actionable information.

Developments in sequencing technology, explored in previous chapters, have created a tidal wave of genomic data available to clinicians and scientists. Capturing the full medical benefit of this information requires the ability to go beyond scanning for what is already known (regions of the genome that are known to be linked to disease and are well-documented in reference databases and panels) and begin to efficiently interrogate whole exomes and genomes.

For example, research into rare diseases has revealed that the majority of disease-causing variants have either never been seen before, and

so are not in the literature or existing gene panels, or are found in the patient but not inherited from either parent (de novo variants).

Effective interpretation and discovery requires managing and using data on an unprecedented scale, and connecting vast data collections around the world. Limitations and challenges in interpretation have become a critical bottleneck in the progression of personalised medicine. For precision medicine to become a standard part of global healthcare, clinical diagnoses need to be made and confirmed at same speed with which we perform other complex tasks by accessing large-scale data over the internet.

There are three main challenges associated with the actual process of genomic interpretation and discovery:

• Scale: The vast size and complexity of raw genomic data.• Power: Limited diagnostic and discovery yield when we seek to fully

exploit all of the available data.• Reach: The increasing need to connect data sets and link

interpretive tools worldwide.

Finally, we shall take a look at the regulatory issues surrounding unknown variants. A single NGS test has the potential to identify thousands of variants that could be used in a diagnosis, but in order to form the basis of a diagnosis the test must meet regulatory standards. We shall explore the regulatory challenges surrounding interpretation and discovery.

Data Generation

Scalable

Precision Medicine

Seamless

Normalised

Global

Cloud

Sequencing

Population

WellnessDiscovery

Clinical

The sheer volume of data generated by NGS is creating a bottleneck, slowing the development of precision medicine.

Page 38: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

36 / Genomics 101

NGS INTERPRETATION AND DISCOVERY

SCALEOnly a few years back, genomics was relatively data poor. SNP genotyping enabled broad coverage of the genome but was useful principally for identifying common variants and assigning risk for common diseases. Finding rare variants was slow, painstaking and expensive, requiring sequencing of individual genes and the steady but very slow compilation of disease-linked variant panels.

The good news was that genotyping in this way did not generate very large quantities of data. What data was produced could be stored and analysed using standard database technology.

The emergence of NGS changed all of that. Following the advent of advanced sequencing techniques, the raw data from a single exome, the coding region of a genome, now can weigh in at more than 10 gigabytes of data (depending on read depth). An entire human genome at 30X depth generates a file of approximately 90 gigabytes. In context, a computer with a 1 terabyte hard drive can store fewer than ten individual genomes, hardly enough for a large-scale research project. Data on this scale, particularly querying thousands of genomes simultaneously, is overwhelming for standard databases and all but the largest IT systems. Even when these data can be stored, mining sequences intensively is extremely slow because the time it takes to complete a computation is limited by the input/output channel of information. This issue is further compounded as the number of analysed samples increases.

The answer to this problem is to develop new and more computationally efficient data architecture. The most widely-used solution for large-scale research and diagnostics is the genomically-ordered relational database (GORdb), developed a decade ago for the world’s largest population genomics research effort genetics in Iceland. Offering a different approach to data storage and retrieval, this system is now being further refined and deployed around the world.

Traditional relational database systems were designed principally for finance and banking, to perform lots of small operations on relatively simple datasets of reasonable size. This makes them ill-suited to the task of dealing with high volumes of sequence reads and variation data, because they are built for lots of small transactions and have legacy command and data structures made to perform those original tasks.

By contrast, GOR was designed from the ground up specifically to address the need to store and interrogate genomic data. GOR databases resolve issues with input/output lag time and computer crashes by storing sequence data according to its inherent structure – its position on human chromosomes – with underlying data structures and commands created specifically for genomically-ordered tables. This enables genomic data and annotation data – including reference datasets – to be stored and updated separately and queried rapidly. Applications built on GOR databases can retrieve individual reads quickly, correlating only the relevant bits of data before moving on. Massive amounts of sequence data can be interrogated in minutes rather than days or weeks.

In a clinical context this allows for the rapid filtering of patient genomes against gene lists, public and private gene and annotation databases, all of which can be harmonised to the GOR format. In research, case-control analyses involving tens of thousands of genomes can take minutes rather than hours or days.

POWER The ability to mine sequence datasets is only useful if potentially pathogenic variants can be efficiently identified. The diagnostic yield of a system is directly related to its ability to compare samples to databases of existing genomic data, access to extensive reference libraries, and an ability to predict deleterious variants, even if they are novel and not previously annotated. Creating a seamless link between this information and the clinic is a critical step in advancing both precision medicine and genomic research.

SNIP SNIPSNPs, single nucleotide polymorphisms, or ‘snips’ are one of the most common forms of genetic variation. A SNP is a single base-pair mutation at a specific location in the genome. In humans SNPs can be associated with disease susceptibility. Conditions such as sickle-cell anaemia and cystic fibrosis have been linked to specific SNPs.

Page 39: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 37

NGS INTERPRETATION AND DISCOVERY

A crucial component of the effective use of this form of genome exploration is the ability to instantly visualise raw genomic data reads, showing what is happening at each base in a sample. This enables visual confirmation of statistical or summary findings and to check the quality of the sequencing. If data is normalised, any variants can be viewed in the context of standard and customised or proprietary reference sets, such as the Human Genome Project, all in the same format.

Once the exploration of the genome has revealed an interesting variant, its effect on human phenotype can be indicated or confirmed in another genome with the same disease, usually through a search of the published medical literature. There currently exist over 60 million articles and guidelines, and new information is constantly becoming available, once again presenting a significant problem for data access and handling. Initiatives are underway to centralize this information, such as the government-funded ClinGen and ClinVar systems, which aim to create an open-access resource that defines the clinical relevance of genes and variants respectively.

Another approach for replicating or confirming variants that may not yet be in the literature or databases is the development of data-sharing ‘beacons’. One such has been developed by NIH, and another by the Global Alliance for Genomics in Health (GA4GH). Beacons are web-servers housing genome data, submitted by contributing institutions, which can answer specific questions about the presence or absence of particular alleles at a particular genomic location. A researcher can ask a beacon “Do you have any genomes with an “A” at position X on chromosome 6” and the beacon responds simply “Yes” or “No”.

As well as interrogating the sequence data, there are databases available for exploring the biology of genetic diseases. Archives such as the Genetic Association Database (GAD) and The Online Mendelian Inheritance in Man (OMIM) contain detailed information on genetic diseases, their biology, their relationship to relevant genes and the complexity of gene(s) associated with the disease. p.41

Sequence information

Looking for disease-causing variation within the genome typically begins with checking the sequence against lists of known disease-associated genes and reference sources. In clinical diagnostics, for example, this could involve using a gene panel test that examines specific regions of the genome looking for known alterations that are linked to disease. For example, the TaGSCAN (Targeted Gene Sequencing and Custom Analysis) screening panel examines 514 genetic regions that have been associated with childhood diseases. There are gene panels available from a wide array of companies that can be used for the identification of carrier status, assessment of disease risk, and diagnoses.

While this approach is a valuable start in interrogating a genome, detailed knowledge of individual variants (the “Known-Knowns”) is not always extensive enough to obtain an answer. In fact, this approach will only solve 20-25% of rare disease cases.

If comparative methods fail to generate a result, the next step is typically a systematic search of the genome that filters the information for a range of different genetic features. These include:

• Population Allele Frequency: Tools like the Exome Aggregation Consortium (ExAC) allow researchers to identify rare, potentially disease-linked variants within a cohort of over 60,000 individuals.

• Variant Impact: Tools like Variant Effect Predictor (VEP) can predict the impact that identified variants will have on genes, transcripts and proteins. This analysis is based upon the location of a variant within a gene and the expected effect that a mutation will have on it product (most often a protein).

• Inheritance Model: Such as autosomal dominant or recessive. This also includes de novo mutation when a variant has been arisen spontaneously and has not been inherited.

• Paralogs: These are usually “silent” second copies or versions of genes that have been kept in the genome over the course of evolution, but can still be or become functional.

Visualisation of raw genomic data reads show what is happening at each base in the sequence, and how that compares to reference sequences.

Page 40: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

38 / Genomics 101

NGS INTERPRETATION AND DISCOVERYEXPLORING A QUADRILLION BASES

TO FIND THE ONE THAT MATTERS

THE OS OF THE GENOMENO ONE PUTS THE FULL POWER OF THE GENOME AT YOUR FINGERTIPS LIKE WUXI NEXTCODE. INTERPRET CLINICAL CASES WITH UNRIVALLED POWER; MINE POPULATION WGS IN MINUTES;

JOIN COHORTS ON MANY CONTINENTS. ALL YOUR DATA AND RESULTS BACKED BY ALL KEY GLOBAL REFERENCE SETS AND COLLECTIONS – IN ONE SYSTEM, ONE FORMAT, AT RAW

RESOLUTION AND IN REAL TIME.

FIND OUT WHY IN HEAD-TO-HEAD COMPARISONS AGAINST EVERY PLATFORM ON THE PLANET, WE TAKE OUR PARTNERS FARTHER AND FASTER.

GENOMIC BIG DATA SOLVEDSpeed. Our Genomically Ordered Relational Data (GOR)model does what others can’t: it enables on-the-fly queries and joins of massive sequence data, wherever it resides. By relying on the structure of the genome itself, it can perform complex analyses in minutes, not weeks, and provides always-on raw sequence visualization at a click.Scale. GOR technology manages, mines, interprets and connects more genomes than any other. It is the onlysystem built and proven at population scale, underpinning the power of our tests and scans, as well as the largest and leading precision medicine efforts on three continents.The global standard. This scalability, coupled with our pioneering Deep Learning capabilities, creates an insight engine to benefit patients everywhere. Our unparalleled dynamic knowledge base includes curated data from allkey public as well as the world’s largest proprietary genomics datasets.

SEAMLESS DIAGNOSTICS AND DISCOVERYBroad. Our expertise and tests span rare disease,cancer, public health and wellness.Powerful. We don’t rely solely on the known annotations that panel tests do. By systematically scanning the entire genome, our system has solved years-long diagnostic odysseys in hours and points straight into the biology tofind actionable results.

Efficient. We let you instantly filter variants by frequency, impact, mode of inheritance, scan for de novos, and validateimmediately by viewing raw reads - shaving days from once tedious workflows.Integrated. Since so many causative variants are novel, weuse GOR to bring diagnostics and discovery together.Toggle between clinical and research datasets, increasingdiagnostic yields, uncovering new targets, and following them up in fast and customized case-control studies.

A GLOBAL ECOSYSTEM FOR PRECISION MEDICINESoup to nuts, worldwide. Menu of our products andservices include CLIA-certified sequencing, secondaryanalysis, data storage, diagnostics, discovery and product development. Choose a la carte or get a complete turnkey workflow.Let the cloud do the heavy lifting. As well as onsiteinstallation, you can also run our entire system in the cloud.You’ll get elastic scalability, compliance and the best-in-classsecurity that our partnership with DNAnexus brings.The internet of DNA. Powered by GOR, the largest globalnetwork of genomes makes it possible to collaborateinstantly with colleagues and institutions around the world,using full-resolution sequence data without moving big files.This capability accelerates diagnostics and discovery, to the benefit of patients and populations everywhere.

[email protected] | Cambridge | Reykjavik

Page 41: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 39

NGS INTERPRETATION AND DISCOVERYEXPLORING A QUADRILLION BASES

TO FIND THE ONE THAT MATTERS

THE OS OF THE GENOMENO ONE PUTS THE FULL POWER OF THE GENOME AT YOUR FINGERTIPS LIKE WUXI NEXTCODE. INTERPRET CLINICAL CASES WITH UNRIVALLED POWER; MINE POPULATION WGS IN MINUTES;

JOIN COHORTS ON MANY CONTINENTS. ALL YOUR DATA AND RESULTS BACKED BY ALL KEY GLOBAL REFERENCE SETS AND COLLECTIONS – IN ONE SYSTEM, ONE FORMAT, AT RAW

RESOLUTION AND IN REAL TIME.

FIND OUT WHY IN HEAD-TO-HEAD COMPARISONS AGAINST EVERY PLATFORM ON THE PLANET, WE TAKE OUR PARTNERS FARTHER AND FASTER.

GENOMIC BIG DATA SOLVEDSpeed. Our Genomically Ordered Relational Data (GOR)model does what others can’t: it enables on-the-fly queries and joins of massive sequence data, wherever it resides. By relying on the structure of the genome itself, it can perform complex analyses in minutes, not weeks, and provides always-on raw sequence visualization at a click.Scale. GOR technology manages, mines, interprets and connects more genomes than any other. It is the onlysystem built and proven at population scale, underpinning the power of our tests and scans, as well as the largest and leading precision medicine efforts on three continents.The global standard. This scalability, coupled with our pioneering Deep Learning capabilities, creates an insight engine to benefit patients everywhere. Our unparalleled dynamic knowledge base includes curated data from allkey public as well as the world’s largest proprietary genomics datasets.

SEAMLESS DIAGNOSTICS AND DISCOVERYBroad. Our expertise and tests span rare disease,cancer, public health and wellness.Powerful. We don’t rely solely on the known annotations that panel tests do. By systematically scanning the entire genome, our system has solved years-long diagnostic odysseys in hours and points straight into the biology tofind actionable results.

Efficient. We let you instantly filter variants by frequency, impact, mode of inheritance, scan for de novos, and validateimmediately by viewing raw reads - shaving days from once tedious workflows.Integrated. Since so many causative variants are novel, weuse GOR to bring diagnostics and discovery together.Toggle between clinical and research datasets, increasingdiagnostic yields, uncovering new targets, and following them up in fast and customized case-control studies.

A GLOBAL ECOSYSTEM FOR PRECISION MEDICINESoup to nuts, worldwide. Menu of our products andservices include CLIA-certified sequencing, secondaryanalysis, data storage, diagnostics, discovery and product development. Choose a la carte or get a complete turnkey workflow.Let the cloud do the heavy lifting. As well as onsiteinstallation, you can also run our entire system in the cloud.You’ll get elastic scalability, compliance and the best-in-classsecurity that our partnership with DNAnexus brings.The internet of DNA. Powered by GOR, the largest globalnetwork of genomes makes it possible to collaborateinstantly with colleagues and institutions around the world,using full-resolution sequence data without moving big files.This capability accelerates diagnostics and discovery, to the benefit of patients and populations everywhere.

[email protected] | Cambridge | Reykjavik

Page 42: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

40 / Genomics 101

INTERPRETATION & DISCOVERY

“IN RARE DISEASE CASES, PARTICULAR

VARIANTS ARE RARE BUT MANY

MAY CLUSTER IN A PARTICULAR GENE.”

Page 43: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 41

INTERPRETATION & DISCOVERY

by a suite of cloud-based analytical tools. Genomics England and the 100,000 Genomes Project aim to be a good example of this. De-identified patient data is placed in a central data centre, accessible by approved researchers, doctors, nurses and other healthcare professionals. Research users have restricted, remote access to datasets that contain only the information needed for their specific study. Genomics England are able to provide IT support, computing infrastructure, genome analysis tools and other technical services through partnerships with a range of companies and organisations.

One of the critical future challenges will be to enable researchers to interrogate multiple large data stores at once, rather than one by one. Given the amount of sequencing going on around the world – and the number of large-scale projects aimed at discovery for improving diagnostics and therapies linked to them – this is an area of great potential.

In the spring 2016 the Simons Simplex Collection in autism, comprising some 10,000 whole exomes, was made available in GOR format on the WuXi NextCODE Exchange to the global autism research and clinical community. This represents the first online use of large-scale genome data at full resolution, and points to the potential and likely rapid development of this capability in the near future.

REGULATORY CHALLENGESAs we have already discussed, NGS produces enormous quantities of data, and has the potential to identify thousands of variants that may be disease-linked. This creates a significant challenge for regulators. For a diagnostic genetic test to gain regulatory approval, and so be clinically useful, the U.S. Food and Drug Administration typically requires that the variant identified by the test is reported, and is known to be associated with a disease. Test developers must show clinical significance before approval can be given.

However, to date a relatively small number of disease-linked variants have been identified, and often NGS tests are used precisely because they can routinely detect rare variants that may not be identified by established tests.

At present, the discussions around how to effectively regulate NGS tests are on-going, and one focus is to evaluate the methodology of NGS interpretation as well as previously seen links, combined with an ongoing assessment of the clinical outcomes.

ON TO THE CLINICAs we have outlined in this chapter, the challenges involved in interpreting the tidal wave of NGS data are considerable, but not insurmountable. There are numerous initiatives underway to ease the strain in the bottleneck and speed the development of precision medicine systems.

In the next chapter we will look at genomics in the clinic, where the results of interpretation and discovery are turned into actionable results for clinicians and patients. n

On the basis of these analyses, gene variants are typically classified as pathogenic, benign, or variant of unknown significance (VUS). This information can be used to support a clinical genetic report, which we will explore in more detail in the following chapter.

Case-control studies

Comparing disease and control cases across large groups of patients presents a very different interpretive challenge. Here the genomes of hundreds or even thousands of patients who have a particular disease may be compared to the genomes of many times more control subjects. Genome-wide association studies (GWAS) have been used to compare many common genetic variants to identify loci that may be linked to disease (typically using genotyping arrays). With the falling costs of whole genome sequencing (WGS), however, many researchers are now turning to WGS analysis to achieve base-level granularity and detect disease associations even with low-frequency alleles.

The growing number of national genome projects, such as the UK’s 100,000 Genomes Project and the Qatar Genome Project, are conducting this research as part of the wider integration of genomic medicine into their national health programmes. This process requires enormous quantities of data, and is an area where improved database architecture has enormously improved the speed of genome interrogation.

In rare disease cases, particular variants are rare but many may cluster in a particular gene. To increase the power of a case control analysis, variant aggregation analysis tools can be used to collapse all variants in a gene associated with a disease into one pseudo variant. This can increase the power of association studies and with potentially powerful results.

One such utility is the identification of rare variants that point to potential drug pathways in common diseases: that is, to use rare disease genetics not only for the diagnosis of individuals but to develop drugs to treat common diseases or phenotypes. An example of this is PCSK9, which was identified as a potent modulator of LDL cholesterol levels through gene discovery in families with rare variants. Subsequent discovery work on the pathway established that compounds inhibiting PCSK9 could lower LDL cholesterol, an important public health impact especially for those resistant to or intolerant of statins, the most common cholesterol-lowering drugs. Two such inhibitors were approved by the FDA in 2015 and are now on the market.

REACHStoring, accessing and mining data in situ is the first part of the challenge surrounding interpretation and discovery. The second part, which is set to be a crucial game-changer, is the ability to work with these datasets online, from anywhere in the world. The beacons being developed by Global Alliance are a simple example of how this can work, but for the future it will become critical for researchers and clinicians to go beyond asking basic questions.

Given the scale of genomic data, the standard approach is to hold the genomic information in one central database and allow researchers and clinicians remote access, sometimes accompanied

Page 44: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

CHAPTER 5:

NGS IN THE CLINIC

SPONSORED BY

Page 45: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 43

NGS IN THE CLINIC

INTRODUCTIONIn the preceding chapters we have explored the process of collecting and analysing genomic sequence data, in a manner that could be applied to both research and to clinical diagnostics. For this chapter we will be focussing entirely on genomics in the clinic, and how the outcomes of genomic tests are communicated to clinicians and patients by the clinical laboratories conducting them.

Creating an accessible, useful report based on NGS information and analysis for a physician is one of the most challenging areas of clinical genomics. High quality patient care is dependent on a written report that is easy to understand and easy for physicians and genetic counsellors to act upon.

Traditionally, laboratory tests have looked for specific genetic variants with known disease outcomes. But with the tidal wave of information generated by NGS tests clinical laboratories are faced with the thorny issue of how to present information on thousands of genetic variants, many of which will have inconclusive clinical outcomes.

The balance between what to include in a molecular genetics report is essentially a Goldilocks problem: a report should aim to provide just enough information, but not too much; just the right level of detail, but not too complicated; and to be succinct, but still include all the relevant information.

We have previously examined the different steps involved in evaluating the clinical relevance of different variants. Here we will take a look at what happens in the clinic, starting with selecting the right test for a patient, and how to present the test outcomes in an effective clinical report. We will look at the approaches and challenges associated with clinical reporting, and explore what current best practice looks like.

CHOOSING THE RIGHT TESTThere are four main types of NGS tests available to patients, which provide varying degrees of genome coverage and varying levels of diagnostic detail. The type of test that a clinician will choose is generally determined by a patient’s symptoms and medical history.

Targeted gene panel testing

A targeted gene panel test is typically used when a patient’s symptoms and medical or family history strongly indicate a particular genetic condition associated with a small number of specific genes. Targeted panels explore the exons (coding regions) of between 20 and over 100 genes known to be disease-linked, and while the test examines several different genes the analysis will be targeted on a specific genetic condition. Diseases that targeted gene panels have been developed for include epilepsy and hearing impairment.

Medical exome sequencing

Whereas a targeted gene panel looks for known or specific variants in a few disease-linked genes, a medical exome test takes a broader diagnostic approach, exploring the exons in up to 4,600 disease-linked genes.

Whole exome sequencing

Exploring the entire genome sequence is still relatively expensive, making whole exome sequencing a more cost-effective alternative

for patients and healthcare providers. A clinician is only likely to order a whole exome or a whole genome test when a panel of genes is not available for a particular condition, or when the diagnosis is very unclear.

The human exome contains approximately 20,000 genes, so in the first instance the whole exome analysis will focus on genes believed to be linked to the patient’s condition. However, the analysis can be extended to cover more exome genes if the first results are inconclusive.

Typically whole exome sequencing is conducted as a “trio analysis”, meaning that both the patient and their parents will be sequenced and analysed. This way genetic variants in the genes of the patient can be compared with variants in their parents.

Whole genome sequencing

This approach sequences and analyses the patient’s entire genome, and is typically only used when a patient is very ill and previous tests have proved inconclusive. At present, sequencing and interpreting an entire genome is extremely costly, and the clinical usefulness of this test can be impaired by the sheer quantity of data generated.

Start of gene

Exon (coding)

Intron (non-coding) Exon Intron Exon

End of gene

Page 46: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

44 / Genomics 101

NGS IN THE CLINIC

As yet there is no agreed industry standard for reporting genomic test results, but several professional organisations have laid out guidelines for the sector. In the US, the American College of Medical Genetics and Genomics produced a document entitled “Standards and Guidelines for Clinical Genetics Laboratories”, while in the UK the Association for Clinical Genetic Science has produced a series of guidelines for genomic reporting.

The style and content of the report varies depending on the type of test ordered, but broadly speaking the majority of clinical genomics reports contain:• The patient’s results and the interpretation

of the results;• The location of the identified variant(s);• The type of test used, the methodology,

and its limitations;• Any secondary findings, if requested by the

patient.

The results of a genetic test show one of two things: either there is a deviation in the usual sequence of a gene found in the sample provided by the patient, or there is not.

During the interpretation and discovery process, a patient’s genes (or entire exome/genome) are compared to those from other people with a similar condition, to identify possible disease-linked variants. As a result, the test may identify a variant that is definitely known to be the cause of the patient’s condition, or to contribute to it in some way. This is a “pathogenic” or “disease-causing variant.”

Our knowledge of genetic variants and their associated diseases is still relatively small, and as a result a genetic test may well uncover a variant with an unknown or uncertain link to the patient’s condition. This is a “variant of unknown significance”.

No two humans have identical genomes; genetic variation is a natural part of the genome. These variations may well be picked up by a genetic test, but because they are not linked to a disease condition they are called “benign variants”.

Finally, there is a chance that any genetic test may pick up a known disease-linked variant that is not associated with the patient’s condition. This is called a “secondary finding”, and there is an ongoing and extensive ethical debate within the genomics community about how best to communicate these findings to patients.

TO KNOW OR NOT TO KNOW?During a whole exome test to diagnose a patient’s rare condition, the clinical laboratory conducting the test makes a secondary finding, namely a mutation in the patient’s BRCA1 gene that massively increases their lifetime likelihood of developing breast cancer. However, the patient has specifically asked not to be notified of secondary findings. What is the right course of action?

This debate is ongoing in the clinical genomics community. On the one hand it seems morally wrong not to inform the patient about a potentially lethal gene mutation that they are otherwise unaware of, even if they have not consented to receive that information. However, patients do and should have the right to autonomy over their medical information and how it is used.

One solution, recommended by ACMG, is to have a minimum list of known conditions, such as BRCA1 mutations, which are routinely evaluated and reported on as part of a genetic test. These results would be reported without seeking preferences from the patient.What do you think?

REPORTING THE RESULTS

49

Page 47: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 45

NGS IN THE CLINIC

THE ETHICS OF...NGS tests are a relatively new clinical tool, and as a result there is no one established industry standard for handling the array of issues that can arise from NGS data. Here we consider three of the major issues facing clinical laboratories and healthcare practitioners: secondary findings, data- re-analysis, and patient access to raw data. While guidelines exist for how to approach all three topics, the final decision is at the discretion of the individual clinical laboratory.

Secondary findings

The more comprehensive a genetic test, the more genes sequenced and explored, the higher the likelihood that the analysis will throw up unexpected results or “secondary findings”. How to report secondary findings, whether to report them, and issues around patient consent are an ongoing area for discussion within the research and clinical communities.

Secondary findings are genetic variants not linked to the condition being tested for, which may be linked to other disease conditions that could either affect the patient in later life, or could affect the health of future offspring. During genetic diagnostics many laboratories may conduct a separate secondary finding analysis that specifically looks for 56 gene alterations recommended by ACMG. Patients can choose to opt out of receiving this information if they wish. Conditions on the ACMG list include familial cancer disposition, such as mutations in BRCA1 or BRCA2 genes.

The likelihood of generating secondary findings are relatively low, particularly in targeted gene panels that only look at a narrow subset of specific genes. Even in whole exome sequencing the chance is low if the analysis is restricted to a specific condition. During a targeted analysis of a whole exome sequence it has been estimated that an unexpected variant would occur once in every 100 tests. If the whole exome is searched that likelihood increases to 3 in every 100 tests.

Data re-analysis

Over time, as our knowledge of the genome increases, patients whose test were previously unsuccessful may find themselves in a position to obtain a diagnosis. A crucial part of developing the clinical reporting system will be futureproofing, ensuring that in the future patients can come back for a diagnosis as the science advances. Again, how to handle data re-analysis currently comes down the discretion and capabilities of the individual clinical laboratory.

For example, how much patient data should a clinical laboratory store in order to support a re-analysis? As with the formatting and content of a clinical report, there is no hard and fast industry standard. One option is to store the list of variants discovered by the test, the VCF or “variant call format”, rather than the complete exome or genome sequence. This solution places less strain on a laboratory’s data infrastructure, but there is the risk that the VCF file may not contain the relevant

variant. The solution to this problem is to store the entire sequence data collected during the test, but as discussed in the previous chapter this can place significant strain on facility’s data storage capacity.

Patient data access

Patients may well request access to the raw data from their NGS tests, often because they are interested in looking for variants that may be relevant to their condition that were not included in the laboratory report. This enables patients to take control of their information, allowing them to follow up on unreported findings with their clinicians or to monitor the scientific literature for new information about the function or disease association of these variants. And as with secondary findings and data re-analysis, the decision to provide this data to the patient, and the format of the data (raw sequence data or VCF) is at the discretion of the individual laboratory.

The key concerns around providing raw data to patients focus on expertise, and the limitations of the tests themselves. A certain degree of training is required to evaluate interpret raw data files, and the risk is that patients or clinicians may over-interpret the findings, ascribing significance to certain variants whose effects may be relatively benign or harmless.

NGS tests are also not completely accurate; there is a risk of false positives – identifying variants that are in fact not present – and false negatives – variants identified as absent that are actually present. In most cases clinical laboratories will use a second sequencing method, such as Sanger (see chapter 2) to confirm the validity of any findings to address this problem. A patient looking at their raw data would not be able to confirm if a change is necessarily real. p.49

Page 48: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

46 / Genomics 101

NGS IN THE CLINIC

46 / Genomics 101

INTRODUCTIONIn a recent paper, the American College of Medical Genetics and Genomics published standards and guidelines for the interpretation of sequence variants1. The College made these available as an educational resource for clinical laboratory geneticists to help them provide qualitative clinical laboratory services. Although adherence to these standards and guidelines is voluntary and cannot replace the clinical laboratory geneticist’s professional judgment, the recommendations represent a broad consensus of the clinical genetics community. With increasing volumes and the use of large gene panels (clinical, full exomes and even full genomes) in clinical genetics routine practice, labs need strong informatics tools that support them in the automation and standardization of variant assessment and reporting, in order to benefit from community standards and to keep up with the best standard of care. In this case study, we showcase how Cartagenia Bench Lab NGS enables labs to implement their take on the ACMG recommendations. The Molecular Genetics department at Uppsala University Hospital illustrates how it is has implemented the recommendations in their specific routine diagnostic setting, using a flexible, drag-and-drop interface to build and store the lab’s variant triage protocol.

KEY REQUIREMENTSThe standards and guidelines describe an evidence-based approach for the assessment of variants of clinically validated genes. The recommendations use literature and database-based criteria to classify variants in five different categories: benign, likely benign, uncertain significance, likely pathogenic and pathogenic. Evidence levels are weighted (e.g. “Strong”, “Moderate”). To allow labs to automate their implementation of this evidence-based approach, a number of specific tools are required.

ANNOTATION SOURCES, SUCH AS POPULATION, DISEASE-SPECIFIC, AND SEQUENCE DATABASESThe guidelines recommend the use of a wide range of criteria. Examples include: population databases such as the Exome Aggregation Consortium (ExAC, http://exac.broadinstitute.org/); disease databases such as ClinVar (http://www.ncbi.nlm.nih.gov/clinvar), and sequence databases such as RefSeq (http://www.ncbi.nlm.nih.gov/refseq/rsg). With Cartagenia’s Bench NGS platform, labs can integrate and use a wide range of community-accepted resources, including

ACMG RECOMMENDATIONS ON SEQUENCE VARIANT INTERPRETATION: IMPLEMENTATION

ON THE BENCH NGS PLATFORM

Cartagenia Bench Lab

Effi cient variant assessment

A clinical-grade solution

Access relevantcontent

Draft lab reports with ease

Discover unique use cases and best practices on NGSfor clinical genetics and pathology labs at www.cartagenia.com.

Cartagenia Bench Lab™ is marketed in the USA as exempt Class I Medical Device and in Europe and Canada as a Class I Medical Device.

1. MOLECULAR GENETICS LABORATORY, UPPSALA UNIVERSITY HOSPITAL, SWEDEN2. CARTAGENIA INC (A PART OF AGILENT TECHNOLOGIES), 485 MASSACHUSETTS AVENUE,

SUITE 300, CAMBRIDGE, MA 02139, USA

Berivan Baskin1, Ph.D, FACMG, FCCMG; Steven Van Vooren2, Ph.D

Figure 1.Partial view of the Uppsala University Hospital decision tree representing their filtration strategy, investigating public and in-house variant databases, modes of inheritance, population frequency statistics databases, and variant coding effect. Top: decision tree. Middle: currently selected ACMG category PP5. Bottom: variants matching selected criteria. (Courtesy of Dr. Berivan Baskin)

Page 49: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 47

NGS IN THE CLINIC

Genomics 101 / 47

Cartagenia Bench Lab

Effi cient variant assessment

A clinical-grade solution

Access relevantcontent

Draft lab reports with ease

Discover unique use cases and best practices on NGSfor clinical genetics and pathology labs at www.cartagenia.com.

Cartagenia Bench Lab™ is marketed in the USA as exempt Class I Medical Device and in Europe and Canada as a Class I Medical Device.

the tools and data sources recommended in the ACMG guidelines. Moreover, with Bench NGS, labs can benefit from full version control and traceability on these resources.

RULES FOR COMBINING CRITERIA TO CLASSIFY SEQUENCE VARIANTS: DECISION TREES AND SCORINGThe guidelines recommend a broad set of informative criteria for assessing the clinical impact of a sequence variant. With each criteria, they also provide a level of evidence strength. For example, for a de novo variant in a patient with the disease, no family history and with both maternity and paternity confirmed, the evidence to classify the variant as ‘Pathogenic’ is suggested to be ‘Strong’. Other levels are ‘Very Strong’, ‘Moderate’ and ‘Supporting’. The guidelines also propose a scheme of rules by which labs can combine different criteria for classifying variants with different levels of evidence. In order to automate such a scheme, a tool set is required to represent rules into a workflow, and associate scores to variants accordingly. Bench NGS elegantly provides such a system by means of classification trees.

The user can choose from a library of filter components that each represent filter criteria such as ‘population frequency’, and can drag and- drop these into a decision tree.

Then, classifications (e.g. ‘likely benign’) can be assigned to the outcome of a particular branch, and labels can be used to annotate statuses, review actions or levels of evidence (e.g. ‘PVS1’, with which the guidelines represent very strong evidence of pathogenicity, or ‘review’, which prompts the lab to investigate a variant further).

IMPLEMENTATIONThe molecular genetics laboratory at the Uppsala University Hospital has implemented the ACMG guidelines on the Cartagenia Bench NGS platform and validated their approach on a set of clinical cases. The lab has implemented different criteria as well as levels of evidence in a decision tree, partially shown in Figure 1. In this view, a validated pipeline is run on a Connective Tissue Panel sample, showcasing a variant in the COL1A2 gene that is reported as clinically relevant. The protocol represented by the tree has checked all variants in the assay, and highlighted the p.Gly949Ser variant for review. The clinical geneticist consecutively verifies relevant sources – in this case: ESP, 1000 Genomes, ExAC, HGMD, in silico score annotations from ACMG-recommended SIFT, Mutation Taster and PolyPhen, and a confirmed spectrum of missense mutations in the gene at hand. Parental samples tested negative for this variant.

CONCLUSIONWith this case study, the lab has illustrated how various features of the Cartagenia Bench Lab NGS platform were used to implement an automated Standard Operating Procedure that reflects how the lab performs variant filtration. This case illustrates strong advantages in lab efficiency - whereas a manual process of variant filtration is time consuming and error prone, the lab benefits from automation of these manual protocols, freeing up time for genetic specialists to focus on variant interpretation and reporting.

Notes1. Richards et al., Genetics in Medicine, advance online publication 5 March 2015. doi:10.1038/gim.2015.30

This article is adapted from Agilent Publication 5991-6387EN.

Cartagenia Bench Lab™ is marketed in the USA as exempt Class I Medical Device and in Europe and Canada as a Class I Medical Device.

Page 50: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

48 / Genomics 101

NGS IN THE CLINIC

“IN AN IDEAL SCENARIO, FOLLOWING A GENETIC

TEST A PATIENT’S REPORT AND ALL THE

ASSOCIATED RAW DATA WOULD BE UPLOADED

TO THEIR EMR.”

Page 51: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 49

NGS IN THE CLINIC

The Electronic Medical Records and Genomics (eMERGE) Network, a National Institutes of Health-organised and funded consortium of US medical research institutions that are developing research processes combining information from DNA biorepositories and clinical information stored in EMRs in order to conduct large-scale genomic research.

Now entering phase III as of September 2015, the project participants are exploring the best avenues for incorporating genetic variants into EMRs for use in clinical care and diagnostics.

CONCLUSIONThe use of NGS tests for clinical diagnostics look set to become part of routine healthcare practice, and with the development of EMR systems the long-term benefits for medical research could be significant.

However, there are many challenges that have yet to be solved, and many tools and processes that need to be developed in order to fully realise the benefit. Clinical reporting is set to evolve rapidly in the future, as the cost of sequencing decreases and our knowledge of the genome increases. Consequently best practices for analysis, interpreting variants and clinical reporting will also continue to evolve. n

ELECTRONIC MEDICAL RECORDSMany health services across the world have migrated, or are in the process of migrating their patient health information into digital storage in the form of electronic medical records or EMRs. Across healthcare in general this is seen as a vital step in creating a more integrated system, and for genomic information in particular being able to append sequence information to patient EMR is a vital step in the development of precision medicine.

In an ideal scenario, following a genetic test a patient’s report and all the associated raw data would be uploaded to their EMR. This information could then be accessed by any clinician in the future, with the patient’s permission, who may need to re-analyse the data or make a fresh diagnosis based on new research. Patients may also be able to access their data remotely and securely. And researchers and clinical laboratories will be able to access de-identified data from thousands of patients as part of genomic interpretation and discovery (see chapter 5), and enable the uptake of genomic diagnostics into routine clinical care.

In reality, while a 1-3 page test report can be easily incorporated into an electronic medical record, there are significant difficulties associated with adding complete genomic data to a patient’s record. Nevertheless there are several large-scale projects underway aimed at creating a successful electronic database for genomic data that will benefit both patients and research.

Page 52: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

CHAPTER 6:

EDITING THE GENOME

SPONSORED BY

Page 53: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 51

EDITING THE GENOME

INTRODUCTIONIt is impossible to go to a conference in the genomics, molecular biology, or synthetic biology space without hearing the terms ‘genome editing’ or ‘CRISPR’. Precise, specific, and controlled genome editing is arguably the trendiest application in these spaces at the moment, with active momentum building due to its bold promises. It is hard to ignore a field that could change the face of personalised medicine, with the potential to treat thousands of currently untreatable diseases.

Today there are five papers published every day on just CRISPR alone – an astounding number for a technology barely 3 years old! But before we get to that we want to get back to humble beginnings, to where the genome-engineering journey began.

Human influenced genomic modification of organisms is as old as selective breeding, which has existed – whether intentionally selecting for beneficial traits, or unintentionally by domestication – for millennia. The direct modification of organisms using targeted methods has existed for around four decades.

HOMOLOGOUS RECOMBINATION MEDIATED GENE TARGETINGDuring the early microinjection studies of the late 70s it was discovered that the success rates of exogenous DNA expression in mammalian cells could be greatly increased if the DNA being introduced also contained a viral DNA sequence on either end. In the early 80s scientists discovered this same viral within cell line genomes, with the exogenous DNA inserted at these sites. Using

this paradigm, foreign DNA could be inserted anywhere into the genome so long as there existed regions of homology. By the mid to late 80s researchers were designing constructs to integrate foreign DNA into the genomes of a number of organisms, aiming to disrupt and discover gene or pathway form and function.

Over 7000 genes and regulatory elements had their function inferred since the late 80s thanks to this method of gene targeting. Knockout mutants, or point mutation variants were created, and depending on phenotype, function could be inferred. The importance of this technology led to Drs Capecchi, Evans and Smithies co-winning the 2007 Nobel Prize for physiology or medicine.

Homologous recombination mediated gene targeting occurs via a process called strand invasion – part of the homology directed repair of double stranded DNA breaks. A successful insertion event therefore relies on a randomly occurring double stranded break existing at a desired position. This makes experiment efficiency low.

Successful modifications occur at best in 0.1% of cells. Site-specific recombinase enzymes were developed as an adjunct to traditional homologous recombination, increasing the integration efficiency, though at least one recombinase-free recombination event was still required for success. Genome editing protocols usually required months of in vitro fertilisation and crossbreeding to find a double mutant for a desired allele. Add that to the challenge of uncontrollable integration of plasmid DNA throughout the genome through the non-homologous end joining repair machinery, and the stage was set for a more efficient technology to be developed.

Painting from ancient Egypt depicting early

domestication

Page 54: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

52 / Genomics 101

EDITING THE GENOME

ZINC FINGER NUCLEASES (ZFNs)Targeted, site-specific enzymes that facilitate directed cuts at any point in the genome arrived on the scene with the introduction of zinc finger nucleases, commonly known as ZFNs, around 20 years ago, promising to dramatically improve genome-editing efficacy.

ZFNs are synthetic, modular proteins. They consist of DNA binding domains sourced from the active site of transcription factor proteins, each approximately 30 amino acids in length, stabilised by a zinc ion. Each ZF is engineered to bind a specific triplet of nucleotides, and ZFs can be engineered to bind almost any triplet sequence.

Modularly adding triplet-binders to an endonuclease allows for sequence directed cuts to be induced anywhere in the genome. In order to maximise double stranded break efficiency and minimise off target ZF binding to several orders of magnitude beyond a human genome, a duplex approach is adopted.

This requires two zinc finger complexes; each binding to opposite strands of the DNA,

each fused with one half of the bipartite FokI endonuclease. In order to produce a site-specific cut, both zinc finger complexes were engineered to bind the perfect distance from each other in order bringing both FokI subunits in close enough proximity to induce a cut.

Fine control over a cut site position gave researchers tweezer-like control over homologous recombination events. Additionally, by making more than one specific cut, sections of the genome could now be completely deleted, taking advantage of the cell’s non-homologous end joining DNA repair machinery. Although a major breakthrough, processes used to make each DNA binding unit to be highly specific were arduous and ultimately expensive.

Each zinc finger sub-unit has finely nuanced binding affinity further complicating the use of this technology. To use the analogy of a hand, a zinc finger in the ‘pinky’ position of the entire ZFN structure could have extremely high efficiency. However in the ‘ring’ position the same zinc finger will display an entirely different, often worse binding efficiency. Therefore binding efficiency had to be engineered in the context of the entire protein.

TECHNOLOGY SHOWCASE:DEVELOPING MOUSE MODELS FOR CYSTIC FIBROSISCystic Fibrosis patients carry a mutation in a chloride ion channel, causing mucosal tissues to function incorrectly – leading to impaired mucosal secretion and damage the intestinal tracts. Patients also often suffer severe and chronic infection from over 50 pathogenic or opportunistic species.

Much of the pathophysiology of this life shortening disease was learned through early homologous recombination generated mouse models. One important study area has been the study of highly complex bacterial biofilms within the CF-model mouse lung.

Importantly, co-infection with multiple pathogens, i.e. Pseudomonas aeruginosa and Burkholderia cenocepacia led to both increased inflammatory responses and chronic infection establishment in mice. In the last 5 years a number of potential anti-biofilm drugs are being tested on the CF model mice with the hope it could extend patient lives.

Microinjection of a cell to introduce new genetic material

Page 55: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 53

EDITING THE GENOME

Transcription Activator-Like Effector Nucleases (TALENs)

TALENs are topologically similar to ZFNs, however instead of zinc fingers, it is transcription activator-like Effector (TALE) proteins that are modularised to facilitate specific DNA binding. Native to Xanthamonas plant pathogens, TALEs are important virulence factors that bind to promoter sequences in the host genome to increase the expression of plant proteins that will facilitate cell colonisation by the pathogen.

TECHNOLOGY SHOWCASE:12 PEOPLE TREATED FOR HIVIn 1995, individuals were found to be naturally resistant to HIV. Each carried a 32 base pair deletion in an immune receptor called CCR5, ablating its function.

In 2008, using zinc-finger nucleases, CCR5-null cells were produced in the lab. Between 2011 and 2013 12 HIV+ patients had their immune cells cultivated, modified with zinc finger nucleases to contain the CCR5-null phenotype and re-introduced into the patients. Patients saw a reduction in viral load, and the persistence of a HIV-resistant population of T-cells. It is important to note that patients are not cured, but their disease outlook is improved.

Current research aims to combine the same procedure with stem cell therapy to provide a ‘one shot’ HIV cure. Sangamo are currently in FDA approved phase 2 clinical trials for T cell technology and phase 1 for stem cell technology.

In 2009 researchers deciphered their astoundingly simple DNA binding mechanisms, and by 2010 TALENs were being engineered to direct double stranded breaks in DNA. An individual TAL is a small 33-35 amino acid protein, with two adjacent amino acids (in position 12 and 13) controlling DNA binding. Therefore only four TAL-4 variants (one for each A, T, C, G nucleotide) have to be organised to provide sequence specific DNA binding.

When designed properly, cleavage efficiency between ZFNs and TALENs is actually somewhat similar. What differentiates TALENs is their absolute ease of engineering. ZFNs required a deep understanding of ZF binding modalities, as well as an E. coli based screening system that used libraries of sequences. TALENs required the shuffling of the four module variants.

The ease of engineering led a number of new, novel applications that can be brought about by an easily engineered double stranded break, including very large (1.5 million base) scale deletion, and inversion event models.

CRISPR-CAS9In 2012, the entire field of genome editing was shaken up again with the coadaptation of the CRISPR-Cas9 system to genome editing. In nature, the CRISPR-Cas9 system is found in over 40% of all sequenced bacteria, and almost every archaea. It affords immunity to invading DNA elements (from viruses or other pathogens) by site-specific RNA guided cleavage. Due to its simplicity of use, accuracy and ease of further Cas9 modification to shuttle other DNA acting enzymes to specific genomic regions is has become the current gold standard in genome editing machinery. p.56

ZF1 ZF2 ZF3

ZF4ZF5ZF6

)*+

TAL1

TAL4

TAL1

TAL3

TAL2

TAL4

TAL3

TAL4

TAL1

TAL3

TAL2

TAL1

TAL2

TAL2

TAL2

TAL3

TAL1

TAL4

A cartoon dipicting how Zinc Finger Nucleases (ZNFs) bind to 3 specific nucleic acid bases

ZF1 ZF2 ZF3

ZF4ZF5ZF6

)*+

TAL1

TAL4

TAL1

TAL3

TAL2

TAL4

TAL3

TAL4

TAL1

TAL3

TAL2

TAL1

TAL2

TAL2

TAL2

TAL3

TAL1

TAL4

A cartoon dipicting how TAL Effector Nucleases (TALENSs) bind to individual nucleic acid bases

Page 56: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

ANALYZE

RESULTS TO IDENTIFY ALL GENES INVOLVED IN

SELECTION PATHWAY

Reimagine Genome Scale ResearchMassively Parallel DNA Synthesis rewrites CRISPR workflows

WORKFLOW FOR PATHWAY COMPONENT DISCOVERY

SOFTWARE TOOLS ALLOW OFF-TARGET FREE GRNA TO BE DESIGNED AT 1:500, TWIST BIOSCIENCE’S OLIGO SYNTHESIS ERROR RATE IS INDUSTRY LEADING

29,040 unique 80mer oligo sequences were designed to contain a central 40bp variable region (25% representation of each base) with identical 20mer flanking regions. These designs were synthesized simultaneously on the Twist Bioscience

OLIGO POOLS ARE UNIFORM IN THEIR REPRESENTATION

0.0

0.2

0.4

0.6

0.8

1.0

0 5000 10000 15000

^ #

tran

scrip

ts

Distance from AUG (bp)

AC

TIVI

TY S

CO

RE

MUC4 GUIDES

sgRNA with off-target activity in MUC4 only

sgRNA with moderate off-target activity

sgRNA with low off-target activity

Protein-coding Exon

121 oligos per cluster

Local Accuracy

Control clusters

Two 70mer oligonucletides were synthesized on Twist Bioscience’s Silicon DNA writing platform. These oligonucleotides were assembled to make the full 120mer gRNA template (peak denoted by blue arrow in i) that was complimentary to a sequence of interest. In-vitro transcription was used to convert the template into gRNA (peak in ii). This gRNA was used to guide Cas9 to the DNA sequence of interest. The 760bp sequence (blue arrow in iii) was cleaved successfully into two pieces (blue arrows in iv) 321 and 439 bp in length. No remaining full length target or non specific events were detectable.

Tell us what Twist Bioscience can do for you

@TwistBioscience

www.twistbioscience.com

silicon DNA writing platform in 240 clusters, and sequenced with an illumina mi-seq. Alignments with design sequences showed that around 1 in 500 nucleotides per cluster was erroneous - an industry leading synthesis accuracy.

The same 29,040 oligonucleotides that were designed above had their NGS data assessed for abundance and oligonucleotide representation. 100% of the designed oligonucleotides were present in the NGS analysis. Additionally 90% of all sequences were synthesized at a density within 4x the mean density. This data confirms that what you design is exactly what will be synthesized on Twist Bioscience’s platforM.

TRANSFORM

LENTIVIRAL LIBRARY INTO CELL LINE TO PRODUCE A

CELL LINE LIBRARY

SCREEN

LIBRARY WITH MULTIPLE ROUNDS OF SELECTION

THAT ACTS ON A PARTICULAR PATHWAY

PACKAGE

LIBRARY INTO LENTIVIRAL DELIVERY

SYSTEM

CLONE

gRNA TEMPLATE LBRARY INTO VECTOR(S) OF

CHOICE

MANUFACTURE

WHOLE GRNA TEMPLATE LIBRARY USING

MASSIVELY PARALLEL DNA SYNTHESIS

DESIGN

AN ACCURATE, OFF-TARGET ACTIVITY

FREE gRNA LIBRARY

Desktop Genetics offer a platform for the easy, and efficient design of gRNA libraries that are accurate for a genome of interest, and have minimal off target effects. a) Snapshot of the Desktop Genetic interface showing BRCA1 introns and exons at its position in the human genome. b) Several scoring algorithms are used to define whether a specific gRNA will possess off target activity throughout the genome.

IN VITRO TRANSCRIPTION

TWIST BIOSCIENCE OFFERS GRNA AS EITHER UNAMPLIFIED OR AMPLIFIED POOLS

Unamplified oligo pools

Unamplified oligo pools

Unamplified

Amplified oligo pools

POOLED PLASMA sgRNA

1. Pooled gRNA libraries

2. Cloning-ready gRNA pools

Unamplified oligo pools

Amplified oligo pools

Ready for Trasfection

gRNA FROM TWIST BIOSCIENCE CUTS WITH HIGH SPECIFICITY

Page 57: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

ANALYZE

RESULTS TO IDENTIFY ALL GENES INVOLVED IN

SELECTION PATHWAY

Reimagine Genome Scale ResearchMassively Parallel DNA Synthesis rewrites CRISPR workflows

WORKFLOW FOR PATHWAY COMPONENT DISCOVERY

SOFTWARE TOOLS ALLOW OFF-TARGET FREE GRNA TO BE DESIGNED AT 1:500, TWIST BIOSCIENCE’S OLIGO SYNTHESIS ERROR RATE IS INDUSTRY LEADING

29,040 unique 80mer oligo sequences were designed to contain a central 40bp variable region (25% representation of each base) with identical 20mer flanking regions. These designs were synthesized simultaneously on the Twist Bioscience

OLIGO POOLS ARE UNIFORM IN THEIR REPRESENTATION

0.0

0.2

0.4

0.6

0.8

1.0

0 5000 10000 15000

^ #

tran

scrip

ts

Distance from AUG (bp)

AC

TIVI

TY S

CO

RE

MUC4 GUIDES

sgRNA with off-target activity in MUC4 only

sgRNA with moderate off-target activity

sgRNA with low off-target activity

Protein-coding Exon

121 oligos per cluster

Local Accuracy

Control clusters

Two 70mer oligonucletides were synthesized on Twist Bioscience’s Silicon DNA writing platform. These oligonucleotides were assembled to make the full 120mer gRNA template (peak denoted by blue arrow in i) that was complimentary to a sequence of interest. In-vitro transcription was used to convert the template into gRNA (peak in ii). This gRNA was used to guide Cas9 to the DNA sequence of interest. The 760bp sequence (blue arrow in iii) was cleaved successfully into two pieces (blue arrows in iv) 321 and 439 bp in length. No remaining full length target or non specific events were detectable.

Tell us what Twist Bioscience can do for you

@TwistBioscience

www.twistbioscience.com

silicon DNA writing platform in 240 clusters, and sequenced with an illumina mi-seq. Alignments with design sequences showed that around 1 in 500 nucleotides per cluster was erroneous - an industry leading synthesis accuracy.

The same 29,040 oligonucleotides that were designed above had their NGS data assessed for abundance and oligonucleotide representation. 100% of the designed oligonucleotides were present in the NGS analysis. Additionally 90% of all sequences were synthesized at a density within 4x the mean density. This data confirms that what you design is exactly what will be synthesized on Twist Bioscience’s platforM.

TRANSFORM

LENTIVIRAL LIBRARY INTO CELL LINE TO PRODUCE A

CELL LINE LIBRARY

SCREEN

LIBRARY WITH MULTIPLE ROUNDS OF SELECTION

THAT ACTS ON A PARTICULAR PATHWAY

PACKAGE

LIBRARY INTO LENTIVIRAL DELIVERY

SYSTEM

CLONE

gRNA TEMPLATE LBRARY INTO VECTOR(S) OF

CHOICE

MANUFACTURE

WHOLE GRNA TEMPLATE LIBRARY USING

MASSIVELY PARALLEL DNA SYNTHESIS

DESIGN

AN ACCURATE, OFF-TARGET ACTIVITY

FREE gRNA LIBRARY

Desktop Genetics offer a platform for the easy, and efficient design of gRNA libraries that are accurate for a genome of interest, and have minimal off target effects. a) Snapshot of the Desktop Genetic interface showing BRCA1 introns and exons at its position in the human genome. b) Several scoring algorithms are used to define whether a specific gRNA will possess off target activity throughout the genome.

IN VITRO TRANSCRIPTION

TWIST BIOSCIENCE OFFERS GRNA AS EITHER UNAMPLIFIED OR AMPLIFIED POOLS

Unamplified oligo pools

Unamplified oligo pools

Unamplified

Amplified oligo pools

POOLED PLASMA sgRNA

1. Pooled gRNA libraries

2. Cloning-ready gRNA pools

Unamplified oligo pools

Amplified oligo pools

Ready for Trasfection

gRNA FROM TWIST BIOSCIENCE CUTS WITH HIGH SPECIFICITY

Page 58: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

56 / Genomics 101

EDITING THE GENOME

BRIEF HISTORY OF CRISPR-CAS9PROKARYOTIC GENOMES CONTAIN WELL-ORGANISED REPEATS

Jansen et al., 2002

Scientists had previously noted that many bacterial and single-celled (archaeal) genomes contained distinct repeated motifs of <50bp that were clearly, neatly and consistently ordered.

Order implies function, but their function was baffling – they were non-coding for one, and their pattern kept showing up in different species, each with its own unique, repetitive sequence that was often highly diverged from other species with the same pattern.

Efforts to find a solution to their function came in the paper above, kickstarting the field by naming the repeat regions ‘Clustered Regularly Interspaced Short Palindromic Repeats’ – or CRISPR for short – and documenting the existence a number of CRISPR associated genes adjacent to these repeats – named the Cas family. The CRISPR-Cas paradigm was born.

SEQUENCES IN BETWEEN THESE REPEATS ARE SURPRISINGLY FOREIGN

Mojica et al., 2005

CRISPRs were always interspersed with what seemed like totally arbitrary sequence, also <50bp in length. A CRISPR locus looks as follows:

CRISPR-random-CRISPR-random-CRISPR-random-CRISPR-random…

in 2005 Mojica et al. sequenced 4500 CRISPR sequences from 67 strains representing both bacteria and archaea, and compared these sequences against repositories within GenBank.

Their astounding finding revealed that the sequences matched a mixture of bacteriophage (viruses that infect bacteria) sequences, invasive plasmid (weapons used by bacteria to destroy other bacteria) sequences and personal genomic sequences that had been sequestered into CRISPRs interspacing regions.

EUKARYOTES ARE NOT THE ONLY ORGANISMS TO HAVE AN ADAPTIVE IMMUNE SYSTEM

Barrangou et al., 2007

The authors of the previous study noticed that a single-celled organism that grows primarily in hotsprings, Sulfolobus solfataricus, was naturally immune to a virus called SIRV. It also had SIRV DNA in its CRISPR spacers, noteworthy in that viruses use DNA as their weapon to infect hosts. It was hypothesised that CRISPR was a form of bacterial adaptive immunity against viral attack.

In 2007 Barrangou et al., showed that subjecting bacteria to viral attack until it became resistant caused that virus’ DNA to be introduced into the CRISPR interspacing regions. To account for false positives, they then took out the spacers containing viral DNA from the resistant strain, and subjected it to viral attack once again. Resistance was instantly lost.

CAS USES CRISPR TO BECOME A GUIDED MISSILE

Garneau et al., 2010 and Deltcheva et al., 2011

Given that CRISPR and spacer DNA provides immunity to viruses, nresearchers dug for the precise mechanism that leads to one genetic element destroying another. Between 2010 and 2011, CRISPR/Cas was shown to use the information in CRISPR spacers as coordinates for cutting up invading DNA sequences.

Viral DNA challenged with CRISPR/Cas of a resistant strain was always cleaved within the sequence that matched the spacer. This cleavage always occurred at a specific distance from a recognised CRISPR sequence motif that was always consistent between spacers of any particular species (the Protospacer Adjacent Motif, or PAM).

Virus resistant-bacteria produce an abundance of RNA from two distinct regions in the CRISPR/Cas system. One was the CRISPR spacer itself (crRNA), with the other existing just outside of the CRISPR repeats, near where the Cas genes are found (tracrRNA). Both RNA fragments together were necessary to cleave viral DNA alongside the endonuclease protein Cas9.

TECHNOLOGY SHOWCASE:TALENS TO MODIFY TALESWhile Xanthamonas have been a useful source of TALE proteins for genome engineering, they also use the same tool to cause crop destroying rice blight. In true ‘fighting fire with fire’ fashion researchers designed TALENs to modify the natural TALE binding regions of the rice crop by inducing either deletions or mutations in an ‘effector binding region’ of the plant genome.

Using a modified DNA injecting plant pathogen Agrobacterium tumefasciens, plasmids encoding TALENs were injected into rice embryonic cells, which were then screened for double knockout mutants. These mutants were found to show no impairment growth or development, alongside a resistance to the 32 rice-infecting strains that target the now modified (unrecognisable) target site. Due to the simplicity of this experiment it can be easily used in any plant to confer resistance to many ‘blighting’ pathogens that use a TALE (or similar) infection system.

Page 59: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 57

EDITING THE GENOME

TIMELINE

TODAY

HOMOLOGOUS RECOMBINATIONThe homologous recombination machinery that allows

genetic recombination during meiosis was hijacked to afford the directed homologous recombination of exogenous DNA

into precise positions in mouse genomes, with an efficiency of one recombination in every 106 cells.

SYNTHETIC ZINC-FINGER NUCLEASE PROTEINS (ZFNS).The homologous recombination machinery that allows genetic recombination during meiosis was hijacked to afford the directed homologous recombination of exogenous DNA into precise positions in mouse genomes, with an efficiency of one recombination in every 106 cells.

NON-REPEATING SPACERS IN CRISPR FOUND TO CONTAIN VIRAL DNA.Intervening sequences of regularly spaced prokaryotic elements derive from foreign genetic elements. Mojica et al. Journal of Molecular Evolution

CRISPR IS A PROKARYOTES ADAPTIVE IMMUNE SYSTEM.

CRISPR provides acquired resistance against viruses in prokaryotes.

Barrangou et al. Science SYNTHETIC TAL EFFECTOR NUCLEASE PROTEINS (TALENS).Plant pathogenic Xanthamonas were found to use DNA binding TAL effectors as virulence factors. TAL effectors bind specific nucleotides, and added together, along with a Fokl nuclease, can introduce site directed double stranded breaks. When properly designed efficiency can be above 1 in every 2 cells.CRISPR AFFORDS IMMUNITY BY

CUTTING UP FOREIGN DNA.The CRISPR/Cas bacterial immune system cleaves

bacteriophage and plasmid DNA. Garneau et al. Nature NUCLEASE PROTEIN CSN1 (NOW

NAMED CAS9) IS GUIDED BY 2 CRISPR-ENCODED RNA STRUCTURES (CRRNA AND TRACRRNA) TO A SPECIFIC DNA SEQUENCE.CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III. Deltcheva et al. Nature

LINKAGE OF CRRNA AND TRACRRNA INTO A SINGLE GRNA COULD ENABLE

GUIDED GENE EDITING. A programmable dual-RNA-guided DNA endonuclease in

adaptive bacterial immunity. Deltcheva et al. Nature

CRISPR/CAS9 NUCLEASES.Some bacteria have adaptive immune systems that protect

them from viral DNA. A Cas9 protein is guided by RNA which has been transcribed from ‘learned’ information about the virus. Guide RNA is complimentary to a viral strand, and can be engineered to be complimentary to

any strand of choice, allowing Cas9 to introduce double stranded breaks in up to 9 in every 10 cells.

CRISPR/CPF1 NUCLEASES.A new, relatively non-validated, technology. Proteins similar to Cas9 were screened, and from this Cpf1 was discovered. It was found to require smaller guide RNAs, and induce an overhanging break instead of a double stranded break.

CRISPR IS ONE OF THE FASTEST EVOLVING FIELDS IN BIOLOGY.

According to PubMed, in 2015 alone there were 1266 CRISPR publications. In January 2016 alone there were 207 paper, fitting in with

the trend of exponential growth.

LATE 90’s

LATE 80’s

TERM CLUSTERED REGULARLY INTERSPACED PALINDROMIC REPEATS

(CRISPR) IS COINED.Identification of genes that are associated with DNA repeats

in prokaryotes. Jansen et al. Molecular Microbiology

2002

2005

2007

LATE 00’s

2010

2011

2012

2015

Page 60: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

58 / Genomics 101

EDITING THE GENOME

IF CAS9 IS CRISPR RNA GUIDED, AND WE CAN ENGINEER DNA SEQUENCES...

Jinek et al., 2012

The 2012 study by Jinek et al. is a good candidate for the greatest biological advance in the last five years. It was a one-two punch knockout, with an incredible leap that revolutionised not just synthetic biology, but genetic engineering, personalised medicine, agricultural science, genetics, cell biology, to name a few.

First, the authors demonstrated that because crRNA and tracrRNA are complimentary sequences, they bind into a double strand. This double strand then guides the Cas9 to the complimentary strand in the invasive DNA starting at the PAM. The Cas9, which contains two different DNA-cutting domains, unwinds the DNA into two strands, and then creates a blunt break in the invading DNA.

Next, Jinek et al. showed that linking the crRNA and the tracrRNA into a new molecule they named guide RNA (gRNA), they could simplify the system. Any gRNA sequence could be synthesised to facilitate Cas9-mediated blunt-ended cleavage of any DNA strand. This gives researchers precise control to cut anywhere in an organism’s genome, allowing genes to be engineered in or out of selected organisms with relative ease.

A CLOSER LOOK AT CRISPR/CAS9 FOR GENOME EDITINGVery few parts have to come together to facilitate CRISPR mediated genome editing. A Cas9 protein is expressed with a correctly designed gRNA complimentary to a DNA sequence of interest, which targets the Cas9 to the genome if the complimentary sequence to the gRNA is adjacent to a PAM.

Where homologous recombination-mediating recombinases were locked into an unchangeable 30+ base recognition sequence, Cas9 is almost unblocked in sequence recognition. As

the 20bp guide RNA sequence can be any string of nucleotides that are complimentary to a genome sequence, the only constraining factor to Cas9 binding is the existence of a PAM. S. pyogenes Cas9 effectively recognises a two base pair PAM (NGG, where N can be anything) so any sequence preceding two guanines can be cut.

Such simplicity is extremely powerful, as it allows almost any genomic position to be modified in any organism that can express exogenous DNA. In order to fully take advantage of this system, it has been vital to understand exactly how these parts interplay at a molecular level – eventually leading to further improvements in this system.

IMPORTANT STRUCTURAL ELEMENTS TO CAS9 Cas9 mediated DNA cleavage can be considered as a three-step process: destabilisation, invasion, and cleavage. Cas9 is a considerably large protein, with S. pyogenes Cas9 weighing in at 1368 amino acids. Smaller Cas9 proteins do exist, however due to its simple PAM site, S. pyogenes Cas9 sees the most use. Within Cas9 are two ‘lobes’ with four active sites, split into the recognition lobe - which directs the recognition between gRNA and DNA, and the nuclease lobe - which cleaves each strand of the DNA independently, and recognises the PAM site.

DNA cleavage relies of all four active sites, but is performed by the two endonuclease-like domains called RuvC and HNH.

Destabilisation: First, gRNA is integrated into the recognition lobe of Cas9, forming a stable structure that has all of the active sites required for DNA cleavage. Once this complex is formed, the cleavage lobe facilitates a DNA scan for PAM sites. If found, the PAM recognition site in the cleavage lobe of the protein is thought to locally destabilise the chemical interactions that hold both complimentary strands of the PAM site DNA together. Further protein-DNA interactions then stabilise the now unwound DNA immediately upstream of the PAM site.

Invasion: At this point, if there is no match between the gRNA and the unwinding DNA, the binding energy within the entire system is too low to maintain the overall structure, and the Cas9-gRNA complex moves on to another PAM site. If there is a match, each gRNA residue will subsequently displace its homologous DNA residue, complimentarily binding to the target sequence. A number of complex sequential interactions between the growing target DNA-gRNA dimer and the recognition lobe facilitate and stabilise the entire process.

Cleavage: Once the whole Cas9-gRNA-DNA complex is formed the DNA is held into the protein in an accessible formation for the HNH and RuvC cleavage domains. HNH is flexible in its movement. Once into a favorable conformation with the target DNA-gRNA dimer it will cleave the DNA strand. Simultaneously, the non-complimentary strand is positioned for RuvC to cleave. The entire protein complex then disassociates, the cleaved DNA strands re-wind, and a targeted double stranded break is left in the DNA.A cartoon depicting how Cas9, gRNA and the PAM come together

for DNA cleavage to facilitate genome engineering

Page 61: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

Genomics 101 / 59

EDITING THE GENOME

IMPROVING CAS9Cas9, when expressed or transfected in cells alongside a gRNA, allows for the targeted introduction or deletion of genetic information. This process was used to produce knock out mutant mice that had a mutation in both alleles in a process that only took a total of around 4 weeks from start to finish. It has been deemed a fantastic success, and is often referred to as one of the greatest breakthrough technologies in recent years.

CRISPR has already shown incredible promise in the development of personalised gene therapies for rare diseases in human-cell lines, and in mouse models. Pre-clinical trial, proof of concept treatments already exist for β-thalassemia, rheumatoid arthritis, Duchenne muscular dystrophy, cystic fibrosis and tyrosinemia.

Once the full structure and mechanism of CRISPR-Cas9 mediated DNA cleavage was identified, researchers could set about improving the technology even further, increasing its efficacy and expanding the potential applications. While many mutants and fusion proteins have been produced, three stick out: Cas9n, dCas9 and hfCas9.

Cas9n: Cas9n, or ‘nicking Cas9’ has either its RuvC or its HNH cleavage domain modified to be inactive. This inactivation leaves Cas9 only able to produce only a stranded break in the DNA (a nick), not a double stranded break. This is significant for two applications.

First, there is concern over the effects that off target Cas9 cutting events will have on any cell that is engineered with this system. Research has showed that off target effects are often few and far between, but their impact cannot be ignored. For this reason, two Cas9n enzymes, one for each strand, could be used to produce the double stranded break. As they would have to recognise both the upstream and downstream regions of the cut site, off target effects are almost always ablated.

There are a number of fates a DNA sequence may undergo following a double stranded break. The most common, is the

homologous recombination-directed repair of a sequence. Alternatively the non-homologous end joining machinery could rejoin the two strands back together, potentially integrating an exogenous sequence in any orientation in the space. The non-homologous end joining machinery often causes the loss of a few nucleotides, so in each case can lead to the production introduction of a faulty version of the target gene.

Recently, it was discovered that homologous recombination can occur following an individual nick event in place of to the canonical single stranded repair mechanisms, although the efficiency of this process is reduced 24-fold. Regardless this allows researchers to introduce a homologous recombination event without worrying about inducing a double-stranded break based error.

dCas9: Short for ‘dead’ Cas9, it has had both its RuvC and its HNH nuclease domains inactivated. This turns Cas9 into a shuttle for other enzymes that can act upon the DNA. dCas9 has been used as a fusion product with transcription factors in order to tightly control the activation or repression of particular proteins outside of their usual activity. It has also been fused to FokI, and used as a dual strand cleavage system that belongs to the same paradigm as ZFNs and TALENs.

hfCas9: Instead of using dual Cas9n proteins to generate the off-target effect-free Cas9 cut, researchers took to modifying the Cas9 enzyme itself to reduce off target effects, and keep the Cas9 system as simple as possible.

As mentioned earlier, there are a number of complex interactions that maintain the DNA in its unwound position to facilitate gRNA base pairing. It was thought that the net effect of all of the interactions means the binding energy is above that which is required for the reaction to be successfully carried out. This allows a relaxed specificity meaning that the gRNA and target DNA do not have to be a perfect match.

This is great for the bacteria as it can endure viral DNA that has undergone one or more mutations since it was last encountered, however it could be detrimental to genome editing due to the off-target effects caused.

Therefore, by mutating four of these DNA interacting domains, the DNA binding energy of the whole system was reduced to a point in which the gRNA had to be exactly correct in order to induce a cut in the DNA. When the target organism’s genome was sequences and analysed for off target effects, not a single one was found. So long as gRNA is designed with off target effects in mind (with a tool like the genome search algorithms offered by Desktop Genetics) hfCas9 allows genome editing is specific only for the site of interest.

Where next? It was not mentioned in this chapter, but the CRISPR associated protein Cpf1 could extend the reach of the CRISPR/Cas system. Maybe there is something even more powerful looming on the horizon. The future of genetic engineering continues to evolve. n

For a fully referenced version see the digital version at frontlinegenomics.com

A crystal structure of the Cas9 protein (Blue and Cyan) using gRNA (green) to interact with unwound DNA (Magenta)

Page 62: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

60 / Genomics 101

EDITING THE GENOME

ACTIVATOR:A transcription factor protein that controls the transcription of DNA by binding to specific regions of the genome. Activators specifically encourage gene expression.

ADAPTIVE IMMUNE RESPONSE:An organisms defense against pathogens, in which the end goal is destruction of the pathogen through a process that uses learned information about the pathogen. This learned information could be retained for the organism to call upon when it next encounters a pathogen. Mammals have antibodies and lymphocytes. Bacteria have the CRISPR/Cas system.

BLUNT-CUT :An enzyme induced break in the DNA, in which both DNA strands are cut in the same place:5’-atgcatgcavtgcatgcatgc tacgtacgt^acgtacgtacg-3’

Cas :Short for CRISPR associated protein. Cas is a protein family whose members are found adjacent to a CRISPR motif, and are required for the bacterial adaptive immune response.

Cas9 :(also Csn1) A Cas enzyme that forms a complex with both crRNA and the tracrRNA, or synthetically engineered gRNA in order to perform sequence guided blunt ended cuts in a strand of DNA.

Cas9n:Cas9n, or ‘nicking Cas9’ has either its RuvC or its HNH cleavage domain modified to be inactive. This inactivation leaves Cas9 only able to produce a single stranded break in the DNA (a nick), not a double stranded break.

dCas9:An engineered mutant of Cas9 named ‘dead’ Cas9, which has had its endonuclease sites rendered catalytically inactive. This enzyme is often fused to other DNA relevant enzymes to afford precisely targeted genetic control.

hfCas9:An engineered mutant of Cas9 named ‘high fidelity’ Cas9 which has had its DNA interacting domains modified. This makes the Cas9 bind to DNA weakly, so gRNA has to be an exact match to its target DNA in order for a cut to occur.

Cpf1: An enzyme that was discovered due to its relatedness to Cas9. Like Cas9 it is an RNA guided nuclease. Unlike Cas9 it forms a complex with only a crRNA RNA strand making it simpler to engineer. It also provides an overhanging-cut.

Csn1:see Cas9

CRISPR:An acronym for Clustered Regularly Interspersed Palindromic Repeats. CRISPR are repetitive prokaryotic motifs in prokaryotic genomes that are interspersed with foreign sequences. These interspersing sequences sequences are learned from invasive nucleic acids. Become the crRNA and tracrRNA, that guide Cas9 toward invading nucleic acids, and with the Cas enzymes form the prokaryotic adaptive immune system. Has been repurposed to facilitate directed genome editing

DELETION:The removal of one or more nucleic acids from a genome sequence. Can cause frame shift mutations.

DOUBLE STRANDED BREAK:When a strand of DNA is broken across, leaving two free fragments. This can occur naturally by DNA damaging radiation, or enzymatically by nucleases.

DUAL NICK:A reaction that breaks the DNA using two nickase enzymes,

ENDONUCLEASE:An enzyme that is able to produce either a blunt-cut, or overhanging-cut in both strands of DNA sequence.

FokI:An endonuclease commonly used in genome editing. Requires a dimer of FokI enzymes to produce a blunt cut in the DNA so can facilitate precision DNA cutting activity when used as a fusion with TAL effector proteins, Zinc Finger proteins or dCas9.

EXOGENOUS DNA :A length of DNA that is either taken up by the cell from its surroundings, or synthetically introduced into the cell. Both cases can lead to new genetic information being inserted into the cells genome. Exogenous DNA of a specific, desired sequence is used in genome editing to introduce new genes, or mutant versions of genes into cells in order to study their effects.

FRAME SHIFT:A type of mutation that changes the amino acid composition of a protein. This occurs as the information in RNA is translated into amino acids as triplets of nucleic acids. A single deletion or insertion will cause the ‘reading frame’ to shift, often rendering a protein inactive: RNA: AUG UCU UGU UCU GGU A… Amino: Met Ser Cys Ser Gly …Deletion event at second Guanine causes frame shift, changing amino acid composition throughout the remainder of the protein: RNA: AUG UCU UUU CUG GUA A… Amino: Met Ser Phe Leu Val …

HOMOLOGOUS:A nucleic acid sequence with significant similarity to an existing nucleic acid sequence. In molecular biology this applies to a sequence with sufficient similarity to facilitate homologous recombination.

HOMOLOGOUS RECOMBINATION:Nucleic acids are integrated from one strand of DNA to an identical DNA sequence suffering from a double stranded break to DNA, to accurately stitch together the two broken ends. Can also occur without damage to produce sequence variation during meiosis. Used by bacteria and viruses to mediate sequence invasion during horizontal gene transfer. Utilised in genome editing to accurately introduce foreign sequences into specific genome positions.

NON-HOMOLOGOUS RECOMBINATION:Nucleic acids that have undergone a double stranded break can undergo a process called non-homologous end joining, in which both free ends of DNA are enzymatically stitched together regardless of their sequence.

INSERTION:The introduction of new genetic material into a nucleic acid sequence. Insertions can cause protein silencing and frame shift mutations if introduced into a protein encoding DNA sequence.

KNOCK-IN:The introduction of new generic material into a specific point in an organism’s genome

KNOCK-OUT:The removal of genetic information from a specific point in an organism’s genome

NICKASE:A type of nuclease enzyme that performs a cut on a single DNA strand:5’-atgcatgcavtgcatgcatgc tacgtacgt^actacgtacg-3’

NON-HOMOLOGOUS REPAIR:see non-homologous recombination

OFF-TARGET EFFECTS:An unintentional, unforseen genomic modification or manipulation caused by a targeted modification tool also being able to bind to genomic sequences elsewhere in the genome.

OVERHANGING-CUT:An enzyme induced break in DNA which occurs at separate sites, leaving an overhanging single strand on each product. Overhang length depends on the enzyme:5’-atgcatgcav tgcatgcatgc tacgtacgta ^cgtacgtacg-3’

PAM :An Acronym for ‘Protospacer Adjacent Motif’. This motif is the recognition site for Cas9 and is specific to each Cas9 containing species. Streptococcus pyogenes PAM is NGG. Cas9 will not successfully bind and cleave the target DNA if there is no PAM immediately following the region of homology.

REPRESSOR:A transcription factor protein that controls the transcription of DNA by binding to specific regions of the genome. Repressors specifically inhibit gene expression.

RESIDUE:Another term for a nucleic acid, usually used in the context of a whole DNA molecule.

RNA:Double stranded DNA in the genome is transcribed into the single stranded RNA. RNA either provides the code to make proteins, is independently catalytically functional (a ribozyme), or forms a complex with other RNA/a protein to provide a catalytic function.

crRNA:The ‘cr’ stands for ‘CRISPR’. crRNA is RNA encoded within the CRISPR motif, and is one of two RNA sequences required for Cas9 to target a specific DNA sequence.

gRNA:The ‘g’ stands for ‘guide’. gRNA is a synthetic linkage of crRNA and tracrRNA which makes the CRISPR system easier to engineer for use in genome engineering. crRNA and tracrRNA are joined by a linker sequence.

tracRRNA:The ‘tracr’ stands for ‘trans-activating cr’. tracrRNA is RNA that is encoded outside of CRISPR, but is complimentary to the repeat regions of crRNA. It is one of the two RNA sequences required for Cas9 to target a specific DNA sequence.

TALEN:An Acronym for ‘TAL Effector Nuclease’. TAL effectors are enzymes secreted by Xanthamonus species as part of their infection process, which can directly bind to individual nucleic acids. Multiple TALs can be combined to target a specific DNA sequence. They are often fused to FokI to produce targeted blunt ended cuts in DNA during genome editing.

ZFN:An acronym for ‘Zinc Finger Nuclease’. Zinc Fingers are motifs that allow transcription factors to bind to specific nucleic acid triplets. Multiple ZFNs can be combined to target a specific DNA sequence. They are often fused to FokI to produce targeted blunt ended cuts in DNA during genome editing.

THE CRISPR JARGON BUSTER:

Page 63: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,
Page 64: Genomics · Genomics 101 / 3 GENOMICS GLOSSARY ADAPTORS A short nucleotide molecule that binds to each end of a DNA fragment prior to sequencing. ALLELE One of two forms of a gene,

GEN

OM

ICS

101