topics in (nano) biotechnology human genome project lecture 9

TOPICS IN (NANO) BIOTECHNOLOGYHuman Genome Project

Lecture 9

15th April, 2004

PhD Course

• Human Genome organisation – Human genome contains ~ 40,000 genes– Nuclear genome 3000 Mb– 30,000 to 40,000 structural genes– 24 different types of DNA duplex– 22 autosomes, 2 sex chromosomes

Remember what the genome is?

Human Genome

Nuclear

Mitochondrial

• DEFINITION: The entire genetic makeup of the human cell

nucleus.

Includes non-coding sequences located between genes, which makes up the vast majority of the DNA in the genome (~95%)

Let’s define it.

• DEFINITION: The Human Genome Project is a multi-year effort to find

all of the genes on every chromosome in the human body and to determine their biochemical nature.

• SPECIFIC GOALS: – Identify all the genes in human DNA– Determine the sequences of the 3 billion bps– Save the information in databases– Improve tools for data analysis– Transfer related technologies to the private sector– Address the ethical, legal and social issues that may arise from

the project

What is the Human Genome Project?

Sequencing the Human Genome

Why are genome projects important? – The key to continued development of molecular biology, genetics and

molecular life sciences– a catalogue containing a description of the sequence of every gene in

a genome is seen as immensely valuable, even if the function is not known

– aid in isolation and utilisation of new genes– stretch technology to its limits

What is the potential impact?– Improved diagnosis/therapy of disease– prokaryotic genomes: vaccine design, exploration of new microbial

energy sources– plant and animal genomes: enhance agriculture

Importance and Impact

• The Whitehead Institute for Biomedical Research (Eric Lander, Massachusetts, USA)

• The Sanger Centre (Cambridge, GB)• Baylor College of Medicine (Richard Gibbs, Houston,

USA)• Washington University (Robert Wayerston, St. Louis,

USA)• DoEs Joint Genome Institute, JGI (Trevor Hawkins,

Walnut Creek, California, USA)

• …and other genome centres worldwide...

The primary HGP sequencing sites

The Human Genome Project- Timelines -

19861987

1st HumanChromosomeSequenced

CongressRecommends15 year HGP

Project

HGPOfficiallyBegins

LowResolution

LinkageMap of HGPublished

High ResolutionMaps ofSpecific

ChromosomesAnnounced

E.coliGenome

Completed

CeleraGenomicsFormed

Conferenceon HGP

Feasibility

S. cerevisiaeGenome

CompletedC. elegansGenome

Completed

FlyGenome

Completed

HumanGenome

Published

President announcesgenome working draft completed

Science (Feb. 16, 2001) - CeleraNature (Feb. 15, 2001) - HGP

• 1983 Los Alamos Labs and Lawrence Livermore National Labs, both under the DOE, begin production of DNA cosmid libraries for single chromosomes

• 1986 DOE announces HUMAN GENOME PROJECT

• 1987 DOE advisory committee recommends a 15-year multi-disciplinary undertaking to map and sequence the human genome. NHS begins funding of genome projects

• 1988 Recognition of need for concerted effort. HUGO founded (Human Genome Organisation) to coordinate international efforts DOE and NIH sign the Memorandum of Understanding outlining plans for co-operation

History of Human Genome Project

• 1990 DOE and NIH present joint 5-year Human Genome Project to Congress. The 15 year project formally begins

• 1991 Genome Database (GDB) established

• 1992 Low resolution genetic linkage map of entire human genome published, High resolution map of Y and chromosome 21 published

• 1993 DOE and NIH revise 5-year goals– IMAGE consortium established to co-ordinate efficient mapping and

sequencing of gene-representing cDNAs (Integrated Molecular Analysis of Genomes and their Expression)

• 1994 Genetic-mapping 5-year goal achieved 1 year ahead of schedule – Genetic Privacy Act proposed to regulate collection, analysis, sorage and use

of DNA samples (endorsed by ELSI)– LLNL chromosome paints commercialised

• 1994-98 Tons of stuff happens that continues to advance the project

• 1998 Celera Genomics formed– New 5-year plan by DOE and NIH

• 1999 First chromosome completely sequenced (Chromosome 22)• 2000 June 6, HGP and Celera announce they had completed ~

97% of the human genome.

• James Watson Original Head of HGP

• Francis Collins

• Craig Venter

People of Human Genome Project

• The Sanger dideoxy termination method (remember?)– Nucleotide analogs (ddNTP) are incorporated into DNA during its synthesis

together with normal nucleotides (dNTP) - when a ddNTP is inserted, the reaction stops = chain termination

• Radioactively labeled ddNTPs– four different reactions are performed, each reaction contains ddA, ddG, ddC, ddT– Autoradiography enable analysis of different fragment lengths which correspond to

different termination points

• Fluorescently labeled ddNTPS– one reaction carried out, all four ddNTPs are incorporated but each ddNTP is

labelled with a different fluourescent dye– automated DNA sequencers interfaced with computers determine the order of the

dyes and hence the DNA sequence

DNA sequencing

• The Gene Linkage Map

• Identifies position of genes by locating marker base sequences associated with RFLPs

• Based on how close together two genes are– the closer together two genes are, the less likely they are to separate during

meiotic recombination in germ cells– the frequency of recombination between two genes can help to decipher the

distance between them on a gene linkage map– genes separated by more than 50cM (50 million bps) are not considered linked

• Studies of families affected by genetic disease have proven useful for genetic linkage analysis

Mapping the Human Genome: Low Resolution Mapping

• The Physical Map

• Provides the actual distances in bps between genes on a given chromosome

• Prepared by aligning the sequences of adjacent DNA fragments from small overlapping clones to form a contiguous map (a contig map)

• Sequence tag sites (STGs) mark sites on chromosomes and help to locate adjacent segments of DNA– if two DNA fragments share an STS they overlap and are contiguous

Mapping the Human Genome: High Resolution Mapping

• The aim, obviously, is to determine the entire genome sequence

• A sequence has to be constructed from a series of shorter fragments

• Shotgun technique– break molecule into smaller fragments– determine sequence of each one– use a computer to search for overlaps and build a master

sequence

Determining genome sequences

• Analysis of DNA sequences of chromosomes by extending the sequenced region a little bit further each time until the tips of the chromosome are reached

• The next round of sequencing is based on the results of the previous round by synthesising appropriate DNA primers to extend further

Chromosome walking

• The International Human Genome Sequencing Consortium published their results in Nature, 409(6822):860-921, 2001– Initial Sequencing and Analysis of the Human Genome

• Celera Genomics published their results in Science, 291(5507), 1304-1351, 2001– The Sequence of the Human Genome

Results of Human Genome Project

• The Human genome contains 3146.7 million bases

• The average gene size is 3,000 bases

• Total number of genes is between 30-40,000

• The order of 99.9% of the nucleotides is the same in all people

• Of the discovered genes, the function for more than half is unknown

• > 30 genes have already been associated with human disease (e.g. Cancer, blindness)

• About 2% of the genome encodes instructions for the synthesis of proteins

• Repeated sequenes make up 50% of the genome

• There are urban centres that are gene rich: stretches of C and G bases repeats (CpG islands) occur adjacent to gene rich areas

• Chromosome 1 has 2,968 genes; the Y has 231

• Humans:– only twice number of genes of the fly– 3 times as many proteins as fly or worm– share the same gene families as fly or worm

• Microbial genomes– Haemophilus influenzae– Escherichia coli– Bacillus subtilus– Helicobacter pylori– Streptococcus pneumonaie– Saacharomyces cerevisiae– Archaeglobus fulgidus– Methanbacterium thermoautotropicum– Methanococcus jannaschil– Mycobacterium tubercolosis– Staphylococcus aureus

• and more…..

• Insect genomes– Arabidopsis thaliana– Drosophilia melanogaster– Mus musculus

Completed genomes

Organism Genome Size (Bases) Estimated GenesHuman (Homo sapiens) 3 billion 30,000

Laboratory mouse (M. musculus) 2.6 billion 30,000

Mustard weed (A. thaliana) 100 million 25,000

Roundworm (C. elegans) 97 million 19,000

Fruit fly (D. melanogaster) 137 million 13,000

Yeast (S. cerevisiae) 12.1 million 6,000

Bacterium (E. coli) 4.6 million 3,200

Human immunodeficiency virus (HIV) 9700 9

• The DOE and the NIH spend between 3-5% of their annual HGP budgets toward studying the ELSI associated with availability of genetic information

• This budget is the world’s largest bioethics program, and has become a worldwide model

• Examples of ELSI are:– privacy legislation– gene testing– patenting– forensics– behavioural genetics– genetics in the courtroom

Ethical, legal and societal issues

• Who should have access to this information?– Employers– Insurers– Schools– Courts– Adoption agencies– Military

• Philosophical Implications– Human responsibility– Free will versus genetic determinism

• Who owns and controls genetic information?– How is privacy and confidentiality managed?

• Psychological impact and stigmatisation– Effects on the individual– Effects on society’s perceptions and expectations of the individual

Societal Concerns

• Clinical Issues– Growing demand to educate health care workers – Public needs to gain scientific literary and understand the capabilities, limitations

and risks– Standards need to be established including quality controls to ensure accuracy

and reliability– Regulations?

• Genetic Counselling– Informed consent for complex procedures– Counseling about risks, limitations and reliability of genetic screening techniques– Reproductive decision making based on genetic information– Reproductive rights

• Multifactorial diseases and environmental factors– Genetic predispositions do not mandate disease development– Caution must be exercised when correlating genetic tests with predictions

Clinical Issues

• Who owns genes and DNA sequences?– The person (or company) who discovered it, or the

person whose body it came from– Should genetic information be the property of

humanity?– Is it ethical to charge someone for access to a

database of genetic information?

• Is it time to raise the bar concerning patents?– Will patent protection slow the advance of research

and be detrimental to society as a whole in the long run

Commercialisation and patents

Medicine

Bioinformatics

Biotechnology

DNA chip technology

Gene therapy applications

Diagnostic & therapeutic applications

Medicine & pharmaceutical industries

Agriculture & Bioremediation Industries

Microarray Technology

Proteomics

Pharmacogenomics

Preventative measures

Developmental Biology

Evolutionary & Comparative Biologists

Benefits of Human Genome Project

• These occur when a single nucleotide in the genome sequence is altered (1 bp difference)

• 66% of SNPs involve a C to T change and they occur every 100-300 bases in either coding or non-coding regions

• Evolutionary stable, there are between 2 and 3 million SNPs in the human genome

• Many SNPs have no effect on cell function, but: – some SNPs could be responsible for variations in how many humans

respond to disease, environmental factors, drugs and other therapies– SNPs may help identify multiple genes involved in complex diseases

Single nucleotide polymorphisms

• SNPs are NOT the same things as alleles (or so we believe so far)

• Researchers have found that most SNPs are not responsible for a disease state

– They serve as markers for pinpointing a disease on the human genome map, being located near a gene found to be associated with a certain disease

– Occasionally, SNPs may actually cause a disease and can to be used to search for and isolate the disease-causing gene

– SNPs travel together - i.e. Variations in DNA are linked

• To date, Celera & Orchid Biosciences have largest databases

• Goals:• Develop large scale technologies• Identify common variants in the coding regions• Create a SNP of at least 100,000 markers• Develop the intellectual foundation for studies of sequence variation• Create public resources of DNA samples and cell lines

• SNP Consortium:• Ten large pharmaceutical companies and the UK Wellcome Trust• Headed by Arthur Holden• Find and map 300,000 common SNPs• Generate a widely accepted, high-quality, publically available map

• High quality genome sequencing and annotation (2003)• Complete sequencing the genomes of other model organisms (e.g.

Mouse)

• The next step: Functional Genomics• Determine what our genes do through systematic studies of function

on a large scale– Transcriptomics - Comparative analysis of mRNA expression /splicing– Proteomics - Comparative analysis of protein expression and post-translational

modifications– Structural genomics - Determine 3-D structures of key family members– Intervention studies - Effects of inhibiting gene expression– Comparative genomics - Analysis of DNA sequence patterns of humans and

well studies model organisms

What next?

•Is it ethical for the government to invest such a large fraction of its research budget in the Human Genome Project when the result is denial of funding for other worthy projects?

•Do such possibilities as finding the cause of many genetic diseases and identifying criminals outweigh such concerns as the possibility of using the genetic information to renew the types of eugenics programs practiced before and during World War II or to deny health insurance coverage?

•Given the huge investment of public funds in the Human Genome project, is the government responsible to assure that the benefits will be equally available to people of all socioeconomic levels and ethnic or racial backgrounds?

•Should genetic testing be made available to people who have not received the genetics counseling they need in order to fully understand and respond to the results?

• Whole genome – Once the whole genome is truly known and the whole

genome sequences become available for an organism, the challenge turns from identifying parts to understanding function

• Functional genomics – The post-genomic era is defined as functional genomics– Assignation of function to identified genes– Organisation and control of genetic pathways that come

together to make up the physiology of an organism

Functional Genomics

• 42% of human genes of unknown function have been found in the human genome

• assigning function to these genes using systematic high throughput methods is required

Functional Genomics

The Periodic Table: Functional grouping of Chemical Elements

Biologist’s Periodic Table

Organism’s Gene

System for classifying

• Will not be two-dimensional

• Will reflect similarities at diverse levels– Primary DNA sequence in coding and regulatory regions

– Polymorphic variation within a species or subgroup

– Time and place of expression of RNAs during development, physiological response and disease

– Subcellular localisation and intermolecular interaction of protein products

• Array of hope? Arrays offer hope for global views of biological

processes– Systematic way to study DNA and RNA variation– Standard tool for molecular biology research & clinical

diagnostics– Labelled nucleic acid molecules can be used to interrogate

nucleic acid molecules attached to solid support (remember Southern Blotting?)

(Refer to January 1999, Nature Genetics Supplement, Volume 21)

Gene Expression analysis

• DNA chips Also known as gene chips, biochips, microarrays…basically DNA-covered pieces of glass (or plastic) capable of simultaneously analysing thousands of genes at a time – they can be high density arrays of oligonucleotides or cDNA

• Chips allow the monitoring of mRNA expression on a big scale (i.e many many genes at the same time)

Pre-1995, Northern Blots used to look at gene expression

Incyte

Affymetrix

Determining gene function

sequence homology

sequence motif

tissue distribution

chromsme localisation

function . expression in disease

biochemical assays

proteomics .

expression in models

Protein synthesis

RNA synthesis and processing

Alternatively spliced mRNA

• DEFINITION: The mRNA collection content, present at any given

moment in a cell or a tissue, and its behaviour over time and cell states

(Adam Sartel, COMPUGEN).

The complete collection of mRNAs and their alternative splice forms is sometimes referred to as the trancriptome. The transcriptome is teh set of instructions for creating all of the different proteins found in an organism.

(From Genome to Transcriptome, Incyte)

The transcriptome

Genome, proteome and transcriptome

The Proteome

The Genome

- Index to a range of possible proteins - Useful as a map and for inter-organisms analysis

- Describes what actually happens in the cell - Complex tools, partial results

• Discovery of new proteins: – that are present in specific tissues– that have specific cell locations– that respond to specific cell states

• Discovery of new variants:– of important genes– that work to increase/decrease the activity of the ‘native’ protein

• The transcriptome reflects tissue source (cell type, organ) and also tissue activity and state such as the stage of development, growth and death, cell cycle, diseased or healthy, response to therapy or stress..

Use of transcriptome analysis

• Proteomics…where the genome hits the road – Proteomics refers to the simultaneous, large scale analysis of

all (or many) of the proteins made in a cell at one time to get a global picture of what proteins are made in cells and when

– Hopefully then we can determine the ‘whys’ and what we can thus do about it – very important for drug development

– The proteome is the protein complement encoded by a genome and the term was first proposed by an Australian post-doc, Marc Wilkins in 1994

Beyond genomics…proteomics

Beyond the genome: Proteomics• Genomics involves study of mRNA expression-the full set of

genetic information in an organism contains the recipes for making proteins

• Proteins constitute the “bricks and mortar” of cells and do most of the work

• Proteins distinguish various types of cells, since all cells have essentially the same “Genome” their differences are dictated by which genes are active and the corresponding proteins that are made

• Similarly, diseased cells may produce dissimilar proteins to healthy cells

• However task of studying proteins is often more difficult than genes (e.g. post-translational modifications can dramatically alter protein function)

• Identification of all the proteins made in a given cell, tissue or organism

• Identification of the intracellular networks associated with these proteins

• Identification of the precise 3D-structure of relevant proteins to enable researchers to identify potential drug targets to turn protein “on or off”

• Proteomics very much requires a coordinated focus involving physicists, chemists, biologists and computer scientists

Beyond the genome: Proteomics

• Major challenge-how do we go from the treasure chest of information yielded by genomics in understanding cellular function

• Genomics based approaches initially use computer-based similarity searches against proteins of known function

• Results may allow some broad inferences to be made about possible function

• However, a significant percentage (>30%) of the sequences thus far ascertained seem to code for proteins that are unrelated at this level to proteins of known function

• Beyond the genetic make-up of an individual or organism, many other factors determine gene and ultimately protein expression and therefore affect proteins directly

• These include environmental factors such as pH, hypoxia, drug treatment to name a few

• Examination of the genome alone can not take into account complex multigenic processes such as ageing, stress, disease or the fact that the cellular phenotype is influenced by the networks created by interaction between pathways that are regulated in a coordinated way or that overlap

• Genomic analysis has certainly provided us with much insight into the possible role of particular genes in disease

• However proteins are the functional output of the cell and their dynamic nature in specific biological contexts is critical

• The expression or function of proteins is modulated at many diverse points from transcription to post-translation and very little of this can be predicted from a simple analysis of nucleic acids alone

• There is generally poor correlation between the abundance of mRNA transcribed from the DNA and the respective proteins translated from that mRNA

• Furthermore, transcript splicing can yield different protein forms• Proteins can undergo extensive modifications such as glycosylation,

acetylation, and phosphorylation which can lead to multiple protein products from the same gene

Proteomics Tools• The core methodologies for displaying the proteome

are a combination of advanced separation techniques principally involving two-dimensional electrophoresis (2D-GE) and mass spectrometry

2D-GE: basic methodology• Sample (tissue, serum, cell extract) is solubilized and the

proteins are denatured into polypeptide components• This mixture is separated by isoelectric focusing (IEF); on the

application of a current, the charged polypeptide subunits migrate in a polyacrylamide gel strip that contains an immobilized pH gradient until they reach the pH at which their overall charge is neutral (isoelctric point or pI), hence producing a gel strip with distinct protein bands along its length

• This strip is applied to the edge of a rectangular slab of polyacrylamide gel containing SDS. The focused polypeptides migrate in an electric current into the second gel and undergo separation on the basis of their molecular size

• The resultant gel is stained (Coomassie, silver, fluorescent stains) and spots are visualized by eye or an imager. Typically 1000-3000 spots can be visualized with silver. Complementary techniques, e.g. immunoblotting allow greater sensitivity for specific molecules.

• Multiple forms of individual proteins can be visualized and the particular subset of proteins examined from the proteome is determined by factors such as initial solubilization conditions, pH range of the IPG and gel gradient

2D-GE: basic methodology

General schematic of 2D-PAGE for protein identification in Toxicology

Sample growth Sample solubilization

Isoelectric focusing (IPG)

2D-PAGE

Image analysisImmunoblot (Western)

Isolation of spots of interest

Trypsin digestion of proteins

MS analysis of tryptic fragments

Identification of proteins

General strategy for proteomic analysis

Nature of IPG determines spot location on 2D-PAGE

Limitations of 2D-GE

• In the large scale analysis of proteomics, 2D-GE has been the major workhorse over the last 20 years-its unique application in being able to distinguish post-translational modifications and is analytically quantitative

• However despite the significant improvements (e.g. immobilized pH gradients) to the technique and its coupling with MS analysis it is still difficult to automate

• Although at first glance the resolution of 2D seems very impressive, it still lags behind the enormous diversity of proteins and thus comigrating protein spots are not uncommon

• This is especially of concern when trying to distinguish between highly abundant proteins e.g. actin (108 molecules/cell) and low abundant like transcription factors (100-1000)-this is beyond the dynamic range of 2D

• Enrichment or prefractionation can often overcome such discrepancies

• Chemical heterogeneity of proteins also presents a major limitation

• Thus the full range of pIs and MWs of proteins exceeds what can routinely be analyzed on 2D-GE. However improvements to IPGs is expected to overcome some of these constraints and greatly imrpove the coverage of the entire proteome of the cell

• Problems liked with extraction and solubilization of proteins prior to 2D-GE present an even greater challenge-especially for extremely hydrophobic proteins, such as membrane and nuclear proteins. Again recent advances in buffer composition has diminished the scale of this problem

Limitations of 2D-GE

Protein identification and characterization

• Specialized imaging software allows for a more detailed analysis of spot identification and comparison between gels, and treatments

• By a process of subtraction, differences (e.g. presence, absence, or intensity of proteins or different forms) between healthy and diseased samples can be revealed

• Cross-references to protein databases allow assignment by known pIs and apparent molecular size. Ultimate protein identification requires spot digestion (enzymatic) and analysis of charge and mass by mass spectrometry (MS)

• Spot cutter tools can be coupled to image analysis tools and in gel tryptic digestion techniques in 96 or 384 well format can greatly reduce the bottle-neck in sample identification by MS

Protein analysis by MS• Compared to sequencing, MS is more sensitive (femtomole to

attomole concentrations) and is higher throughput• Digestion of excised spot with trypsin results in a mixture of peptides.

These are ionized by electrospray ionization from liquid state or matrix-assisted laser desorption ionization from solid state (MALDI-TOF) and the mass of the ions is measured by various coupled analyzers (e.g. time of flight measures the time for ions to travel from the source to the detector, resulting in a peptide fingerprint

• The resultant signature is compared with the peptide masses predicted from theoretical digestion of protein sequences found in databases-identification of protein!

• Tandem MS allows one to obtain actual protein sequence information-discrete peptide ions can be selected and further fragmented, and complex algorithms employed to correlate exp data with database derived peptide sequences

topics in (nano) biotechnology human genome project lecture 9

Documents

topics in (nano) biotechnology lecture iii 10th april phd...

topics in (nano) biotechnology self-assembly 19th january,...

biotechnology and the human genome review chapter 13 + 14a

topics in (nano) biotechnology lecture v

topics in (nano) biotechnology course outline 2nd october,...

topics in (nano) biotechnology introduction to...

nano biotechnology and bio sensors

biotechnology and the human genome review

genome biology and biotechnology

vibrant gujarat summit profile for biotechnology and nano...

nanotribology and nanomechanics in nano/biotechnology ·...

applications of nano‐biotechnology in wastewater...

topics in (nano) biotechnology lecture 5 25th october, 2006...

topics in (nano) biotechnology lecture ii 3 march 2004 phd...

topics in (nano) biotechnology immunosensors

building the sugarcane genome for biotechnology and

topics in (nano) biotechnology enzyme sensors 30th june phd...

nano science and engineering ph.d. program handbook...

nano biotechnology

nano biotechnology final ppt