genes & genomes - unifi - disia - sito...
TRANSCRIPT
Mendel 1866 – Ereditarietà dei caratteriSutton 1903 – I Geni sui CromosomiGriffith 1928 – Il Fattore TrasformanteAvery, MacLoad, McCarty 1944 – DNA: materiale genetico
Watson, Crick et al. 1953 – il DNA è una doppia elicaSanger et al. 1951 – Sequenziamento ProteineCrick 1957 – scoperta del tRNAVarious - 1960-65 – Mappaggio CodoniLeder et al. - 1977 – Introni
Sanger et al., Maxam, Gilbert et al. 1977 –Sequenziamento DNA
Dalla Genetica alla GenomicaDalla Genetica alla Genomica
Mullis - 1990 – PCRAltschul et al. - 1990 – BLAST1990’s - Varie metodologie “gene knockout” e RNAi
Pat Brown, Ron Davis 1995 – Microarrays
Consortium 1996 – sequenziamento genoma di S. CerevisiaeConsortium 1998 – primo genoma animale: C. elegansConsortium 2000 - primo genoma vegetale : Arabidopsis thalianaConsortia/Celera 2001 – Bozza Sequenza Genoma Umano
What is a gene?If you will ask to a....A geneticist
A genetic engeneer
A biochemist
A molecular Biologist
A sociobiologist
A philosopher
A poet
The self replicating unit able to transfer itself according to the Mendel laws
The information necessary to a cell to produce a protein
The fraction of a chromosome that can be transcribed
The idea that helps us in the comprehension of the mistery of life and its development.
A structure that is existed long enough and that is sufficiently complex to serve as base for evolution
A molecule capable to replicate and recombine that can be transfered in a receiving cell
She will answer......................
A gene is a gene....., is a gene....., is a gene.....
Evolution of the gene conceptMendelian trait
Gene
Sutton & Boveri
One gene one enzymeOne gene one protein
Beadle & Tatum
DNA structure
Watson & Crick
Introns-exons
Leder et al.
1866
1902
1940
1953
1977
1990-2000… Genome Projects
The genes are in couples as the chromosomes and
the 2 members of each pair segregate in a balanced fashion in the gametes
MessageMessage
A gene is a DNA region that can be transcribed A gene is a DNA region that can be transcribed in a functionalin a functional RNA RNA in a defined and precise in a defined and precise
moment based on the specific needs of themoment based on the specific needs of the cell cell to be later translated in a proteinto be later translated in a protein
What is Protein Function? The Post-Genomic View
• The biochemical reaction in which it participates?
• The biological process in which it is involved?
• The genes and proteins it interacts with?• The genes and proteins with which it is co-
regulated?
Every cell of an organism contains one or more sets of chromosomes, one genome
The genome is constituted by one or more long molecules of DNA that are organized in
chromosomes
Genome DefinitionGenome Definition
TheThe procaryotic cellprocaryotic cellThe genome is constituted by a circular chromosome in single copy (haploids)
TheThe Eucaryotic cellsEucaryotic cellsPossess a nucleus, a nuclear genome and organelles.
The nuclear genome is constuted of several linear chromosomes:
•In single copy, haploids (S. cerevisiae and germ cells)
•In double copy, diploids (animals)
•In multiple copy, polyploids (plants)
Epulopiscium fishelsoni, batterio 100 µm x 0.5mMthiomargarita.
Eucariotic cell diameter from 2 to 200 µm
S.cerevisiae 4 micrometers haploid6 micrometers diploid
Bacteria are smaller than eucarya
Thiomargarita namibiensis
DNA avg 4x10 to the sixth bpfrom 0.65 to 10 megabases
DNA supercolied in a circulmolecule
3-4 plasmidis
1. Does not have a nuclear membrane.
2. Has a unique circular genome a DNA double helix.
3. E. Coli = 4x106 bp, 99% of which is coding sequences, a compact genome only 1% is made of repeated non coding regions.
4. Genes are organized in operons functional units where correlated genes (es. enzymes of a metabolic pathway) are colocalized one next to the other on the chromosome,transcribed in only one polycistronic mRNA, and regulated as a whole.
Procariotic genome
Procaryotic gene StructureProcaryotic gene Structure
Regulatory Region responsible of the
onset of transcription
Coding RegionTerminator
Promoter Operator
What is a promoter
• Is a DNA region, usually a palyndromicsequence, recognized specifically by a given transcription factor.
• In many cases more than 1 transcription factor is needed to activate transcription, therefore the promoters have a complex structure.
TRP operonTRP operon E. coliE. coli
Proteins E+D = I° enzyme of the trp pathway
Protein C= intermediate step
Protein A + B = triptophan synthase
Negative and positiveNegative and positive transcritpional control of transcritpional control of laclac operon by repressor operon by repressor protein andprotein and cAMPcAMP--CAP respectivelyCAP respectively
Microbial genome projectsMicrobial genome projects
• 2,160,837-bp
• 2236 regioni codificanti
• 1440 (64%) known function
• 5% insertion sequences, IS ex-transposons, might cause rearrangements.
Es. StreptococcusStreptococcus pneumoniaepneumoniaeGram+Pneumonia, bacteremia, meningitis e otitis
The Institute of Genome Research (TIGR)
Comprehensive Microbial Resource (CMR)
55 microbial genomes completely sequenced in 2004. 45 species specie
Genome analysis and the tree of lifeGenome analysis and the tree of life
Archea : estremofilic
Bacteria: bacteria
Woese, 1998
1.1. DNA DNA single copy,single copy, geni geni coding for proteinscoding for proteins
2. DNA 2. DNA in multiple copies, lines, sines, junk DNAin multiple copies, lines, sines, junk DNA
3. DNA 3. DNA spacerspacer
Eucariotic sequencesEucariotic sequences
And centromeres, regions in the middleAnd centromeres, regions in the middle
Rich of repeated sequences
Are important for attachment of chromosomes to the spindle body and the chromosome segregation.
FeulgenFeulgenEuchromatin,Euchromatin, less coloured, less packed, active genes
Eterochromatin,Eterochromatin, more densely coloured, its borders often contain genes thjat can be turned on or off, is present near centromeres.
Citological mapsCitological mapsChromosom bandingChromosom banding::
3H-uridine labelling of RNA syntetci sytes (silver stain, black).White regions areheterocromatin near the nuclear membrane.
Heterocromatine is transcriptionally activeHeterocromatine is transcriptionally active
Active RNA synthesys
Boundaries between eu- and hetero-cromatine are variable?
Are tissue specific?
Genes in the borders can be switched off briging to loss of function?
RearrangementsRearrangements (traslocation or inversion)(traslocation or inversion) can can localize a gene near heterochromain or non localize a gene near heterochromain or non transcribed regions and inactivate the genetranscribed regions and inactivate the gene
GG--light bandslight bands: GC rich, contain housekeeping genes, active in every cell type.
GG--Dark BandsDark Bands: AT rich, late replication contain tissue specific genes.
Giemsa Band colorationGiemsa Band coloration
Hygher eucaryotes genes are discontinuous Hygher eucaryotes genes are discontinuous genesgenes
They contain non translated regions, introns introns lthat interrupt the coding regions exsonsexsons, expandig gene size also 20 times.
Regulatory region, ATG
Coding Region
Terminator
Introns
Exons
Introns are transcribed together with exons but are eliminated during the mRNA maturation upon export from the nucleus trough a mechansim called RNAsplicing.
Some genes, interferons and Histons are an exception as they dont have introns.
RNA processingRNA processing
3 exons interrupted by 2 introns
The “Untranslated regions” (UTRs) are transcribed regions
that will not be translated.
At 5’ end is added a7-metilguanilate cap
(m7Gppp; green)Al\t 3’ are addedd poly A residues
(poly(A))
ββ--globin geneglobin gene
RNA specific fro a protein of 147-aa
Splicing: introns removal
•Exons encode for different protein domains, a domain is a functional region of a protein.•The exon shuffling can cause a rapid evolution of the protein juxtaposing different domains in different splicing variants.•It is possible to generate a great variability variabilitàmixing a relatively small number of sequences.
•Alternative splicing processes are responsible for Tissue specific variability of the different proteins
TheThe Walter Walter GilbertGilbert hypothesys on the hypothesys on the
role of the intronesrole of the intrones
Evolution of the gene for triosophospate isomerase
Ancient origin of introns with equal position
Introns
Exons
S. cerevisiae nuclear genes do not have introns
S. cerevisiae Mitochondrial genes do have introns
S. cerevisiae genes have lost introns during evolution, S.cerevisiae is a unicellular organisms, does not need tissue specificity generated trough splicing, splicing can be cumbersome for a veryefficient model of developement.
Or introns have been generated for the first time in the mitochondrial DNA and such a system has been positively selectedduring evolution?
Saccharomyces cerevisiaeSaccharomyces cerevisiae, , the the conjunction between prochariotes and conjunction between prochariotes and
euchariotes or a very specialized euchariotes or a very specialized eucaryotic organismeucaryotic organism ??
In In S. cerevisiaeS. cerevisiae the 5the 5 genes that encode for the TRP genes that encode for the TRP synthetic genes are localized on 4 different synthetic genes are localized on 4 different
chromosomes, regulation occurs trough transcription chromosomes, regulation occurs trough transcription factors with a much more finely tuned interplayfactors with a much more finely tuned interplay
ATM ATR
CHK1
CHK2
DNA-PK JNK
CKI
PKC
CDK2
SUMO
p38
CAK
HIPK2
CK2/hSPT16/SSRP1 PCAF
p300/CBP
MDM2
HDAC
hSIR2
SIN3
Stress Signals
Post-translationalmodifications
The p53 tumor suppressor protein is an inducible sequence-specific transcription factor...
Stabilization and Activation
p53 tetramer
N
C
…that binds to a family of different response elements...
N
C
The p53 tumor suppressor protein is an inducible sequence-specific transcription factor...
RRRCWWGYYY(N)0-13RRRCWWGYYY
…and can modulate a wide array of target genes...
co-activators/adaptors/co-repressors:ADA3, ASPP, p53-BP1; p53-BP2, p33ING1, WRN, BRCA1, TFIID,
TFIIH, SIR2, CBP/p300, MDM2, MDMX
The p53 tumor suppressor protein is an inducible sequence-specific transcription factor...
N
CTarget gene
+
…that binds to a family of different response elements...
…that can modulate a wide array of target genes...
The p53 tumor suppressor protein is an inducible sequence-specific transcription factor...
Cell cycle arrest DNA repair Cell death
Target gene
P21CyclinG14-3-3sCDC25-CPC3PA26
BAXPIG3IGF-BP3NOXAPUMA AIP1
ScotinPIDDPERPApaf-1
Gadd45PCNA p53-R2p48BTG2XPC?
FASKiller/DR5TRAIL
…that can regulate:
…that binds to a family of different response elements...
p53 stability
MDM2
p53: a key player in cell cycle control
p53 co-activators: CBP/p300, ref1, p33ING1, WRN, BRCA1, ADA3
ApoptosisG1/G2 arrestDNA repair
Stress Signals
(e.g. DNA damage, nucleotide depletion, hypoxia, activated oncogenes, viral infection)
p14ARF
baxp21 PC3PA26
Killer/DR5TrailNOX-A
Transcriptional activation
IGF-BP3Gadd45
PIG3
AIP-1ScotinPIDD
p48-XPPCNA p53-R2BTG2
14-3-3σcyclinGCDC25-C
MDM2
upstream activatorspost-translational modifications:
phosphorylation, acetylation,sumolation
ATM ATRCHK1 CHK2
DNA-PK JNK
CKI PKCCDC2
SUMO-1
Rapid cell proliferation
Easily cultivated in petri dishes
Possibility to isolate mutants
Well defined genetic system
Highly versatile in gene manipulation techniques.
Non pathogenic Available in
large amounts
Saccharomyces cerevisiae as a model organismYeast Saccharomyces cerevisiae is one of the most common models of the
eucariotic cell
SaccharomycesSaccharomyces cerevisiae GENOMEcerevisiae GENOME (1996)(1996)
•Genome: 13.4 Mb
•16 chromosomes
•tRNA 275
•rRNA 140 repeats
•Quasi 1MB of repetitive sequences (junk DNA??)
•Proteins 6300-5570
•70% of the genome is coding
•One gene every 2Kb
•4% of the genes have introns
•60% of the proteins have known function
•20% of the proteins have assigned function in silico
•Duplicated gene families are in subtelomeric regions
1994
Genetic Map (Gyapay et al., 1994)
23 linkage groups (one per chromosome) with 1.200 markers each spaced 1 cM
1995
Phisical map (Hudson at al., 1995): 52.000 STS (Sequence TaggedSite) intervals of 60 Kb
Database 30.000 EST (Adams et al., 1995)
1998
Collection of 3000 SNPs (Wang et al., 1998)
2000
First sequence of Chromosome 21 (Hattori et al., 2000)
The human genome projectThe human genome project
Human Genome ProjectHuman Genome Project
(National Institute of Health & Department of Energy)
Goals:
•Genetic maps.
•Sequence 3 miliardi di lettere del DNA umano con un’accuratezza maggiore del 99,99 % entro il 2005.
•Identify every human gene (ORFs e ESTs, functional and comparative data).
•Compilation of a polymorphsms database (SNPs)
February 2001February 2001
•Venter et al., The sequence of the human genome. Science 2001
•International Human Genome Sequencing Consortium (IHGSC), Initial sequencing and analysis of the human genome. Nature 2001
27,000 – 34,000 genes
DNA SourceDNA Source
3/4 2/3Only males in I° draft
No ethnic ID2 males 3 females
Ethnic base equally distributed
DNA in large excess respect to what is needed for making proteins
Only 3% of the sequences is coding
97% is non coding
1 error every 10.000 bases
Frequency of SNPs: 10 every 10.000 bases
1 error every 10 SNPs
Extended centromeric heterochromatin will never be sequenced(20% genoma).
Chr. 13 (3,038,416 bp)Longest intergenic region (between annotated + hypothetical genes)
Chr. Y (0.36)Chromosome with lowest proportion of DNA in annotated exons
Chr. 19 (9.33)Chromosome with highest proportion of DNA in annotated exons
605 MbpTotal size of gene deserts (>500 kb with no annotated genes)
Chr. 13 (5 genes/Mb), Chr. Y (5 genes/Mb)Least gene-rich chromosomesChr. 19 (23 genes/Mb)Most gene-rich chromosome27 kbpAverage gene size
Titin (234 exons)Gene with the most exons
59Percent of hypothetical and annotated genes with unknown function
39,114Number of genes (hypothetical and annotated)42Percent of annotated genes with unknown function26,383Number of annotated genes35Percent of genome classified as repeats
Chr. X (25%)Least GC-rich 50 kb
Chr. 2 (66%)Most GC-rich 50 kb
9Percent of undetermined bases in the genome
38Percent of G+C in the genome54Percent of A+T in the genome2.66 GbpSize of the genome (excluding gaps)
2.91 GbpSize of the genome (including gaps)
VenterVenter etet al., The al., The sequencesequence of the of the humanhuman genome. 2001genome. 2001 ScienceScience
Distribution of the functions ofDistribution of the functions of 26,38326,383 human geneshuman genes
Venter et al., Science 2001
Internet ResourcesInternet Resources1988: National Center for biotechnology Information (NCBI)
http://www.ncbi.nlm.nih.gov
OnlineOnline MendelianMendelian InheritanceInheritance in Man (OMIM)in Man (OMIM)((http://www.ncbi.nlm.nih.govhttp://www.ncbi.nlm.nih.gov))
CancerCancer Genome Genome AnatomyAnatomy Project (CGAP)Project (CGAP)
http://http://cgapcgap..ncinci..nihnih..govgov//
Genes in populationsGenes in populations
New alleles in populations appear trough mutations in the cells of the germ line (mutations in the somatic line get lost with the death of the individual).
Every allele has a given allelic frequence in the population depending on its fitness.
Mutations alter a status quo acquired trough selection, therefore most mutations are deleterious.
Allelic frequences changes with time trough interaction with the environments as result of natural selection and genetic drift..
Naturale selection changes the frequency of an allelel selecting on its fitness based on environmental or sociological considerations, the result is favourable mutations are preserved, (increase in frequency), deleterious are lost or kept unexpressed (recessive alleles present in heterozygosity), thier frequency tends to decrease, but they can be preserved in the population as they might become handy later on with a change in the environment.
An allele is fixed when reaches the frequency of 100%.
A single genotype may produce manydifferents phenotypes environment
dependent
A single phenotype may be producedby many differents genotypes
environment dependent
GenotypePhenotype 1
Phenotype 2
Phenotype 3
Phenotype
Genotype 1
Genotype 2
Genotype 3
SNPsSNPs• Single Nucleotide Polymorphism a site in the
genome where a signle nucleotide can be present in one or more forms in a collection of individuals of the same species
• SNPs usually are sostitutions, but also deletions or insertions of a single nucleotide are often observed
• Frequency in the human genome : 1 every 1Kb.
••TransitionsTransitions
••TransversionsTransversions
••InsertionsInsertions and/orand/or deletionsdeletions
Purine Purine (A G; G A)
Pirimidine Pirimidine (C T, T C)
Purine Pirimidine (G C; G T; A C; A T)
Pirimidine Purine (C G; C A; T A; T G)
Mostfrequent
Type ofType of SNPsSNPs
(A) Un errore di replicazione può portare ad un mismatch in una delle doppie eliche figlie, portando allal generezione di una molecola mutata e ad una con la corretta sequenza.
(B) Effetto di un mutagenosull’alterazione di A nel filamento inferiore della molecola parentale. Anche in questo caso si verifica un mismatch.
Examples of mutationsExamples of mutations
(A) La DNA polimerasi seleziona attivamente il corretto nucleotide da inserire in ciascuna posizione
(B) Gli errori che si verificano possono essere corretti da una attività 'proofreading' se la polimerasi possiede una attività esonucleasica 3’-5’.
Se l’ultimo nucleotide inserito è accoppiato alla base complementare del templatoprevarrà l’attività polimerasica.
Se invece non è accoppiato l’attività esonucleasica sarà favorita.
Meccanismi per assicurare l’accuratezza Meccanismi per assicurare l’accuratezza della duplicazione del DNA.della duplicazione del DNA.
nonnon coding SNPs coding SNPs
Are localized in 5’ o 3’ of (NTR), or 5’ o 3’ of non translated regions (UTR), in introns or intergenic regions.
coding SNPscoding SNPs
ReplacementReplacement Polymorphism Polymorphism Change the AA
SynonymousSynonymous Polymorphism Polymorphism Change the codon but not the AA
Nonreplacement Nonreplacement PolymorphismPolymorphism
Are the “Synonymous Polymorphism” and non coding SNPs.
Have an indirect effect on gene function aletring regolation of trascription, traduzction, splicing e RNA stability.
PolimorphismPolimorphism
This term refers to a locus represented by a different number of alleles or haplotypes in
the population
The importance of the models
As in Biology nothing makes sense unless interpreted trough evolution, in genomics nathing makes sense unless analyzed in
a comparative fashion.
Model systemsModel systems
Comparative GenomicsComparative Genomics
SyntenySynteny: conservation of gene order on the chromosomes of evolutionary related organisms.
Homologous genesHomologous genes: derived from a common ancestral locus
• Hortologous genesHortologous genes: Genes present in genomes of different organisms that derive from a common ancestral locus .
• Paralogous genesParalogous genes: Similar genes present in the same genome that derive from a process of genic or genomic duplication.
SintenySinteny Man MouseMan Mouse
A) Synteny Blocks chromosome 11 mouse and parts of 5 human chromosomes.
B) Zoom in 5q31(1Mb) with perfect synteny 23 genes(4 interleuchins)
C) Allignment of a 50kbp region.
Only (1%) is rodent specific while (14%) is in common amongst all mammals.
Mouse Genome Sequencing Consortium, Nature 2002
Taxonomy of mouse proteinsTaxonomy of mouse proteins
Mammals evolutionMammals evolution
Sequenced genomes
genomes being sequenced
genomes to be sequenced
Nature Reviews Genetics 3; 33-42 (2002); RAT GENETICS: ATTACHING PHYSIOLOGY AND PHARMACOLOGY TO THE GENOME
Data integration for understanding human pathologies Data integration for understanding human pathologies
Other invertebrates as models for biomedical researchOther invertebrates as models for biomedical research
December 1998CaenorhabditisCaenorhabditis eleganselegans March 2000,
DrosophilaDrosophila melanogastermelanogaster
Complete Genome Sequencing
95 96 97 98 99 00 01
Bacteria1.6Mb
1600 genes
Eukaryote13Mb
~6000 genes
Animal100Mb
~20,000 genes
Human3Gb
~30,000 genes?
http://www.ncbi.nlm.nih.gov/
Bioinformatics is Born
Growth in number of residues in GenBank compared to the request for people with competence in bioinformatics
(as estimated from the number of positions advertised in Naturein March and September of each year)
Resi
dues
Posi
tion
s
Year
Bioinformatics and genomics128 Pentium processors in parallel 250 terabytes of
memoryBioinformatics core, Rosetta Resolver.
Bioinformatics allows making sense of the Sequence …
DNA genomic sequence
Gene Finding
Reguatory siteanalysis
Variation: SNPs
Exons/intronIdentification
The molecular “parts-list”: The transcriptosome
Type(~10,000 types/cell)
Splice variant (~90,000)
Quantity(Copy number)
Expression profiles
Transcriptosome Snapshots: Expression Profiling
cDNA bonded on a glass surface
Camera(Microarray)
Snapshot(Expression profile)
Scanned, hybridized array
Label RNA from cell and hybridize
to array
Reference Treatment
Prepare RNAFluorescently Labeled cDNA
Mix, andHybridize
The microarray procedure. The experimental objective in this example is to compare the transcriptional profile of cells in one growth phase (Treatment) to that of mixed-phase cells (Reference) (Figure courtesy of D. Botstein, Stanford University).
Experiments
200 10000 50.00 5.644800 4800 1.00 0.009000 300 0.03 -4.91
Gen
esCy3 Cy5
Cy5Cy3
log2Cy5Cy3
Extracting data. Slides are scanned at the appropriate excitation/ emisson spectra and intensities recorded in dye-specific channels. Log2-ratio intensities reveal fold-differences between Reference (green) and Treatment (red). These are color-coded and presented in a GENES * EXPERIMENT matrix (Figure courtesy of D. Botstein, Stanford University).
New
Scan
ScanAlyze
GenePix
Database
Data Selection
Complete Data Table (cdt)
Hierarchical
Clustering
K-Means
SVDDownload
SelfOrganizingMaps
Data flow in microarray studies. Following laser scanning, data are entered into a Complete Database Table (cdt). Multiple software packages are available as freeware forpost-scan analyses (Figure courtesy of D. Botstein, Stanford University).
The molecular “parts-list”: The proteome
TranslationDegradation
Localization
Modification (binding, cleavage,
covalent modification)
0.5-1X106 variants
Multiple Faces of the Proteome: Expression
Peptide Sequence identityProtein expression (modification)
Multiple Faces of the Proteome: Expression
Protein identity (by peptide sequence)
Relative protein expression (by
peak ratio)
Bionformatics and prediction of protein
structureAnalyzing existing
structures
Identification and
classification of folds
Structure alignment and
scoring
Association with sequence and
function
… Let There Be Structure
Structure prediction(Ab initio, threading, fold recognition,
homology modeling)
Domain B-2, AspartateTranscarbamolylase Propeptide of subtilisin
Threading model based on domain B-2
core
Multiple Faces of the Proteome: Protein-protein Interaction
High throughput 2-hybrid analysis of protein-protein interaction in yeast
Extracting what we already know
A protein-protein interaction pathway map automatically constructed from a user query of effective human cyclin inhibitor".
The more you knowthe harder is
to take decisive action ???
The more you knowthe greatest the needfor tools enabling to
handle complexity
Genomes of parassitesGenomes of parassites
• 23 Mb• 14 chromosomi• 5,300 genes•The richest in (A + T) among bacteria• 90% Introns and coding regions.• Gene involved in antigenic variaiton are in subtelomeric regions.
• Most of its genes are transporters or genes involved in evading the hosyt immune system
PlasmodiumPlasmodium falciparumfalciparum