genomic analysis
DESCRIPTION
Genomic Analysis. Flowchart. get genome sequence – genome assembly find genes translate genes all against all, self-comparison all against all, interproteome functional classification synteny analysis microarrays. Contigs. - PowerPoint PPT PresentationTRANSCRIPT
Genomic AnalysisGenomic Analysis
FlowchartFlowchart
• get genome sequence – genome assembly
• find genes• translate genes• all against all, self-comparison• all against all, interproteome• functional classification• synteny analysis• microarrays
• get genome sequence – genome assembly
• find genes• translate genes• all against all, self-comparison• all against all, interproteome• functional classification• synteny analysis• microarrays
ContigsContigs
• Sequences are obtained by genetically engineering pieces of DNA into plasmids
• One sequencing reaction can only resolve a maximum of about 800 base pairs
• Overlapping fragments allows deduction of complete sequences
• Sequences are obtained by genetically engineering pieces of DNA into plasmids
• One sequencing reaction can only resolve a maximum of about 800 base pairs
• Overlapping fragments allows deduction of complete sequences
Fragment Assembly package in GCGFragment Assembly package in GCG• This package of programs allows you
to input fragment sequences, make the contigs, and then edit the final contigs.
• This package of programs allows you to input fragment sequences, make the contigs, and then edit the final contigs.
Contigs: the algorithmContigs: the algorithm
• First, find regions of overlap that contain a minimum number of identities (sliding window with an identity matrix)
• Second, save those overlaps whose identities/overlap ratio meets a threshold criterion (80% in GelMerge)
• First, find regions of overlap that contain a minimum number of identities (sliding window with an identity matrix)
• Second, save those overlaps whose identities/overlap ratio meets a threshold criterion (80% in GelMerge)
Identity/overlap ratioIdentity/overlap ratio
• In order to save the threshold-meeting overlaps, must align them
• This is a global alignment that does not penalize overhanging ends
• So F(i,0) = 0 and F(0,j) = 0 (top row and leftmost column are all 0 so we can start anyplace along the top or left border)
• In order to save the threshold-meeting overlaps, must align them
• This is a global alignment that does not penalize overhanging ends
• So F(i,0) = 0 and F(0,j) = 0 (top row and leftmost column are all 0 so we can start anyplace along the top or left border)
• Start the traceback from the maximum value on the right or bottom border: F(max) = (i,m) or (n,j)
• Start the traceback from the maximum value on the right or bottom border: F(max) = (i,m) or (n,j)
H E A G A W H
0 0 0 0 0 0 0 0
P 0
A 0
W 0 n,m
H E A G A W H
0 0 0 0 0 0 0 0
P 0
A 0
W 0 n,m
• GelMerge then aligns the two pieces (contigs) with the longest overlap and assembles a single piece of DNA from that; this process is repeated until there are no remaining overlaps in the fragment database being used
• GelMerge then aligns the two pieces (contigs) with the longest overlap and assembles a single piece of DNA from that; this process is repeated until there are no remaining overlaps in the fragment database being used
In-class exercise In-class exercise • Open the file called fragments; this
contains truncated regions of the file named geneseq.
• In the editor, select all the sequences.• Select Functions -->Fragment
Assembly--> GelStart; enter a project name and select Begin a new project; select Run
• Open the file called fragments; this contains truncated regions of the file named geneseq.
• In the editor, select all the sequences.• Select Functions -->Fragment
Assembly--> GelStart; enter a project name and select Begin a new project; select Run
In-class exercise, contIn-class exercise, cont
• Go back to Fragment Assembly, select GelEnter; in the green GelEnter of box it should say selected sequences from Editor (make sure all sequences in Editor are still selected); select Enter the selected sequences from main window; Run
• Go back to Fragment Assembly, select GelEnter; in the green GelEnter of box it should say selected sequences from Editor (make sure all sequences in Editor are still selected); select Enter the selected sequences from main window; Run
In-class exercise, contIn-class exercise, cont
• Go back to Fragment Assembly again; select GelMerge; Run.
• Go back to Fragment Assembly again; select GelView; Run
• Now go back and look at options, especially in the GelMerge program; try changing them and seeing what happens.
• Go back to Fragment Assembly again; select GelMerge; Run.
• Go back to Fragment Assembly again; select GelView; Run
• Now go back and look at options, especially in the GelMerge program; try changing them and seeing what happens.
Genome project programsGenome project programs• PHRED: analyses raw sequence to produce a
`base call‘ with an associated `quality score' for each sequence position
• Phred scores reported as 10*log10(p), where p is the probability of the base call being wrong
• q of 20 is 10x q of 30• PHRAP: assembles raw sequence into
sequence contigs and assigns to each position an associated ‘quality score’ for each position in the sequence, based on the Phred scores of the raw sequence reads (same scale as Phred).
• PHRED: analyses raw sequence to produce a `base call‘ with an associated `quality score' for each sequence position
• Phred scores reported as 10*log10(p), where p is the probability of the base call being wrong
• q of 20 is 10x q of 30• PHRAP: assembles raw sequence into
sequence contigs and assigns to each position an associated ‘quality score’ for each position in the sequence, based on the Phred scores of the raw sequence reads (same scale as Phred).
• GigAssembler: merges the information from individual sequenced clones into a draft genome sequence.
• GigAssembler: merges the information from individual sequenced clones into a draft genome sequence.
Chromosomal Map from Mycobacterium tuberculosis (TIGR)
Gene and regulatory region findingGene and regulatory region finding• Sequencing a million base pairs is
relatively easy• Identifying open reading frames
(eukaryotic) in that million base pairs is quite difficult (because of intervening sequences, introns, etc.)
• Identifying regulatory sequences is very difficult – such sequences are short, and can be separated from orf by 50,000 base pairs
• Sequencing a million base pairs is relatively easy
• Identifying open reading frames (eukaryotic) in that million base pairs is quite difficult (because of intervening sequences, introns, etc.)
• Identifying regulatory sequences is very difficult – such sequences are short, and can be separated from orf by 50,000 base pairs
Gene finding by similarityGene finding by similarity
• Screen genomic sequence against known cDNA sequences in database; if you find a significant match, that’s probably an ORF! (usual first step with genomic sequence)
• This will miss lots of genes ...
• Screen genomic sequence against known cDNA sequences in database; if you find a significant match, that’s probably an ORF! (usual first step with genomic sequence)
• This will miss lots of genes ...
Genomic DNA BLAST resultsGenomic DNA BLAST results
• Input: genomic DNA fragment from E. coli
• BLASTX of nr protein database at NCBI
• Output follows
• Input: genomic DNA fragment from E. coli
• BLASTX of nr protein database at NCBI
• Output follows
• This is a pretty trivial example, but you can see how this works for actual unknown genome sequences
• This is a pretty trivial example, but you can see how this works for actual unknown genome sequences
Major methods of gene findingMajor methods of gene finding
• Pattern discrimination• Find metrics that correlate with
usage in coding regions• Generate way to separate
coding/noncoding regions according to that metric
• Others (HMM, neural net, genetic algorithm, …)
• Pattern discrimination• Find metrics that correlate with
usage in coding regions• Generate way to separate
coding/noncoding regions according to that metric
• Others (HMM, neural net, genetic algorithm, …)
ORF patternsORF patterns
• 7 major metrics:• Frame bias: find the frame that matches
codon bias of that organism• Fickett algorithm: amalgam of several
tests involving 3-periodicity of query DNA vs. known 3-periodicity of known coding DNA; and also overall base composition
• 7 major metrics:• Frame bias: find the frame that matches
codon bias of that organism• Fickett algorithm: amalgam of several
tests involving 3-periodicity of query DNA vs. known 3-periodicity of known coding DNA; and also overall base composition
• Fractal dimension: common codons clustered with common codons, or uncommon with uncommon, has low fractal dimension, which is typical of exons
• Coding 6-tuple word preferences: compare occurrence to known coding vs noncoding regions in database
• Coding 6-tuple in-frame preferences: compare occurrence to known in-frame vs. out-of-frame preferences
• Word commonality: exons use rare, introns use common 6-tuples
• Repetitive 6-tuple preferences
• Fractal dimension: common codons clustered with common codons, or uncommon with uncommon, has low fractal dimension, which is typical of exons
• Coding 6-tuple word preferences: compare occurrence to known coding vs noncoding regions in database
• Coding 6-tuple in-frame preferences: compare occurrence to known in-frame vs. out-of-frame preferences
• Word commonality: exons use rare, introns use common 6-tuples
• Repetitive 6-tuple preferences
• Each of these metrics by itself is not very good at predicting ORF’s; integrating all this information is much more likely to be successful
• Such integration is species specific, and also somewhat regionally specific within species; nonetheless very useful
• Each of these metrics by itself is not very good at predicting ORF’s; integrating all this information is much more likely to be successful
• Such integration is species specific, and also somewhat regionally specific within species; nonetheless very useful
Gene prediction in prokaryotes (and yeast)Gene prediction in prokaryotes (and yeast)• Little intergenic DNA, lack of introns,
highly conserved regulatory region patterns make gene prediction easier in prokaryotes
• MM’s (GeneMark) and HMM’s (GeneMark.hmm) work because predictable patterns give reasonable estimates of probabilities for transitions between coding and non-coding regions
• Little intergenic DNA, lack of introns, highly conserved regulatory region patterns make gene prediction easier in prokaryotes
• MM’s (GeneMark) and HMM’s (GeneMark.hmm) work because predictable patterns give reasonable estimates of probabilities for transitions between coding and non-coding regions
In class exercise: GeneMark and GeneMark.hmmIn class exercise: GeneMark and GeneMark.hmm• Go to GeneMark website
http://opal.biology.gatech.edu/GeneMark/• Use text editor to open ecoli_lac_operon.txt file
(Troy: local guest directory; Hartford: my directory); this contains genomic sequence from E. coli
• Use GenMark webserver to get predicted ORFs using both GeneMark and GeneMark.hmm
• Compare outputs; how would you find out if these ORFs correspond to your results from exercise I?
• Go to GeneMark website http://opal.biology.gatech.edu/GeneMark/
• Use text editor to open ecoli_lac_operon.txt file (Troy: local guest directory; Hartford: my directory); this contains genomic sequence from E. coli
• Use GenMark webserver to get predicted ORFs using both GeneMark and GeneMark.hmm
• Compare outputs; how would you find out if these ORFs correspond to your results from exercise I?
GlimmerGlimmer
• Higher order HMM’s • Instead of looking at just the previous
state, use information from the previous n states (e.g., 5th order
• Interpolated HMM’s = IMM’s• Incorporate highest-order information
possible that preserves statistical discrimination
• Glimmer is TIGR’s main gene finding tool
• Higher order HMM’s • Instead of looking at just the previous
state, use information from the previous n states (e.g., 5th order
• Interpolated HMM’s = IMM’s• Incorporate highest-order information
possible that preserves statistical discrimination
• Glimmer is TIGR’s main gene finding tool
Gene finding in eukaryotesGene finding in eukaryotes
• Significant intergenic DNA, less conserved patterns for regulatory regions, significant numbers of introns, more complicated chromosome structure
• Gene finding in eukaryotes significantly more difficult than prokaryotes
• Significant intergenic DNA, less conserved patterns for regulatory regions, significant numbers of introns, more complicated chromosome structure
• Gene finding in eukaryotes significantly more difficult than prokaryotes
Neural NetNeural Net
• Attempts to mimic neural patterns of learning
• Set up network of inputs that give outputs only if threshold is reached (like neurons); thresholds can be reached in a variety of different ways
• The network is a set of “hidden layers” that provide the information for the final output
• Attempts to mimic neural patterns of learning
• Set up network of inputs that give outputs only if threshold is reached (like neurons); thresholds can be reached in a variety of different ways
• The network is a set of “hidden layers” that provide the information for the final output
Simple neural netSimple neural net
Sensor
Sensor
Node output
Node might only give outputif both sensors +; or only ifboth -; or only if one +, one -
More complex neural netMore complex neural net
output
hidden net layers
• Construct network of nodes and connections• Train on sequences with known properties;
adjust weights for connections to optimize for desired outcome on training set
• GRAIL works by using 7 algorithms in a neural net trained on a large set of human sequences with known coding and noncoding regions
• GRAIL won’t work for every human sequence; won’t necessarily work for non-human sequences; nonetheless, works quite well
• Construct network of nodes and connections• Train on sequences with known properties;
adjust weights for connections to optimize for desired outcome on training set
• GRAIL works by using 7 algorithms in a neural net trained on a large set of human sequences with known coding and noncoding regions
• GRAIL won’t work for every human sequence; won’t necessarily work for non-human sequences; nonetheless, works quite well
ExerciseExercise
• Human genomic DNA
• Use GRAIL EXP to find exons
• Compare to GeneMark.hmm
• Human genomic DNA
• Use GRAIL EXP to find exons
• Compare to GeneMark.hmm
• Bayesian methods: use comparison of sequences from fairly close species (mouse and human) -- look for regions that align, ignore the rest
• Based on the idea that those regions that are conserved are likely to be coding or regulatory regions; those that are not conserved are likely not to be
• Bayesian methods: use comparison of sequences from fairly close species (mouse and human) -- look for regions that align, ignore the rest
• Based on the idea that those regions that are conserved are likely to be coding or regulatory regions; those that are not conserved are likely not to be
Regulatory region findingRegulatory region finding
• Again use comparison but this time look in regions outside open reading frame
• This has been done successfully using Bayesian methods
• Again use comparison but this time look in regions outside open reading frame
• This has been done successfully using Bayesian methods
All-against-all self-comparison of proteomeAll-against-all self-comparison of proteome
• Translate all identified ORFs
• BLAST each translated ORF against all other translated ORF w/in that proteome
• Identify paralogs = separate genes that arose by duplication
• Identify gene families
• Translate all identified ORFs
• BLAST each translated ORF against all other translated ORF w/in that proteome
• Identify paralogs = separate genes that arose by duplication
• Identify gene families
All-against-all interproteome comparisonAll-against-all interproteome comparison
• Like self comparison, only between organisms
• Identify orthologs = genes with same function conserved between species
• Identify gene families
• Identify conserved domains
• Like self comparison, only between organisms
• Identify orthologs = genes with same function conserved between species
• Identify gene families
• Identify conserved domains
Functional classificationFunctional classification
• Useful as a precursor to data mining for finding genes related by function, etc.
• Useful as a precursor to data mining for finding genes related by function, etc.
Synteny analysisSynteny analysis
• Arrangement of genes (ORFs) on a chromosome is preserved to a greater or lesser extent depending on the relatedness of the organisms
• Computational analysis of synteny very similar to sequence alignment methods
• Isochores = “long regions of homogeneous base composition”
• 1M base pairs• GC content uniform throughout (differences in GC
content of sliding window would be no more than 1% different than overall GC content of isochore)
• H = high density – rich in genes• L = low density – poor in genes
• Arrangement of genes (ORFs) on a chromosome is preserved to a greater or lesser extent depending on the relatedness of the organisms
• Computational analysis of synteny very similar to sequence alignment methods
• Isochores = “long regions of homogeneous base composition”
• 1M base pairs• GC content uniform throughout (differences in GC
content of sliding window would be no more than 1% different than overall GC content of isochore)
• H = high density – rich in genes• L = low density – poor in genes
Global gene regulationGlobal gene regulation
• Microarray analysis
• Beyond scope of this class
• See discussion in text
• Microarray analysis
• Beyond scope of this class
• See discussion in text
Other molecular biology applicationsOther molecular biology applications
• PCR primer finding– How do you think this algorithm works?
• Restriction enzyme mapping– How do you think this algorithm works?
• PCR primer finding– How do you think this algorithm works?
• Restriction enzyme mapping– How do you think this algorithm works?