part i – sequence analysis (dna)bio-mirror.im.ac.cn/mirrors/s-star/downloads/tutorial/t2a.pdf ·...
TRANSCRIPT
1
Part I – Sequence analysis (DNA) : Bioinformatics Software
Chen XinNational University of Singapore
Bioinformatics software
Its role in research:
n Hypothesis-driven research cycle in biology (From Kitano H. Systems biology: a brief overview. Science 2002, 295:1662-4)
2
Bioinformatics software
n Cyclical refinement of predictive computer models used to define further biological experiments, including the optimization step.
n From Brusic et al. 2001, Efficient discovery of immune response targets by cyclical refinement of QSAR models of peptide binding. J. Mol. Graph. Model. 19:405-11, 467
Bioinformatics software
n By combining computational methods with experimental biology, major discoveries can be made faster and more efficiently.
n Today, every large molecular or systems biology project has a bioinformatics component.
n Use of biological software allows biologists to extend their set of skills for more efficient and more effective analysis of their data, and for planning of experiments.
3
Genetic information
n Genetic information carriern DNA or RNA
n Genetic information carriedn Sequence
n Hence:
Life = Life = f f (Sequence)(Sequence)
New drug discovery
A drug = Target identification -> Lead discovery ->Lead optimization -> animal trial -> clinical trail
Target:Key to disease developmentSpecific to disease developmentSequence, Sample protein, 3D structure …
4
DNA sequence analysisTypes of analysis:n GC contentn Pattern analysisn Translation (Open Reading Frame detection)n Gene findingn Mutationn Primer designn Restriction map n ……
When you have a sequence
n Is it likely to be a gene?n What is the possible expression level?n What is the possible protein product?n Can we get the protein product?n Can we figure out the key residue in the
protein product?n ……
5
GC content
n Stabilityn GC: 3 hydrogen bondsn AT: 2 hydrogen bonds
n Codon preferencen GC rich fragment
à Gene
GC Content
n CpG islandn Resistance to methylationn Associated with genes which are frequently
switched onn Estimate: ½ mammalian gene have CpG
islandn Most mammalian housekeeping genes
have CpG island at 5’ end
6
GC content
n GC Content:n Emboss -> CompSeqn Emboss -> GEECEEn Bioedit
n CpG Island:n http://l25.itba.mi.cnr.it/genebin/wwwcpg.pl (Italy)n Emboss -> CpGReport
Pattern analysis
n Patterns in the sequencen Associated with certain biological
functionn Transcription factor bindingn Transcription startingn Transcription endingn Splicingn ……
7
Gene finding
n A kind of pattern searchn Gene structure
n Promoter, Exon, Intronn Promoter: TATA box (TATAAT)n Exon: Open Reading Frame (ORF)n Intron: Only eukaryotes, have splicing signaln Other motifs
Gene
Picture from the LSM2104 Practical, V.B. LIT
8
Gene finding
n Most of the programs focused on Open reading framen Emboss -> GetORFn Emboss -> ShowORF
n Other important elements:n Matrix binding site: Emboss -> MarScann Promoter region: PromoterInspectorn Splicing sites: GeneSplicer
Gene finding
n Prokaryotesn No intron n Long open reading framen High densityn Easy to detect
n Eukaryotesn Have intron n Combination of short open reading framesn Low densityn Hard to detect
9
Problem 1:
n Is it a gene?n Not sure, but have some confidence
n What is the expression level if it is a gene?n Determined by the promoter and other
upper stream elements
Translation
n Six reading framesn Open reading frame (ORF)
n Start codonn Stop codonn Certain length
n Tools: ShowORF
10
Conceptual translation
5’ AATGGCAATCCGCGTAGACTAGGCA 3’
3’ TTACCGTTAGGCGCATCTGTATCGT 5’
AATAATGGCGGCAATAATCCGCCGCGTCGTAGAAGACTACTAGGCGGCAAAAATGATGGCAGCAATCATCCGCCGCGTAGTAGACGACTAGTAGGCAGCAAAAATGGTGGCAACAATCCTCCGCGGCGTAGTAGACTACTAGGAGGCACA
TTTACTACCGTCGTTAGTAGGCGGCGCATCATCTGCTGTATTATCGTCGT
TTTTACCACCGTTGTTAGGAGGCGCCGCATCATCTGTTGTATCATCGTGTTTATTACCGCCGTTATTAGGCGGCGCAGCATCTTCTGTAGTATCGTCGTT
+1
-2-3
-1
+2+3
Six reading frames
AATAATGGCGGCAATAATCCGCCGCGTCGTAGAAGACTACTAGGCGGCAAN G N P R R L G N G N P R R L G
AAAATGGTGGCAACAATCCTCCGCGGCGTAGTAGACTACTAGGAGGCACAW Q S A * T RW Q S A * T R
TTTACTACCGTCGTTAGTAGGCGGCGCATCATCTGCTGTATTATCGTCGT
TTTTACCACCGTTGTTAGGAGGCGCCGCATCATCTGTTGTATCATCGTGTTTATTACCGCCGTTATTAGGCGGCGCAGCATCTTCTGTAGTATCGTCGTT
+1
-2-3
-1
+3
AAATGATGGCAGCAATCATCCGCCGCGTAGTAGACGACTAGTAGGCAGCAM A I R V D * AM A I R V D * A
+2
11
Problem 2
n What is the possible product of this gene?n It is likely to be ….n This conceptual translation is in open
reading frame ……n Can we get the gene product?
n If expression level high: Directly separaten If expression level low: Clone it
Recom
binant DN
A
12
Primer design
n Design primers only from accurate sequence data
n Restrict your search to regions that best reflect your goals
n Locate candidate primersn Verification of your choice
Primer design
n (primer 1) CTAGTACGATn ATGCCGTAGATC……TCCGATCATGCTAn TACGGCATCTAG……AGGCTAGTACGATn ATGCCGTAG (primer 2)
13
Primer design
n Mispriming areas n Primer length: 18-30 (Usually)n Annealing Temperature (55 - 75 C)n GC content: 35% - 65% (usually)n Avoid regions of secondary structuren 100% complimentarity is not necessary n Avoid self-complimentarity
Primer Design
Online tools:n http://www.hgmp.mrc.ac.uk/GenomeWeb/nuc-
primer.htmln http://www-genome.wi.mit.edu/cgi-
bin/primer/primer3_www.cgin http://www.cybergene.se/primer.html
Software toolsn Omigan Vecter NTI
14
Restriction map
n Restriction enzymen Recognize a patternn Recognition site V.S. Cutting site
n Select restriction enzyme to get a fragment of sequence
n Rebuild the sequence to create or invalidate a restriction site
n Tools: Omiga, remap, bioedit
15
Mutation
n Can be generated by PCRn Primers that not perfectly match
n Frame shift mutationn Insertionn Deletion
n Substitutionn Normaln Silent
16
Mutation
n Test the importancen Mutate suspected important place
n Create a patternn Often silent mutation
n Invalidate a patternn Often silent mutation
n Keep a reading frame
Problem 3
n Can we get the protein product?n Clone it and use a bacteria to express it
n Can we figure out the key residue in the protein product?n Guess the important residuen Mutate the residue to see whether the
activity loses
17
Summary
n Life is determined by nucleotide sequencesn Sequence analysis reveals patterns have
biological significancen Sequence analysis helps the design of wet-lab
experiments
n Next part will be on protein sequence analysis