bioinformatics tigor nauli ([email protected] / [email protected]) research center for informatics -...

28
Bioinformatics Tigor Nauli ([email protected] / [email protected]) Research Center for Informatics - LIPI

Upload: audrey-cochran

Post on 27-Mar-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Bioinformatics

Tigor Nauli([email protected] / [email protected])

Research Center for Informatics - LIPI

Page 2: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Topics

• Definition• Biological database• Sequence alignment• Gene prediction• Phylogenetic analysis• Protein structure prediction• Other studies• Conclusion

Page 3: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Definition• Bioinformatics is

– the application of computational tools and techniques to the management and analysis of biological data.

– information technology (IT) in molecular biology.

– a subset of the larger field of computational biology.

• The term of bioinformatics is being used in a number ways depending on who using it.

Page 4: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Definition

• The field of bioinformatics relies heavily on work by experts in statistical methods and pattern recognition.

• Bioinformatics is an in silico research.

Page 5: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Biological database• Database

– an archive of information.– a logical organization of information.– tools to gain acess to it.

• Biological data– cover nucleic acid and protein sequences,

macromolecular structures, and function– being generated by the efficient large-

sequencing machines.– being submitted by molecular biologists

around the world.– .

Page 6: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Biological database

• Archival database of biological information– nucleic acid and protein sequences– protein expression patterns– sequence motifs (‘signature patterns’)– mutations and variants in sequences– classification or relationships of

protein sequence families or protein folding patterns

– bibliographic

Page 7: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Biological database

• Databank for nucleotide database– GenBank is maintained by National

Center for Biotechnology Information (NCBI)• http://www.ncbi.nlm.nih.gov

– EMBL (European Molecular Biology Laboratory)• http://www.ebi.ac.uk/embl/

Page 8: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Biological database• Databank for annotated protein sequence

– SWISS-PROT is maintained by European Bioinformatics Institute (EBI)• http://us.expasy.org/sprot/

• Databank for sequence profiles, patterns, and motifs– PROSITE

• http://us.expasy.org/prosite/

• Databank for protein structure– Protein Data Bank

• http://www.rcsb.org/pdb/

Page 9: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Biological database

• Database queries– given a sequence, or fragment of a

sequence, find sequences in the database that are similar to it

– given a protein structure, or fragment, find protein structures in the database that are similar to it

such searches are carried out thousands of times a day

Page 10: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Biological database

• Database queries– given a sequence of a protein of

unknown structure, find structures in the database that adopt similar three-dimensional structures

– given a protein structure, find sequences in the databank that correspond to similar structures

are active fields of research

Page 11: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Biological database

• GenBank file– may contain identifying, descriptive,

and genetic information in ASCII-format

– for example:

Page 12: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

LOCUS AF134350 1734 bp mRNA linear INV 03-JAN-2000DEFINITION Drosophila melanogaster transcription factor Toy (toy) mRNA, complete cds.ACCESSION AF134350VERSION AF134350.1 GI:4883931KEYWORDS .SOURCE Drosophila melanogaster (fruit fly) ORGANISM Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila.REFERENCE 1 (bases 1 to 1734) AUTHORS Czerny,T., Halder,G., Kloter,U., Souabni,A., Gehring,W.J. and Busslinger,M. TITLE twin of eyeless, a second Pax-6 gene of Drosophila, acts upstream of eyeless in the control of eye development JOURNAL Mol. Cell 3 (3), 297-307 (1999) MEDLINE 99214845 PUBMED 10198632REFERENCE 2 (bases 1 to 1734) AUTHORS Czerny,T., Halder,G., Kloter,U., Souabni,A., Gehring,W.J. and Busslinger,M. TITLE Direct Submission JOURNAL Submitted (11-MAR-1999) Research Institute of Molecular Pathology, Dr. Bohr-Gasse 7, Vienna A-1030, Austria

Page 13: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

FEATURES Location/Qualifiers source 1..1734 /organism="Drosophila melanogaster" /mol_type="mRNA" /db_xref="taxon:7227" /chromosome="IV" /map="102E1" /dev_stage="embryo" gene 1..1734 /gene="toy" /note="twin of eyeless; second Pax-6" CDS 10..1641 /gene="toy" /codon_start=1 /product="transcription factor Toy" /protein_id="AAD31712.1" /db_xref="GI:4883932" /translation="MMLTTEHIMHGHPHSSVGQSTLFGCSTAGHSGINQLGGVYVNGR PLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIKPRAIGGSKPR VATTPVVQKIADYKRECPSIFAWEIRDRLLSEQVCNSDNIPSVSSINRVLRNLASQKE QQAQQQNESVYEKLRMFNGQTGGWAWYPSNTTTAHLTLPPAASVVTSPANLSGQADRD DVQKRELQFSVEVSHTNSHDSTSDGNSEHNSSGDEDSQMRLRLKRKLQRNRTSFSNEQ IDSLEKEFERTHYPDVFARERLADKIGLPEARIQVWFSNRRAKWRREEKMRTQRRSAD TVDGSGRTSTANNPSGTTASSSVATSNNSTPGIVNSAINVAERTSSALISNSLPEASN GPTVLGGEANTTHTSSESPPLQPSAPRLPLNSGFNTMYSSIPQPIATMAENYNSSLGS MTPSCLQQRDAYPYMFHDPLSLGSPYVSAHHRNTACNPSAAHQQPPQHGVYTNSSPMP SSNTGVISAGVSVPVQISTQNVSDLTGSNYWPRLQ" misc_difference 1605 /gene="toy" /note="compared to genomic sequence; aspartic acid to glutamic acid change" /replace="a"

Page 14: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

ORIGIN 1 taattaatta tgatgctaac aactgaacac ataatgcatg ggcatcccca ctcgtcagtc 61 gggcagagta ctctatttgg gtgctccacg gcgggccata gcggaataaa tcagctgggc 121 ggcgtatatg ttaatggccg gccactgccc gattcaacgc gtcaaaaaat tgtcgaattg 181 gctcattccg gcgcacgtcc ttgtgatatt tcaagaatac tacaagtgtc caacggttgc 241 gtaagcaaaa ttttgggcag atattatgaa actggatcga taaaacctcg agctataggt 301 ggttcaaagc cacgagtagc tacaaccccg gttgtgcaaa aaattgcaga ttacaaacgg 361 gaatgtccca gcatatttgc gtgggaaata cgagatcgac tgctatcgga acaagtttgc 421 aatagtgata acattccaag tgtttcatct attaatcgag tcttacgtaa cctggcctca 481 caaaaggagc agcaagctca gcaacaaaac gaatccgttt atgaaaagct tcgcatgttt 541 aatggccaaa cgggcggatg ggcatggtat ccaagcaata caacgacggc acatttgacg 601 ctaccaccag cagcttccgt tgtgacatct cctgcaaatt tatcaggaca ggccgatcgg 661 gatgatgttc aaaaaagaga attacaattt tcagtagaag tttcgcatac aaactctcac 721 gatagtacat cggatggaaa ctctgaacat aattcatccg gggacgaaga ctctcaaatg 781 cggttgcgcc taaaaaggaa gttacagcgc aatcggacat cattttctaa tgagcaaatt 841 gacagtcttg aaaaagaatt tgaaagaaca cattatcccg atgtttttgc gcgagaaagg 901 cttgctgata aaattggttt gccagaggca cgtattcagg tttggttttc aaaccgacga 961 gctaaatggc gccgagaaga aaaaatgcga actcagagac gatcggccga taccgtggac 1021 ggcagtggtc gaaccagcac ggcaaataat ccttcaggaa cgactgcatc ttcctccgtc 1081 gcaacgtcaa acaactcaac tccagggatt gtgaactcag caatcaacgt tgcggaacga 1141 acatcatctg cattaattag taatagcctt cccgaggctt caaatggacc aactgttttg 1201 ggtggtgaag ctaatactac acacaccagc tctgaaagcc caccccttca gccatcggca 1261 ccgcggctac ccttaaattc tggattcaac accatgtact catctattcc acaaccgatt 1321 gcaacgatgg ctgaaaatta caactcctca ttaggatcaa tgaccccgtc atgcttacaa 1381 caacgcgatg cctatcctta catgtttcac gatccgttat cactaggatc tccctatgtg 1441 tcagcccacc atcgaaacac agcttgcaac ccctcagctg cgcaccaaca gccccctcag 1501 catggcgttt ataccaatag ttctccaatg ccatcatcaa acacaggtgt catttctgcg 1561 ggcgtttcgg tgcctgtcca gatttcaacg caaaatgtat ctgacctaac gggaagcaat 1621 tactggccac gtcttcagtg atcgtcaatc tttggctcac cattagatca tttgtcaaag 1681 gcgactgccg ctgcaatcat tgccgcacaa gcagctgaga aaagccataa acac//

Page 15: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Sequence alignment• The basic sequence analysis task is to ask if

two sequences are related– sequence similarity/homology

• When we compare sequences, we are considered that they have diverged by a process of mutation.

• The mutational process are substitutions, which change residues in a sequence, and insertions and deletions, which add or remove residues.

• The three ways an alignment can be extended: match, mismatch, and gap.

Page 16: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Sequence alignment

• We use dynamic programming matrix to find the optimal alignment.

• To align CATGT with ACGCTG, first we fill the matrix with scores: +2 for match-1 for mismatch-1 for gap

-6

-5

-3

-4

-2

-1

0

-3

-2

0

-1

1

-1

-1

-3

-2

0

-1

0

1

-2

0

1

-1

-1

0

0

-3

3

0

2

1

-1

-1

-4

2

3

1

1

-2

-2

-5

0

C

1

A

2

T

3

G

4

T

5

0

A1

C2

G3

C4

T5

G6

j

i

Page 17: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Sequence alignment

• The maximum score will be:

0 -1

1

0

2

1

2

3

0

C

1

A

2

T

3

G

4

T

5

0

A1

C2

G3

C4

T5

G6

i

j

• The best alignment is:

C A T G - T - | | |- A C G C T G

• The example output using BLAST program:

Page 18: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

PAX-6 = Homo sapiens paired box gene 6 (aniridia, keratitis) (PAX6), mRNAeyeless = twin of eyeless, a second Pax-6 gene of Drosophila, acts upstream of eyeless in the control of eye development

Identities = 323/467 (69%), Gaps = 11/467 (2%)

PAX-6 : 430 cacagcggagtgaatcagctcggtggtgtctttgtcaacgggcggccactgccggactcc 489 || |||||| | |||||||| || || || | ||| || || ||||||||||| || ||eyeless: 97 catagcggaataaatcagctgggcggcgtatatgttaatggccggccactgcccgattca 156

PAX-6 : 490 acccggcagaagattgtagagctagctcacagcggggcccggccgtgcgacatttcccga 549 || || || || ||||| || | ||||| ||| || || || || || ||||| ||eyeless: 157 acgcgtcaaaaaattgtcgaattggctcattccggcgcacgtccttgtgatatttcaaga 216

PAX-6 : 550 attctgcaggtgtccaacggatgtgtgagtaaaattctgggcaggtattacgagactggc 609 || || || ||||||||||| || || || |||||| ||||||| ||||| || |||||eyeless: 217 atactacaagtgtccaacggttgcgtaagcaaaattttgggcagatattatgaaactgga 276

PAX-6 : 610 tccatcagacccagggcaatcggtggtagtaaaccgagagtagcgactccagaagttgta 669 || || | ||| | || || |||||| || || ||||||| || | |||||eyeless: 277 tcgataaaacctcgagctataggtggttcaaagccacgagtagctacaaccccggttgtg 336

. . .

PAX-6 : 790 agagttctt-cgcaacctgg-ctagcgaaa--agcaac-agatgggc-gc-agacg---g 839 ||| |||| || ||||||| || | ||| |||| | || | || | | |||eyeless: 457 cgag-tcttacgtaacctggcctcacaaaaggagcagcaagctcagcaacaaaacgaatc 515

PAX-6 : 840 catgtatgataaactaaggatgttgaacgggcagaccggaagctggg 886 | | ||||| || || | ||||| || || || || || | ||||eyeless: 516 cgtttatgaaaagcttcgcatgtttaatggccaaacgggcggatggg 562

Page 19: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Gene prediction

• Predicting gene locations– identify all the open reading frame

(ORF) in unannotated DNA– a query sequence will be compared

to an entire annotated DNA database to find similar sequences

– based on Bayesian statistics to find the most probable subsequence appears following the known subsequenceP(CCGAT)=P(CC)*P(G|CC)*P(A|CG)*P(T|GA)

Page 20: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Gene prediction

– implementing the Hidden Markov model

A|C|G|T A T G

T T G

A A A

A A C

A A G

T T T

... T A G

T G A

T A A

stop codon

start codon

61 triplet model

intergen model

Page 21: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Phylogenetic analysis

• Phylogenetic analysis– the process of developing

hypotheses about the evolutionary relatedness of organisms based on their observable characteristics

• Phylogenetic tree– build from multiple sequence

alignment -hemoglobin gene (223 bp)Homo sapiens GCTGCACTGT GACAAGCTGC ACGTGGATCC TGAGAACTTCChimpanzee GCTGCACTGT GACAAGCTGC ACGTGGATCC TGAGAACTTCCow GCTGCACTGT GATAAGCTGC ACGTGGATCC TGAGAACTTCGoat GCTGCACTGT GATAAGCTGC ACGTGGATCC TGAGAACTTCChicken ACTGCATTGT GACAAGCTGC ATGTGGACCC CGAGAACTTCFrog GAAGCACGCT GAGGAACTCC ACGTGGACCC TGAAAACTTC

Page 22: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Phylogenetic analysis– implemets Parsimony method, UPGMA,

Cladistic, Neighbor Joining, Least Squares Method, Maximum Likelihood, or Clustering, to determine the differences in the sequences

– find the relatedness by clustringNumber of different nucleotidesFrog Chicken Goat Cow Chimpanzee H. sapiens

Frog - 13 10 9 9 9Chicken 13 - 6 6 5 5Goat 10 6 - 0 1 1Cow 9 6 0 - 1 1Chimpanzee 9 5 1 1 - 0H. sapiens 9 5 1 1 0 -

Page 23: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Phylogenetic analysis

– the percentage of identity:

– phylogenetic tree:

Identic to H. sapiens

Chimpanzee 223/223 = 100 %Cow 189/223 = 84 %Goat 189/223 = 84 %Chicken 170/223 = 76 %Frog 137/223 = 61 %

Page 24: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Protein structure prediction

• Two approaches in computational modeling of protein structure– knowledge-based modeling

• employ parameters extracted from the database of existing structures to evaluate and optimize structures

– predict structure from sequence

Page 25: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Protein structure prediction

• Predict from sequence:– select the protein sequence of the target

– determine the secondary structure by calculating its hydrophobicity values

1 MRNNLLFSLNAIAGAVAHPSFPIHKRQSDLNAFIEAQTPIAKQGVLNNIGADGKLVEGAA 61 AGIVVASPSKSNPDYFYTWTRDAGLTMEEVIEQFIGGDATLESTIQNYVDSQANEQAVSN121 PSGGLSDGSGLAEPKFYVNISQFTDSWGRPQRDGPALRASALIAYGNSLISSDKQSVVKA181 NIWPIVQNDLSYVGQYWNQTGFDLWEEVQGSSFFTVAVQHKALVEGDAFAKALGEECQAC241 SVAPQILCHLQDFWNGSAVLSNLPTNGRSGLDTNSLLGSIHTFDPAAACDDTTFQPCSSR301 ALSNHKLVVDSFRSVYGINNGRGAGKAAAVGPYAEDTYQGGNPWYLTTLVAAELLYDALY361 QWDKQGQVNVTETSLPFFKDLSSNVTTGSYAKSSSAYESLTSAVKTYADGFISVVQEYTP421 DGGALAEQYSRDQGTPVSASDLTWSYAAFLSAVGRRNGTVPASWGSSTANAVPSQCSGGT481 VSGSYTTPTVGSW

1 -----HHEHHHHHHH-----------------HHHH---H------E-------HHHHHH 61 HEEEE----------EEE--------HHHHHHHE------EE---EEE------------121 ---------------EEEE------------------HHHHHHHH----E------EE--181 ----EE-----EH--EE------HHHHH----EEEEHHHH-HHHHHHHHHHHH-HHH---241 ----HHHHH-H--------E------------------EE--------------------301 -----HEEEH----EEE---------HH----------------EEEHHHHHHHHHHHHH361 HH-----EEE------HH-----------EE-----HHHHHHHHHHE----EEEEE----421 ----HHHH--------------EHHHHHHHHH---------------------------E481 EE-----------

Page 26: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Protein structure prediction

– align the structure with the similar sequence in databank

– find the list of angles

– draw the structure

Page 27: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Others studies

• Protein structure property analysis• Biochemical simulation• Whole genome analysis• Primer design• DNA microarray analysis• Proteomics analysis

Page 28: Bioinformatics Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI

Conclusion

• Bioinformatics can provide anything from the abstraction of the properties of a biological system into a mathematical or physical model, to the the implementation of new algorithms for data analysis.