bioinformatics and biocomputing - seoul national university · expression patterns during the cell...

36
Bioinformatics and Biocomputing Byoung-Tak Zhang Center for Bioinformation Technology (CBIT) & Biointelligence Laboratory School of Computer Science and Engineering Seoul National University [email protected] http://bi.snu.ac.kr/ or http://cbit.snu.ac.kr/ 2 Outline ! Bioinformation Technology (BIT) ! DNA Chip Data Mining: IT for BT ! DNA Computing: BT for IT ! DNA Computing with DNA Chips ! Outlook

Upload: others

Post on 25-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

Bioinformatics and Biocomputing

Byoung-Tak Zhang

Center for Bioinformation Technology (CBIT) &

Biointelligence Laboratory

School of Computer Science and Engineering

Seoul National University

[email protected]

http://bi.snu.ac.kr/ or http://cbit.snu.ac.kr/

2

Outline

! Bioinformation Technology (BIT)

! DNA Chip Data Mining: IT for BT

! DNA Computing: BT for IT

! DNA Computing with DNA Chips

! Outlook

Page 2: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

3

Human Genome Project

GenomeHealthImplications

A New

Disease

Encyclopedia

New Genetic

Fingerprints

New

Diagnostics

New

Treatments

Goals• Identify the approximate 40,000 genes

in human DNA• Determine the sequences of the 3 billion

bases that make up human DNA• Store this information in database• Develop tools for data analysis• Address the ethical, legal and social

issues that arise from genome research

4

Bioinformation Technology:Bioinformatics vs. Biocomputing

BTIT

Bioinformatics

Biocomputing

Page 3: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

5

Bioinformatics

6

What is Bioinformatics?

! Bioinformatics vs. Computational Biology

! Bioinformatik (in German): Biology-based computerscience as well as bioinformatics (in English)

Informatics – computer science

Bio – molecular biology

Bioinformatics – solving problems arising frombiology using methodology from computerscience.

Page 4: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

7

Molecular Biology: Flow ofInformation

DNA RNA Protein Function

���

��������������������

���

�������

�����

�����

���������

8

DNA (Gene) RNA Protein

�������������

����

�����

���������

����

�������������

�������������

����

������������� �������������

����

�����

����������� ������

�’ ��� �’ ���

Page 5: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

9

Nucleotide and Protein Sequence

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

����������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

����������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

����������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

����������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

���������������������

����������������������

���������������������

���������������������

���������������������

���������������������

���������������������

��������������� ��������������������������������

DNA (Nucleotide) Sequence

CG2B_MARGL Length: 388 April 2, 1997 14:55 Type: P Check:

9613 .. 1

�������������� � ������������ �������

ARNNLQAGAK KELVKAKRGM TKSKATSSLQ SVMGLNVEPMEKAKPQSPEP MDMSEINSAL EAFSQNLLEG VEDIDKNDFDNPQLCSEFVN DIYQYMRKLE REFKVRTDYM TIQEITERMRSILIDWLVQV HLRFHLLQET LFLTIQILDR YLEVQPVSKN

KLQLVGVTSM LIAAKYEEMY PPEIGDFVYI TDNAYTKAQIRSMECNILRR LDFSLGKPLC IHFLRRNSKA GGVDGQKHTMAKYLMELTLP EYAFVPYDPS EIAAAALCLS SKILEPDMEWGTTLVHYSAY SEDHLMPIVQ KMALVLKNAP TAKFQAVRKKYSSAKFMNVS TISALTSSTV MDLADQMC

Protein (Amino Acid) Sequence

10

Some Facts

! 1014 cells in the human body.

! 3 × 109 letters in the DNA code in every cell inyour body.

! DNA differs between humans by 0.2% (1 in 500bases).

! Human DNA is 98% identical to that ofchimpanzees.

! 97% of DNA in the human genome has no knownfunction.

Page 6: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

11

Topics in Bioinformatics

Structure analysis4 Protein structure comparison4 Protein structure prediction4 RNA structure modeling

Pathway analysis4 Metabolic pathway4 Regulatory networks

Sequence analysis4 Sequence alignment4 Structure and function prediction4 Gene finding

Expression analysis4 Gene expression analysis4 Gene clustering

12

Extension of BioinformaticsConcept! Genomics

4Functional genomics4Structural genomics

! Proteomics: large scaleanalysis of the proteins ofan organism

! Pharmacogenomics:developing new drugs thatwill target a particulardisease

! Microarray: DNA chip,protein chip

Page 7: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

13

Applications of Bioinformatics

! Drug design

! Identification of genetic risk factors

! Gene therapy

! Genetic modification of food crops and animals

! Biological warfare, crime etc.

! Personal Medicine?

! E-Doctor?

14

Bioinformatics as InformationTechnology

������������

����������

������

�������

������ �

��������

�����

������

�������

� ������

��������������

��������������������

����������������

����������

�������������

�������������������

��!���������������

�������������"������ ���

��������

Page 8: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

15

Background of Bioinformatics

! Biological information infra4Biological information management systems

4Analysis software tools

4Communication networks for biological research

! Massive biological databases4DNA/RNA sequences

4Protein sequences

4Genetic map linkage data

4Biochemical reactions and pathways

! Need to integrate these resources to model biologicalreality and exploit the biological knowledge that is beinggathered.

16

StructuralGenomics

FunctionalGenomics

ProteomicsPharmaco-genomics

���������������

���������������

���������������

���������������

���������������

���������������

Microarray (Biochip)

Infrastructure of Bioinformatics

Areas and Workflow ofBioinformatics

Page 9: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

17

DNA Chip Data Mining:IT for BT

18

cDNA Microarray

cDNA clones(probes)

PCR product amplificationpurification

Printing

Microarray

Hybridize targetto microarray

mRNA target

Excitation

Laser 1Laser 2

Emission

Scanning

Analysis

Overlay images and normalize

0.1nl/spot

Page 10: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

19

The Complete MicroarrayBioinformatics Solution

DataManagement

Databases

StatisticalAnalysis

ImageProcessing

Automation

DataMining

ClusterAnalysis

20

DNA Chip Applications

! Gene discovery: gene/mutated gene4Growth, behavior, homeostasis …

! Disease diagnosis4Cancer classification

! Drug discovery: Pharmacogenomics

! Toxicological research: Toxicogenomics

Page 11: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

21

Disease Diagnosis:Cancer Classification with DNA Microarray

- cDNA microarray data of 6567gene expression levels [Khan ’01].

- Filter genes that are correlated tothe classification of cancer usingPCA and ANN learning.

- Hierarchical clustering of the DNAchip samples based on the filtered 96genes.

- Disease diagnosis based on DNAchip.

[Fig.] Flowchart of the experimentalprocedure.

22

Disease Diagnosis:Hierarchical Clustering Based on Gene Expression Levels

- Hierarchical clustering ofcancer by 96 gene expressionlevels.

- The relation between geneexpression and cancercategory.

- Four cancer diagnosticcategories

[Fig.] The dendrogram of fourcancer clusters and gene expressionlevels (row: genes, column: samples).

Page 12: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

23

AI Methods for DNA Chip DataAnalysis! Classification and prediction

4ANNs, support vector machines, etc.

4Disease diagnosis

! Cluster analysis4Hierarchical clustering, probabilistic clustering, etc.

4Functional genomics

! Genetic network analysis4Differential models, relevance networks, Bayesian

networks, etc.

4Functional genomics, drug design, etc.

24

Cluster Analysis

[DNA microarray dataset]

[Gene Cluster 1]

[Gene Cluster 2]

[Gene Cluster 3]

[Gene Cluster 4]

Page 13: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

25

Methods for Cluster Analysis

! Hierarchical clustering [Eisen ’98]

! Self-organizing maps [Tamayo ’99]

! Bayesian clustering [Barash ’01]

! Probabilistic clustering using latent variables[Shin ’00]

! Non-negative matrix factorization [Shin ’00]

! Generative topographic mapping [Shin ’00]

26

Clustering of Cell Cycle-regulatedGenes in S. cerevisiae (the Yeast)! Identify cell cycle-regulated

genes by cluster analysis.4104 genes are already known to

be cell-cycle regulated.4Known genes are clustered into

6 clusters.

! Cluster 104 known genes andother genes together.

! The same cluster " similarfunctional categories.

[Fig.] 104 known gene expressionlevels according to the cell cycle(row: time step, column: gene).

Page 14: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

27

Probabilistic Clustering UsingLatent Variables

gi: ith gene

zk: kth clustertj: jth time stepp(gi|zk): generating probability

of ith gene given kth clustervk=p(t|zk): prototype of kth

cluster

)(

)()|()|()(

i

kkiikki p

zpzpzpzp

gg

gg ==∈

∑∑ ∑=i j k

kjkikij ztpzpzpgztf ))|()|()(log(),,( gg

∑=j

kjijki vxsimilarity ),( vx

: (*) objective function(maximized by EM)

28

Experimental Result:Identify Cell Cycle-Regulated Genes

! Clustering result

[Table] Clustering result with α-factor arrest data. In 4 clusters, the genes, thathave high probability of being cell cycle-regulated, were found.

Page 15: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

29

Experimental Result:Prototype Expression Levels of Found Clusters

[Fig.] Prototype expression levels ofgenes found to be cell cycle-regulated (4 clusters).

• The genes in the samecluster show similarexpression patterns duringthe cell cycle.• The genes with similarexpression patterns arelikely to have correlatedfunctions.

30

Clustering Using Non-negativeMatrix Factorization (NMF)

! NMF (non-negative matrix factorization)

∑=

=≈

≈r

aaiaii HW

1

)()( µµµ WHG

WHG

G ��gene expression data matrix

W ��basis matrix (prototypes)

H ��encoding matrix (in low

dimension)

0,, ≥µµ aiai HWG

! NMF as a latent variable model

h1 hr

g1 g2 gn

W

Whg >=<

h2

Page 16: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

31

Experimental Result:Five Clusters Found by NMF

! 5 prototype expression levels during the cell cycle.

����

����

����

����

���

����

����

����

����

� � � � � � � �� �� �� �� �� � �� � ��

Time step in cell cycle

Exp

ress

ion

leve

l

32

Clustering Using GenerativeTopographic Mapping (GTM)

• GTM: a nonlinear, parametric mapping y(x;W)from a latent space to a data space.

y�x�W���mapping

t1

t3

t2

x2

x1

Grid

<Latent space> <Data space>

Visualization

Generation

Page 17: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

33

Experimental Result:Clusters Found by GTM

! Three cell cycle-regulated clusters found by GTM

(.894 .907 -.766 -.479)10 / 16 (62%)0 / 16

35 / 18/ 7

(-0.111 0.333)(-0.111 0.111)

G1 c1c2

(-.616 –1.01 1.832 1.596)0 / 53 / 5 (80%)

10 / 5/ 3

(0.111 0.333)(0.111 0.111)

G2/M c1c2

(-.171 -.573 .091 .311)1 / 60 / 60 / 6

13 / 7/ 2/ 2

(0.111 0.333)(-0.111 –0.111)(0.323 0.1)

M/G1 c1c2c3

(1.075 1.482 -.233 -.375)5 / 5 (100%)5 / 5(0.111 –0.333)S

(.148 .184 -.367 -.044)1 / 25 /S/G2

Overall mean expressionlevels (Cln/b) of knowngenes

Correct no. / testdata

No. of trainData/ no. incluster

Cluster center

34

Experimental Result:Comparison with other methods

! Comparison of prototype expression levels

(.66 .49 -.55 -.33)300

(total = 800)

(.92 .74 -.62 -.33)(.79 .82 -.48 -.34)

12274

(total = 570)

G1 c1c2

(-.32 -.62 .49 .54)195(-.59 -.96 1.34 1.29)(.08 -.30 .51 .57)

3360

G2/M c1c2

(-.21 -.61 -.04 .07)113(.82 .65 -.65 -.38)(-.04 -.37 -.01 -.11)(.32 .29 -.3 .05)

1203410

M/G1 c1c2c3

(.46 .47 -.43 -.18)71(.84 .81 -.42 -.33)25S

(.13 .05 -.16 .03)121(.13 -.06 -.1 .01)92S/G2

Mean expressionlevels by Spellman

No. of selectedgenes bySpellman

Mean expressionlevels by GTM

No. ofselectedgenes

Page 18: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

35

Genetic Network Analysis

- Discover the complex regulatoryinteraction among genes.

- Disease diagnosis, pharmacogenomicsand toxicogenomics

- Boolean networks

- Differential equations

- Relevance networks [Butte ’97]

- Bayesian networks [Friedman ’00][Hwang ’00]

[Fig.] Basin of attraction of 12-geneBoolean genetic network model[Somogyi ’96].

36

Bayesian Networks

! Represent the joint probability distribution amongrandom variables efficiently using the concept ofconditional independence.

BA

C D

Enet)Bayesexample(by the)|()|(),|()()(

rule)chain(by),,,|(),,|(),|()|()(

),,,,(

CEPBDPBACPBPAP

DCBAEPCBADPBACPABPAP

EDCBAP

==

•A, C and D are independent given B.

•C asserts dependency between A and B.

•A, B and E are independent given C.

An edge denotes the possibility of thecausal relationship between nodes.

Page 19: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

37

Bayesian Networks Learning

! Dependence analysis [Margaritis ’00]

4Mutual information and χ2 test

! Score-based search

• D: data, S: Bayesian network structure

4NP-hard problem

4Greedy search

4Heuristics to find good massive network structuresquickly (local to global search algorithm)

∏ ∏ ∏= = = Γ+Γ

+ΓΓ

⋅=

=n

i

q

j

r

kijk

ijkijk

ijij

iji iN

NSp

SDpSpSDp

1 1 1 )(

)(

)(

)()(

)|()(),(

αα

αα

38

The Small Bayesian Network forClassification of Cancer

Zyxin

Leukemiaclass

MB-1

C-mybLTC4S

1.3/340/38RBF networks

1/340/38Neural trees

2/340/38Bayes nets

Test errorTraining error

•The Bayesian network was learned by full searchusing BD (Bayesian Dirichlet) score withuninformative prior [Heckerman ’95] from theDNA microarray data for cancer classification(http://waldo.wi.mit.edu/MPR/).

[Table] Comparison of the classification performancewith other methods [Hwang ’00].

Page 20: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

39

Large-Scale Bayesian Networkwith 1171 Genes

- Genetic networks forunderstanding the regulatoryinteraction among genes andtheir derivatives

- Pharmacogenomics andToxicogenomics

[Fig.] The Bayesian networkstructure constructed from DNAmicroarray data for cancerclassification (partial view).

40

DNA Computing: BT for IT

Page 21: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

41

DNA Computing: BioMoleculesas Computer

011001101010001 ATGCTCGAAGCT

42

Why DNA Computing?

! 6.022 × 1023 molecules / mole

! Immense, brute force search of all possibilities4Desktop: 109 operations / sec

4Supercomputer: 1012 operations / sec

41 µmol of DNA: 1026 reactions

! Favorable energetics: Gibb’s free energy

! 1 J for 2 × 1019 operations

! Storage capacity: 1 bit per cubic nanometer

-1mol8kcalG −=∆

Page 22: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

43

HPP

...

......

...ATGACG

TGC

CGA

TAA

GCA

CGT...

...

...

...... ...

...

...

10

3

2 5

6

4

Solution

ATGTGCTAACGAACG

ACGCGAGCATAAATGTGCCGT

TAAACG

CGACGT

TAAACGGCAACG

...

...

...

...

CGACGTAGCCGT

...

...

...

ACGCGAGCATAAATGTGCCGTACGCGTAGCCGT

ACGCGT

......

...

...

...

ACGGCATAAATGTGCACGCGTACGCGAGCATAAATGCGATGCCGT

ACGCGAGCATAAATGTGCCGT

...... ......

...

ACGCGAGCATAAATGTGCCGT

...

.........

...

Decoding

Ligation

Encoding

Gel Electrophoresis

Affinity Column

ACGCGAGCATAAATGTGCACGCGT

ACGCGAGCATAAATGCGATGCACGCGT

ACGCGAGCATAAATGTGCACGCGT

ACGCGAGCATAAATGCGATGCACGCGT

2

0 13 4

56

Node 0: ACG Node 3: TAANode 1: CGA Node 4: ATGNode 2: GCA Node 5: TGC

Node 6: CGT

Flow of DNA Computing

PCR(Polymerase

ChainReaction)

44

Biointelligence on a Chip?

BiologicalComputer

MolecularElectronics

BioinformationTechnology

Computing Models:The limit of conventionalcomputing models

Computing Devices:The limit of siliconesemiconductor technology

InformationTechnology

Biotechnology

BiointelligenceChip

Page 23: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

45

Intelligent BiomolecularInformation Processing

��������� ����������

������� ���������

S

GFP

Cytochrome c

S

GFP

Cytochrome c

������� �����

�������������� Controller

������

Reaction Chamber

(Calculating)

46

Evolvable BiomolecularHardware

! Sequence programmable and evolvable molecular systems have beenconstructed as cell-free chemical systems using biomolecules such asDNA and proteins.

Page 24: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

47

DNA Computers vs.Conventional Computers

electronic data are vulnerable butcan be backed up easily

DNA is sensitive to chemicaldeterioration

setting up only requires keyboardinput

setting up a problem may involveconsiderable preparations

smaller memorycan provide huge memory in smallspace

can do substantially feweroperations simultaneously

can do billions of operationssimultaneously

fast at individual operationsslow at individual operations

Microchip-based computersDNA-based computers

48

Molecular Operators for DNAComputing

• Hybridization: complementary pairing of two single-stranded polynucleotides

�’� ���������–�’

�’� ���������–�’

�’� ���������–�’

�’� ���������–�’

• Ligation: attaching sticky ends to a blunt-ended molecule

����

��������

��������

����

������������

������������

�� ������

Page 25: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

49

Research Groups

! MIT, Caltech, Princeton University, Bell Labs

! EMCC (European Molecular ComputingConsortium) is composed of national groups from11 European countries

! BioMIP Institute (BioMolecular InformationProcessing) at the German National ResearchCenter for Information Technology (GMD)

! Molecular Computer Project (MCP) in Japan

! Leiden Center for Natural Computation (LCNC)

50

Applications of BiomolecularComputing! Massively parallel problem solving! Combinatorial optimization! Molecular nano-memory with fast associative search! AI problem solving! Medical diagnosis! Cryptography! Drug discovery! Further impact in biology and medicine:

4Wet biological data bases4Processing of DNA labeled with digital data4Sequence comparison4Fingerprinting

Page 26: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

51

NACST(Nucleotide Acid Computing Simulation Toolkit)

GUI

DNA Sequence Generator

Genetic Algorithm

Ligation Unit

PCR Unit

Electrophoresis Unit

Affinity Column Unit

Enzyme Unit

NACST Engine Controller

DNA Sequence Optimizer

52

NACSTOutputsInputs

Page 27: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

53

Combinatorial Problem Solver

1

32

AGCT TAGGP1A P1B

ATGG CATGP2A P2B

CGAT CGAAP3A P3B

10

3

2 5

6

4

3

53

3

7

113

3

9

11

33 7 3

P1B P3A

ATCC GCCT GCTAW1→3P1B P2A

ATCC ATCA TACCW1→2

TSP (Traveling Salesman Problem)

Representations

0 → 1 → 2 → 3 → 4 → 5 → 6 → 0

54

Combinatorial Problem Solver

! Weight representationmethods

1. Molecules with high G-Ccontent tend to hybridizeeasily.

2. Molecules with high G-Ccontent tend to bedenatured at highertemperature.

3. Molecules with largerpopulation in tube willhave more probability tohybridize.

Hybridization/Ligation

PCR/Gel electrophoresis

Affinity chromatography

PCR/Gel electrophoresis

Temperature GradientGel Electrophoresis

Graduate PCR

Page 28: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

55

Experimental Results for 4-TSP

Hybridization (37°C)Ligation (16 °C 15hr)

PCR (36 cycle)Gel electrophoresis

(10% polyacrylamide gel)

50 bp markerOligomer mixture

Ligation result

Final PCRresult(140bp)

56

Molecular Theorem Prover

! Resolution refutation method

RQP ∨¬∨¬ QTS ∨¬∨¬ S TP R¬

RQ∨¬ QT∨¬

Q

R

nilR is true!

! Problem underconsideration:

! Turninto , add R as

!

?true

,,,,

=→∧→∧

R

PTSQTSRQP

BA →BA ∨¬

RPTS

QTSRQP

¬∨¬∨¬∨¬∨¬

,,,

,

Page 29: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

57

Molecular Theorem Prover(Abstract Implementation)

! ������������ 1 ! ������������ 2

¬S ¬T Q

¬Q ¬P R

P ¬R

TS

¬S ¬T Q¬Q ¬P R

P ¬R

TS

¬S ¬T Q¬Q ¬P R

P ¬RTS

R

¬Q

Q

¬P¬S

¬T ¬R

T S

P

58

Molecular Theorem Prover(Experiments for Method 1)

! �� �� ! �� ��

II. Denaturation

( 95°C 10 min)

IV. Polyacrylamide gel Electrophoresis(20%)

( PAGE )

V. Detection of solution

: 75bp ds DNA

III. Annealing

95°C 1 min #### 15 °C : 1°C down/min

I. � ���� ��

100pmol/each #### Total 20 ul

200 bp

20 bp

1 2 3 4 5 6

20 bp DMA marker (Talara)

Mixture Reaction

Page 30: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

59

Solving Logic Problems byMolecular Computing

! Satisfiability Problem4Find Boolean values for

variables that make the givenformula true

! 3-SAT Problem4Every NP problems can be

seen as the search for asolution that simultaneouslysatisfies a number of logicalclauses, each composed ofthree variables.

)oror(AND)oror(

)oror(AND)oror(

321321

654321

xxxxxx

xxxxxx

)()()( 324431 xxxxxx ∨∧∧∨∨

DNA Computing with DNA Chips

Page 31: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

61

DNA Chips for DNA Computing

I. Make: oligomer synthesis

II. Attach (Immobilized):5’HS-C6-T15-CCTTvvvvvvvvTTCG-3’

III. Mark: hybridization

IV. Destroy: Enzyme rxn (ex.EcoRI)

V. Unmark*���������� strand

VI. Readout:N cycle�������������, PCR������ !

62

Variable Sequences and theEncoding Scheme

Page 32: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

63

Tree-dimensional Plot andHistogram of the Fluorescence

! S3: w=0, x=0, y=1, z=1

! S7: w=0, x=1, y=1, z=1

! S8: w=1, x=0, y=0, z=0

! S9 : w=1, x=0, y=0, z=1

! y=1: (w V x V y) ��

! z=1: (w V y V z) ��

! x=0 or y=1: (x V y) ��

! w=0: (w V y) ��

! Four spots with high fluorescenceintensity correspond to the fourexpected solutions.

! DNA sequences identified in thereadout step via addressed arrayhybridization.

64

Outlook

! IT gets a growing importance in the advancementof BT.4Bioinformatics

4DNA Microarray Data Mining

! IT can benefit much from BT.4Biocomputing and Biochips

4DNA Computing (with DNA Chips)

! Bioinformation technology (BIT) is essential as anext-generation information technology.4In Silico Biology vs. In Vivo Computing

Page 33: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

65

References

! [Barash ’01] Barash, Y. and Friedman, N., Context-specific Bayesianclustering for gene expression data, Proc. of RECOMB’01, 2001.

! [Butte ’97] Butte, A.J. et al., Discovering functional relationshipsbetween RNA expression and chemotherapeutic susceptibility usingrelevance networks, Proc. Natl Acad. Sci. USA, 94, 1997.

! [Eisen ’98] Eisen, M.B. et al., Cluster analysis and display of genome-wide expression patterns, Proc. Natl Acad. Sci. USA, 95, 1998.

! [Friedman ’00] Friedman, N. et al, Using Bayesian networks toanalyze expression data, Proc. of RECOMB’00, 2000.

! [Heckerman ’95] Heckerman, D. et al., Learning Bayesian networks:the combination of knowledge and statistical data, Machine Learning,20(3), 1995.

! [Hwang ’00] Hwang, K.-B. et al., Applying machine learningtechniques to analysis of gene expression data: cancer diagnosis,CAMDA’00, 2000.

66

References

! [Khan ’01] Khan, J. et al., Classification and diagnostic prediction ofcancers using gene expression profiling and artificial neural networks,Nature Medicine, 7(6), 2001.

! [Margaritis ’00] Margaritis, D. and Thrun, S., Bayesian networkinduction via local neighborhoods, Proc. of NIPS’00, 2000.

! [Shin ’00] Shin, H.-J. et al., Probabilistic models for clustering cellcycle-regulated genes in the yeast, CAMDA’00, 2000.

! [Somogyi ’96] Somogyi, R. and Sniegoski, C.A., Modeling thecomplexity of genetic networks: understanding multigenic andpleiotropic regulation, Complexity, 1(6), 1996.

! [Tamayo ’99] Tamayo, P. et al., Interpreting patterns of geneexpression with self-organizing maps: methods and application tohematopoietic differentiation, Proc. Natl Acad. Sci. USA, 96, 1999.

Page 34: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

67

Web Resources: Bioinformatics

! ANGIS - The Australian National Genomic Information Service:http://morgan.angis.su.oz.au/

! Australian National University (ANU) Bioinformatics: http://life.anu.edu.au/! BioMolecular Engineering Research Center (BMERC): http://bmerc-www.bu.edu/! Brutlag bioinformatics group: http://motif.stanford.edu/! Columbia University Bioinformatics Center (CUBIC): http://cubic.bioc.columbia.edu/! European Bioinformatics Institute (EBI): http://www.ebi.ac.uk/! European Molecular Biology Laboratory (EMBL): http://www.embl-heidelberg.de/! Genetic Information Research Institute: http://www.girinst.org/! GMD-SCAI: http://www.gmd.de/SCAI/scai_home.html! Harvard Biological Laboratories: http://golgi.harvard.edu/! Laurence H. BakerCenter for Bioinformatics and Biological Statistics:

http://www.bioinformatics.iastate.edu/! NASA Center for Bioinformatics: http://biocomp.arc.nasa.gov/! NCSA Computational Biology: http://www.ncsa.uiuc.edu/Apps/CB/! Stockholm Bioinformatics Center: http://www.sbc.su.se/! USC Computational Biology: http://www-hto.usc.edu/! W. M. Keck Center for Computational Biology: http://www-bioc.rice.edu/

68

Web Resources: Biocomputing

! European Molecular Computing Consortium (EMCC):http://www.csc.liv.ac.uk/~emcc/

! BioMolecular Information Processing (BioMip):http://www.gmd.de/BIOMIP

! Leiden Center for Natural Computation (LCNC):http://www.wi.leidenuniv.nl/~lcnc/

! Biomolecular Computation (BMC):http://bmc.cs.duke.edu/

! DNA Computing and Informatics at Surfaces:http://www.corninfo.chem.wisc.edu/writings/DNAcomputing.html

! SNU Molecular Evolutionary Computing (MEC) Project:http://scai.snu.ac.kr/Research/

Page 35: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

69

Web Resources: Biochips

! DNA Microarry (Genome Chip):http://www.gene-chips.com/

! Large-Scale Gene Expression and MicroarrayLink and Resources:http://industry.ebi.ac.uk/~alan/MicroArray/

! The Microarray Centre at The Ontario CancerInstitute:http://www.oci.utoronto.ca/services/microarray/

! Lab-on-a-Chip resources: http://www.lab-on-a-chip.com/

! Mailing List: [email protected]

70

Books: Bioinformatics

! Cynthia Gibas and Per Jambeck, Developing BioinformaticsComputer Skills, O’REILLY, 2001.

! Peter Clote and Rolf Backofen, Computational Molecular Biology:An Introduction, A John Wiley & Sons, Inc., 2000.

! Arun Jagota, Data Analysis and Classification for Bioinformatics,2000.

! Hooman H. Rashidi and Lukas K. Buehler, Bioinformatics BasicsApplications in Biological Science and Medicine, 1999.

! Pierre Baldi and Soren Brunak, Bioinformatics: The MachineLearning Approach, MIT Press, 1998.

! Andreas Baxevanis and B. F. Francis Ouellette, Bioinformatics: APractical Guide to the Analysis of Genes and Proteins, A John Wiley& Sons, Inc., 1998.

Page 36: Bioinformatics and Biocomputing - Seoul National University · expression patterns during the cell cycle. • The genes with similar ... NMF as a latent variable model ... •C asserts

71

Books: Biocomputing

! Cristian S, Calude and Gheorghe Paun, Computing with Cells andAtoms: An introduction to quantum, DNA and membrane computing,Taylor & Francis, 2001.

! Pâun, G., Ed., Computing With Bio-Molecules: Theory andExperiments, Springer, 1999.

! Gheorghe Paun, Grzegorz Rozenberg and Arto Salomaa, DNAComputing, New Computing Paradigms, Springer, 1998.

! C. S. Calude, J. Casti and M. J. Dinneen, Unconventional Models ofComputation, Springer, 1998.

! Tono Gramss, Stefan Bornholdt, Michael Gross, Melanie Mitchell andthomas Pellizzari, Non-Standard Computation: MolecularComputation-Cellular Automata-Evolutionary Algorithms-QuantumComputers, Wiley-Vch, 1997.

72

More information athttp://cbit.snu.ac.kr/http://bi.snu.ac.kr/