language theory and bioinformatics

45
Language Theory and Bioinformatics Bailin Hao T-Life Research Center, Fudan Univer sity Institute of Theoretical Physics, Ac ademia Sinica http://www.itp.ac.cn/~hao/

Upload: gerda

Post on 20-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Language Theory and Bioinformatics. Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics, Academia Sinica http://www.itp.ac.cn/~hao/. Statistical Analysis of DNA Sequences. A first and must step in any analysis: Frequency of appearance of strings - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Language Theory and Bioinformatics

Language Theory and Bioinformatics

Bailin HaoT-Life Research Center, Fudan University

Institute of Theoretical Physics, Academia Sinicahttp://www.itp.ac.cn/~hao/

Page 2: Language Theory and Bioinformatics

Statistical Analysis of DNA Sequences

A first and must step in any analysis:• Frequency of appearance of strings • Correlations of letters and strings• 1D and 2D DNA walks vs. random walk

Summary in two lines according to Luo Liao-fu:

1. DNA sequences are not random.

2. Characteristics close to randomness.

Page 3: Language Theory and Bioinformatics

Hint:Statistical methods alone are not powerful

enough to amplify the difference between

DNA and random sequences and the

difference among themselves.

Need for new “deterministic” approaches.

Page 4: Language Theory and Bioinformatics

超越概率统计方法

概率统计是基本功 频度和关联, 马可夫链和隐马可夫链 神经网络模型 贝叶斯( Bayes )统计、“先验”分布

随机序列是好的参考系吗? 足够长的符号序列具有不可避免的“规则性” 基因组序列够长吗?

Page 5: Language Theory and Bioinformatics

具有确定后果的随机运动 因果论与目的论 终值分布决定的随机微分方程 超越郎之万:随机微分方程的其他提法 分子马达、沿细胞骨架的运动

语言学方法:语法和语义

语义问题、遗传“字典” Gnomics: A DNA Dictionary (1986)

目前: >5000 转录因子结合位点 >300 内切酶识别点 各种重复序列,卫星、微卫星

Page 6: Language Theory and Bioinformatics

Language Metaphor in Biology

Transcription ( 转录 )

Translation ( 翻译 )

Edition ( 编辑 )

Modification ( 修饰 )

Page 7: Language Theory and Bioinformatics

Words As landmarks, e.g., recognition sites for : Restriction endonucleases ( REBASE ) methylases ( REBASE ) transcription factors ( TRANSFAC ) As components of “sentences” : promoters ( EPD ) , enhancers silencers, insulators, terminators splicing sites

Page 8: Language Theory and Bioinformatics

Sentences

enhancer — silencer — enhancer — …

promotor — ( exon — intron )k— exon — terminator

Essays/Articles genes, “ junk”, …

Encyclopedia Complete genome of a species

Reference Library Kingdom Monera, …, kingdom Animalia

Page 9: Language Theory and Bioinformatics

自然语言与遗传语言

相似处:多义性 冗余度 容错和纠错 长程关联 均基于离散的排列组合系统有某些语法,但不能完全生成方言、个体差异性演化、突变、灭绝历史“垃圾”、古语、“化石”外来语、横向交换

相异处: 标点符号和间隔不同

两种语言的相互作用

二维、三维的相互作用

重复序列的数目和作用

Page 10: Language Theory and Bioinformatics

语言学( language 而非 philology )方法

统计语言学 “ 字”的频度和关联 Zipf 定律

代数语言学:生成语法和语法复杂性 串行生成: Chomsky 体系 平行生成: Lindenmayer 体系(来自发育生物学) 可因式化语言

Page 11: Language Theory and Bioinformatics

模糊语言学 形式推广不难: Z .G .Yu (2001)

如何定量地引用生物知识 Consensus 序列和权重矩阵

随机语法 隐马可夫链 = 随机正规语法 更高阶的随机语法?

Page 12: Language Theory and Bioinformatics

Consensus Sequences• TATAAT ( Pribnov or -10 box ):

T80A95T45A60A50T96

• TTGACA ( -35 box ):

T82T84G78A65C54A45

• CAAT ( CAAT or –75 box ):

GGYCAATCT• TATA ( TATA or Goldberger-Hogness box ):

TATAWAW• CATG ( Transcription startpoint ):

However, in Aful: ATG –76% GTG –22% TTG –2%

Page 13: Language Theory and Bioinformatics

An Observation u d c s b t

charge, mass, flavor, charm, …

p n e

charge, mass, spin, magnetic momentum, …

H C N O P …

atomic number, ion radius, valence, affinity, …

H2O NO CO2 …

molecular weight, polarity, …

a c g t

A D E F G H … W Y VBRCA1 PDGF

Page 14: Language Theory and Bioinformatics

A PROGRAMME:

Coarse-Grained Description of Nature

Use of Symbols and Symbolic Strings

Language

Grammar and Complexity (Chomsky, Lindenmayer, etc.)

So far this programme has been best realized in the study of dynamics by using Symbolic Dynamics.

There have been preliminary attempts in analyzing biological sequences.

Page 15: Language Theory and Bioinformatics

It may not be a coincidence that the two systems in the universe that most impress us with their open-ended complex design — life and mind — are based on discrete combinatorial systems. Many biologists believe that if inheritance were not discrete, evolution as we know it could not have taken place.

S. Pinker, The Language Instinct (1995)

Page 16: Language Theory and Bioinformatics

Simple Examples

At the level of words:

DOG GOD

At sentence level:

Dog bites Man

Man bites Dog

Page 17: Language Theory and Bioinformatics

N C EGF (Epidermal GF)

N C Chymotrypsin ( 胰凝乳蛋白酶 )

N C Urokinase (UK) ( 尿激酶 )

N C Factor IX

( 凝血因子 IX, X-mas 抗血友病因子 )

N C Plasminogen

( 纤维蛋白融酶原 )

几种丝氨酸蛋白酶的 domain组合 B.Alberts 等, Mol.Biology of the Cell 第三版 1994. P.123

Ca 结合蛋白

含 3 个 -s-s-

Page 18: Language Theory and Bioinformatics

GC 语法复杂性 字母表 例 1. = {a, c, g, t}

例 2. = {A, C, D … W, Y}

例 3. = {a, … z, A, … Z, +, –, …}

字母表中各种字母组成的一切字母串 (包括空串) *

* 的任何子集是基于的一种语言

语法 = { 字母表,初始字母,产生规则 }

基于该语法的语言

Page 19: Language Theory and Bioinformatics

Classification of Formal Languages

Chomsky Hierarchy

Sequential production rules

Lindenmayer Systems

Parallel production rules

Page 20: Language Theory and Bioinformatics

Generative Grammar S Sentence

NP Noun Phrase

VP Verb Phrase

Adj Adjective

Art Article

S if S then S

S either S or S

Non-Terminal and Terminal Symbols

N boy | girl | scientist | …

V sees | believes | loves | eats | …

Adj young | good | beautiful | …

Art a | one | the

S NP VP

VP V NP

NP (Art) Adj* N

Page 21: Language Theory and Bioinformatics

Chomsky 语法层次 N — 非终结字母集(工作用符号) T — 终结字母集 S N 起始字母 P = { 生成规则( x y )的集合 }

x, y 为字母串 关于 x, y 的不同规定导致不同语法 语法 G = (N, T, P, S)

0 类语法 x (NT)* N(NT)*

y (NT)*至少含有一个非终结字母

Page 22: Language Theory and Bioinformatics

1 类语法 上下文有关语法 x = t1 a t2

t1, t2 T*

a N

2 类语法 上下文无关语法 x = a N

3 类语法 正规语法 x = a y = b 或 bc

a, c N b = 空 或 b T

Page 23: Language Theory and Bioinformatics

A, B, … Non-terminals (NT)

, , … Terminals (T)

Regular Grammar: A A A One symbol on LHS;

One or none NT at the right-end of the RHS.

Page 24: Language Theory and Bioinformatics

Context-Free Grammar:A A B B |

One symbol on the LHS;

NT anywhere on the RHS.

Context-Sensitive Grammar:A AB A

A A

One or more symbols on LHS, but length that of RHS;

One or more NT on RHS.

Recursively Enumerable Grammar:No restriction in production rules.

Page 25: Language Theory and Bioinformatics

形式语言的 Chomsky 层次

层 语言 计算机 存储要求0 递归可数

REL

图灵机(万能计算机)

无根

1 上下文有关CSL

线性有界自动机 比例於输入字长

2 上下文无关CFL

下推自动机 下推区(堆栈)

3 正规RGL

有限自动机 不要求

Page 26: Language Theory and Bioinformatics

R L R R R L R R

a b

(i) (ii)

R L

a b c

b … …

c … …

d … …

A transfer function

(a, R) = b

A Finite State Automaton(FSA)

Page 27: Language Theory and Bioinformatics

A Pushdown Automaton

Pushdown list

Stack

First In Last Out (FILO)

Page 28: Language Theory and Bioinformatics

A Turing MachineAlan M. Turing (1912-1954)

FSA + R/W tape

Church-Turing Thesis (1936):

Any effective (mechanical) computation can

be carried out by a Turing machine

Page 29: Language Theory and Bioinformatics

形式语言的 Chomsky 层次

层 语言 计算机 存储要求0 递归可数

REL

图灵机(万能计算机)

无根

1 上下文有关CSL

线性有界自动机 比例於输入字长

2 上下文无关CFL

下推自动机 下推区(堆栈)

3 正规RGL

有限自动机 不要求

Page 30: Language Theory and Bioinformatics

Terminals = {a, b, c}

Non-terminal = {A, B}

Sequential rules: B aBAc | abc

bA bb

cA Ac

B abc

B aBAc aabcAc aabAcc

B abAc aaBAcAc

aaBAAc

aaabcAAc

aaabAcAc aaabbAcc

Example: {ai b ici | i>0} CSL

Page 31: Language Theory and Bioinformatics

Rules to Generate Gene-Like Sequences( by David Searls )

gene upstream transcript downstream

transcript 5’-untranslated-region start-codon coding-region

3’-untranslated-region

coding-region codon coding-region | stop-codon | splice |

coding region

codon lys | asn | thr | met | glu | his | pro | asp | ala | gly | tyr |

trp | phe | leu | ile | ser | arg | gln | val | cys

start-codon met

stop-codon taa | tag | tga

Page 32: Language Theory and Bioinformatics

leu tt purine | ct base (6)ser ag pyrimidine | tc base (6)arg ag purine | cg base (6)val gt base pro cc base (4)ala gc base gly gg base (4)thr ac base (4) ile at pyrimidine | ata (3)lys aa purine asn aa pyrimidine (2)gln ca purine his ca pyrimidine (2)glu ga purine cys tg pyrimidine (2)phe tt pyrimidine tyr ta pyrimidine (2)asp ga pyrimidine (2)met atg trp tggbase m a | c | g | t purine a | gprimidine c | t

Page 33: Language Theory and Bioinformatics

splice intron intron gt | intron-body | ag

splice a a intron splice c c intron

splice t t intron splice g g intron

a splice intron a c splice intron c

t splice intron t g splice intron g

upstream enhancer promotor enhancer

enhancer …

promotor …

silencer …

isolator …

Page 34: Language Theory and Bioinformatics

These rules are capable to generate an unlimited

set of gene-like sequences, mostly biological nonsense.

They may be used to recognize gene-like segments

in long DNA sequences.

Syntax versus Semantics: texts vs. grammar.

Physics behind this coarse-grained description:

stereochemistry, interaction between proteins and

DNA chains, metallic ions etc.

Page 35: Language Theory and Bioinformatics

br bl

ar al

albr blar

Alphabet: S = {ar, al, br, bl}

Production rules:

Initial symbol (axiom) = ar

Grammar: G = (S, P, )

Language: L (G) S*

Development of Anabaena catenula ( 串珠藻项圈藻属 )

br ar

ar albr

bl al

al blar

P =

Page 36: Language Theory and Bioinformatics

Lindenmayer Systems

Parallel production rules. Finer classification

D0L – Deterministic, no interaction, i.e., context-free

0L – non-deterministic, no interaction

IL – non-deterministic, with Interaction, i.e., context

sensitive

T0L – with Table of production rules

TIL –

E0L – Extended to non-terminal symbols

ET0L –

EIL REL of Chomsky

Page 37: Language Theory and Bioinformatics

RGL Regular CFL Context-Free

CSL Context-Sensitive REL Recursively Enumerable

CSL

CFL

RGL

FINDOL

REL

Page 38: Language Theory and Bioinformatics

Chomsky

Lindenmayer

Indexed

0:REL

1:CSL

IND

ET0L

E0L

2:CFL

3:RGL

IL

T0L

0L

D0L

EIL

Page 39: Language Theory and Bioinformatics

L = {aibici | i > 0} CSL

G = (S, T, )

= abc

S = {a, b, c}

T = {t1, t2}

T1 = {a aa, b bb, c cc}

T2 = {a , b , c }

T0L

Example a la Lindenmayer

Page 40: Language Theory and Bioinformatics

Gene-Finding

Gene-structure model

Page 41: Language Theory and Bioinformatics

5’-UTR 3’-UTR

transcribe

Genomic DNA

Pre-mRNA

splice

mRNA

translate

AA seq ( protein primary seq )

fold

Protein fold

start stop

5’ 3’

RNA Pol II +…

splicesome u1u2u4u5u6RNP

ribsome init.

+ elong. factors term.

chaperonine

Page 42: Language Theory and Bioinformatics

GT-AG Rule for Intron 5’ splicing donor site

exon …A64G73 G100T100A62A68G84T63… …12PyNC65A100G100 N…exon

3’ splicing

acceptor site

Page 43: Language Theory and Bioinformatics

{()【( . )( . )( . )】()}

• 【( First exon

• )( Internal exon

• )】 Last exon

• {( Non-coding 5’ exon

• )【 Non-coding 5’ exon

• ( . ) Intron

• 】( Non-coding 3’ exon (rare)

• )} Non-coding 3’ exon (rare)

• }{ Intergenic region

Transcription Translation Translation Transcription start start end end

Page 44: Language Theory and Bioinformatics

Dyck language: A language of nested parentheses

• Many types of parentheses

• Finite depth of nesting

• Context-free language

Our case:

• Only 3 types of parentheses

• Shallow nesting

• Conjecture (Xie): may be regular language

Page 45: Language Theory and Bioinformatics

Huimin Xie 谢惠民 Grammatical Complexity and 1D dynamical Systems Vol.6 in Directions in Chaos WSPC, 1996.

谢惠民 《复杂性与动力系统》 上海科技教育出版社 , 1994

J.Hopcroft, J.Ullman, Introduction to Automata Theory, Languages andComputation,Addison-Wesley, 1979.