language theory and bioinformatics
DESCRIPTION
Language Theory and Bioinformatics. Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics, Academia Sinica http://www.itp.ac.cn/~hao/. Statistical Analysis of DNA Sequences. A first and must step in any analysis: Frequency of appearance of strings - PowerPoint PPT PresentationTRANSCRIPT
Language Theory and Bioinformatics
Bailin HaoT-Life Research Center, Fudan University
Institute of Theoretical Physics, Academia Sinicahttp://www.itp.ac.cn/~hao/
Statistical Analysis of DNA Sequences
A first and must step in any analysis:• Frequency of appearance of strings • Correlations of letters and strings• 1D and 2D DNA walks vs. random walk
Summary in two lines according to Luo Liao-fu:
1. DNA sequences are not random.
2. Characteristics close to randomness.
Hint:Statistical methods alone are not powerful
enough to amplify the difference between
DNA and random sequences and the
difference among themselves.
Need for new “deterministic” approaches.
超越概率统计方法
概率统计是基本功 频度和关联, 马可夫链和隐马可夫链 神经网络模型 贝叶斯( Bayes )统计、“先验”分布
随机序列是好的参考系吗? 足够长的符号序列具有不可避免的“规则性” 基因组序列够长吗?
具有确定后果的随机运动 因果论与目的论 终值分布决定的随机微分方程 超越郎之万:随机微分方程的其他提法 分子马达、沿细胞骨架的运动
语言学方法:语法和语义
语义问题、遗传“字典” Gnomics: A DNA Dictionary (1986)
目前: >5000 转录因子结合位点 >300 内切酶识别点 各种重复序列,卫星、微卫星
Language Metaphor in Biology
Transcription ( 转录 )
Translation ( 翻译 )
Edition ( 编辑 )
Modification ( 修饰 )
Words As landmarks, e.g., recognition sites for : Restriction endonucleases ( REBASE ) methylases ( REBASE ) transcription factors ( TRANSFAC ) As components of “sentences” : promoters ( EPD ) , enhancers silencers, insulators, terminators splicing sites
Sentences
enhancer — silencer — enhancer — …
promotor — ( exon — intron )k— exon — terminator
Essays/Articles genes, “ junk”, …
Encyclopedia Complete genome of a species
Reference Library Kingdom Monera, …, kingdom Animalia
自然语言与遗传语言
相似处:多义性 冗余度 容错和纠错 长程关联 均基于离散的排列组合系统有某些语法,但不能完全生成方言、个体差异性演化、突变、灭绝历史“垃圾”、古语、“化石”外来语、横向交换
相异处: 标点符号和间隔不同
两种语言的相互作用
二维、三维的相互作用
重复序列的数目和作用
语言学( language 而非 philology )方法
统计语言学 “ 字”的频度和关联 Zipf 定律
代数语言学:生成语法和语法复杂性 串行生成: Chomsky 体系 平行生成: Lindenmayer 体系(来自发育生物学) 可因式化语言
模糊语言学 形式推广不难: Z .G .Yu (2001)
如何定量地引用生物知识 Consensus 序列和权重矩阵
随机语法 隐马可夫链 = 随机正规语法 更高阶的随机语法?
Consensus Sequences• TATAAT ( Pribnov or -10 box ):
T80A95T45A60A50T96
• TTGACA ( -35 box ):
T82T84G78A65C54A45
• CAAT ( CAAT or –75 box ):
GGYCAATCT• TATA ( TATA or Goldberger-Hogness box ):
TATAWAW• CATG ( Transcription startpoint ):
However, in Aful: ATG –76% GTG –22% TTG –2%
An Observation u d c s b t
charge, mass, flavor, charm, …
p n e
charge, mass, spin, magnetic momentum, …
H C N O P …
atomic number, ion radius, valence, affinity, …
H2O NO CO2 …
molecular weight, polarity, …
a c g t
A D E F G H … W Y VBRCA1 PDGF
A PROGRAMME:
Coarse-Grained Description of Nature
Use of Symbols and Symbolic Strings
Language
Grammar and Complexity (Chomsky, Lindenmayer, etc.)
So far this programme has been best realized in the study of dynamics by using Symbolic Dynamics.
There have been preliminary attempts in analyzing biological sequences.
It may not be a coincidence that the two systems in the universe that most impress us with their open-ended complex design — life and mind — are based on discrete combinatorial systems. Many biologists believe that if inheritance were not discrete, evolution as we know it could not have taken place.
S. Pinker, The Language Instinct (1995)
Simple Examples
At the level of words:
DOG GOD
At sentence level:
Dog bites Man
Man bites Dog
N C EGF (Epidermal GF)
N C Chymotrypsin ( 胰凝乳蛋白酶 )
N C Urokinase (UK) ( 尿激酶 )
N C Factor IX
( 凝血因子 IX, X-mas 抗血友病因子 )
N C Plasminogen
( 纤维蛋白融酶原 )
几种丝氨酸蛋白酶的 domain组合 B.Alberts 等, Mol.Biology of the Cell 第三版 1994. P.123
Ca 结合蛋白
含 3 个 -s-s-
GC 语法复杂性 字母表 例 1. = {a, c, g, t}
例 2. = {A, C, D … W, Y}
例 3. = {a, … z, A, … Z, +, –, …}
字母表中各种字母组成的一切字母串 (包括空串) *
* 的任何子集是基于的一种语言
语法 = { 字母表,初始字母,产生规则 }
基于该语法的语言
Classification of Formal Languages
Chomsky Hierarchy
Sequential production rules
Lindenmayer Systems
Parallel production rules
Generative Grammar S Sentence
NP Noun Phrase
VP Verb Phrase
Adj Adjective
Art Article
S if S then S
S either S or S
Non-Terminal and Terminal Symbols
N boy | girl | scientist | …
V sees | believes | loves | eats | …
Adj young | good | beautiful | …
Art a | one | the
S NP VP
VP V NP
NP (Art) Adj* N
Chomsky 语法层次 N — 非终结字母集(工作用符号) T — 终结字母集 S N 起始字母 P = { 生成规则( x y )的集合 }
x, y 为字母串 关于 x, y 的不同规定导致不同语法 语法 G = (N, T, P, S)
0 类语法 x (NT)* N(NT)*
y (NT)*至少含有一个非终结字母
1 类语法 上下文有关语法 x = t1 a t2
t1, t2 T*
a N
2 类语法 上下文无关语法 x = a N
3 类语法 正规语法 x = a y = b 或 bc
a, c N b = 空 或 b T
A, B, … Non-terminals (NT)
, , … Terminals (T)
Regular Grammar: A A A One symbol on LHS;
One or none NT at the right-end of the RHS.
Context-Free Grammar:A A B B |
One symbol on the LHS;
NT anywhere on the RHS.
Context-Sensitive Grammar:A AB A
A A
One or more symbols on LHS, but length that of RHS;
One or more NT on RHS.
Recursively Enumerable Grammar:No restriction in production rules.
形式语言的 Chomsky 层次
层 语言 计算机 存储要求0 递归可数
REL
图灵机(万能计算机)
无根
1 上下文有关CSL
线性有界自动机 比例於输入字长
2 上下文无关CFL
下推自动机 下推区(堆栈)
3 正规RGL
有限自动机 不要求
R L R R R L R R
a b
(i) (ii)
R L
a b c
b … …
c … …
d … …
A transfer function
(a, R) = b
A Finite State Automaton(FSA)
A Pushdown Automaton
Pushdown list
Stack
First In Last Out (FILO)
A Turing MachineAlan M. Turing (1912-1954)
FSA + R/W tape
Church-Turing Thesis (1936):
Any effective (mechanical) computation can
be carried out by a Turing machine
形式语言的 Chomsky 层次
层 语言 计算机 存储要求0 递归可数
REL
图灵机(万能计算机)
无根
1 上下文有关CSL
线性有界自动机 比例於输入字长
2 上下文无关CFL
下推自动机 下推区(堆栈)
3 正规RGL
有限自动机 不要求
Terminals = {a, b, c}
Non-terminal = {A, B}
Sequential rules: B aBAc | abc
bA bb
cA Ac
B abc
B aBAc aabcAc aabAcc
B abAc aaBAcAc
aaBAAc
aaabcAAc
aaabAcAc aaabbAcc
Example: {ai b ici | i>0} CSL
Rules to Generate Gene-Like Sequences( by David Searls )
gene upstream transcript downstream
transcript 5’-untranslated-region start-codon coding-region
3’-untranslated-region
coding-region codon coding-region | stop-codon | splice |
coding region
codon lys | asn | thr | met | glu | his | pro | asp | ala | gly | tyr |
trp | phe | leu | ile | ser | arg | gln | val | cys
start-codon met
stop-codon taa | tag | tga
leu tt purine | ct base (6)ser ag pyrimidine | tc base (6)arg ag purine | cg base (6)val gt base pro cc base (4)ala gc base gly gg base (4)thr ac base (4) ile at pyrimidine | ata (3)lys aa purine asn aa pyrimidine (2)gln ca purine his ca pyrimidine (2)glu ga purine cys tg pyrimidine (2)phe tt pyrimidine tyr ta pyrimidine (2)asp ga pyrimidine (2)met atg trp tggbase m a | c | g | t purine a | gprimidine c | t
splice intron intron gt | intron-body | ag
splice a a intron splice c c intron
splice t t intron splice g g intron
a splice intron a c splice intron c
t splice intron t g splice intron g
upstream enhancer promotor enhancer
enhancer …
promotor …
silencer …
isolator …
These rules are capable to generate an unlimited
set of gene-like sequences, mostly biological nonsense.
They may be used to recognize gene-like segments
in long DNA sequences.
Syntax versus Semantics: texts vs. grammar.
Physics behind this coarse-grained description:
stereochemistry, interaction between proteins and
DNA chains, metallic ions etc.
br bl
ar al
albr blar
Alphabet: S = {ar, al, br, bl}
Production rules:
Initial symbol (axiom) = ar
Grammar: G = (S, P, )
Language: L (G) S*
Development of Anabaena catenula ( 串珠藻项圈藻属 )
br ar
ar albr
bl al
al blar
P =
Lindenmayer Systems
Parallel production rules. Finer classification
D0L – Deterministic, no interaction, i.e., context-free
0L – non-deterministic, no interaction
IL – non-deterministic, with Interaction, i.e., context
sensitive
T0L – with Table of production rules
TIL –
E0L – Extended to non-terminal symbols
ET0L –
EIL REL of Chomsky
RGL Regular CFL Context-Free
CSL Context-Sensitive REL Recursively Enumerable
CSL
CFL
RGL
FINDOL
REL
Chomsky
Lindenmayer
Indexed
0:REL
1:CSL
IND
ET0L
E0L
2:CFL
3:RGL
IL
T0L
0L
D0L
EIL
L = {aibici | i > 0} CSL
G = (S, T, )
= abc
S = {a, b, c}
T = {t1, t2}
T1 = {a aa, b bb, c cc}
T2 = {a , b , c }
T0L
Example a la Lindenmayer
Gene-Finding
Gene-structure model
5’-UTR 3’-UTR
transcribe
Genomic DNA
Pre-mRNA
splice
mRNA
translate
AA seq ( protein primary seq )
fold
Protein fold
start stop
5’ 3’
RNA Pol II +…
splicesome u1u2u4u5u6RNP
ribsome init.
+ elong. factors term.
chaperonine
GT-AG Rule for Intron 5’ splicing donor site
exon …A64G73 G100T100A62A68G84T63… …12PyNC65A100G100 N…exon
3’ splicing
acceptor site
{()【( . )( . )( . )】()}
• 【( First exon
• )( Internal exon
• )】 Last exon
• {( Non-coding 5’ exon
• )【 Non-coding 5’ exon
• ( . ) Intron
• 】( Non-coding 3’ exon (rare)
• )} Non-coding 3’ exon (rare)
• }{ Intergenic region
Transcription Translation Translation Transcription start start end end
Dyck language: A language of nested parentheses
• Many types of parentheses
• Finite depth of nesting
• Context-free language
Our case:
• Only 3 types of parentheses
• Shallow nesting
• Conjecture (Xie): may be regular language
Huimin Xie 谢惠民 Grammatical Complexity and 1D dynamical Systems Vol.6 in Directions in Chaos WSPC, 1996.
谢惠民 《复杂性与动力系统》 上海科技教育出版社 , 1994
J.Hopcroft, J.Ullman, Introduction to Automata Theory, Languages andComputation,Addison-Wesley, 1979.