from genomics to geology: hidden markov models for seismic data analysis samuel brown february 5,...
TRANSCRIPT
From Genomics to Geology:From Genomics to Geology:Hidden Markov Models for Hidden Markov Models for
Seismic Data AnalysisSeismic Data Analysis
Samuel BrownSamuel Brown
February 5, 2009February 5, 2009
Objective
1
• Create a pattern recognition tool that can locate,
isolate, and characterize events in seismic data.
• Adapt hidden Markov model (HMM) search
algorithms from genomics applications and extend
them to work on multi-dimensional seismic data.
Outline
2
• HMM Applications
• HMM Theory
• Generating Sequences
• Simple Markov Models
• Hidden Markov Models
• From Sequence Generation to Pattern Recognition
• Application to Seismic Data
Application Areas
3
HMMs can be used to build powerful, fast pattern
recognition tools for noisy, incomplete data.
• Medical Informatics/Genomics
• hmmer – Sean Eddy, Howard Hughes Medical
Institute
• Speech Recognition
• Intelligence
Prerequisites
4
• We will describe our data as a sequence of symbols
from a predefined alphabet
• For DNA, the alphabet consists of nucleic acids:
{A, C, G, T}
• A sample DNA sequence: CGATATGCG
• We will reference the symbols in a sequence by
their position, starting with 1; ie symbol 1 in the
previous sequence is ‘C’
Sequence Generation
5
• Goal: build a probabilistic state machine (model)
that can generate any DNA sequence and
characterize quantitatively the probability that the
model generates any given sequence.
• Enter the Markov model (chain)
Markov Models
6
• Primary Characteristic: The probability of any
symbol in a sequence depends solely on the
probability of the previous symbol.
• tAC = P(xi = A | xi-1 = C)
Markov Model State Machine
7
• Each state with straight edges is an emitting state.
• B and E are special non-emitting states for
beginning and ending sequences.
Markov Model State Machine
8
•A transition probability is assigned to each arrow:
• tAC = P(xi = A | xi-1 = C)
• The probability of sequence x of length L is:
• P(x) = P(E | xL)P(xL | xL-1)…P(x1 | B)
Markov Model State Machine
9
•Given the sequence, CGAGTC, and a table of
transition probabilities, we can trace a path through
the state machine to get the probability of the
sequence.
Markov Model Example
10
•Assume all transitions are equiprobable, ie,
.25 = tAC = tAG = tAT = tCA= …
P(CGAGTC) = (tEC)(tCG)(tGA)(tAG)(tGT)(tTC)(tCB)
= .00097
CpG Islands
11
•The dinucleotide subsequence CG is relatively rare
in the human genome and is usually associated with
the beginning of coding DNA regions
• CpG islands are subsequences in which the CG
pattern is common, and there are more C and G
nucleotides in general.
CpG Island Model
12
•We can define a new Markov model for CpG
islands, in which the transition probabilities are
adjusted to reflect the higher frequency of C and G
nucleotides.
Combined Model
13
•What we really want is a model that can emit
normal DNA sequences and CpG islands, with low
probability transitions between the two regions.
Hidden Markov Model
14
•We call this type of Markov model ‘hidden’
because one cannot immediately determine which
state emitted a given symbol.
Outline
15
• HMM Applications
• HMM Theory
• Generating Sequences
• Simple Markov Models
• Hidden Markov Models
• From Sequence Generation to Pattern Recognition
• Application to Seismic Data
HMM Search
16
•We can calculate the path through the model which
generates a given sequence with the highest
probability using the Viterbi algorithm.
• This allows us to identify portions in our sequence
that have a high
probability of being
CpG islands.
HMM Generalizations
17
•Remove direct link between states and symbols and
allow any state to emit any symbol with a defined
probability distribution.
• ek(b) = P(xi= b | πi = k)
• Sequence probability is now a joint probability of
transition and emission probabilities:
• P(x, π) = Π eπi(xi) tπi,πi+1
HMM Generalizations
18
•Create HMM based on specific sequences, with
(M)atch, (I)nsert and (D)elete states.
• Add B self-transition to be able to skip symbols.
• Allow for feedback from E to B to link recognized
sequence portions together.
Outline
19
• HMM Applications
• HMM Theory
• Generating Sequences
• Simple Markov Models
• Hidden Markov Models
• From Sequence Generation to Pattern Recognition
• Application to Seismic Data
Searching a Trace
20
•When searching a trace, what is our alphabet?
• Trace amplitudes + noise, which is assumed to be
normally distributed.
• What are our models?
• Library of scaled wavelets – one set of (M)atch,
(I)nsert, and (D)elete states for each sample.
Searching a Trace
21
• Emitting states emit a trace sample with a high
probability when the trace sample amplitude is
within one standard deviation of the scaled model
sample amplitude.
• HMM search program will return a list of wavelet
types, central times, and amplitudes.
Searching a Trace
22
•HMM search correctly identified all wavelet components in the
left trace, allowing us to synthesize the spiked trace.
Central Time Frequency Coefficient
.23s 18hz -1
.25s 20hz .5
.48s 20hz -.4
.498s 18hz 1.5
Searching a Trace
23
What about noise?
Searching Across Traces
24
•Taking the output wavelets from a 1D HMM, what
is the alphabet when we search across traces?
• Moveouts.
• What are our models?
• Library of target trajectories.
First Arrival 1D HMM
25
First Arrival 2D HMM
26
First Arrival 2D HMM
27
To Be Continued…
28