from genomics to geology: hidden markov models for seismic data analysis samuel brown february 5,...

Post on 21-Jan-2016

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

From Genomics to Geology:From Genomics to Geology:Hidden Markov Models for Hidden Markov Models for

Seismic Data AnalysisSeismic Data Analysis

Samuel BrownSamuel Brown

February 5, 2009February 5, 2009

Objective

1

• Create a pattern recognition tool that can locate,

isolate, and characterize events in seismic data.

• Adapt hidden Markov model (HMM) search

algorithms from genomics applications and extend

them to work on multi-dimensional seismic data.

Outline

2

• HMM Applications

• HMM Theory

• Generating Sequences

• Simple Markov Models

• Hidden Markov Models

• From Sequence Generation to Pattern Recognition

• Application to Seismic Data

Application Areas

3

HMMs can be used to build powerful, fast pattern

recognition tools for noisy, incomplete data.

• Medical Informatics/Genomics

• hmmer – Sean Eddy, Howard Hughes Medical

Institute

• Speech Recognition

• Intelligence

Prerequisites

4

• We will describe our data as a sequence of symbols

from a predefined alphabet

• For DNA, the alphabet consists of nucleic acids:

{A, C, G, T}

• A sample DNA sequence: CGATATGCG

• We will reference the symbols in a sequence by

their position, starting with 1; ie symbol 1 in the

previous sequence is ‘C’

Sequence Generation

5

• Goal: build a probabilistic state machine (model)

that can generate any DNA sequence and

characterize quantitatively the probability that the

model generates any given sequence.

• Enter the Markov model (chain)

Markov Models

6

• Primary Characteristic: The probability of any

symbol in a sequence depends solely on the

probability of the previous symbol.

• tAC = P(xi = A | xi-1 = C)

Markov Model State Machine

7

• Each state with straight edges is an emitting state.

• B and E are special non-emitting states for

beginning and ending sequences.

Markov Model State Machine

8

•A transition probability is assigned to each arrow:

• tAC = P(xi = A | xi-1 = C)

• The probability of sequence x of length L is:

• P(x) = P(E | xL)P(xL | xL-1)…P(x1 | B)

Markov Model State Machine

9

•Given the sequence, CGAGTC, and a table of

transition probabilities, we can trace a path through

the state machine to get the probability of the

sequence.

Markov Model Example

10

•Assume all transitions are equiprobable, ie,

.25 = tAC = tAG = tAT = tCA= …

P(CGAGTC) = (tEC)(tCG)(tGA)(tAG)(tGT)(tTC)(tCB)

= .00097

CpG Islands

11

•The dinucleotide subsequence CG is relatively rare

in the human genome and is usually associated with

the beginning of coding DNA regions

• CpG islands are subsequences in which the CG

pattern is common, and there are more C and G

nucleotides in general.

CpG Island Model

12

•We can define a new Markov model for CpG

islands, in which the transition probabilities are

adjusted to reflect the higher frequency of C and G

nucleotides.

Combined Model

13

•What we really want is a model that can emit

normal DNA sequences and CpG islands, with low

probability transitions between the two regions.

Hidden Markov Model

14

•We call this type of Markov model ‘hidden’

because one cannot immediately determine which

state emitted a given symbol.

Outline

15

• HMM Applications

• HMM Theory

• Generating Sequences

• Simple Markov Models

• Hidden Markov Models

• From Sequence Generation to Pattern Recognition

• Application to Seismic Data

HMM Search

16

•We can calculate the path through the model which

generates a given sequence with the highest

probability using the Viterbi algorithm.

• This allows us to identify portions in our sequence

that have a high

probability of being

CpG islands.

HMM Generalizations

17

•Remove direct link between states and symbols and

allow any state to emit any symbol with a defined

probability distribution.

• ek(b) = P(xi= b | πi = k)

• Sequence probability is now a joint probability of

transition and emission probabilities:

• P(x, π) = Π eπi(xi) tπi,πi+1

HMM Generalizations

18

•Create HMM based on specific sequences, with

(M)atch, (I)nsert and (D)elete states.

• Add B self-transition to be able to skip symbols.

• Allow for feedback from E to B to link recognized

sequence portions together.

Outline

19

• HMM Applications

• HMM Theory

• Generating Sequences

• Simple Markov Models

• Hidden Markov Models

• From Sequence Generation to Pattern Recognition

• Application to Seismic Data

Searching a Trace

20

•When searching a trace, what is our alphabet?

• Trace amplitudes + noise, which is assumed to be

normally distributed.

• What are our models?

• Library of scaled wavelets – one set of (M)atch,

(I)nsert, and (D)elete states for each sample.

Searching a Trace

21

• Emitting states emit a trace sample with a high

probability when the trace sample amplitude is

within one standard deviation of the scaled model

sample amplitude.

• HMM search program will return a list of wavelet

types, central times, and amplitudes.

Searching a Trace

22

•HMM search correctly identified all wavelet components in the

left trace, allowing us to synthesize the spiked trace.

Central Time Frequency Coefficient

.23s 18hz -1

.25s 20hz .5

.48s 20hz -.4

.498s 18hz 1.5

Searching a Trace

23

What about noise?

Searching Across Traces

24

•Taking the output wavelets from a 1D HMM, what

is the alphabet when we search across traces?

• Moveouts.

• What are our models?

• Library of target trajectories.

First Arrival 1D HMM

25

First Arrival 2D HMM

26

First Arrival 2D HMM

27

To Be Continued…

28

top related