from genomics to geology: hidden markov models for seismic data analysis samuel brown february 5,...

29
From Genomics to Geology: From Genomics to Geology: Hidden Markov Models for Hidden Markov Models for Seismic Data Analysis Seismic Data Analysis Samuel Samuel Brown Brown February 5, February 5, 2009 2009

Upload: erik-johnston

Post on 21-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

From Genomics to Geology:From Genomics to Geology:Hidden Markov Models for Hidden Markov Models for

Seismic Data AnalysisSeismic Data Analysis

Samuel BrownSamuel Brown

February 5, 2009February 5, 2009

Page 2: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Objective

1

• Create a pattern recognition tool that can locate,

isolate, and characterize events in seismic data.

• Adapt hidden Markov model (HMM) search

algorithms from genomics applications and extend

them to work on multi-dimensional seismic data.

Page 3: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Outline

2

• HMM Applications

• HMM Theory

• Generating Sequences

• Simple Markov Models

• Hidden Markov Models

• From Sequence Generation to Pattern Recognition

• Application to Seismic Data

Page 4: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Application Areas

3

HMMs can be used to build powerful, fast pattern

recognition tools for noisy, incomplete data.

• Medical Informatics/Genomics

• hmmer – Sean Eddy, Howard Hughes Medical

Institute

• Speech Recognition

• Intelligence

Page 5: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Prerequisites

4

• We will describe our data as a sequence of symbols

from a predefined alphabet

• For DNA, the alphabet consists of nucleic acids:

{A, C, G, T}

• A sample DNA sequence: CGATATGCG

• We will reference the symbols in a sequence by

their position, starting with 1; ie symbol 1 in the

previous sequence is ‘C’

Page 6: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Sequence Generation

5

• Goal: build a probabilistic state machine (model)

that can generate any DNA sequence and

characterize quantitatively the probability that the

model generates any given sequence.

• Enter the Markov model (chain)

Page 7: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Markov Models

6

• Primary Characteristic: The probability of any

symbol in a sequence depends solely on the

probability of the previous symbol.

• tAC = P(xi = A | xi-1 = C)

Page 8: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Markov Model State Machine

7

• Each state with straight edges is an emitting state.

• B and E are special non-emitting states for

beginning and ending sequences.

Page 9: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Markov Model State Machine

8

•A transition probability is assigned to each arrow:

• tAC = P(xi = A | xi-1 = C)

• The probability of sequence x of length L is:

• P(x) = P(E | xL)P(xL | xL-1)…P(x1 | B)

Page 10: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Markov Model State Machine

9

•Given the sequence, CGAGTC, and a table of

transition probabilities, we can trace a path through

the state machine to get the probability of the

sequence.

Page 11: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Markov Model Example

10

•Assume all transitions are equiprobable, ie,

.25 = tAC = tAG = tAT = tCA= …

P(CGAGTC) = (tEC)(tCG)(tGA)(tAG)(tGT)(tTC)(tCB)

= .00097

Page 12: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

CpG Islands

11

•The dinucleotide subsequence CG is relatively rare

in the human genome and is usually associated with

the beginning of coding DNA regions

• CpG islands are subsequences in which the CG

pattern is common, and there are more C and G

nucleotides in general.

Page 13: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

CpG Island Model

12

•We can define a new Markov model for CpG

islands, in which the transition probabilities are

adjusted to reflect the higher frequency of C and G

nucleotides.

Page 14: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Combined Model

13

•What we really want is a model that can emit

normal DNA sequences and CpG islands, with low

probability transitions between the two regions.

Page 15: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Hidden Markov Model

14

•We call this type of Markov model ‘hidden’

because one cannot immediately determine which

state emitted a given symbol.

Page 16: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Outline

15

• HMM Applications

• HMM Theory

• Generating Sequences

• Simple Markov Models

• Hidden Markov Models

• From Sequence Generation to Pattern Recognition

• Application to Seismic Data

Page 17: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

HMM Search

16

•We can calculate the path through the model which

generates a given sequence with the highest

probability using the Viterbi algorithm.

• This allows us to identify portions in our sequence

that have a high

probability of being

CpG islands.

Page 18: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

HMM Generalizations

17

•Remove direct link between states and symbols and

allow any state to emit any symbol with a defined

probability distribution.

• ek(b) = P(xi= b | πi = k)

• Sequence probability is now a joint probability of

transition and emission probabilities:

• P(x, π) = Π eπi(xi) tπi,πi+1

Page 19: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

HMM Generalizations

18

•Create HMM based on specific sequences, with

(M)atch, (I)nsert and (D)elete states.

• Add B self-transition to be able to skip symbols.

• Allow for feedback from E to B to link recognized

sequence portions together.

Page 20: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Outline

19

• HMM Applications

• HMM Theory

• Generating Sequences

• Simple Markov Models

• Hidden Markov Models

• From Sequence Generation to Pattern Recognition

• Application to Seismic Data

Page 21: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Searching a Trace

20

•When searching a trace, what is our alphabet?

• Trace amplitudes + noise, which is assumed to be

normally distributed.

• What are our models?

• Library of scaled wavelets – one set of (M)atch,

(I)nsert, and (D)elete states for each sample.

Page 22: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Searching a Trace

21

• Emitting states emit a trace sample with a high

probability when the trace sample amplitude is

within one standard deviation of the scaled model

sample amplitude.

• HMM search program will return a list of wavelet

types, central times, and amplitudes.

Page 23: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Searching a Trace

22

•HMM search correctly identified all wavelet components in the

left trace, allowing us to synthesize the spiked trace.

Central Time Frequency Coefficient

.23s 18hz -1

.25s 20hz .5

.48s 20hz -.4

.498s 18hz 1.5

Page 24: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Searching a Trace

23

What about noise?

Page 25: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

Searching Across Traces

24

•Taking the output wavelets from a 1D HMM, what

is the alphabet when we search across traces?

• Moveouts.

• What are our models?

• Library of target trajectories.

Page 26: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

First Arrival 1D HMM

25

Page 27: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

First Arrival 2D HMM

26

Page 28: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

First Arrival 2D HMM

27

Page 29: From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009

To Be Continued…

28