string matching with finite automata,aho corasick,

33
String Matching with Finite Automata Aho-Corasick String Matching By Waqas Shehzad Fast NU Pakistan

Upload: 8neutron8

Post on 11-May-2015

3.947 views

Category:

Education


1 download

TRANSCRIPT

Page 1: String Matching with Finite Automata,Aho corasick,

String Matching with Finite Automata

Aho-Corasick String Matching

By Waqas ShehzadFast NU Pakistan

Page 2: String Matching with Finite Automata,Aho corasick,

String Matching

Whenever you use a search engine, or a “find” function like grep, you are utilizing a string matching program. Many of these programs create finite automata in order to effectively search for your string.

 

Page 3: String Matching with Finite Automata,Aho corasick,

Finite state machines

A finite state machine (FSM, also known as a deterministic finite automaton or DFA) is a way of representing a language

we represent the language as the set of those strings accepted by some program. So, once you've found the right machine, we can test whether a given string matches just by running it.

Page 4: String Matching with Finite Automata,Aho corasick,

How it works We'll draw pictures with circles and arrows. A

circle will represent a state, an arrow with a label will represent that we go to that state if we see that character.

A finite automaton accepts strings in a specific language. It begins in state q0 and reads characters one at a time from the input string. It makes transitions () based on these characters, and if when it reaches the end of the tape it is in one of the accept states, that string is accepted by the language.

Page 5: String Matching with Finite Automata,Aho corasick,

Example Example, that could be used by the C preprocessor (a part of most C

compilers) to tell which characters are part of comments and can be removed from the input

They can be viewed as just being a special kind of graph, and we can use any of the normal graph representations to store them.

Page 6: String Matching with Finite Automata,Aho corasick,

cont

One particularly useful representation is a transition table: we make a table with rows indexed by states, and columns indexed by possible input characters

Page 7: String Matching with Finite Automata,Aho corasick,

Finite Automata

A finite automaton is a quintuple (Q, , , s, F):

Q: the finite set of states : the finite input alphabet : the “transition function” from Qx to

Q s Q: the start state F Q: the set of final (accepting) states

Page 8: String Matching with Finite Automata,Aho corasick,

Example: nano

State diagram for finding word “Nano "through grep utility.

Simulating this on the string

"banananona“ We get the sequence of states empty, empty, empty, "n",

"na", "nan", "na", "nan", "nano", "nano", "nano".

Page 9: String Matching with Finite Automata,Aho corasick,

transition table

Page 10: String Matching with Finite Automata,Aho corasick,

Running Time ofCompute-Transition-Function

It takes something like O(m^3 + n) time:

O(m^3) to build the state table described above,

O(n) to simulate it on the input file.

Page 11: String Matching with Finite Automata,Aho corasick,

Aho-Corasick String Matching

An Efficient String Matching

Page 12: String Matching with Finite Automata,Aho corasick,

Introduction

Locate all occurrences of any of a finite number of keywords in a string of text.

Consists of constructing a finite state pattern matching machine from the keywords and then using the pattern matching machine to process the text string in a single pass.

Page 13: String Matching with Finite Automata,Aho corasick,

Pattern Matching Machine(1)

Let be a finite set of strings which we shall call keywords and let x be an arbitrary string which we shall call the text string.

The behavior of the pattern matching machine is dictated by three functions: a goto function g , a failure function f , and an output function output.

yyyK k,,,

21

Page 14: String Matching with Finite Automata,Aho corasick,
Page 15: String Matching with Finite Automata,Aho corasick,

Pattern Matching Machine(2)

Goto function g : maps a pair consisting of a state and an input symbol into a state or the message fail.

Failure function f : maps a state into a state, and is consulted whenever the goto function reports fail.

Output function : associating a set of keyword (possibly empty) with every state.

Page 16: String Matching with Finite Automata,Aho corasick,
Page 17: String Matching with Finite Automata,Aho corasick,

Start state is state 0. Let s be the current state and a the

current symbol of the input string x. Operating cycle

If , makes a goto transition, and enters state s’ and the next symbol of x becomes the current input symbol.

If , make a failure transition f. If , the machine repeats the cycle with s’ as the current state and a as the current input symbol.

', sasg

failasg , 'ssf

Page 18: String Matching with Finite Automata,Aho corasick,
Page 19: String Matching with Finite Automata,Aho corasick,

Example

Text: u s h e r s State: 0 0 3 4 5 8 9 2 In state 4, since , and the

machine enters state 5, and finds keywords “she” and “he” at the end of position four in text string, emits

5,4 eg

5output

Page 20: String Matching with Finite Automata,Aho corasick,

Example Cont’d

In state 5 on input symbol r, the machine makes two state transitions in its operating cycle.

Since , M enters state . Then since , M enters state 8 and advances to the next input symbol.

No output is generated in this operating cycle.

failrg ,5 52 f 8,2 rg

Page 21: String Matching with Finite Automata,Aho corasick,

Construction the functions

Two part to the construction First : Determine the states and the

goto function. Second : Compute the failure

function. Output function start at first,

complete at second.

Page 22: String Matching with Finite Automata,Aho corasick,

Construction of Goto function

Construct a goto graph like next page.

New vertices and edges to the graph, starting at the start state.

Add new edges only when necessary. Add a loop from state 0 to state 0 on

all input symbols other than keywords.

Page 23: String Matching with Finite Automata,Aho corasick,
Page 24: String Matching with Finite Automata,Aho corasick,
Page 25: String Matching with Finite Automata,Aho corasick,
Page 26: String Matching with Finite Automata,Aho corasick,
Page 27: String Matching with Finite Automata,Aho corasick,

About construction

When we determine , we merge the outputs of state s with the output of state s’.

In fact, if the keyword “his” were not present, then could go directly from state 4 to state 0, skipping an unnecessary intermediate transition to state 1.

To avoid above, we can use the deterministic finite automaton, which discuss later.

'ssf

Page 28: String Matching with Finite Automata,Aho corasick,

Time Complexity of Algorithms 1, 2, and 3

Algorithms 1 makes fewer than 2n state transitions in processing a text string of length n.

Algorithms 2 requires time linearly proportional to the sum of the lengths of the keywords.

Algorithms 3 can be implemented to run in time proportional to the sum of the lengths of the keywords.

Page 29: String Matching with Finite Automata,Aho corasick,

Eliminating Failure Transitions

Using in algorithm 1 , a next move function such

that for each state s and input symbol a.

By using the next move function , we can dispense with all failure transitions, and make exactly one state transition per input character.

as,

Page 30: String Matching with Finite Automata,Aho corasick,
Page 31: String Matching with Finite Automata,Aho corasick,
Page 32: String Matching with Finite Automata,Aho corasick,

Conclusion

Attractive in large numbers of keywords, since all keywords can be simultaneously matched in one pass.

Using Next move function can reduce state transitions by 50%,

but more memory. Spend most time in state 0 from which

there are no failure transitions.

Page 33: String Matching with Finite Automata,Aho corasick,

Refrences Cormen, et al. Introduction to Algorithms. ©1990 MIT Press,

Cambridge. 862-868.

Reif, John. http://www.cs.duke.edu/education/courses/cps130/fall98/lectures/lect14/node28.html

Eppstein, David. http://www.ics.uci.edu/~eppstein/161/960222.html

http://banyan.cm.nctu.edu.tw/computernetwork2/ Network Technology Laboratory ( Network Communication labratory), Department of Communicaton Engineering, National chiao Tung University.