aho-corasick algorithm parallelization
Post on 14-Aug-2015
42 Views
Preview:
TRANSCRIPT
<Name Surname>
2String-matching: AC algorithm
String-matching algorithms are a class of algorithms that aim to find occurrences of words (patterns) within a larger string (text)
Aho-Corasick algorithm (AC) is a classic solution to exact set matching.
Given
pattern set 𝑃 = { 𝑃1, . . . , 𝑃𝑘 }
text 𝑇[1…𝑚]
total length of patterns n = 𝑖=1𝑘 |𝑃𝑖|
the AC algorithms complexity is 𝑂(𝑛 +𝑚 + 𝑧), where 𝑧is the number of pattern occurrences in 𝑇
<Name Surname>
3AC algorithm: finite-state machine
The AC algorithm builds a finite-state machine to efficiently memorize the pattern set
The FSA is memorized along with three functions
the goto function 𝑔(𝑞, 𝑎) gives the state entered
from current state 𝑞 by matching target char 𝑎
the failure function 𝑓 𝑞 , 𝑞 ≠ 0 gives the state
entered at a mismatch
the output function out 𝑞 gives the set of patterns
recognized when entering state q
<Name Surname>
5AC algorithm: matching phase
The AC algorithm uses the FSA to match the text against the keywords
𝐴𝐶_𝑚𝑎𝑡𝑐ℎ𝑖𝑛𝑔 𝑇 1…𝑚
𝑞 ≔ 0; // initial state (root)
𝒇𝒐𝒓 𝑖 ≔ 1 𝒕𝒐 𝑚 𝒅𝒐
𝒘𝒉𝒊𝒍𝒆 𝑔 𝑞, 𝑇 𝑖 = 0 𝒅𝒐
𝑞 ≔ 𝑓 𝑞 ; // follow a fail
𝑞 ≔ 𝑔 𝑞, 𝑇 𝑖 ; // follow a goto
𝒊𝒇 𝑜𝑢𝑡 𝑞 ≠ 0 𝒕𝒉𝒆𝒏 𝒑𝒓𝒊𝒏𝒕 𝑖, 𝑜𝑢𝑡 𝑞 ;
𝒆𝒏𝒅𝒇𝒐𝒓
The number of steps of the loop is equal to the length of the text
<Name Surname>
6Parallelization step
Idea: parallelize the matching phase of the AC algorithm (the FSA can be built once for each pattern data set)
The 𝑚 steps of the loop can be split in 𝑘 chunks, each one of length 𝑙 = 𝑚 𝑘 and then each chunk can be processed by a thread
Feasible because a chunk can be independently analyzed
𝑚 = 19 𝑘 = 3 𝑙 = 7
<Name Surname>
7Parallelization: problems
The splitting phase as performed before can lead to missing occurrences
Let assume 𝑃 = 𝑎𝑑𝑣, 𝑜𝑟𝑖𝑡, 𝑒𝑑
Each thread would run AC on its related chunk
Thread 1: 𝑇 = 𝑎𝑑𝑣𝑎𝑛𝑐𝑒
Thread 2: 𝑇 = 𝑑 𝑎𝑙𝑔𝑜𝑟
Thread 3: 𝑇 = 𝑖𝑡ℎ𝑚𝑠
None of them would find the occurrences of the second and third keyword
Needed a redundancy for text overlapping twochunks
<Name Surname>
8Parallelization: solutions
The maximum needed overlap o is the lenght of the longest word in the pattern data set – 1
Each chunk will contain the last o characters of the previous one
However: orit correctly found by thread 3 but ed incorrectly matched twice (threads 1 and 2)
Correction: start counting matches only after o characters read
<Name Surname>
9Implementation
AC has been implemented in C using openMP; the matching-phase has been split among threads using the pragma for structure
Input: text, keywords, number of threads
Output: number of occurences
The chunk size 𝑙 is computed with the following formula
𝑙 = 𝑚 + 𝑜𝑣 ( 𝑘 − 1)
The output variable is aggregated after the end of the loop ( reduction statement )
<Name Surname>
10Implementation
Each read character is converted in its ASCII code
Therefore, the FSA
allows 256 different
transitions
It allows to use the AC
algorithm even with
non-textual files
Binary files must be
read bytewise
<Name Surname>
11Test
Very large input files have been used in order to test the algorithm’s performance
a text file containing the English version of the bible
a dictionary including the 10000 most common English words
A single test consists of an aggregation measure of 10 different runs of the algorithm on the inputs using the same number of threads
<Name Surname>
12Test
1 2 3 4 5 6 7 8 9 10
0
50
100
150
200
250
300
350
number of threads
execution t
ime (
sec)
i7 4700MQ - 4 cores/8 threads
Mean
Minimum
<Name Surname>
13Test
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0
15
30
45
60
75
90
105
number of threads
execution t
ime (
sec)
12 cores/24 threads machine
mean
min
<Name Surname>
14Conclusion
In this work it has been showed a parallelization procedure for a serial-designed algorithm
The more threads are used the faster the algorithm runs until a certain point after which we do not get any improvements
Parallelization improves performance but requires modifications not always clear from the beginning that often lead to overheads
top related