hardware-accelerated regular expression matching for high
Post on 12-Feb-2022
6 Views
Preview:
TRANSCRIPT
Kubilay Atasu – IBM Research Zurich
© 2013 IBM Corporation
Hardware-accelerated regular expression matching for high-throughput text analytics Kubilay Atasu, Raphael Polig, Christoph Hagleitner, Frederick. R. Reiss IBM Research – Zurich & IBM Research – Almaden
© 2013 IBM Corporation 2
Outline
Text analytics systems
Advanced regex features
Network of state machines
Implementation & experiments
Conclusions & future work
K. Atasu et al.
© 2013 IBM Corporation 3
SystemT: an algebraic approach to declarative information extraction
distill structured data from unstructured and semi-structured text
exploit the extracted data in your applications
For years, Microsoft
Corporation CEO Bill Gates
was against open source. But
today he appears to have
changed his mind. "We can be
open source. We love the
concept of shared source,"
said Bill Veghte, a Microsoft
VP. "That's a super-important
shift for us in terms of code
access.“
Richard Stallman, founder of
the Free Software Foundation,
countered saying…
Name Title Organization
Bill Gates CEO Microsoft
Bill Veghte VP Microsoft
Richard Stallman Founder Free Soft..
(from Cohen’s IE tutorial, 2003)
Annotations
K. Atasu et al.
© 2013 IBM Corporation
A typical SystemT information extraction query
Find the names (regex) that are at most 20 chars after a title (dict.)
Founder..............Bill Gates
regex
match
at most 20 chars
dict.
match
end offset start offset start offset end offset
start offset
result
end offset
4
K. Atasu et al.
© 2013 IBM Corporation 5
Outline
Text analytics systems
Advanced regex features
Network of state machines
Implementation & experiments
Conclusions & future work
K. Atasu et al.
© 2013 IBM Corporation
Regex matching: background
Consider the regex .*a*b[ˆa]*ca*b
Can be transformed into NFA/DFA
Hybrid solutions are also possible
NFA
DFA
NFA with a single nondeterministic state
6
K. Atasu et al.
© 2013 IBM Corporation
match 4
Start Offset Reporting
Consider the regex .*a*b[ˆa]*ca*b
Consider the input string abcabcab
There are four distinct regex matches
Only 2 and 4 are leftmost matches
Each one has a different start offset
a b c a b c a b
match 1
match 2
match 3
NFA needs to remember multiple start offsets!
NFA must know which one(s) to report, when!
No existing HW architecture addresses this problem:
Reconfigurable NFAs (Baker FCCM 2001, Bispo FPT 2006 , Yang ANCS 2008, …)
Programmable DFAs (Smith SIGCOMM 2008, Van Lunteren INFOCOM 2012, …)
7
K. Atasu et al.
© 2013 IBM Corporation
Capturing groups
Consider the regex .*a*(?b[ˆa]*c)a*b
A capturing group is marked by (? )
Assume regexs of the type .*R1(?R2)R3
Build the DFAs of R1, R2, and R3
– interconnected by epsilon transitions
epsilon removal creates nondeterminism
– but, only at the terminal states
– all other states are deterministic
– at most two possible next states
8
K. Atasu et al.
© 2013 IBM Corporation 9
Outline
Text analytics systems
Advanced regex features
Network of state machines
Implementation & experiments
Conclusions & future work
K. Atasu et al.
© 2013 IBM Corporation
Our solution: network of state machines
Use a network of state machines and enable each state machine to remember its start offset
10
K. Atasu et al.
© 2013 IBM Corporation
Network of state machines
Shutdown Logic: – deactivate configurations storing the
same state value
– only the one with the smallest start
offset remains active
Routing Logic: – route branch configurations to inactive
state machines
– based on active flags (af) and branch
flags (bf)
11
K. Atasu et al.
© 2013 IBM Corporation
Computing the size of the network
The size of the network can be statically computed:
– the largest number of NFA states mapped to a single DFA state
– typically much smaller than the number of NFA states
While supporting leftmost match semantics
– if multiple state machines reach the same state, only one will remain alive
– shutdown logic will pick the configuration with the smallest start offset
NFA
DFA
12
K. Atasu et al.
© 2013 IBM Corporation
What kind of network? (architecture 1)
Only state 0 can assert its branch flag
Only a single branch configuration is routed
A priority encoder is sufficient for arbitration
NFA with a single nondeterministic state
Partially expanded NFA can be minimized
DFA compression techniques can be used
Also decreases the size of the network
13
K. Atasu et al.
© 2013 IBM Corporation
What kind of network? (architectures 2 & 3)
Pack and unpack operations (left)
– log(N) delay, N log(N) area each
– 2log(N) delay and 2N log(N) total area
A single wide pack operation (below)
– log(2N) = log(N)+1 delay (lower)
– 2Nlog(2N) = 2Nlog(N)+2N area (higher)
– simpler shutdown logic
– this is our Architecture 3
NFA with a single nondeterministic state
– single branch flag
– N+1→N pack instead of a 2N→N pack
– this is our Architecture 2
14
K. Atasu et al.
© 2013 IBM Corporation 15
Outline
Text analytics systems
Advanced regex features
Network of state machines
Implementation & experiments
Conclusions & future work
K. Atasu et al.
© 2013 IBM Corporation
Implementation & experiments
Design: two stage, dual threaded pipeline
– cycle 1: next configuration computation
– cycle 2: interconnection network
Hardwired next state computation
– DFA compression techniques
Benchmark # regexs Comb. LUTs Registers Frequency
Text Analytics 25 4.4% 4.3% ~250 MHz
L7 Filter 101 29% 20% ~150 MHz
Device: Altera Stratix IV GX530KH40C2
– using Altera Quartus II, V 11
– target frequency: 250 MHz
Single pipeline throughput rate: 2 Gb/s
– single stream throughput rate: 1 Gb/s
Up to 16 Gb/s aggregate throughput rate for Text Analytics regexs
– using 8 hardware pipelines (16 document streams)
320 fold faster than the software having the same functionality
– using 16 software threads running on an Intel™ Xeon E5530
All L7 filter regexs have been made unanchored (more complex)
– it’s trivial to compute the start offset for anchored regexs
16
K. Atasu et al.
© 2013 IBM Corporation 17
Text analytics regular expressions: resource usage
0
1000
2000
3000
4000
5000
6000
7000
8000
1 3 5 7 9 11 13 15 17 19 21 23 25
Architecture 1
Architecture 2
Architecture 3
Capturing Gr.
A histogram of the resource usage (ALUTs) of 25 regexs
K. Atasu et al.
© 2013 IBM Corporation 18
Text analytics regular expressions: clock frequency
A histogram of the clock frequency (MHz) of 25 regexs
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Architecture 1
Architecture 2
Architecture 3
Capturing Gr.
K. Atasu et al.
© 2013 IBM Corporation 19
Outline
Text analytics systems
Advanced regex features
Network of state machines
Implementation & experiments
Conclusions & future work
K. Atasu et al.
© 2013 IBM Corporation
Conclusions & future work
Novel regex matching architecture that supports advanced features
–start offset reporting, capturing groups, leftmost match semantics
–based on a network of state machines and an optimized network
–strictly forward processing, without introducing any back-pressure
Its reconfigurable hardware implementation and experiments that show
–up to 16 Gb/s aggregate throughput rate using 8 hardware pipelines
–up to 320 X speed-up over 16 software threads running on a server
– including an evaluation of various implementation choices
The current and future work includes
–reducing the resource consumption to improve the scalability
–making the next-configuration-computation logic programmable
–supporting additional regex features, such as back-references
20
K. Atasu et al.
Kubilay Atasu – IBM Research Zurich
© 2013 IBM Corporation
Hardware-accelerated regular expression matching for high-throughput text analytics Kubilay Atasu, Raphael Polig, Christoph Hagleitner, Frederick. R. Reiss IBM Research – Zurich & IBM Research – Almaden
top related