regular expression
DESCRIPTION
Provides fundamental knowledge on regular expressionTRANSCRIPT
![Page 1: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/1.jpg)
Regular Expression
Minh Hoang TOPortal Team
![Page 2: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/2.jpg)
2
Agenda
» Finite State Machine
» Pattern Parser
» Java Regex » Parsers in GateIn
» Advanced Theory
![Page 3: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/3.jpg)
Finite State Machine
![Page 4: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/4.jpg)
4
State Diagram
![Page 5: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/5.jpg)
5
JIRA Issue Lifecycle
![Page 6: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/6.jpg)
6
Java Thread Lifecycle
![Page 7: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/7.jpg)
7
Java Compilation Flow
![Page 8: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/8.jpg)
8
Finite State Machine - FSM
» Behavioral model to describe working flow of a system
![Page 9: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/9.jpg)
9
Finite State Machine - FSM
» Directed graph with labeled edges
![Page 10: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/10.jpg)
Pattern Parser
![Page 11: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/11.jpg)
11
Classic Problem
» A – Finite characters set
Ex:
A = {a, b, c, d,..., z} or A = { a, b, c,..., z, public, class, extends, implements, while, if,...}
» Pattern P and input sequence INPUT made of A 's elements
Ex:
P = “a.*b” or P = “class.*extends.*”INPUT = “aaabbbcc” or INPUT = a Java source file
→ Parser reads character-by-character INPUT and recognizes all subsequences matching pattern P
![Page 12: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/12.jpg)
12
Classic Problem - Samples
» Split a sequence of characters into an array of subsequences
String path = “/portal/en/classic/home”; String[] segments = path.split(“/”);
» Handle comment block encountered in a file
» Override readLine() in BufferedReader
» Extract data from REST response
» Write an XML parser from scratch
![Page 13: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/13.jpg)
13
Finite State Machine & Classic Problem
» Acceptor FSM?
» How to transform Classic Problem into graph traversing problem with well-known generic solution?
Find pattern occurrences ↔ Traversing directed graph with labeled edges
![Page 14: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/14.jpg)
14
FSM – Word Accepting
» Consider a word W – sequence of characters from character set A
W = “abcd...xyz”
FSM having graph edges labeled with characters from A, accepts W if there exists a path connecting START node to one of END nodes
START = S1 → S2 → … → Sn = END
1. Duplicate of intermediate nodes is allowed
2. The transition from S_i → S_(i+1) is determined by i-th character of W
![Page 15: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/15.jpg)
15
FSM – Word Accepting
![Page 16: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/16.jpg)
16
Acceptor FSM
» Consider a pattern P, a FSM is called Acceptor FSM if it accepts any word matching pattern P.
Ex:
Acceptor FSM of “a[0-9]b” accepts any element from word set
{ “a0b”, “a1b”, “a2b”, “a3b”, “a4b”, “a5b”, “a6b”, “a7b”, “a8b”, “a9b”}
![Page 17: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/17.jpg)
17
How Pattern Parser Works?
Traversing directed graph associated with Acceptor FSM
1. Start from root node 2. Read next characters from INPUT, then makes move according to transition rules 3. Repeat second step until visiting one leaf node or INPUT becomes empty
4. Return OK if leaf node refers to success match.
![Page 18: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/18.jpg)
18
Example One
» Recognize pattern
eXo.*er
in:
AAAeXo123erBBBeXoerCCCeXoeXoerDDD
![Page 19: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/19.jpg)
19
Example One
» Acceptor FSM with 8 states:
START – Start reading input sequence
e – encounter eeX – encounter eX
eXo – encounter eXo
eXo.* – encounter eXo.*
eXo.*e – encounter eXo.*e
END – subsequence matching eXo.*er foundFAILURE
![Page 20: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/20.jpg)
20
![Page 21: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/21.jpg)
21
Example Two
» Recognize comment block
/* */in:
/* Don't ask * /final int innerClassVariable;
![Page 22: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/22.jpg)
22
Example Two
» Acceptor FSM with 5 states:
START – start reading input sequence
OUT – stay away from comment blocks
ENTERING – at the beginning of comment block
IN – stay inside a comment block
LEAVING – at the end of comment block
![Page 23: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/23.jpg)
23
![Page 24: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/24.jpg)
24
Finite State Machine With Stack
» Example Two is slightly harder than Example One as transition decision depends on past information → We must keep something in memory
»
FSM with Stack = Ordinary FSM + Stack Structure storing past info
Contextual transition is determined by pair
(next input character , stack state)
![Page 25: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/25.jpg)
Java Regex
![Page 26: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/26.jpg)
26
Model
» Pattern: Acceptor Finite State Machine
» Matcher: Parser
![Page 27: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/27.jpg)
27
java.util.regex.Pattern
» Construct FSM accepting pattern
Pattern p = Pattern.compile(“a.*b”);
FSM states are instances of java.util.regex.Pattern$Node
» Generate parser working on input sequence
Matcher matcher = p.matcher(“aaabbbb”);
![Page 28: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/28.jpg)
28
java.util.regex.Matcher
» Find next subsequence matching pattern
find()
» Get capturing groups from latest match
group()
![Page 29: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/29.jpg)
29
Capturing Group
Two Pattern objects
Pattern p = Pattern.compile(“abcd.*efgh”);Pattern q = Pattern.compile(“abcd(.*)efgh”);String text = “abcd12345efgh”;Matcher pM = p.match(text);Matcher qM = q.match(text);
» pM.find() == qM.find();
» pM.group(1) != qM.group(1);
![Page 30: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/30.jpg)
30
Capturing Group
» Hold additional information on each match
while(matcher.find()){ matcher.group(index);}
» Pattern P = (A)(B(C))
matcher.group(0) = the whole sequence ABCmatcher.group(1) = ABCmatcher.group(2) = BCmatcher.group(3) = C
![Page 31: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/31.jpg)
31
Capturing Group
» Pattern.compile(“abc(defgh”);Pattern.compile(“abcdef)gh”);
→ PatternSyntaxException
» Pattern.compile(“abc\\(defgh”);Pattern.compile(“abcdef\\)gh”);
→ Success thanks to escape character '\'
![Page 32: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/32.jpg)
32
Operators
» Union
[a-zA-Z-0-9]» Negation
[^abc]
[^X]
![Page 33: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/33.jpg)
33
Contextual Match
» X(?=Y)
Once match X, look ahead to find Y
» X(?!= Y)
Once match X, look ahead and expect not find Y
» X(?<= Y)
Once match X, look behind to find Y
» X(?<!= Y)
Once match X, look behind and expect not find Y
![Page 34: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/34.jpg)
34
Tips
» Pattern is stateless → Maximize reuse
We often see:
static final Pattern p = Pattern.compile(“a*b”);
» Be careful with String.split
String.split vs Java loop + String.charAt
![Page 35: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/35.jpg)
Parsers in GateIn
![Page 36: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/36.jpg)
36
Parsers in GateIn
» JavaScript Compressor
» CSS Compressor
» Groovy Template Optimizer
» Navigation Controller
Extracting URL param = Regex matching + Backtracking algorithm
» StaxNavigator (Nice XML parser based on StAX)
![Page 37: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/37.jpg)
Advanced Theory
![Page 38: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/38.jpg)
38
Grammar & Language
» Any word matching pattern eXo.*er is a combination of transforms, starting from S
S → eXoQerQ → RQTQ → ''R → {a,b,c,d,...}T → {a,b,c,d,...}
» Language of a Grammar = Vocabularies generated by finite-combination of transforms, starting from S
Ex: Any valid Java source file is generated by a finite number of transforms mentioned in Java Grammar (JLS)
![Page 39: Regular Expression](https://reader033.vdocuments.us/reader033/viewer/2022061609/5563a5bad8b42a2b6a8b52ba/html5/thumbnails/39.jpg)
39
Finite State Machine & Language
» Language accepted by a FSM with Stack must be built from a context-free grammar
Explicit steps to build such context-free grammar are described in Kleene theorem
» Context-free grammar Language is accepted by a FSM with Stack
Explicit steps to build such Finite State Machine aredescribed in Kleene theorem