suffix arrays: a new method for on-line string searches
DESCRIPTION
Suffix Arrays: A new method for on-line string searches. Udi Manber Gene Myers May 1989 Presented by: Oren Weimann. Introduction - Problem definition. “Is W a substring of A?” |A|=N and |W|=P A = a 0 a 1 …a N-1 A i = suffix beginning at index i = a i a i+1 …a N-1. W= badgfbb. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/1.jpg)
1
Suffix Arrays:A new method for on-line
string searches
Udi Manber Gene Myers
May 1989
Presented by:Oren Weimann
![Page 2: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/2.jpg)
2
Introduction - Problem definition
“Is W a substring of A?”
|A|=N and |W|=P A = a0a1…aN-1
Ai = suffix beginning at index i = aiai+1…aN-1
A= abccbbadgfbbcahgjf
W= badgfbb
A= abccbbadgfbbcahgjf
![Page 3: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/3.jpg)
3
Introduction – what is a suffix array? Example:
assassin 0 assin 3 in 6 n 7 sassin 2 sin 5 ssassin 1 ssin 4
Pos
Pos[2] = 6 (A6 = in)
0 3 6 7 2 5 1 4
A = assassin0 1 2 3 4 5 6 7
![Page 4: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/4.jpg)
4
Introduction – what is a suffix array?
A lexicographically sorted array- Pos[N], of all
the suffixes of A:
Pos[k] = i Ai is the kth smallest suffix in the set {A0, A1, A2…… AN-1}
![Page 5: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/5.jpg)
5
Introduction – what is a suffix tree? Example:
A trie that contains all suffixes of A:
sa
4
3
ss
ss
a
in0
i
n 6
in
A = assassin0 1 2 3 4 5 6 7
s
ina
ssin
2
in
5
1a s s i n
![Page 6: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/6.jpg)
6
The Article Overview
1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information).
2. How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known)
3. An Algorithm for computing the lcp information in O(NlogN).
4. Algorithms for Expected-time improvement.
![Page 7: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/7.jpg)
7
The Search algorithm - Definitions
For any string u, up = u1u2u3…….up (or u if |u| p)
Let “ “ denote a Lexicographical order, We say u v up vp
Note that for any choice of p:
Note that W is a substring of A there is an i such that W
p
]1[]2[]1[]0[ .... Npospppospposppos AAAA
][iposp A
![Page 8: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/8.jpg)
8
The Search algorithm – how does the array help us know if W is a substring of A?
We define a search interval: LW = min {k | W APos[k] or k = N}
RW = max {k | W APos[k] or k = -1}
W matches ai ai+1 ...ai+P-1 i=Pos[k] for some k [LW, RW]
p
p
![Page 9: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/9.jpg)
9
Example:
Pos0 assassin
1 assin 2 in 3 n 4 sassin
5 sin
6 ssassin
7 ssin
W LW RW # s 4 7 4 as 0 1 2
assa 0 0 1 ast 2 1 0
A = assassin0 1 2 3 4 5 6 7
Option 1
Option 2
Option 3
![Page 10: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/10.jpg)
10
Why finding LW, RW == Finding the matches:
If LW > RW => W is not a substring of A.
Else: there are (RW-LW+1) matches - APos[LW],…, APos[RW]
W>APos[k] W<APos[k]LW RW
Pos
![Page 11: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/11.jpg)
11
The Search algorithm –The easy way - O(PlogN)
L M R
abcde... abcdf... abd...Pos
Log(N) iterations, each iteration sets new L,R bonds (initially L=0, R=N-1) according to a comparison of W with APos[M] , where M=(L+R)/2.
In the end LW R
W=“abcx”
![Page 12: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/12.jpg)
12
The Search algorithm using lcp values in O(P+logN) – Definitions:
Speedup using precomputed lcp Values, for now We assume lcp is known.
Each iteration We define: – l = lcp(APos[L], W) – r=lcp(W, APos[R]) – Llcp[M] = lcp(APos[L] APos[M])– Rlcp[M] = lcp(APos[M], APos[R])
![Page 13: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/13.jpg)
13
The Search algorithm using lcp values in O(P+logN) Example: A=“abcx”
l = 3
Llcp[M]=4 Rlcp[M]=2L M R
abcde... abcdf... abd...Pos
r = 2
Note that Llcp[M] is well defined because every midpoint M has one LM and one RM
![Page 14: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/14.jpg)
14
So how do we use l,r,Llcp[M] ?Example: W=abcx
abcde...
abc... abc... abcdf… abd…
l=3 r=2
Case 1: Llcp[M] > l (Llcp[M]=4 and l=3 )W>APos[L]
W>APos[M]
Go rightl is unchanged = 3
L M R
Llcp[M]=4
![Page 15: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/15.jpg)
15
Example: W=abcx (cont.)
Case 2: Llcp[M] < l (Llcp[M]=2 and l=3 )
APos[L] <APos[M]
W<APos[M]
Go left r = Llcp[M] = 2
abcde...
abdf… abd…
r=2l=3
L M R
Llcp[M]=2
![Page 16: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/16.jpg)
16
Example: W=abcx (cont.)
abcde...
abc... abc... abcp… abd…
l=3 r=2
Case 3: Llcp[M] = l (Llcp[M]=3 and l=3 )Compare Wl and APos[M]l
until Wl+j APos[M]l+j
Go right or left according to Wl+j, APos[M]l+j
new l or r = (l+j) Number of comparisons = j+1
L M R
Llcp[M]=3
![Page 17: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/17.jpg)
17
The Search algorithm using lcp values-complexity
In each iteration there are maximum j+1comparisons, when in total
Total comparisons (P + #Iterations) O(P+logN) running time
Requires only 3N-sized arrays
Pjiterations
#
1
![Page 18: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/18.jpg)
18
The Article Overview
1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information).
2. How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known)
3. An Algorithm for computing the lcp information in O(NlogN).
4. Algorithms for Expected-time improvement.
![Page 19: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/19.jpg)
19
Construction of suffix array in O(NlogN)
Sorting the suffixes in a unique Radix sort – WeWill have O(logN) stages (numbered
1,2,4,8,16…)
In stage H the suffixes are sorted in bucketscalled H Buckets, according to the first Hcharacters. (next stage is 2H– thus, in stage Hthe suffixes are sorted by )H
![Page 20: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/20.jpg)
20
Construction of suffix array –The general idea
If Ai, Aj H-bucket we Sort them by the
Next H symbols, but:Their next H symbols = first H symbols ofAi+H and Aj+H which are already sorted in phase
H.
abef… abcd… ab… bb... bb… cd… cd… ef…
H=2:Ai Aj Aj+H Ai+H
first bucket fourth bucketthird bucketsecond bucket
![Page 21: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/21.jpg)
21
Construction of suffix array –The general idea (cont.)
Let Ai be in first H-bucket after stage H
Ai starts with smallest H-symbol string
Ai-H should be first in its H-bucket
abef…
abcd…
ab… bb... bb… cdef… cdab…
ef…
Ai Ai-HH=2:
![Page 22: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/22.jpg)
22
Construction of suffix array –The algorithm
Go over the suffix array: For each Ai: Move Ai-H to next available place in
its H-bucket The suffixes are now sorted according to -order Go over the array again, and decide which
suffix opens a new 2H-bucket, use lcs knowledge (described later)
H2
![Page 23: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/23.jpg)
23
Construction of suffix array –The algorithm Example:
A = assassin0 1 2 3 4 5 6 7
assin assassin
in n sin ssin sassin ssassin
H=1A3
A2
Ai sets Ai-1
![Page 24: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/24.jpg)
24
Construction of suffix array –The algorithm Example:
assin assassin
in n sassin ssin sin ssassin
H=1A0
A = assassin0 1 2 3 4 5 6 7
Ai sets Ai-1
![Page 25: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/25.jpg)
25
Construction of suffix array –The algorithm Example:
assin assassin
in n sassin ssin sin ssassin
H=1A6
A = assassin0 1 2 3 4 5 6 7
A5
Ai sets Ai-1
![Page 26: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/26.jpg)
26
Construction of suffix array –The algorithm Example:
assin assassin
in n sassin sin ssin ssassin
H=1A7
A = assassin0 1 2 3 4 5 6 7
A6
Ai sets Ai-1
![Page 27: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/27.jpg)
27
Construction of suffix array –The algorithm Example:
assin assassin
in n sassin sin ssin ssassin
H=1
A2 A1
A = assassin0 1 2 3 4 5 6 7
Ai sets Ai-1
![Page 28: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/28.jpg)
28
Construction of suffix array –The algorithm Example:
assin assassin
in n sassin sin ssassin
ssin
H=1
A4
A = assassin0 1 2 3 4 5 6 7
A5
Ai sets Ai-1
![Page 29: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/29.jpg)
29
Construction of suffix array –The algorithm Example:
assin assassin
in n sassin sin ssassin
ssin
H=1
A = assassin0 1 2 3 4 5 6 7
A1A0
Ai sets Ai-1
![Page 30: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/30.jpg)
30
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=1
A = assassin0 1 2 3 4 5 6 7
A4A3
Ai sets Ai-1
![Page 31: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/31.jpg)
31
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=1
A = assassin0 1 2 3 4 5 6 7
Go over array to get new 2-buckets
lcs(sassin,sin)= 1+ lcs(assin,in)= 1+0=1 so “sin” opens a new 2-bucket
backAi sets Ai-1
![Page 32: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/32.jpg)
32
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A0
Ai sets Ai-2
![Page 33: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/33.jpg)
33
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A3A1
Ai sets Ai-2
![Page 34: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/34.jpg)
34
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A6A4
Ai sets Ai-2
![Page 35: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/35.jpg)
35
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A7 A5
Ai sets Ai-2
![Page 36: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/36.jpg)
36
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A2A0
Ai sets Ai-2
![Page 37: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/37.jpg)
37
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A5A3
Ai sets Ai-2
![Page 38: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/38.jpg)
38
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A1
Ai sets Ai-2
![Page 39: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/39.jpg)
39
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A4A2
Ai sets Ai-2
![Page 40: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/40.jpg)
40
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
Go over array to get new 4-buckets
Ai sets Ai-2
![Page 41: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/41.jpg)
41
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=4
A = assassin0 1 2 3 4 5 6 7
That’s it, we are sorted!
![Page 42: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/42.jpg)
42
Construction of suffix array –Complexity Summary
Sorting by first char – O(N) O(logN) stages of O(N) operations = O(NlogN)
Total - time: O(NlogN) - space: 2 integer arrays of size N
back
![Page 43: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/43.jpg)
43
The Article Overview
1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information).
2. How to construct Pos[ ] in O(NlogN) time and O(N) space.
3. An Algorithm for computing the lcp information in O(NlogN).
4. Algorithms for Expected-time improvement.
![Page 44: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/44.jpg)
44
How to find Longest Common Prefixes – the general idea
We don’t care what is the lcp between suffixes in the same H-bucket.
For Ap, Aq in the same H-bucket but different 2H-buckets:– H lcp(Ap, Aq) < 2H– lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H)– lcp(Ap+H, Aq+H) < H that is why Ap+H,Aq+H
Are in different H-buckets, but which ones?
![Page 45: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/45.jpg)
45
How to find Longest Common Prefixes – the general idea
If Ap+H and Aq+H were in adjacent H-buckets then lcp is known. how?
If not, Then: lcp(APos[i], APos[j]) =
{lcp(APos[k],APos[k+1])}]1,[ jik
Min
![Page 46: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/46.jpg)
46
How to find Longest Common Prefixes – the general idea
lcp(Ap+H, Aq+H) = min{1,1,2} = 1
assassin
assin in n sassin sin ssassin
ssin
Aq+hAp+h
1 1 2
Notice that if 2 neighbors are in the same H-bucket we can consider there lcp to be H, since lcp(Ap+H, Aq+H) < H
H=2
![Page 47: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/47.jpg)
47
How to find lcp – algorithm and data structures – Hgt[]
During the construction stage, we build an arrayCalled Hgt[N]: Hgt(i)=lcp(APos[i-1], APos[i]),
initialized so that Hgt[i]=N+1 for every i.
In stage H=1: Hgt(i)=0 for APos[i] that are first in their buckets. In stage 2H: we update every Hgt(i) that APos[i] is the first in a newly created 2H bucket
![Page 48: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/48.jpg)
48
How to find lcp – Hgt[] example:
H=1assin assassin
3 0 6 7 5 4 2 1 in n sin ssin sassin ssassin
0 0 0 9 999
1 1
assin assassin in n sin ssinsassin ssassin3 0 6 7 2 5 4 1
0 0 0 99
H=2
lcp(ssin,sin)=1+lcp(sin,in)=1+min{lcp(in,n),lcp(sin, n)}=1
![Page 49: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/49.jpg)
49
How to find lcp – Hgt[] example (cont.)
23
0 3 6 7 2 5 1 4 assinassassin in n sin ssinsassin ssassin
H=4
0 0 0 1 1
lcp(assassin,assin)=2+lcp(sin, sassin)=2+1=3lcp(ssin, ssassin)=2+lcp(in, assin)=2+0=2
![Page 50: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/50.jpg)
50
How to find lcp –data structures
We need a data structure that will containlcp(APos[j], APos[i]) between any i and j
(not just i and i+1 which Hgt[] supplies)
Hgt[] will become the leaves of a binarybalanced tree called the Interval tree.
![Page 51: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/51.jpg)
51
How to find lcp –example of Interval tree
(2,3) (3,4) (4,5) (5,6) (6,7)(1,2)(0,1)
0
9 0 0 0
0 0
9
0
9 9
9
9
1 1
1
1
3 2
![Page 52: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/52.jpg)
52
How to find lcp –Complexity
Each time a leaf opens a new bucket we change Hgt[i] for that leaf.
That change requires O(logN) changes in the interval tree
There are O(N) leaves opening new bucket
In total we get O(NlogN) to get all lcp values
![Page 53: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/53.jpg)
53
The Article Overview
1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information).
2. How to construct Pos[ ] in O(NlogN) time and O(N) space.
3. An Algorithm for computing the lcp information in O(NlogN).
4. Algorithms for Expected-time improvement.
![Page 54: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/54.jpg)
54
Time Expected-case Improvement of the construction of pos[]
Assumptions: - All N-symbol strings are equally likely.
– Under this assumption: Expected length of longest repeated substring = O(log| |N)
This immediately implies that construction of pos[] is reduced to O(NLogLogN). why?
Next is a way to reduce it to O(N).
![Page 55: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/55.jpg)
55
Time Expected-case Improvement of the construction of pos[]
Let T = We encode each possible T length string to
an integer with the isomorphism IntT(u)
Map each AP to IntT(AP) [0,| |T-1] :
– IntT(AP) = ap| |T-1 +
Nlog
/)( 1pT AInt
![Page 56: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/56.jpg)
56
Example of the mapping
IntT(AP) = ap| |T-1 +
assassin 0 ssassin 1 sassin 2 assin 3 ssin 4 sin 5 in 6 n 7
/)( 1pT AInt
2*4^0 + 0 2
| |= 4 , a=0, i=1, n=2, s=3
N=8
T= =1
1*4^0 + 0 1
Nlog
3*4^0 + 0 3
3*4^0 + 0 3
0*4^0 + 0 0
3*4^0 + 0 3
3*4^0 + 0 3
0*4^0 + 0 0
![Page 57: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/57.jpg)
57
Time Expected-case Improvement of the construction of pos[]
By the definition of IntT(AP) it takes O(N) to
compute all IntT(AP) values of all suffixes.
So now instead of starting with H=1 we start with H=
But since the longest repeated substring length isO(log| |N) we will have O(1) stages of the radix sort.
Thus, the total time for constructing pos[] = O(N)
Nlog
![Page 58: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/58.jpg)
58
So is a suffix array better then a suffix tree?
Suffix array Suffix tree
Construction time
O(NlogN) - for small | |O(N) – needs additional space
O(N)
Time Complexity
O(P+logN) – good for large alphabets
O(Plog| |)
Space Complexity
requires 2N integers – this is the main advantage.
O(N)
dependent on | | ?
No Yes
![Page 59: Suffix Arrays: A new method for on-line string searches](https://reader030.vdocuments.us/reader030/viewer/2022012905/56813b70550346895da47b63/html5/thumbnails/59.jpg)
59