tries and succinct data structures - persone -...

Post on 08-Mar-2019

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Tries and Succinct Data Structures

Auto-completion as our target application

Rossano Venturinirossano@di.unipi.it

Dataset?

Dataset? All the past queries

Dataset?Searches?

All the past queries

Dataset?Prefix searchSearches?All the past queries

Dataset?Prefix searchSearches?All the past queries

Data structure?

Dataset?Prefix searchSearches?All the past queries

Data structure? Trie

Dataset?Prefix searchSearches?All the past queries

Data structure? TrieHow to find top-k efficiently?

Dataset?Prefix searchSearches?All the past queries

Data structure? TrieHow to find top-k efficiently?

Project: implement a autocompletion system

Problem

Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:

Problem

Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:

• Lookup(P): say 1 if P belongs to D, 0 otherwise;

Problem

Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:

• Lookup(P): say 1 if P belongs to D, 0 otherwise;• WeakPrefix(P): return the consecutive range of strings in D

that are prefixed by P, -1 if there is no such interval;

Problem

Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:

• Lookup(P): say 1 if P belongs to D, 0 otherwise;• WeakPrefix(P): return the consecutive range of strings in D

that are prefixed by P, -1 if there is no such interval;• StrongPrefix(P): return the consecutive range of strings in D

sharing the longest common prefix with P;

Problem

Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:

• Lookup(P): say 1 if P belongs to D, 0 otherwise;• WeakPrefix(P): return the consecutive range of strings in D

that are prefixed by P, -1 if there is no such interval;• StrongPrefix(P): return the consecutive range of strings in D

sharing the longest common prefix with P;• Insert(P): insert s in D;

Problem

Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:

• Lookup(P): say 1 if P belongs to D, 0 otherwise;• WeakPrefix(P): return the consecutive range of strings in D

that are prefixed by P, -1 if there is no such interval;• StrongPrefix(P): return the consecutive range of strings in D

sharing the longest common prefix with P;• Insert(P): insert s in D;• Delete(P): delete s from D;

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

Problem

Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:

• Lookup(P): say 1 if P belongs to D, 0 otherwise;• WeakPrefix(P): return the consecutive range of strings in D

that are prefixed by P, -1 if there is no such interval;• StrongPrefix(P): return the consecutive range of strings in D

sharing the longest common prefix with P;• Insert(P): insert s in D;• Delete(P): delete s from D;

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Trie

0

b

1

b

2

a

5

c

6

a

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Prefix(c)

Trie

0

b

1

b

2

a

5

c

6

a

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Prefix(c)

Trie

0

b

1

b

2

a

5

c

6

a

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Prefix(c)

Trie

0

b

1

b

2

a

5

c

6

a

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Prefix(c)

Trie

0

b

1

b

2

a

5

c

6

a

Find all the strings prefixed by any pattern P in O(|P| log σ) time

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Prefix(c)

Trie

0

b

1

b

2

a

5

c

6

a

Find all the strings prefixed by any pattern P in O(|P| log σ) time Insert? Delete?

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Prefix(c)

Trie

0

b

1

b

2

a

5

c

6

aO(m) nodes

O(m log m + m log σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time Insert? Delete?

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Prefix(c)

Trie

0

b

1

b

2

a

5

c

6

aO(m) nodes

O(m log m + m log σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

Save space?

Insert? Delete?

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

Variable length labels on the edges 😱

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

<a,1,0>

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

<a,1,0>

label len

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

<a,1,0>

label len string id

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

<a,1,0>

label len string id

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

<a,1,0>

<b,0,1>

label len string id

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

<a,1,0>

<b,0,1>

label len string id

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

<a,1,0>

<b,0,1>

<a,1,1>

label len string id

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

<a,1,0>

<b,0,1>

<a,1,1>

label len string id

<c,1,2> <a,0,3>

<b,0,3><c,0,4><a,1,5><b,1,6>

<c,1,3>

<b,1,5>

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

<a,1,0>

<b,0,1>

<a,1,1>

Search: every time we traverse an edge we have to access a

dictionary string!

label len string id

<c,1,2> <a,0,3>

<b,0,3><c,0,4><a,1,5><b,1,6>

<c,1,3>

<b,1,5>

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

<a,1,0>

<b,0,1>

<a,1,1>

Search: every time we traverse an edge we have to access a

dictionary string!

label len string id

<c,1,2> <a,0,3>

<b,0,3><c,0,4><a,1,5><b,1,6>

<c,1,3>

<b,1,5>

Probably a cache miss at each edge 😱. Can we do better?

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time

<a,1,0>

<b,0,1>

<a,1,1>

Search: every time we traverse an edge we have to access a

dictionary string!

label len string id

<c,1,2> <a,0,3>

<b,0,3><c,0,4><a,1,5><b,1,6>

<c,1,3>

<b,1,5>

Probably a cache miss at each edge 😱. Can we do better?

Patricia trie & Blind search every time we traverse an

edge, we match the only symbol and skip “label len” symbols in the

pattern

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1

(b) Patricia trie of S

1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.

5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.

6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.

7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.

8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.

15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1

(b) Patricia trie of S

0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9

(c) Front Coding of S

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.

16

Patricia Trie (and blind search)

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1

(b) Patricia trie of S

1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.

5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.

6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.

7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.

8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.

15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1

(b) Patricia trie of S

0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9

(c) Front Coding of S

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.

16

Patricia Trie (and blind search)

Blind search every time we traverse an

edge, we match the only symbol and skip “label len” symbols in the

pattern

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1

(b) Patricia trie of S

1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.

5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.

6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.

7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.

8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.

15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1

(b) Patricia trie of S

0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9

(c) Front Coding of S

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.

16

Patricia Trie (and blind search)

P= 000 0 110Blind search

every time we traverse an edge, we match the only symbol

and skip “label len” symbols in the pattern

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1

(b) Patricia trie of S

1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.

5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.

6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.

7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.

8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.

15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1

(b) Patricia trie of S

0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9

(c) Front Coding of S

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.

16

Patricia Trie (and blind search)

P= 000 0 110Blind search

every time we traverse an edge, we match the only symbol

and skip “label len” symbols in the pattern

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1

(b) Patricia trie of S

1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.

5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.

6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.

7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.

8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.

15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1

(b) Patricia trie of S

0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9

(c) Front Coding of S

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.

16

Patricia Trie (and blind search)

P= 000 0 110Blind search

every time we traverse an edge, we match the only symbol

and skip “label len” symbols in the pattern

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1

(b) Patricia trie of S

1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.

5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.

6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.

7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.

8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.

15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1

(b) Patricia trie of S

0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9

(c) Front Coding of S

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.

16

Patricia Trie (and blind search)

P= 000 0 110Blind search

every time we traverse an edge, we match the only symbol

and skip “label len” symbols in the pattern

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1

(b) Patricia trie of S

1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.

5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.

6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.

7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.

8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.

15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1

(b) Patricia trie of S

0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9

(c) Front Coding of S

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.

16

Patricia Trie (and blind search)

P= 000 0 110Blind search

every time we traverse an edge, we match the only symbol

and skip “label len” symbols in the pattern

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1

(b) Patricia trie of S

1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.

5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.

6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.

7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.

8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.

15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1

(b) Patricia trie of S

0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9

(c) Front Coding of S

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.

16

Patricia Trie (and blind search)

P= 000 0 110Blind search

every time we traverse an edge, we match the only symbol

and skip “label len” symbols in the pattern

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1

(b) Patricia trie of S

1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.

5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.

6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.

7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.

8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.

15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1

(b) Patricia trie of S

0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9

(c) Front Coding of S

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.

16

Patricia Trie (and blind search)

P= 000 0 110Blind search

every time we traverse an edge, we match the only symbol

and skip “label len” symbols in the pattern

What’s the property of the node we identify?

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1

(b) Patricia trie of S

1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.

5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.

6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.

7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.

8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.

15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1

(b) Patricia trie of S

0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9

(c) Front Coding of S

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.

16

Patricia Trie (and blind search)

P= 000 0 110Blind search

every time we traverse an edge, we match the only symbol

and skip “label len” symbols in the pattern

What’s the property of the node we identify?

Any substring in its subtree has the same longest common prefix with P

as the “correct node”

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1

(b) Patricia trie of S

1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.

5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.

6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.

7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.

8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.

15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1

(b) Patricia trie of S

0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9

(c) Front Coding of S

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.

16

Patricia Trie (and blind search)

P= 000 0 110Blind search

every time we traverse an edge, we match the only symbol

and skip “label len” symbols in the pattern

What’s the property of the node we identify?

Any substring in its subtree has the same longest common prefix with P

as the “correct node”

=> the correct node is an ancestor of the identified node.

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1

(b) Patricia trie of S

1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.

5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.

6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.

7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.

8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.

15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1

(b) Patricia trie of S

0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9

(c) Front Coding of S

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.

16

Patricia Trie (and blind search)

P= 000 0 110Blind search

every time we traverse an edge, we match the only symbol

and skip “label len” symbols in the pattern

What’s the property of the node we identify?

Any substring in its subtree has the same longest common prefix with P

as the “correct node”

=> the correct node is an ancestor of the identified node.

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1

(b) Patricia trie of S

1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.

5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.

6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.

7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.

8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.

15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1

(b) Patricia trie of S

0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9

(c) Front Coding of S

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.

16

Patricia Trie (and blind search)

P= 000 0 110Blind search

every time we traverse an edge, we match the only symbol

and skip “label len” symbols in the pattern

What’s the property of the node we identify?

Any substring in its subtree has the same longest common prefix with P

as the “correct node”

=> the correct node is an ancestor of the identified node.

Insert? Delete?

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a

A node is a struct with- a variable length vector of (sorted)

symbols and pointers to the children

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a

A node is a struct with- a variable length vector of (sorted)

symbols and pointers to the children

Binary search on symbols to decide which is the edge to follow

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a

A node is a struct with- a variable length vector of (sorted)

symbols and pointers to the children

Binary search on symbols to decide which is the edge to follow

Symbols and pointers stored far from the node 😱

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a

A node is a struct with- a variable length vector of (sorted)

symbols and pointers to the children

Binary search on symbols to decide which is the edge to follow

Symbols and pointers stored far from the node 😱 Seach time is O(|P| log σ)

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a

A node is a struct with- a variable length vector of (sorted)

symbols and pointers to the children

Binary search on symbols to decide which is the edge to follow

Symbols and pointers stored far from the node 😱 Seach time is O(|P| log σ)

Can you solve both these issues?

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a

A node is a struct with- a variable length vector of (sorted)

symbols and pointers to the children

Binary search on symbols to decide which is the edge to follow

Symbols and pointers stored far from the node 😱 Seach time is O(|P| log σ)

Can you solve both these issues?

A node is a struct with- a binary vector saying which

symbol is represented and a variable length vector of pointers

to the children

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a

A node is a struct with- a variable length vector of (sorted)

symbols and pointers to the children

Binary search on symbols to decide which is the edge to follow

Symbols and pointers stored far from the node 😱 Seach time is O(|P| log σ)

Can you solve both these issues?

A node is a struct with- a binary vector saying which

symbol is represented and a variable length vector of pointers

to the children

but the space grows to at least σ bits per node!

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a

A node is a struct with- a variable length vector of (sorted)

symbols and pointers to the children

Binary search on symbols to decide which is the edge to follow

Symbols and pointers stored far from the node 😱 Seach time is O(|P| log σ)

Can you solve both these issues?

A node is a struct with- a binary vector saying which

symbol is represented and a variable length vector of pointers

to the children

but the space grows to at least σ bits per node!

Obtaining the best of both with a easy-to-implement

data structure!

Ternary Search Tree

Ternary Search TreeEvery node has (at most) tree

children

Ternary Search TreeEvery node has (at most) tree

children

<c =c >croot

Ternary Search TreeEvery node has (at most) tree

children

Strings that start with c

Strings that start with a symbol <c

Strings that start with a symbol >c

<c =c >croot

Ternary Search TreeEvery node has (at most) tree

children

Bulid the TST recursively removing the

c

Strings that start with c

Strings that start with a symbol <c

Strings that start with a symbol >c

<c =c >croot

Ternary Search TreeEvery node has (at most) tree

children

Bulid the TST recursively removing the

c

Strings that start with c

Strings that start with a symbol <c

Strings that start with a symbol >c

Bulid the TST recursively

Bulid the TST recursively

<c =c >croot

Ternary Search Tree

Ternary Search Tree

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Query time is

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Query time is

O( # )+ # + #

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Query time is

O( # )+ # + #

O(|P|)

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Query time is

O( # )+ # + #

O(|P|) Select c to pay O(log n)

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Query time is

O( # )+ # + #

O(|P|) Select c to pay O(log n)

~ (3way)Quicksort and pivot

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Query time is

O( # )+ # + #

O(|P|) Select c to pay O(log n)

~ (3way)Quicksort and pivot

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Query time is

O( # )+ # + #

O(|P|) Select c to pay O(log n)

~ (3way)Quicksort and pivot

c<c >c

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Query time is

O( # )+ # + #

O(|P|) Select c to pay O(log n)

~ (3way)Quicksort and pivot

c<c >c

<n/2 <n/2

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Query time is

O( # )+ # + #

O(|P|) Select c to pay O(log n)

~ (3way)Quicksort and pivot

c<c >c

<n/2 <n/2

This way every time we go left or right we (at least) halve the size of the remaining strings

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Query time is

O( # )+ # + #

O(|P|) Select c to pay O(log n)

~ (3way)Quicksort and pivot

c<c >c

<n/2 <n/2

This way every time we go left or right we (at least) halve the size of the remaining strings

Insert? Delete?

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Query time is

O( # )+ # + #

O(|P|) Select c to pay O(log n)

~ (3way)Quicksort and pivot

c<c >c

<n/2 <n/2

This way every time we go left or right we (at least) halve the size of the remaining strings

Insert? Delete?

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Query time is

O( # )+ # + #

O(|P|) Select c to pay O(log n)

~ (3way)Quicksort and pivot

c<c >c

<n/2 <n/2

This way every time we go left or right we (at least) halve the size of the remaining strings

Insert? Delete?

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Query time is

O( # )+ # + #

O(|P|) Select c to pay O(log n)

~ (3way)Quicksort and pivot

c<c >c

<n/2 <n/2

This way every time we go left or right we (at least) halve the size of the remaining strings

Insert? Delete?

Ohhhh it is simple even in C

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Node have fixed size, independent of σ

All information is on the node

What about query time?

Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Query time is

O( # )+ # + #

O(|P|) Select c to pay O(log n)

~ (3way)Quicksort and pivot

c<c >c

<n/2 <n/2

This way every time we go left or right we (at least) halve the size of the remaining strings

Insert? Delete?

Project: Is it possible to increase nodes’ fanout to reduce cache-misses?

Ohhhh it is simple even in C

Dataset?Prefix searchSearches?All the past queries

Data structure? TrieHow to find top-k efficiently?

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cFinding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cFinding Top-1

Score

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Scan to find the maximum!

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Scan to find the maximum!

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Scan to find the maximum!

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Scan to find the maximum!

O(n) query time :-(

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Scan to find the maximum!

O(n) query time :-(

Finding Top-1

Better ideas?

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Augment each node with the max (and string id )

within its subtree!

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Augment each node with the max (and string id )

within its subtree!

4,3

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Augment each node with the max (and string id )

within its subtree!

4,3 6,5

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Augment each node with the max (and string id )

within its subtree!

4,3 6,5

6,5

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Augment each node with the max (and string id )

within its subtree!

4,3 6,5

6,52,1

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Augment each node with the max (and string id )

within its subtree!

4,3 6,5

6,52,1

7,0

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Augment each node with the max (and string id )

within its subtree!

4,3 6,5

6,52,1

7,0

Preprocessing time: O(n)

Extra space: O(n log n) bits

Query time: O(1)

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Augment each node with the max (and string id )

within its subtree!

4,3 6,5

6,52,1

7,0

Preprocessing time: O(n)

Extra space: O(n log n) bits

Query time: O(1)

Solving Top-k?

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Augment each node with the max (and string id )

within its subtree!

4,3 6,5

6,52,1

7,0

Preprocessing time: O(n)

Extra space: O(n log n) bits

Query time: O(1)

Solving Top-k?

Finding Top-1

Solving Top-k?

- Extra space: O(k*n*log n) bits :-( - You must know k at building time! :-(

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

7 2 1 4 1 6 2S

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits

RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2S

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits

RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2S

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits

RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2S

RMQ(3,6) = 5

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits

RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2S

RMQ(3,6) = 5Can you solve Top-2?

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits

RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2SCan you solve Top-2?

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m = total length of strings in D, σ = alphabet size

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits

RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2SCan you solve Top-2?

Finding Top-1

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

Finding Top-k

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

Cartesian Tree

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10Cartesian Tree

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7

Cartesian Tree

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

Cartesian Tree

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

5

3

6

1

Cartesian Tree

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian TreeIt can be built top-down

with RMQ

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=41

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

Results 1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

Results 1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

10Results 1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

Results 1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

Results

10

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

Results

10

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

97

Results

10

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

97

Results

107

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

97

Results

1079

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

97

Results

1079

87

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

97

Results

1079

87

7

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

97

Results

1079

87

8

7

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

97

Results

1079

87

8

7

77

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

97

Results

1079

87

8

7

777

7

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

97

Results

1079

87

8

7

777

7

Claim: we “touch” at most 2k nodes. ⇒ Query time O(k log k)

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

97

Results

1079

87

8

7

777

7

Claim: we “touch” at most 2k nodes. ⇒ Query time O(k log k)

Important: the cartesian tree is not built!

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

max-Heap

k=4

97

Results

1079

87

8

7

777

7

Claim: we “touch” at most 2k nodes. ⇒ Query time O(k log k)

Important: the cartesian tree is not built!

1

Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits

RMQ(i,j) = position of the maximum in the range S[i,j]

Range Maximum Query (1)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Range Maximum Query (1)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n2 log n) bits Query time: O(1)

Range Maximum Query (1)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n2 log n) bits Query time: O(1)

Precompute the answer to any possible query.

There are O(n2) distinct queries!

Range Maximum Query (1)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n2 log n) bits Query time: O(1)

M[i,j] = RMQ(i,j)

Precompute the answer to any possible query.

There are O(n2) distinct queries!

Range Maximum Query (1)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n2 log n) bits Query time: O(1)

M 0 1 2 3 4 5 6 7 8 9 10 1101

5

234

6789

1011

M[i,j] = RMQ(i,j)

Precompute the answer to any possible query.

There are O(n2) distinct queries!

Range Maximum Query (1)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n2 log n) bits Query time: O(1)

M 0 1 2 3 4 5 6 7 8 9 10 1101

5

234

6789

1011

M[i,j] = RMQ(i,j)

Precompute the answer to any possible query.

There are O(n2) distinct queries!

3

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

Maximum in a interval is the max between the maxima of any its subintervals

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

Precompute the answer to every interval of size a power of 2.

There are O(log n) possible intervals starting at any position i.

Maximum in a interval is the max between the maxima of any its subintervals

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

M[i,j] = RMQ(i,i+2j)

Precompute the answer to every interval of size a power of 2.

There are O(log n) possible intervals starting at any position i.

M 0 1 2 3 401

5

234

6789

1011

Maximum in a interval is the max between the maxima of any its subintervals

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

M[i,j] = RMQ(i,i+2j)

Precompute the answer to every interval of size a power of 2.

There are O(log n) possible intervals starting at any position i.

M 0 1 2 3 401

5

234

6789

1011

?

Maximum in a interval is the max between the maxima of any its subintervals

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

M[i,j] = RMQ(i,i+2j)

Precompute the answer to every interval of size a power of 2.

There are O(log n) possible intervals starting at any position i.

M 0 1 2 3 401

5

234

6789

1011

?

Maximum in a interval is the max between the maxima of any its subintervals

9=1+23

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

M[i,j] = RMQ(i,i+2j)

Precompute the answer to every interval of size a power of 2.

There are O(log n) possible intervals starting at any position i.

M 0 1 2 3 401

5

234

6789

1011

?

Maximum in a interval is the max between the maxima of any its subintervals

9=1+23

6

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

M[i,j] = RMQ(i,i+2j)

Precompute the answer to every interval of size a power of 2.

There are O(log n) possible intervals starting at any position i.

M 0 1 2 3 401

5

234

6789

1011

Maximum of a interval is the max between the maxima of any its subintervals

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

M[i,j] = RMQ(i,i+2j)

Precompute the answer to every interval of size a power of 2.

There are O(log n) possible intervals starting at any position i.

M 0 1 2 3 401

5

234

6789

1011

Maximum of a interval is the max between the maxima of any its subintervals

RMQ(1,7) =

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

M[i,j] = RMQ(i,i+2j)

Precompute the answer to every interval of size a power of 2.

There are O(log n) possible intervals starting at any position i.

M 0 1 2 3 401

5

234

6789

1011

Maximum of a interval is the max between the maxima of any its subintervals

argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

M[i,j] = RMQ(i,i+2j)

Precompute the answer to every interval of size a power of 2.

There are O(log n) possible intervals starting at any position i.

M 0 1 2 3 401

5

234

6789

1011

Maximum of a interval is the max between the maxima of any its subintervals

argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

M[i,j] = RMQ(i,i+2j)

Precompute the answer to every interval of size a power of 2.

There are O(log n) possible intervals starting at any position i.

M 0 1 2 3 401

5

234

6789

1011

3

Maximum of a interval is the max between the maxima of any its subintervals

argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

M[i,j] = RMQ(i,i+2j)

Precompute the answer to every interval of size a power of 2.

There are O(log n) possible intervals starting at any position i.

M 0 1 2 3 401

5

234

6789

1011

3

Maximum of a interval is the max between the maxima of any its subintervals

argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

M[i,j] = RMQ(i,i+2j)

Precompute the answer to every interval of size a power of 2.

There are O(log n) possible intervals starting at any position i.

M 0 1 2 3 401

5

234

6789

1011

3

Maximum of a interval is the max between the maxima of any its subintervals

argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =

6

Range Maximum Query (2)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)

M[i,j] = RMQ(i,i+2j)

Precompute the answer to every interval of size a power of 2.

There are O(log n) possible intervals starting at any position i.

M 0 1 2 3 401

5

234

6789

1011

3

Maximum of a interval is the max between the maxima of any its subintervals

argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =

where len =⎣log (j-i+1)⎦RMQ(i,j) = argmax(S[M[i,i+2len]], S[M[j-2len,j]])

6

Range Maximum Query (3)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log n

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log n

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7

Use the previous solution on R!

Space: ? bits Query time: O(1)

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7

Use the previous solution on R!

Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7

Use the previous solution on R!

Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)

RMQ(1,10) = ?

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7

Use the previous solution on R!

Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)

RMQ(1,10) = ?

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7

Use the previous solution on R!

Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)

RMQ(1,10) = ?

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7

Use the previous solution on R!

Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)

RMQ(1,10) = ?

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7

Use the previous solution on R!

Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)

RMQ(1,10) = ?

O(1) time

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7

Use the previous solution on R!

Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)

RMQ(1,10) = ?

O(1) time

O(log n) time O(log n) time

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7

Use the previous solution on R!

Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)

RMQ(1,10) = ?

O(1) time

O(log n) time O(log n) time

Space: O(n log n) bits Query time: O(1)

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7

Use the previous solution on R!

Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)

RMQ(1,10) = ?

O(1) time

O(log n) time O(log n) time

Space: O(n log n) bits Query time: O(1)

O(1) time O(1) time

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7

Use the previous solution on R!

Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)

RMQ(1,10) = ?

O(1) time

O(log n) time O(log n) time

Space: O(n log n) bits Query time: O(1)

O(1) time O(1) time

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7

Use the previous solution on R!

Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)

RMQ(1,10) = ?

O(1) time

O(log n) time O(log n) time

Space: O(n log n) bits Query time: O(1)

O(1) time O(1) time

Puzzle: Use RMQ to compute LCA(u,v) queries on a tree

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P| + log n) time

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits

Compute the top-k strings

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits

Compute the top-k strings O(k log k) time

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits

Compute the top-k strings O(k log k) time O(n) bits

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits

Compute the top-k strings O(k log k) time O(n) bits

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits

Compute the top-k strings O(k log k) time O(n) bits

3 months query log at Yahoo!

≈600 million of distinct (and clean) queries

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits

Compute the top-k strings O(k log k) time O(n) bits

3 months query log at Yahoo!

≈600 million of distinct (and clean) queries

Trie requires ≈50 Gbytes!

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits

Compute the top-k strings O(k log k) time O(n) bits

3 months query log at Yahoo!

≈600 million of distinct (and clean) queries

Trie requires ≈50 Gbytes!

We will see how to reduce to ≈5 Gbytes!

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

Patricia trie

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m total length of strings in D

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m total length of strings in D

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m total length of strings in D

First symbol and length of the label

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m total length of strings in D

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m total length of strings in D

P = cba

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m total length of strings in D

P = cba

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m total length of strings in D

P = cba

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m total length of strings in D

P = cba

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m total length of strings in D

P = cba

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m total length of strings in D

P = cba

Blind search. We can skip symbols and

check at the end.

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m total length of strings in D

P = cba

Blind search. We can skip symbols and

check at the end.

O(n log m + m log σ) bitsO(|P| + log m) time with TST structure

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank0(7) =

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Rank0(7) = 4

Rank1(7) = 8 - Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Rank0(7) = 4

Rank1(7) = 8 - Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Rank0(7) = 4

Select0(j) = position of the j-th 0 in B

Rank1(7) = 8 - Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Rank0(7) = 4

Select0(j) = position of the j-th 0 in B

Select1(j) = position of the j-th 0 in B

Rank1(7) = 8 - Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Rank0(7) = 4

Select0(j) = position of the j-th 0 in B

Select1(j) = position of the j-th 0 in B

Select1(4) =

Rank1(7) = 8 - Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Rank0(7) = 4

Select0(j) = position of the j-th 0 in B

Select1(j) = position of the j-th 0 in B

Select1(4) = 6

Rank1(7) = 8 - Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Rank0(7) = 4

Select0(j) = position of the j-th 0 in B

Select1(j) = position of the j-th 0 in B

Select1(4) = 6 Space: n + O(n log log n/log n) bits Query time: O(1)

Rank1(7) = 8 - Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Rank0(7) = 4

Select0(j) = position of the j-th 0 in B

Select1(j) = position of the j-th 0 in B

Select1(4) = 6 Space: n + O(n log log n/log n) bits Query time: O(1)

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) =

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) =

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) =

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) = 3+

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) = 3+

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) = 3+

1/2 log n bits fit in a word. O(1) with popcount op!

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) = 3+

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1

1 = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1

1 = 4

How much space?

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1

1 = 4

How much space?

O(21/2 log n log n) = O(√n log n) cells,

each uses O(log log n) bits

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1

1 = 4

How much space?

O(21/2 log n log n) = O(√n log n) cells,

each uses O(log log n) bits

How much space?

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1

1 = 4

How much space?

O(21/2 log n log n) = O(√n log n) cells,

each uses O(log log n) bits

How much space?

O(n/log n) entries, each uses O(log n) bits

⇒ O(n) bits :-(

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1

1 = 4

How much space?

O(21/2 log n log n) = O(√n log n) cells,

each uses O(log log n) bits

How much space?

O(n/log n) entries, each uses O(log n) bits

⇒ O(n) bits :-(

Space: O(n) + o(n) bits Query time: O(1)

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

Groups into superblocks!

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4Store the # of 0s up to the beginning of its superblock

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4Store the # of 0s up to the beginning of its superblock

0

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4Store the # of 0s up to the beginning of its superblock

0

Rank0(j) is split into 3 parts:

- # of 0s up to the superblock of j - # of 0s up to the block of j - # of 0s within the block of j

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

0

How much space?

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

0

How much space?

O(n/log2 n) entries, each uses O(log n) bits ⇒ O(n/log n) bits :-)

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

0

How much space?

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

0

How much space?

O(n/log n) entries, each uses O(log log n) bits ⇒ O(n log log n/log n) bits :-)

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

0

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

0

How to support Select in O(log n) time?

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank1(j) = # of 1 in B[0,j]

Space: n + O(n log log n/log n) bits Query time: O(1)

B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

0

How to support Select in O(log n) time?

Select can be solved in O(1) time with a more difficult

approach

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

Trivial representation requires O(n log m) bits :-( Can we do better?

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

0

log

m =

log

24

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

Max number of buckets 2log n = n

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H

Max number of buckets 2log n = n

Write buckets’ cardinalities in unary

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H

Max number of buckets 2log n = n

Write buckets’ cardinalities in unary

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

Max number of buckets 2log n = n

Write buckets’ cardinalities in unary

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

Max number of buckets 2log n = n

Write buckets’ cardinalities in unary

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

Max number of buckets 2log n = n

Write buckets’ cardinalities in unary

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

Max number of buckets 2log n = n

Write buckets’ cardinalities in unary

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

Max number of buckets 2log n = n

Write buckets’ cardinalities in unary

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

Max number of buckets 2log n = n

Write buckets’ cardinalities in unary

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

Max number of buckets 2log n = n

Write buckets’ cardinalities in unary

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

012

013

Max number of buckets 2log n = n

Write buckets’ cardinalities in unary

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

012

013

Max number of buckets 2log n = n

Write buckets’ cardinalities in unary

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

012

013

1 014 15

Max number of buckets 2log n = n

Write buckets’ cardinalities in unary

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

012

013

1 014 15

Max number of buckets 2log n = n

A 1 for each value, at most a 0 for each bucket. <= 2n bits!

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

L

012

013

1 014 15

Max number of buckets 2log n = n

A 1 for each value, at most a 0 for each bucket. <= 2n bits!

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

L

012

013

1 014 15

Max number of buckets 2log n = n

n log (m/n) bits

A 1 for each value, at most a 0 for each bucket. <= 2n bits!

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

L

012

013

1 014 15

Max number of buckets 2log n = n

Access(6)

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

L

012

013

1 014 15

Max number of buckets 2log n = n

Access(6)1. Select1(6)-6 =3

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

L

012

013

1 014 15

Max number of buckets 2log n = n

Access(6)1. Select1(6)-6 =3 011 in binary!

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m

Space: n log (m/n) + O(n) bits Access(i) in O(1)

2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

L

012

013

1 014 15

Max number of buckets 2log n = n

Access(6)1. Select1(6)-6 =3 011 in binary!

2. Access L in position 6*log (m/n) = 13

Elias-Fano representationGiven a sequence of n (positive) integers summing up to m

Elias-Fano representationGiven a sequence of n (positive) integers summing up to m

2 2 4 3S1 2 3 4

Elias-Fano representationGiven a sequence of n (positive) integers summing up to m

2 2 4 3S

Trivial representation requires O(n log m) bits :-( Can we do better?

1 2 3 4

Elias-Fano representationGiven a sequence of n (positive) integers summing up to m

2 2 4 3S

B

Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros

1 2 3 4

Elias-Fano representation

1 00 1

Given a sequence of n (positive) integers summing up to m

2 2 4 3S

B

Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros

1 2 3 4

Elias-Fano representation

1 00 1

1 02 3

Given a sequence of n (positive) integers summing up to m

2 2 4 3S

B

Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros

1 2 3 4

Elias-Fano representation

1 00 1

1 02 3

1 1 1 04 5 6 7

Given a sequence of n (positive) integers summing up to m

2 2 4 3S

B

Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros

1 2 3 4

Elias-Fano representation

1 00 1

1 02 3

1 1 1 04 5 6 7

1 1 08 9 10

Given a sequence of n (positive) integers summing up to m

2 2 4 3S

B

Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros

1 2 3 4

Elias-Fano representation

1 00 1

1 02 3

1 1 1 04 5 6 7

1 1 08 9 10

The i-th value of S is Select0(i)-Select0(i-1)

Given a sequence of n (positive) integers summing up to m

2 2 4 3S

B

Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros

1 2 3 4

Elias-Fano representation

1 00 1

1 02 3

1 1 1 04 5 6 7

1 1 08 9 10

The i-th value of S is Select0(i)-Select0(i-1)

Given a sequence of n (positive) integers summing up to m

2 2 4 3S

B

Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros

1 2 3 4

Elias-Fano representation

1 00 1

1 02 3

1 1 1 04 5 6 7

1 1 08 9 10

The i-th value of S is Select0(i)-Select0(i-1)

Given a sequence of n (positive) integers summing up to m

2 2 4 3S

B

Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros Select0(3)=7Select0(2)=3

1 2 3 4

Elias-Fano representation

1 00 1

1 02 3

1 1 1 04 5 6 7

1 1 08 9 10

The i-th value of S is Select0(i)-Select0(i-1)

Given a sequence of n (positive) integers summing up to m

2 2 4 3S

B

Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros Select0(3)=7Select0(2)=3

Space: n log (m/n) + O(n) bits Select0 in O(1)

1 2 3 4

Unicorn: A System for Searching the Social Graph

Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko,

Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin,

Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang

Facebook, Inc.

ABSTRACTUnicorn is an online, in-memory social graph-aware index-ing system designed to search trillions of edges between tensof billions of users and entities on thousands of commodityservers. Unicorn is based on standard concepts in informa-tion retrieval, but it includes features to promote resultswith good social proximity. It also supports queries that re-quire multiple round-trips to leaves in order to retrieve ob-jects that are more than one edge away from source nodes.Unicorn is designed to answer billions of queries per day atlatencies in the hundreds of milliseconds, and it serves as aninfrastructural building block for Facebook’s Graph Searchproduct. In this paper, we describe the data model andquery language supported by Unicorn. We also describe itsevolution as it became the primary backend for Facebook’ssearch o↵erings.

1. INTRODUCTIONOver the past three years we have built and deployed a

search system called Unicorn1. Unicorn was designed withthe goal of being able to quickly and scalably search all basicstructured information on the social graph and to performcomplex set operations on the results. Unicorn resemblestraditional search indexing systems [14, 21, 22] and servesits index from memory, but it di↵ers in significant ways be-cause it was built to support social graph retrieval and socialranking from its inception. Unicorn helps products to spliceinteresting views of the social graph online for new user ex-periences.

Unicorn is the primary backend system for Facebook GraphSearch and is designed to serve billions of queries per daywith response latencies less than a few hundred milliseconds.As the product has grown and organically added more fea-tures, Unicorn has been modified to suit the product’s re-quirements. This paper is intended to serve as both a nar-

1The name was chosen because engineers joked that—muchlike the mythical quadruped—this system would solve all ofour problems and heal our woes if only it existed.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee. Articles from this volume were invited to present

their results at The 39th International Conference on Very Large Data Bases,

August 26th - 30th 2013, Riva del Garda, Trento, Italy.

Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.

rative of the evolution of Unicorn’s architecture, as well asdocumentation for the major features and components ofthe system.To the best of our knowledge, no other online graph re-

trieval system has ever been built with the scale of Unicornin terms of both data volume and query volume. The sys-tem serves tens of billions of nodes and trillions of edgesat scale while accounting for per-edge privacy, and it mustalso support realtime updates for all edges and nodes whileserving billions of daily queries at low latencies.This paper includes three main contributions:

• We describe how we applied common information re-trieval architectural concepts to the domain of the so-cial graph.

• We discuss key features for promoting socially relevantsearch results.

• We discuss two operators, apply and extract, whichallow rich semantic graph queries.

This paper is divided into four major parts. In Sections 2–5, we discuss the motivation for building unicorn, its design,and basic API. In Section 6, we describe how Unicorn wasadapted to serve as the backend for Facebook’s typeaheadsearch. We also discuss how to promote and rank sociallyrelevant results. In Sections 7–8, we build on the imple-mentation of typeahead to construct a new kind of searchengine. By performing multi-stage queries that traverse aseries of edges, the system is able to return complex, user-customized views of the social graph. Finally, in Sections8–10, we talk about privacy, scaling, and the system’s per-formance characteristics for typical queries.

2. THE SOCIAL GRAPHFacebook maintains a database of the inter-relationships

between the people and things in the real world, which itcalls the social graph. Like any other directed graph, itconsists of nodes signifying people and things; and edgesrepresenting a relationship between two nodes. In the re-mainder of this paper, we will use the terms node and entityinterchangeably.Facebook’s primary storage and production serving ar-

chitectures are described in [30]. Entities can be fetchedby their primary key, which is a 64-bit identifier (id). Wealso store the edges between entities. Some edges are di-rectional while others are symmetric, and there are manythousands of edge-types. The most well known edge-type

Unicorn: A System for Searching the Social Graph

Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko,

Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin,

Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang

Facebook, Inc.

ABSTRACTUnicorn is an online, in-memory social graph-aware index-ing system designed to search trillions of edges between tensof billions of users and entities on thousands of commodityservers. Unicorn is based on standard concepts in informa-tion retrieval, but it includes features to promote resultswith good social proximity. It also supports queries that re-quire multiple round-trips to leaves in order to retrieve ob-jects that are more than one edge away from source nodes.Unicorn is designed to answer billions of queries per day atlatencies in the hundreds of milliseconds, and it serves as aninfrastructural building block for Facebook’s Graph Searchproduct. In this paper, we describe the data model andquery language supported by Unicorn. We also describe itsevolution as it became the primary backend for Facebook’ssearch o↵erings.

1. INTRODUCTIONOver the past three years we have built and deployed a

search system called Unicorn1. Unicorn was designed withthe goal of being able to quickly and scalably search all basicstructured information on the social graph and to performcomplex set operations on the results. Unicorn resemblestraditional search indexing systems [14, 21, 22] and servesits index from memory, but it di↵ers in significant ways be-cause it was built to support social graph retrieval and socialranking from its inception. Unicorn helps products to spliceinteresting views of the social graph online for new user ex-periences.

Unicorn is the primary backend system for Facebook GraphSearch and is designed to serve billions of queries per daywith response latencies less than a few hundred milliseconds.As the product has grown and organically added more fea-tures, Unicorn has been modified to suit the product’s re-quirements. This paper is intended to serve as both a nar-

1The name was chosen because engineers joked that—muchlike the mythical quadruped—this system would solve all ofour problems and heal our woes if only it existed.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee. Articles from this volume were invited to present

their results at The 39th International Conference on Very Large Data Bases,

August 26th - 30th 2013, Riva del Garda, Trento, Italy.

Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.

rative of the evolution of Unicorn’s architecture, as well asdocumentation for the major features and components ofthe system.To the best of our knowledge, no other online graph re-

trieval system has ever been built with the scale of Unicornin terms of both data volume and query volume. The sys-tem serves tens of billions of nodes and trillions of edgesat scale while accounting for per-edge privacy, and it mustalso support realtime updates for all edges and nodes whileserving billions of daily queries at low latencies.This paper includes three main contributions:

• We describe how we applied common information re-trieval architectural concepts to the domain of the so-cial graph.

• We discuss key features for promoting socially relevantsearch results.

• We discuss two operators, apply and extract, whichallow rich semantic graph queries.

This paper is divided into four major parts. In Sections 2–5, we discuss the motivation for building unicorn, its design,and basic API. In Section 6, we describe how Unicorn wasadapted to serve as the backend for Facebook’s typeaheadsearch. We also discuss how to promote and rank sociallyrelevant results. In Sections 7–8, we build on the imple-mentation of typeahead to construct a new kind of searchengine. By performing multi-stage queries that traverse aseries of edges, the system is able to return complex, user-customized views of the social graph. Finally, in Sections8–10, we talk about privacy, scaling, and the system’s per-formance characteristics for typical queries.

2. THE SOCIAL GRAPHFacebook maintains a database of the inter-relationships

between the people and things in the real world, which itcalls the social graph. Like any other directed graph, itconsists of nodes signifying people and things; and edgesrepresenting a relationship between two nodes. In the re-mainder of this paper, we will use the terms node and entityinterchangeably.Facebook’s primary storage and production serving ar-

chitectures are described in [30]. Entities can be fetchedby their primary key, which is a 64-bit identifier (id). Wealso store the edges between entities. Some edges are di-rectional while others are symmetric, and there are manythousands of edge-types. The most well known edge-type

Unicorn: A System for Searching the Social Graph

Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko,

Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin,

Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang

Facebook, Inc.

ABSTRACTUnicorn is an online, in-memory social graph-aware index-ing system designed to search trillions of edges between tensof billions of users and entities on thousands of commodityservers. Unicorn is based on standard concepts in informa-tion retrieval, but it includes features to promote resultswith good social proximity. It also supports queries that re-quire multiple round-trips to leaves in order to retrieve ob-jects that are more than one edge away from source nodes.Unicorn is designed to answer billions of queries per day atlatencies in the hundreds of milliseconds, and it serves as aninfrastructural building block for Facebook’s Graph Searchproduct. In this paper, we describe the data model andquery language supported by Unicorn. We also describe itsevolution as it became the primary backend for Facebook’ssearch o↵erings.

1. INTRODUCTIONOver the past three years we have built and deployed a

search system called Unicorn1. Unicorn was designed withthe goal of being able to quickly and scalably search all basicstructured information on the social graph and to performcomplex set operations on the results. Unicorn resemblestraditional search indexing systems [14, 21, 22] and servesits index from memory, but it di↵ers in significant ways be-cause it was built to support social graph retrieval and socialranking from its inception. Unicorn helps products to spliceinteresting views of the social graph online for new user ex-periences.

Unicorn is the primary backend system for Facebook GraphSearch and is designed to serve billions of queries per daywith response latencies less than a few hundred milliseconds.As the product has grown and organically added more fea-tures, Unicorn has been modified to suit the product’s re-quirements. This paper is intended to serve as both a nar-

1The name was chosen because engineers joked that—muchlike the mythical quadruped—this system would solve all ofour problems and heal our woes if only it existed.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee. Articles from this volume were invited to present

their results at The 39th International Conference on Very Large Data Bases,

August 26th - 30th 2013, Riva del Garda, Trento, Italy.

Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.

rative of the evolution of Unicorn’s architecture, as well asdocumentation for the major features and components ofthe system.To the best of our knowledge, no other online graph re-

trieval system has ever been built with the scale of Unicornin terms of both data volume and query volume. The sys-tem serves tens of billions of nodes and trillions of edgesat scale while accounting for per-edge privacy, and it mustalso support realtime updates for all edges and nodes whileserving billions of daily queries at low latencies.This paper includes three main contributions:

• We describe how we applied common information re-trieval architectural concepts to the domain of the so-cial graph.

• We discuss key features for promoting socially relevantsearch results.

• We discuss two operators, apply and extract, whichallow rich semantic graph queries.

This paper is divided into four major parts. In Sections 2–5, we discuss the motivation for building unicorn, its design,and basic API. In Section 6, we describe how Unicorn wasadapted to serve as the backend for Facebook’s typeaheadsearch. We also discuss how to promote and rank sociallyrelevant results. In Sections 7–8, we build on the imple-mentation of typeahead to construct a new kind of searchengine. By performing multi-stage queries that traverse aseries of edges, the system is able to return complex, user-customized views of the social graph. Finally, in Sections8–10, we talk about privacy, scaling, and the system’s per-formance characteristics for typical queries.

2. THE SOCIAL GRAPHFacebook maintains a database of the inter-relationships

between the people and things in the real world, which itcalls the social graph. Like any other directed graph, itconsists of nodes signifying people and things; and edgesrepresenting a relationship between two nodes. In the re-mainder of this paper, we will use the terms node and entityinterchangeably.Facebook’s primary storage and production serving ar-

chitectures are described in [30]. Entities can be fetchedby their primary key, which is a 64-bit identifier (id). Wealso store the edges between entities. Some edges are di-rectional while others are symmetric, and there are manythousands of edge-types. The most well known edge-type

Unicorn: A System for Searching the Social Graph

Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko,

Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin,

Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang

Facebook, Inc.

ABSTRACTUnicorn is an online, in-memory social graph-aware index-ing system designed to search trillions of edges between tensof billions of users and entities on thousands of commodityservers. Unicorn is based on standard concepts in informa-tion retrieval, but it includes features to promote resultswith good social proximity. It also supports queries that re-quire multiple round-trips to leaves in order to retrieve ob-jects that are more than one edge away from source nodes.Unicorn is designed to answer billions of queries per day atlatencies in the hundreds of milliseconds, and it serves as aninfrastructural building block for Facebook’s Graph Searchproduct. In this paper, we describe the data model andquery language supported by Unicorn. We also describe itsevolution as it became the primary backend for Facebook’ssearch o↵erings.

1. INTRODUCTIONOver the past three years we have built and deployed a

search system called Unicorn1. Unicorn was designed withthe goal of being able to quickly and scalably search all basicstructured information on the social graph and to performcomplex set operations on the results. Unicorn resemblestraditional search indexing systems [14, 21, 22] and servesits index from memory, but it di↵ers in significant ways be-cause it was built to support social graph retrieval and socialranking from its inception. Unicorn helps products to spliceinteresting views of the social graph online for new user ex-periences.

Unicorn is the primary backend system for Facebook GraphSearch and is designed to serve billions of queries per daywith response latencies less than a few hundred milliseconds.As the product has grown and organically added more fea-tures, Unicorn has been modified to suit the product’s re-quirements. This paper is intended to serve as both a nar-

1The name was chosen because engineers joked that—muchlike the mythical quadruped—this system would solve all ofour problems and heal our woes if only it existed.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee. Articles from this volume were invited to present

their results at The 39th International Conference on Very Large Data Bases,

August 26th - 30th 2013, Riva del Garda, Trento, Italy.

Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.

rative of the evolution of Unicorn’s architecture, as well asdocumentation for the major features and components ofthe system.To the best of our knowledge, no other online graph re-

trieval system has ever been built with the scale of Unicornin terms of both data volume and query volume. The sys-tem serves tens of billions of nodes and trillions of edgesat scale while accounting for per-edge privacy, and it mustalso support realtime updates for all edges and nodes whileserving billions of daily queries at low latencies.This paper includes three main contributions:

• We describe how we applied common information re-trieval architectural concepts to the domain of the so-cial graph.

• We discuss key features for promoting socially relevantsearch results.

• We discuss two operators, apply and extract, whichallow rich semantic graph queries.

This paper is divided into four major parts. In Sections 2–5, we discuss the motivation for building unicorn, its design,and basic API. In Section 6, we describe how Unicorn wasadapted to serve as the backend for Facebook’s typeaheadsearch. We also discuss how to promote and rank sociallyrelevant results. In Sections 7–8, we build on the imple-mentation of typeahead to construct a new kind of searchengine. By performing multi-stage queries that traverse aseries of edges, the system is able to return complex, user-customized views of the social graph. Finally, in Sections8–10, we talk about privacy, scaling, and the system’s per-formance characteristics for typical queries.

2. THE SOCIAL GRAPHFacebook maintains a database of the inter-relationships

between the people and things in the real world, which itcalls the social graph. Like any other directed graph, itconsists of nodes signifying people and things; and edgesrepresenting a relationship between two nodes. In the re-mainder of this paper, we will use the terms node and entityinterchangeably.Facebook’s primary storage and production serving ar-

chitectures are described in [30]. Entities can be fetchedby their primary key, which is a 64-bit identifier (id). Wealso store the edges between entities. Some edges are di-rectional while others are symmetric, and there are manythousands of edge-types. The most well known edge-type

101 102 103 104 1050

20

40

60

Inner Truncation Limit

RelativeDeviation

(%)

The first two plots above show that increasing the in-ner truncation limit leads to higher latency and cost, withlatency passing 100ms at approximately a limit of 5000.Query cost increases sub-linearly, but there will likely alwaysbe queries whose inner result sets will need to be truncatedto prevent latency from exceeding a reasonable threshold(say, 100ms).

This means that—barring architectural changes—we willalways have queries that require inner truncation. As men-tioned in section 7.1.2, good inner query ranking is typicallythe most useful tool. However, it is always possible to con-struct “needle-in-a-haystack” queries that elicit bad perfor-mance. Query planning and selective denormalization canhelp in many of these cases.

The final plot shows that relative deviation increases grad-ually as the truncation limit increases. Larger outer querieshave higher variance per shard as perceived by the rack ag-gregator. This is likely because network queuing delays be-come more likely as query size increases.

11. RELATED WORKIn the last few years, sustained progress has been made in

scaling graph search via the SPARQL language, and someof these systems focus, like Unicorn, on real-time responseto ad-hoc queries [24, 12]. Where Unicorn seeks to handle afinite number of well understood edges and scale to trillionsof edges, SPARQL engines intend to handle arbitrary graphstructure and complex queries, and scale to tens of millionsof edges [24]. That said, it is interesting to note that thecurrent state-of-the-art in performance is based on variantsof a structure from [12] in which data is vertically partitionedand stored in a column store. This data structure uses aclustered B+ tree of (subject-id, value) for each propertyand emphasizes merge-joins, and thus seems to be evolvingtoward a posting-list-style architecture with fast intersectionand union as supported by Unicorn.

Recently, work has been done on adding keyword searchto SPARQL-style queries [28, 15], leading to the integrationof posting lists retrieval with structured indices. This workis currently at much smaller scale than Unicorn. Startingwith XML data graphs, work has been done to search forsubgraphs based on keywords (see, e.g. [16, 20]). The focusof this work is returning a subgraph, while Unicorn returnsan ordered list of entities.

In some work [13, 18], the term ’social search’ in fact refersto a system that supports question-answering, and the socialgraph is used to predict which person can answer a question.

While some similar ranking features may be used, Unicornsupports queries about the social graph rather than via thegraph, a fundamentally di↵erent application.

12. CONCLUSIONIn this paper, we have described the evolution of a graph-

based indexing system and how we added features to make ituseful for a consumer product that receives billions of queriesper week. Our main contributions are showing how manyinformation retrieval concepts can be put to work for serv-ing graph queries, and we described a simple yet practicalmultiple round-trip algorithm for serving even more com-plex queries where edges are not denormalized and insteadmust be traversed sequentially.

13. ACKNOWLEDGMENTSWe would like to acknowledge the product engineers who

use Unicorn as clients every day, and we also want to thankour production engineers who deal with our bugs and helpextinguish our fires. Also, thanks to Cameron Marlow forcontributing graphics and Dhruba Borthakur for conversa-tions about Facebook’s storage architecture. Several Face-book engineers contributed helpful feedback to this paperincluding Philip Bohannon and Chaitanya Mishra. BretTaylor was instrumental in coming up with many of theoriginal ideas for Unicorn and its design. We would alsolike to acknowledge the following individuals for their con-tributions to Unicorn: Spencer Ahrens, Neil Blakey-Milner,Udeepta Bordoloi, Chris Bray, Jordan DeLong, Shuai Ding,Jeremy Lilley, Jim Norris, Feng Qian, Nathan Schrenk, San-jeev Singh, Ryan Stout, Evan Stratford, Scott Straw, andSherman Ye.

Open SourceAll Unicorn index server and aggregator code is written inC++. Unicorn relies extensively on modules in Facebook’s“Folly” Open Source Library [5]. As part of the e↵ort ofreleasing Graph Search, we have open-sourced a C++ im-plementation of the Elias-Fano index representation [31] aspart of Folly.

14. REFERENCES[1] Apache Hadoop. http://hadoop.apache.org/.[2] Apache Thrift. http://thrift.apache.org/.[3] Description of HHVM (PHP Virtual machine).

https://www.facebook.com/note.php?note_id=

10150415177928920.[4] Facebook Graph Search.

https://www.facebook.com/about/graphsearch.[5] Folly GitHub Repository.

http://github.com/facebook/folly.[6] HPHP for PHP GitHub Repository.

http://github.com/facebook/hiphop-php.[7] Open Compute Project.

http://www.opencompute.org/.[8] Scribe Facebook Blog Post. http://www.

facebook.com/note.php?note_id=32008268919.[9] Scribe GitHub Repository.

http://www.github.com/facebook/scribe.[10] The Life of a Typeahead Query. http:

//www.facebook.com/notes/facebook-engineering/

the-life-of-a-typeahead-query/389105248919.

Unicorn: A System for Searching the Social Graph

Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko,

Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin,

Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang

Facebook, Inc.

ABSTRACTUnicorn is an online, in-memory social graph-aware index-ing system designed to search trillions of edges between tensof billions of users and entities on thousands of commodityservers. Unicorn is based on standard concepts in informa-tion retrieval, but it includes features to promote resultswith good social proximity. It also supports queries that re-quire multiple round-trips to leaves in order to retrieve ob-jects that are more than one edge away from source nodes.Unicorn is designed to answer billions of queries per day atlatencies in the hundreds of milliseconds, and it serves as aninfrastructural building block for Facebook’s Graph Searchproduct. In this paper, we describe the data model andquery language supported by Unicorn. We also describe itsevolution as it became the primary backend for Facebook’ssearch o↵erings.

1. INTRODUCTIONOver the past three years we have built and deployed a

search system called Unicorn1. Unicorn was designed withthe goal of being able to quickly and scalably search all basicstructured information on the social graph and to performcomplex set operations on the results. Unicorn resemblestraditional search indexing systems [14, 21, 22] and servesits index from memory, but it di↵ers in significant ways be-cause it was built to support social graph retrieval and socialranking from its inception. Unicorn helps products to spliceinteresting views of the social graph online for new user ex-periences.

Unicorn is the primary backend system for Facebook GraphSearch and is designed to serve billions of queries per daywith response latencies less than a few hundred milliseconds.As the product has grown and organically added more fea-tures, Unicorn has been modified to suit the product’s re-quirements. This paper is intended to serve as both a nar-

1The name was chosen because engineers joked that—muchlike the mythical quadruped—this system would solve all ofour problems and heal our woes if only it existed.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee. Articles from this volume were invited to present

their results at The 39th International Conference on Very Large Data Bases,

August 26th - 30th 2013, Riva del Garda, Trento, Italy.

Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.

rative of the evolution of Unicorn’s architecture, as well asdocumentation for the major features and components ofthe system.To the best of our knowledge, no other online graph re-

trieval system has ever been built with the scale of Unicornin terms of both data volume and query volume. The sys-tem serves tens of billions of nodes and trillions of edgesat scale while accounting for per-edge privacy, and it mustalso support realtime updates for all edges and nodes whileserving billions of daily queries at low latencies.This paper includes three main contributions:

• We describe how we applied common information re-trieval architectural concepts to the domain of the so-cial graph.

• We discuss key features for promoting socially relevantsearch results.

• We discuss two operators, apply and extract, whichallow rich semantic graph queries.

This paper is divided into four major parts. In Sections 2–5, we discuss the motivation for building unicorn, its design,and basic API. In Section 6, we describe how Unicorn wasadapted to serve as the backend for Facebook’s typeaheadsearch. We also discuss how to promote and rank sociallyrelevant results. In Sections 7–8, we build on the imple-mentation of typeahead to construct a new kind of searchengine. By performing multi-stage queries that traverse aseries of edges, the system is able to return complex, user-customized views of the social graph. Finally, in Sections8–10, we talk about privacy, scaling, and the system’s per-formance characteristics for typical queries.

2. THE SOCIAL GRAPHFacebook maintains a database of the inter-relationships

between the people and things in the real world, which itcalls the social graph. Like any other directed graph, itconsists of nodes signifying people and things; and edgesrepresenting a relationship between two nodes. In the re-mainder of this paper, we will use the terms node and entityinterchangeably.Facebook’s primary storage and production serving ar-

chitectures are described in [30]. Entities can be fetchedby their primary key, which is a 64-bit identifier (id). Wealso store the edges between entities. Some edges are di-rectional while others are symmetric, and there are manythousands of edge-types. The most well known edge-type

101 102 103 104 1050

20

40

60

Inner Truncation Limit

RelativeDeviation

(%)

The first two plots above show that increasing the in-ner truncation limit leads to higher latency and cost, withlatency passing 100ms at approximately a limit of 5000.Query cost increases sub-linearly, but there will likely alwaysbe queries whose inner result sets will need to be truncatedto prevent latency from exceeding a reasonable threshold(say, 100ms).

This means that—barring architectural changes—we willalways have queries that require inner truncation. As men-tioned in section 7.1.2, good inner query ranking is typicallythe most useful tool. However, it is always possible to con-struct “needle-in-a-haystack” queries that elicit bad perfor-mance. Query planning and selective denormalization canhelp in many of these cases.

The final plot shows that relative deviation increases grad-ually as the truncation limit increases. Larger outer querieshave higher variance per shard as perceived by the rack ag-gregator. This is likely because network queuing delays be-come more likely as query size increases.

11. RELATED WORKIn the last few years, sustained progress has been made in

scaling graph search via the SPARQL language, and someof these systems focus, like Unicorn, on real-time responseto ad-hoc queries [24, 12]. Where Unicorn seeks to handle afinite number of well understood edges and scale to trillionsof edges, SPARQL engines intend to handle arbitrary graphstructure and complex queries, and scale to tens of millionsof edges [24]. That said, it is interesting to note that thecurrent state-of-the-art in performance is based on variantsof a structure from [12] in which data is vertically partitionedand stored in a column store. This data structure uses aclustered B+ tree of (subject-id, value) for each propertyand emphasizes merge-joins, and thus seems to be evolvingtoward a posting-list-style architecture with fast intersectionand union as supported by Unicorn.

Recently, work has been done on adding keyword searchto SPARQL-style queries [28, 15], leading to the integrationof posting lists retrieval with structured indices. This workis currently at much smaller scale than Unicorn. Startingwith XML data graphs, work has been done to search forsubgraphs based on keywords (see, e.g. [16, 20]). The focusof this work is returning a subgraph, while Unicorn returnsan ordered list of entities.

In some work [13, 18], the term ’social search’ in fact refersto a system that supports question-answering, and the socialgraph is used to predict which person can answer a question.

While some similar ranking features may be used, Unicornsupports queries about the social graph rather than via thegraph, a fundamentally di↵erent application.

12. CONCLUSIONIn this paper, we have described the evolution of a graph-

based indexing system and how we added features to make ituseful for a consumer product that receives billions of queriesper week. Our main contributions are showing how manyinformation retrieval concepts can be put to work for serv-ing graph queries, and we described a simple yet practicalmultiple round-trip algorithm for serving even more com-plex queries where edges are not denormalized and insteadmust be traversed sequentially.

13. ACKNOWLEDGMENTSWe would like to acknowledge the product engineers who

use Unicorn as clients every day, and we also want to thankour production engineers who deal with our bugs and helpextinguish our fires. Also, thanks to Cameron Marlow forcontributing graphics and Dhruba Borthakur for conversa-tions about Facebook’s storage architecture. Several Face-book engineers contributed helpful feedback to this paperincluding Philip Bohannon and Chaitanya Mishra. BretTaylor was instrumental in coming up with many of theoriginal ideas for Unicorn and its design. We would alsolike to acknowledge the following individuals for their con-tributions to Unicorn: Spencer Ahrens, Neil Blakey-Milner,Udeepta Bordoloi, Chris Bray, Jordan DeLong, Shuai Ding,Jeremy Lilley, Jim Norris, Feng Qian, Nathan Schrenk, San-jeev Singh, Ryan Stout, Evan Stratford, Scott Straw, andSherman Ye.

Open SourceAll Unicorn index server and aggregator code is written inC++. Unicorn relies extensively on modules in Facebook’s“Folly” Open Source Library [5]. As part of the e↵ort ofreleasing Graph Search, we have open-sourced a C++ im-plementation of the Elias-Fano index representation [31] aspart of Folly.

14. REFERENCES[1] Apache Hadoop. http://hadoop.apache.org/.[2] Apache Thrift. http://thrift.apache.org/.[3] Description of HHVM (PHP Virtual machine).

https://www.facebook.com/note.php?note_id=

10150415177928920.[4] Facebook Graph Search.

https://www.facebook.com/about/graphsearch.[5] Folly GitHub Repository.

http://github.com/facebook/folly.[6] HPHP for PHP GitHub Repository.

http://github.com/facebook/hiphop-php.[7] Open Compute Project.

http://www.opencompute.org/.[8] Scribe Facebook Blog Post. http://www.

facebook.com/note.php?note_id=32008268919.[9] Scribe GitHub Repository.

http://www.github.com/facebook/scribe.[10] The Life of a Typeahead Query. http:

//www.facebook.com/notes/facebook-engineering/

the-life-of-a-typeahead-query/389105248919.

Unicorn: A System for Searching the Social Graph

Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko,

Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin,

Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang

Facebook, Inc.

ABSTRACTUnicorn is an online, in-memory social graph-aware index-ing system designed to search trillions of edges between tensof billions of users and entities on thousands of commodityservers. Unicorn is based on standard concepts in informa-tion retrieval, but it includes features to promote resultswith good social proximity. It also supports queries that re-quire multiple round-trips to leaves in order to retrieve ob-jects that are more than one edge away from source nodes.Unicorn is designed to answer billions of queries per day atlatencies in the hundreds of milliseconds, and it serves as aninfrastructural building block for Facebook’s Graph Searchproduct. In this paper, we describe the data model andquery language supported by Unicorn. We also describe itsevolution as it became the primary backend for Facebook’ssearch o↵erings.

1. INTRODUCTIONOver the past three years we have built and deployed a

search system called Unicorn1. Unicorn was designed withthe goal of being able to quickly and scalably search all basicstructured information on the social graph and to performcomplex set operations on the results. Unicorn resemblestraditional search indexing systems [14, 21, 22] and servesits index from memory, but it di↵ers in significant ways be-cause it was built to support social graph retrieval and socialranking from its inception. Unicorn helps products to spliceinteresting views of the social graph online for new user ex-periences.

Unicorn is the primary backend system for Facebook GraphSearch and is designed to serve billions of queries per daywith response latencies less than a few hundred milliseconds.As the product has grown and organically added more fea-tures, Unicorn has been modified to suit the product’s re-quirements. This paper is intended to serve as both a nar-

1The name was chosen because engineers joked that—muchlike the mythical quadruped—this system would solve all ofour problems and heal our woes if only it existed.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee. Articles from this volume were invited to present

their results at The 39th International Conference on Very Large Data Bases,

August 26th - 30th 2013, Riva del Garda, Trento, Italy.

Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.

rative of the evolution of Unicorn’s architecture, as well asdocumentation for the major features and components ofthe system.To the best of our knowledge, no other online graph re-

trieval system has ever been built with the scale of Unicornin terms of both data volume and query volume. The sys-tem serves tens of billions of nodes and trillions of edgesat scale while accounting for per-edge privacy, and it mustalso support realtime updates for all edges and nodes whileserving billions of daily queries at low latencies.This paper includes three main contributions:

• We describe how we applied common information re-trieval architectural concepts to the domain of the so-cial graph.

• We discuss key features for promoting socially relevantsearch results.

• We discuss two operators, apply and extract, whichallow rich semantic graph queries.

This paper is divided into four major parts. In Sections 2–5, we discuss the motivation for building unicorn, its design,and basic API. In Section 6, we describe how Unicorn wasadapted to serve as the backend for Facebook’s typeaheadsearch. We also discuss how to promote and rank sociallyrelevant results. In Sections 7–8, we build on the imple-mentation of typeahead to construct a new kind of searchengine. By performing multi-stage queries that traverse aseries of edges, the system is able to return complex, user-customized views of the social graph. Finally, in Sections8–10, we talk about privacy, scaling, and the system’s per-formance characteristics for typical queries.

2. THE SOCIAL GRAPHFacebook maintains a database of the inter-relationships

between the people and things in the real world, which itcalls the social graph. Like any other directed graph, itconsists of nodes signifying people and things; and edgesrepresenting a relationship between two nodes. In the re-mainder of this paper, we will use the terms node and entityinterchangeably.Facebook’s primary storage and production serving ar-

chitectures are described in [30]. Entities can be fetchedby their primary key, which is a 64-bit identifier (id). Wealso store the edges between entities. Some edges are di-rectional while others are symmetric, and there are manythousands of edge-types. The most well known edge-type

101 102 103 104 1050

20

40

60

Inner Truncation Limit

RelativeDeviation

(%)

The first two plots above show that increasing the in-ner truncation limit leads to higher latency and cost, withlatency passing 100ms at approximately a limit of 5000.Query cost increases sub-linearly, but there will likely alwaysbe queries whose inner result sets will need to be truncatedto prevent latency from exceeding a reasonable threshold(say, 100ms).

This means that—barring architectural changes—we willalways have queries that require inner truncation. As men-tioned in section 7.1.2, good inner query ranking is typicallythe most useful tool. However, it is always possible to con-struct “needle-in-a-haystack” queries that elicit bad perfor-mance. Query planning and selective denormalization canhelp in many of these cases.

The final plot shows that relative deviation increases grad-ually as the truncation limit increases. Larger outer querieshave higher variance per shard as perceived by the rack ag-gregator. This is likely because network queuing delays be-come more likely as query size increases.

11. RELATED WORKIn the last few years, sustained progress has been made in

scaling graph search via the SPARQL language, and someof these systems focus, like Unicorn, on real-time responseto ad-hoc queries [24, 12]. Where Unicorn seeks to handle afinite number of well understood edges and scale to trillionsof edges, SPARQL engines intend to handle arbitrary graphstructure and complex queries, and scale to tens of millionsof edges [24]. That said, it is interesting to note that thecurrent state-of-the-art in performance is based on variantsof a structure from [12] in which data is vertically partitionedand stored in a column store. This data structure uses aclustered B+ tree of (subject-id, value) for each propertyand emphasizes merge-joins, and thus seems to be evolvingtoward a posting-list-style architecture with fast intersectionand union as supported by Unicorn.

Recently, work has been done on adding keyword searchto SPARQL-style queries [28, 15], leading to the integrationof posting lists retrieval with structured indices. This workis currently at much smaller scale than Unicorn. Startingwith XML data graphs, work has been done to search forsubgraphs based on keywords (see, e.g. [16, 20]). The focusof this work is returning a subgraph, while Unicorn returnsan ordered list of entities.

In some work [13, 18], the term ’social search’ in fact refersto a system that supports question-answering, and the socialgraph is used to predict which person can answer a question.

While some similar ranking features may be used, Unicornsupports queries about the social graph rather than via thegraph, a fundamentally di↵erent application.

12. CONCLUSIONIn this paper, we have described the evolution of a graph-

based indexing system and how we added features to make ituseful for a consumer product that receives billions of queriesper week. Our main contributions are showing how manyinformation retrieval concepts can be put to work for serv-ing graph queries, and we described a simple yet practicalmultiple round-trip algorithm for serving even more com-plex queries where edges are not denormalized and insteadmust be traversed sequentially.

13. ACKNOWLEDGMENTSWe would like to acknowledge the product engineers who

use Unicorn as clients every day, and we also want to thankour production engineers who deal with our bugs and helpextinguish our fires. Also, thanks to Cameron Marlow forcontributing graphics and Dhruba Borthakur for conversa-tions about Facebook’s storage architecture. Several Face-book engineers contributed helpful feedback to this paperincluding Philip Bohannon and Chaitanya Mishra. BretTaylor was instrumental in coming up with many of theoriginal ideas for Unicorn and its design. We would alsolike to acknowledge the following individuals for their con-tributions to Unicorn: Spencer Ahrens, Neil Blakey-Milner,Udeepta Bordoloi, Chris Bray, Jordan DeLong, Shuai Ding,Jeremy Lilley, Jim Norris, Feng Qian, Nathan Schrenk, San-jeev Singh, Ryan Stout, Evan Stratford, Scott Straw, andSherman Ye.

Open SourceAll Unicorn index server and aggregator code is written inC++. Unicorn relies extensively on modules in Facebook’s“Folly” Open Source Library [5]. As part of the e↵ort ofreleasing Graph Search, we have open-sourced a C++ im-plementation of the Elias-Fano index representation [31] aspart of Folly.

14. REFERENCES[1] Apache Hadoop. http://hadoop.apache.org/.[2] Apache Thrift. http://thrift.apache.org/.[3] Description of HHVM (PHP Virtual machine).

https://www.facebook.com/note.php?note_id=

10150415177928920.[4] Facebook Graph Search.

https://www.facebook.com/about/graphsearch.[5] Folly GitHub Repository.

http://github.com/facebook/folly.[6] HPHP for PHP GitHub Repository.

http://github.com/facebook/hiphop-php.[7] Open Compute Project.

http://www.opencompute.org/.[8] Scribe Facebook Blog Post. http://www.

facebook.com/note.php?note_id=32008268919.[9] Scribe GitHub Repository.

http://www.github.com/facebook/scribe.[10] The Life of a Typeahead Query. http:

//www.facebook.com/notes/facebook-engineering/

the-life-of-a-typeahead-query/389105248919.

Succinct representation of trees (1)[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

3

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

3

0 2 2

2 20 0

0 0 0 0

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0

D 3

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0A tree is uniquely determined by the

degree sequence

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0A tree is uniquely determined by the

degree sequence

How reconstruct the tree?

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0

It still requires O(n log n) bits :-(

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0

It still requires O(n log n) bits :-(

Solution: write them in unary

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0

It still requires O(n log n) bits :-(

Solution: write them in unary

B

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0

It still requires O(n log n) bits :-(

Solution: write them in unary

B 1110

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0

It still requires O(n log n) bits :-(

Solution: write them in unary

B 1110 0 110 110 0 0 110 110 0 0 0 0

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0

It still requires O(n log n) bits :-(

Solution: write them in unary

B 1110 0 110 110 0 0 110 110 0 0 0 0

B takes 2n - 1 bits! For each node we have a 0 and a 1

(but the root)

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits Best: 2n bits

Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0

It still requires O(n log n) bits :-(

Solution: write them in unary

B 1110 0 110 110 0 0 110 110 0 0 0 0

B takes 2n - 1 bits! For each node we have a 0 and a 1

(but the root)

Can we navigate the tree?

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

Succinct representation of trees (1)

B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

Succinct representation of trees (1)

B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

Succinct representation of trees (1)

B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

1

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) =

1 2 3 4 5 6 7 8 9 10 11 12

pos(5)

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

pos(5)

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ?

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

firstChild(3)

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

y= Select0(3)+1=8

firstChild(3)

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

if B[y] == 0return -1 // is a leaf

y= Select0(3)+1=8

firstChild(3)

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

if B[y] == 0return -1 // is a leaf

return y-x // Rank1(y)

else

y= Select0(3)+1=8

firstChild(3)

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

if B[y] == 0return -1 // is a leaf

return y-x // Rank1(y)

else

y= Select0(3)+1=8

firstChild(3) = 8-3 = 5

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

if B[y] == 0return -1 // is a leaf

return y-x // Rank1(y)

else

y= Select0(3)+1=8

degree(x) = ?

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

if B[y] == 0return -1 // is a leaf

return y-x // Rank1(y)

else

y= Select0(3)+1=8

degree(x) = ? Select0(x+1) - (Select0(x) + 1)

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

if B[y] == 0return -1 // is a leaf

return y-x // Rank1(y)

else

y= Select0(3)+1=8

degree(x) = ? Select0(x+1) - (Select0(x) + 1)

degree(3)

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

if B[y] == 0return -1 // is a leaf

return y-x // Rank1(y)

else

y= Select0(3)+1=8

degree(x) = ? Select0(x+1) - (Select0(x) + 1)

degree(3) = Select0(4) - (Select0(3) + 1)

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

if B[y] == 0return -1 // is a leaf

return y-x // Rank1(y)

else

y= Select0(3)+1=8

degree(x) = ? Select0(x+1) - (Select0(x) + 1)

Select0(4)=10

degree(3) = Select0(4) - (Select0(3) + 1)

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

if B[y] == 0return -1 // is a leaf

return y-x // Rank1(y)

else

degree(x) = ? Select0(x+1) - (Select0(x) + 1)

parent(x) =

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

if B[y] == 0return -1 // is a leaf

return y-x // Rank1(y)

else

degree(x) = ? Select0(x+1) - (Select0(x) + 1)

parent(x) = Rank0(pos(x))

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

if B[y] == 0return -1 // is a leaf

return y-x // Rank1(y)

else

degree(x) = ? Select0(x+1) - (Select0(x) + 1)

parent(x) = Rank0(pos(x))All these operations in O(1) time!

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

if B[y] == 0return -1 // is a leaf

return y-x // Rank1(y)

else

degree(x) = ? Select0(x+1) - (Select0(x) + 1)

parent(x) = Rank0(pos(x))

subtreeSize(x) = ?

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

if B[y] == 0return -1 // is a leaf

return y-x // Rank1(y)

else

degree(x) = ? Select0(x+1) - (Select0(x) + 1)

parent(x) = Rank0(pos(x))

subtreeSize(x) = ?

Not efficient! Nodes of the subtree are

spread in B

Succinct representation of trees (1)

1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0

[LOUDS - Level-order unary degree sequence]

1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B

if B[y] == 0return -1 // is a leaf

return y-x // Rank1(y)

else

degree(x) = ? Select0(x+1) - (Select0(x) + 1)

parent(x) = Rank0(pos(x))

subtreeSize(x) = ?

Succinct representation of trees (2)[BP - Balanced parenthesis]

Succinct representation of trees (2)[BP - Balanced parenthesis]

Succinct representation of trees (2)[BP - Balanced parenthesis]

1

Succinct representation of trees (2)[BP - Balanced parenthesis]

1

Succinct representation of trees (2)[BP - Balanced parenthesis]

1

2

Succinct representation of trees (2)[BP - Balanced parenthesis]

1

2

Succinct representation of trees (2)[BP - Balanced parenthesis]

1

32

Succinct representation of trees (2)[BP - Balanced parenthesis]

1

32

Succinct representation of trees (2)[BP - Balanced parenthesis]

1

32

Succinct representation of trees (2)[BP - Balanced parenthesis]

1

32

Succinct representation of trees (2)[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

A tree is uniquely determined by the vector B

How reconstruct the tree?

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

subtree of 6

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

subtree of 6( and ) balance themselves

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) =

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (

findClose(7)

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

enclose(10)

= position of the parent of x in B

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

enclose(10)

= position of the parent of x in B

They can be implemented in O(1) time.

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

enclose(10)

= position of the parent of x in Bparent(x) =

They can be implemented in O(1) time.

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

enclose(10)

= position of the parent of x in Bparent(x) = Rank ( (enclose(x))

They can be implemented in O(1) time.

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

= position of the parent of x in B

firstChild(x) =parent(x) = Rank ( (enclose(x))

They can be implemented in O(1) time.

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

= position of the parent of x in B

firstChild(x) =parent(x) = Rank ( (enclose(x))

y = pos(x)+1if B[y] == )

return -1 // is a leaf

return Rank ( (y)else

They can be implemented in O(1) time.

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

= position of the parent of x in B

firstChild(x) =parent(x) = Rank ( (enclose(x))

y = pos(x)+1if B[y] == )

return -1 // is a leaf

return Rank ( (y)else

sibling(x) =

They can be implemented in O(1) time.

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

= position of the parent of x in B

firstChild(x) =parent(x) = Rank ( (enclose(x))

y = pos(x)+1if B[y] == )

return -1 // is a leaf

return Rank ( (y)else

sibling(x) = Rank ( (findClose(x)+1) (if any)

They can be implemented in O(1) time.

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

= position of the parent of x in B

firstChild(x) =parent(x) = Rank ( (enclose(x))

y = pos(x)+1if B[y] == )

return -1 // is a leaf

return Rank ( (y)else

sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =

They can be implemented in O(1) time.

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

= position of the parent of x in B

firstChild(x) =parent(x) = Rank ( (enclose(x))

y = pos(x)+1if B[y] == )

return -1 // is a leaf

return Rank ( (y)else

sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =

(findClose(x) - pos(x) + 1) / 2

They can be implemented in O(1) time.

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

= position of the parent of x in B

firstChild(x) =parent(x) = Rank ( (enclose(x))

y = pos(x)+1if B[y] == )

return -1 // is a leaf

return Rank ( (y)else

sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =

(findClose(x) - pos(x) + 1) / 2depth(x) =

They can be implemented in O(1) time.

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

= position of the parent of x in B

firstChild(x) =parent(x) = Rank ( (enclose(x))

y = pos(x)+1if B[y] == )

return -1 // is a leaf

return Rank ( (y)else

sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =

(findClose(x) - pos(x) + 1) / 2depth(x) =

Rank ( (pos(x)) - Rank ) (pos(x))

They can be implemented in O(1) time.

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

= position of the parent of x in B

firstChild(x) =parent(x) = Rank ( (enclose(x))

y = pos(x)+1if B[y] == )

return -1 // is a leaf

return Rank ( (y)else

sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =

(findClose(x) - pos(x) + 1) / 2depth(x) =

Rank ( (pos(x)) - Rank ) (pos(x))

They can be implemented in O(1) time.

All these operations in O(1) time!

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

= position of the parent of x in B

firstChild(x) =parent(x) = Rank ( (enclose(x))

y = pos(x)+1if B[y] == )

return -1 // is a leaf

return Rank ( (y)else

sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =

(findClose(x) - pos(x) + 1) / 2

degree(x) = ?

depth(x) =Rank ( (pos(x)) - Rank ) (pos(x))

They can be implemented in O(1) time.

Succinct representation of trees (2)

B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

= position of the parent of x in B

firstChild(x) =parent(x) = Rank ( (enclose(x))

y = pos(x)+1if B[y] == )

return -1 // is a leaf

return Rank ( (y)else

sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =

(findClose(x) - pos(x) + 1) / 2

degree(x) = ?

Quite inefficient! Solved by repeatedly calling sibling to scan x’s children.

depth(x) =Rank ( (pos(x)) - Rank ) (pos(x))

They can be implemented in O(1) time.

Succinct representation of trees (3)[DFUDS - Depth First Unary Degree Sequence

1

32 6

7 104 5

8 9 11 12

Succinct representation of trees (3)[DFUDS - Depth First Unary Degree Sequence

1

32 6

7 104 5

8 9 11 12

Succinct representation of trees (3)[DFUDS - Depth First Unary Degree Sequence

1

32 6

7 104 5

8 9 11 12

Succinct representation of trees (3)[DFUDS - Depth First Unary Degree Sequence

1

32 6

7 104 5

8 9 11 12

Succinct representation of trees (3)

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

[DFUDS - Depth First Unary Degree Sequence

1

32 6

7 104 5

8 9 11 12

1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12

Succinct representation of trees (3)

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

[DFUDS - Depth First Unary Degree Sequence

1

32 6

7 104 5

8 9 11 12

1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12

subtree of 6

Succinct representation of trees (3)

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

[DFUDS - Depth First Unary Degree Sequence

1

32 6

7 104 5

8 9 11 12

1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12

children of 1

Succinct representation of trees (3)

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

[DFUDS - Depth First Unary Degree Sequence

1

32 6

7 104 5

8 9 11 12

1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12

pos(6)

pos(x) = Select ) (x) // closing )

id(pos) = Rank ) (pos)

Succinct representation of trees (3)

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

[DFUDS - Depth First Unary Degree Sequence

1

32 6

7 104 5

8 9 11 12

1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12

degree(x) =

pos(x) = Select ) (x) // closing )

id(pos) = Rank ) (pos)

⇋children of 6

Succinct representation of trees (3)

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

[DFUDS - Depth First Unary Degree Sequence

1

32 6

7 104 5

8 9 11 12

1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12

degree(x) = Select ) (Rank ) (pos(x))) - x

pos(x) = Select ) (x) // closing )

id(pos) = Rank ) (pos)

⇋children of 6

Succinct representation of trees (3)

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

[DFUDS - Depth First Unary Degree Sequence

1

32 6

7 104 5

8 9 11 12

1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12

child(x,i), parent(x), subtreeSize(x), …

degree(x) = Select ) (Rank ) (pos(x))) - x

pos(x) = Select ) (x) // closing )

id(pos) = Rank ) (pos)

All these operations in O(1) time!

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

* a b cac ba cb baS

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

c?* a b cac ba cb baS

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

b?* a b cac ba cb baS

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

a?* a b cac ba cb baS

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS

Search P in O(|P|) time!

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS

Search P in O(|P|) time!

* 2 1 221 11 11 22L

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS

Search P in O(|P|) time!

* 2 1 221 11 11 22LElias-Fano representation: n log(m/n) + O(n) bits and

O(1) time access.

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS

Search P in O(|P|) time!

S, L and B use m log σ + n log (m/n)+ O(n) bits

* 2 1 221 11 11 22L

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS

Search P in O(|P|) time!

S, L and B use m log σ + n log (m/n)+ O(n) bits

* 2 1 221 11 11 22L

Project: implement a compact TST

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P|) time

Compute the top-k strings O(k log k) time O(n) bits

3 months query log at Yahoo!

≈600 million of distinct (and clean) queries

Trie requires ≈50 Gbytes!

We will see how to reduce to ≈5 Gbytes!

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P|) time m log σ + n log (m/n) + O(n) bits

Compute the top-k strings O(k log k) time O(n) bits

3 months query log at Yahoo!

≈600 million of distinct (and clean) queries

Trie requires ≈50 Gbytes!

We will see how to reduce to ≈5 Gbytes!

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P|) time m log σ + n log (m/n) + O(n) bits

Compute the top-k strings O(k log k) time O(n) bits

3 months query log at Yahoo!

≈600 million of distinct (and clean) queries

Trie requires ≈50 Gbytes!

We will see how to reduce to ≈5 Gbytes!

to be compared with O(m log σ + n log m) bits

D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }

n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P|) time m log σ + n log (m/n) + O(n) bits

Compute the top-k strings O(k log k) time O(n) bits

3 months query log at Yahoo!

≈600 million of distinct (and clean) queries

Trie requires ≈50 Gbytes!

We will see how to reduce to ≈5 Gbytes!

to be compared with O(m log σ + n log m) bits

(n=) 1 billion of strings of average length 64

m=64*109 symbols and m/n = 64

top related