by - luc devroyeluc.devroye.org/henrimertens-stringology.pdf · aat¥⇒¥-linked list of children...

14
DATA STRUCTURES FOR STRINGS , COMPRESSION ( Notes by Gauri Mertens )

Upload: others

Post on 25-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: by - Luc Devroyeluc.devroye.org/HenriMertens-Stringology.pdf · AaT¥⇒¥-linked list of children This saves space! Remark: to control the tree size, one can collapse all unmarked-I-child

DATA STRUCTURESFOR

STRINGS,

COMPRESSION

( Notes by Gauri Mertens )

Page 2: by - Luc Devroyeluc.devroye.org/HenriMertens-Stringology.pdf · AaT¥⇒¥-linked list of children This saves space! Remark: to control the tree size, one can collapse all unmarked-I-child

DATA STRUCTURES FOR STRINGS

Overview : For collections of strings,words :

-

Tries,

PATRICIA trees, digital search trees

For tents, files:

Affix tries, suffix trees

, suffix arrays

Page 3: by - Luc Devroyeluc.devroye.org/HenriMertens-Stringology.pdf · AaT¥⇒¥-linked list of children This saves space! Remark: to control the tree size, one can collapse all unmarked-I-child

TRIESFrom

"

retired"

CF£k¥n!n%%7fasa)Data !

nap from an alphabet As e.g . ,

A-to.mg Cbinary )Az 40,1 ,

-

→ 97 I decimal )A

= 9A,c

, GT } C DNA )A- had ,c ,

. .

> 3A ,.

→ 70,1 ,. . .gs . . . . } ( tent )

Every string is infinite Iotherwise pad it ) .

Page 4: by - Luc Devroyeluc.devroye.org/HenriMertens-Stringology.pdf · AaT¥⇒¥-linked list of children This saves space! Remark: to control the tree size, one can collapse all unmarked-I-child

If I Atk,

then a tie is a h-

any position tree.

Each string corresponds to a path .

For binary strings , a YE"

n indicates"

,go left"

go right" 0.9

raft 01o

. path for

→Anim \

.

( o& 1101 - -I

, %

④- hit :son:p .:}Ky . My

NIROMy Nz NJ 26

Page 5: by - Luc Devroyeluc.devroye.org/HenriMertens-Stringology.pdf · AaT¥⇒¥-linked list of children This saves space! Remark: to control the tree size, one can collapse all unmarked-I-child

ORDINARY TRIE

at each a string meannually so that all strings endup

in a unique leaf .

O Wagemandata

o/\oas : ( ooo -

- - )o¥° 2

Rz : (O O I O . - . ),at IX§: foo " - - -kIto 6¥:

>6¥.

µ→Ng: ( I l I I I . - n )

,

* Leaves= n = # Strings

Leaves point to the strings.

Page 6: by - Luc Devroyeluc.devroye.org/HenriMertens-Stringology.pdf · AaT¥⇒¥-linked list of children This saves space! Remark: to control the tree size, one can collapse all unmarked-I-child

Remark: Br finite strings that can be prefixes of

- other strings ,one can store all string in the

tire,

and mark the ends of strings

O

TI 8- I

¥¥÷¥÷.

" am .

:*I 00

°o I I

Page 7: by - Luc Devroyeluc.devroye.org/HenriMertens-Stringology.pdf · AaT¥⇒¥-linked list of children This saves space! Remark: to control the tree size, one can collapse all unmarked-I-child

Remade: Large alphabets .

Instead of having cells

on arrays of pointersto children

LIMB 1¥11- a

k childrenk ohildren

one can use linked lists of children ( de be Brian dais )

noAaT¥⇒¥-linked list of children

This saves space !

Page 8: by - Luc Devroyeluc.devroye.org/HenriMertens-Stringology.pdf · AaT¥⇒¥-linked list of children This saves space! Remark: to control the tree size, one can collapse all unmarked-I-child

Remark : to control the tree size,

one can collapse all unmarked-

I-

child modes,

as in the EAT Ricin true ( Marxism,

1968)

of

: :in .

÷÷÷÷÷%ft Associate

a

subduingwith a left child

.

Page 9: by - Luc Devroyeluc.devroye.org/HenriMertens-Stringology.pdf · AaT¥⇒¥-linked list of children This saves space! Remark: to control the tree size, one can collapse all unmarked-I-child

Ina PATRICIA tree

,the # of nodes is E Em -1

.

Froot .

⇒ Leaves = a

F- Nodes = No t Mst Mz t - - -

MEO (by collapse)C Mi

= # nodes of degree i )

# Edges = # Nodes- I = 2M£ t 3Mt 4My t - - -

⇒ 2 (Mzt My t My t - - )=L ( # hodes - n )

* Nodes son - I. D

Note: We assumed that MEO .

However,

in thepresence of marked

nodes,

or when the root has only one child,

Me > o.

However,

we still have # Nodes s In in that case .( Exercise )

Page 10: by - Luc Devroyeluc.devroye.org/HenriMertens-Stringology.pdf · AaT¥⇒¥-linked list of children This saves space! Remark: to control the tree size, one can collapse all unmarked-I-child

DIGITAL SEARCH TREE C DST )

Add the Amigo one by one and associate each stringwith one mode,

namely,

the first free mode on its path .

Example. Ma6991,9: Tree Dst

023= 00 I O I - - -

⑦3¥I99888

-

6%Y¥a °Y ②↳mo0¥€£10 xLIE of DST = n .

Page 11: by - Luc Devroyeluc.devroye.org/HenriMertens-Stringology.pdf · AaT¥⇒¥-linked list of children This saves space! Remark: to control the tree size, one can collapse all unmarked-I-child

THE SUFFIX TRIE :

A trie for all suffixes in a test :

Tent T

#ItsSuffixes : as -- Tft . .

n ] off! byeAge Tff - . n ]:

km =T[ n . . n ]

Suffix TREE : lollapopgffff.au.am#evmary nodes as in

a

Page 12: by - Luc Devroyeluc.devroye.org/HenriMertens-Stringology.pdf · AaT¥⇒¥-linked list of children This saves space! Remark: to control the tree size, one can collapse all unmarked-I-child

§ edges refer to shearings in the tent, e.g ,

Go ,-24J

and leaves point to places in the teat.

Candsome internals )

Teats 0 I 00001 O•

1234567 8 CD '

-

¥-47,83147-0YEA. D 7

Ez . ¥:$2%0010=6, g

.

i 0£,g↳400010=14,8][6,83--0*017,8]3 4

point back to test.

Storage e an (Exercise)

Page 13: by - Luc Devroyeluc.devroye.org/HenriMertens-Stringology.pdf · AaT¥⇒¥-linked list of children This saves space! Remark: to control the tree size, one can collapse all unmarked-I-child

Eroica: Find a binary suffix tie of size Rtn ).

Note: Searching for a string Pepe . - Pa involves

• root

← path pz - - - - Pk.

^

n•Aall marked nodes in the subtree

¥•js↳

Page 14: by - Luc Devroyeluc.devroye.org/HenriMertens-Stringology.pdf · AaT¥⇒¥-linked list of children This saves space! Remark: to control the tree size, one can collapse all unmarked-I-child

SUFFIX ARRAY

Anarray of sorted saltines .

Example Teat = HELLO

apian:g"He

Search for a subduing proceedsOleg binary search.