unit 06 : index and distributed caching comp 5323 web database technologies and applications 2014

Unit 06 : Index and Distributed Caching

COMP 5323Web Database Technologies and

Applications 2014

• This PowerPoint is prepared for educational purpose and is strictly used in the classroom lecturing.

• We have adopted the "Fair Use" doctrine in this PowerPoint which allows limited copying of copyrighted works for educational and research purposes.

Doctrine of Fair Use

http://ikonexpress-uwm.com/images/copyright_symbol.bmp

Learning Objectives

• Understand different path index techniques to improve the performance.

• Learn a distributed memory caching system which improves the performance of web database applications

Outline

1.Index for Semi-structured data2.Distributed Caching

1 Index for Semi-structured Data

6

Why is Indexing Needed?

• Allows fast access to data by replicating portions of the data in special purpose structures.

• Despite the additional cost (storage, maintenance and complexity) they have shown to be useful in evaluating queries.

Index Types

• Structural index– Accessing all elements of given name– Ancestor-descendant and parent-child

relationship between elements

• Content index– Accessing elements containing given

keywords– Supporting most text search functionalities

Classical Content Index

• Classically based on inverted lists – For each term, gives the

doc.ID + localization• Several variations allows

different search types– Offset, Relative, Proximity

• Generally stored in a B+-Tree to optimize search for a given word

• Size is an important issue– Memory and Disk

• (word, localization)– Fixed entry (word

repeated)

• (word, Frequency, (localization)*)– Variable length entry

Words Localization

- t1 : doc1-100, doc1-300, doc3-200, …

- t2 : doc2-30, doc4-70, …

- t3 : doc4-87, doc5-754, …

Short Reference: http://www.igi-global.com/dictionary/inverted-index/15654

9

Problem with XML

• Support of element addressing– Doc.ID should include

NodeId (Xpath) + Offset• Index size becomes very

large– XPath are long

• Support of typed data– Integer, float, simple types

of XML schema– Requires classical indexes

for certain elements

• Query processing– Structural joins– Text search– Exact search

• Support of updates– Incremental updates

would be a plus

Path-based approach

• Represent XML document into tree or graph structure

• Index XML document directly– Without the support of DTD

• Mainly use the memory as the index storage• Properties– Keep the structural information to improve query

performance– Easy to support query with regular path expression

Different Approaches• Patricia Trie

– Cooper et al. 2001 • DataGuides

– J. McHugh et al., 1997• T-Index

– Tova Milo and Dan Sucin• APEX (Adaptive Path Index for XML Data)

– C. W. Chung et al., 2002• Dewey Structure• K-ary Table• Path Table• OrdPath

Partricia Tries

• A compact representation of a trie in which any node that is an only child is merged with its parent.

• Also known as radix tree

Partricia Tries

• Cooper et al. 2001 • Idea:– Partitioned Partricia Tries to index strings– Encode XPath expressions as strings

(encode names, encode atomic values)

<book> <author>Whoever</author> <author>Not me</author> <title>No Kidding</title></book>

B A 1 WhoeverB A 2 Not meB T No Kidding

DataGuides

• World-Wide Web demonstrates that much of the information available online is semistructured.

• Graph-based data model called OEM, for Object Exchange Model

A sample OEM database

Lore Language (Lorel Query)

• Select Restaurant.Entree – returns all entrees served by any restaurant,

the set of objects {6, 10, 11}

• Select Restaurant.Name• where Restaurant.Entree = “Burger”– The answer to the query is the single object 5.

DataGuide

T-index

1-Index

1-Index DataGuides

2-Index

Representation of XML Data Structure

DataGuide

APEX

HAPEX GAPEX

Dewey - Structure• Each node is assigned a label that represents the path from

the document’s root to the node.• Each component of the label represents the local order of an

ancestor node.• Nodes with the same number of delimiters (“.”) in their label

are in the same level.Bib

book paper

paperauthor

Tim Sarah

author

(0)(0)

(0.0)(0.0)

(0.0.0)(0.0.0)

(0.0.0.0)(0.0.0.0)

(0.1)(0.1)

(0.2)(0.2)

(0.2.0)(0.2.0)

(0.2.0.0)(0.2.0.0)

Dewey – Supported Queries (1/3)

• Ancestors / Descendants– Node “X” is an ancestor of node “Y” if the label of

node “X” is a substring of the label of node “Y”.

Bib

book paper

paperauthor

Tim Sarah

author

(0)(0)

(0.0)(0.0)

(0.0.0)(0.0.0)

(0.0.0.0)(0.0.0.0)

(0.1)(0.1)

(0.2)(0.2)

(0.2.0)(0.2.0)

(0.2.0.0)(0.2.0.0)

Dewey – Supported Queries (2/3)• Parent / Child– Node “X” is parent of node “Y” if:

- The label of node “X” is a substring of the label of node “Y”

- And frags(X) = frags(Y) – 1, where frags(X) is the number of delimiters of the label of node X and frags(Y) is the number of delimiters of label of node Y.

Bib

book paper

paperauthor

Tim Sarah

author

(0)(0)

(0.0)(0.0)

(0.0.0)(0.0.0)

(0.0.0.0)(0.0.0.0)

(0.1)(0.1)

(0.2)(0.2)

(0.2.0)(0.2.0)

(0.2.0.0)(0.2.0.0)

Dewey – Supported Queries (3/3)• Siblings– Nodes “X” and “Y” are siblings if:

- They have the same number of delimiters in their labels - And X.prefix = Y.prefix, where prefix is the label of the

node without its positional identifier

Bib

book paper

paperauthor

Tim Sarah

author

(0)(0)

(0.0)(0.0)

(0.0.0)(0.0.0)

(0.0.0.0)(0.0.0.0)

(0.1)(0.1)

(0.2)(0.2)

(0.2.0)(0.2.0)

(0.2.0.0)(0.2.0.0)

Dewey – Updates• Insertion of new node– The label of the nodes in the subtree rooted at the

following sibling need to be updated– O(n) nodes need relabeling, where n is the number of

nodes of the XML fileBib

book paper

paperauthor

TimSarah

author

(0)(0)

(0.0)(0.0)

(0.0.0)(0.0.0)

(0.0.0.0)(0.0.0.0)

(0.1)(0.1)

(0.2)(0.2)

(0.2.0)(0.2.0)

(0.2.0.0)(0.2.0.0)

paper(0.2)(0.2)

(0.3)(0.3)

(0.3.0)(0.3.0)

(0.3.0.0)(0.3.0.0)

Dewey• Not efficient for dynamic XML files with many updates– Need to re-label many nodes

• As the depth of the tree increases:– Label size of a node increases rapidly • Storage size increases rapidly

– It becomes more costly to infer the supported queries between any two nodes (the string prefix matching becomes longer)

• Overflow problem– The original fixed length of bits assigned to store the

size of the label is not enough.

Document Tree

real node

virtual node

aa

3-ary tree

cc cc

dd ee ff gg

hh ii jj

ee ee ee ee ee

ee

ee

• Lee et al, ACM DL 1996.• Represent each document as a k-ary complete tree and assign a UID to each node

K-ary table

• Each document is assigned k, which is the maximum number of siblings in the document tree.

• Each element has an entry (row) in the K-ary table

• When a query is issued, the result set has pointers to the K-ary table.

Level and Element Type Number

• Level– Level means the level in the document tree– It gives a clue how many parent function is applied to

get to a target element• Element type number– A unique number is assigned to each element type in

DTD It enables to filter out unnecessary elements and accumulate the correct frequencies

• Element location– The unique position of an element instance in a

document tree

Element Labeling

Document(1)

Para(5)

Abstract(3)

Chapter(4)

Section(6)

Para(7)

Title(2)

UID

element UID element UID

D1 C1 C2 S1 S2

1 2 3 4 5

S3 S4 P1 P2 P3

8 9 14 15 16

Result of assigning UIDs

parent(i) = [(i-2)/k+1]

dd

3-ary tree

cc cc

ss ss ss ss

pp pp pp

ee ee ee ee ee

ee

• Unique element identifier

36

XML Index Path Table (Oracle)

BaseRid Path OrderKey

Value Locator NumValue

Rid1 po

Rid1 po.data 1 11

Rid1 po.data.item

1.1 “foo” 17

Rid1 po.data.pkg

1.2 “123” 32 123

Rid1 po.data.item

1.3 “bar” 46

<po> <data> <item>foo</item> <pkg>123</pkg> <item>bar</item> </data></po>

Some Typos

37

OrdPath

• ORDPATHs: Insert-Friendly XML Node Labels– Patrick O’Neil, Elizabeth O’Neil1, Shankar Pal,

Istvan Cseri, Gideon Schaller, Nigel Westbury– SIGMOD 2004– SQL Server 2005 implementation

38

OrdPath

• Aims to provide efficient insertion at any position of an XML tree, and also supports extremely high performance query plans for native XML queries.

• Tree modifications– new may be inserted– sub-trees be deleted– sub-trees may be moved around within the

tree

39

OrdPath• Encodes the parent-child relationship by extending the

parent’s ORDPATH label with a component for the child. – E.g.: 1.5.3.9 might be the parent ORDPATH, 1.5.3.9.1 the

child. • The various child components reflect the children’s relative

sibling order, so that byte-by-byte comparison of the ORDPATH labels of two nodes yields the proper document order.

• A new node (possibly a root node of a sub-tree) can be inserted under any designated parent node in an existing tree. – Its label is generated using an additional intermediate

“careting” component that falls between the components of its left and right siblings.

40

OrdPath• At the beginning– Only positive, odd integers are assigned during an initial

load; even-numbered and negative integer component values are reserved for later insertions into an existing tree

• Inserting in the middle– Even numbers are used as carets only. Do not count as

components that increase the depth of the nodes.– E.g. new nodes in between 3.5.5 and 3.5.7

• New siblings: 3.5.6.1, 3.5.6.2, …• A subtree: 3.5.6.1, 3.5.6.1.1, 3.5.6.3, 3.5.6.3.1, 3.5.6.3.3,

3.5.6.3.3.1, 3.5.6.3.3.3, 3.5.6.3.5, 3.5.6.5, 3.5.6.5.1

42

ORDPATH Label of Nodes

BOOK1

Section1.3

Figure1.3.5

Title1.3.1

Section1.5

Title1.5.1

Figure1.5.5

@ISBN1.1

CAPTION1.3.5.1

Nobody…1.3.3

tree frogs1.5.7

All right…1.5.3

43

Infoset Table

2 Memcached

What is memcached briefly?

• memcached is a high-performance, distributed memory object caching system, generic in nature

• It is a key-based cache daemon that stores data and objects wherever dedicated or spare RAM is available for very quick access

• It is a dumb distributed hash table. It does not provide redundancy, failover or authentication. If needed the client has to handle that.

Why was memcached made?

• It was originally developed by Danga Interactive to enhance the speed of LiveJournal.com

• It dropped the database load to almost nothing, yielding faster page load times for users, better resource utilization, and faster access to the databases on a memcache miss

• http://www.danga.com/memcached/

47

Memcached

Where does memcached reside?

• Memcache is not part of the database but sits outside it on the server(s).

• Over a pool of servers

Architecture

When should I use memcached?

• When your database is optimized to the hilt and you still need more out of it.– Lots of SELECTs are using resources that could be better

used elsewhere in the DB.– Locking issues keep coming up

• When table listings in the query cache are torn down so often it becomes useless

• To get maximum “scale out” of minimum hardware

Hit-rate Management

• anything what is more expensive to fetch from elsewhere, and has sufficient hitrate, can be placed in memcached– How often will object or data be used?– How expensive is it to generate the data?– What is the expected hitrate?– Will the application invalidate the data itself, or will TTL be

used? – How much development work has to be done to embed it?

Why use memcached?

• To reduce the load on the database by caching data BEFORE it hits the database

• Can be used for more then just holding database results (objects) and improve the entire application response time

• Feel the need for speed– Memcache is in RAM - much faster then hitting

the disk or the database

Why not use memcached?

• Memcache is held in RAM. This is a finite resource.

• Adding complexity to a system just for complexities sake is a waste. If the system can respond within the requirements without it - leave it alone

What are the limits of memcached?

• Keys can be no more then 250 characters• Stored data can not exceed 1M (largest typical

slab size)• There are generally no limits to the number of

nodes running memcache• There are generally no limits the the amount of

RAM used by memcache over all nodes– 32 bit machines do have a limit of 4GB though

Platform

• You can build and install memcached from the source code directly, or you can use an existing operating system package or installation.– on a RedHat, Fedora or CentOS host, use yum:

• root-shell> yum install memcached– on a Debian or Ubuntu host, use apt-get:

• root-shell> apt-get install memcached– on a Gentoo host, use emerge:

• root-shell> emerge install memcached– on OpenSolaris, use the pkg for SUNWmemcached:

• root-shell> pkg install SUNWmemcached

Port 11211

• Get the source from the website: http://www.danga.com/memcached/download.bml– Memcache has a dependancy on libevent so make

sure you have that also.

• Decompress, cd into the dir• ./configure;make;make install;• Memcached listens on port 11211 by default,

this can be changed with –p option.

http://www.danga.com/memcached/download.bml

http://www.danga.com/memcached/download.bml

How do I start memcached?

• Memcached can be run as a non-root user if it will not be on a restricted port (<1024) - though the user can not have a memory limit restriction

• shell> memcached • Default configuration - Memory: 64MB, all

network interfaces, port:11211, max simultaneous connections: 1024

Memcached options

• You can change the default configuration with various options.– -u <user> : run as user if started as root– -m <num> : maximum <num> MB memory to use for items

• If more then available RAM - will use swap• Don’t forget 4G limit on 32 bit machines

– -d : Run as a daemon– -l <ip_addr> : Listen on <ip_addr>; default to INDRR_ANY– -p <num> : port

How can I connect to memcached?

• Memcached uses a protocol that many languages implement with an API.

• Languages that implement it:– Perl, PHP, Python, Ruby, Java, C#, C, Lua, Postgres,

MySQL, Chicken Scheme

• And yes - because it is a protocol you can even use telnet– shell> telnet localhost 11211

Memcached protocol

• Three types of commands– Storage - ask the server to store some data

identified by a key• set, add, replace, append, prepend and cas

– Retrieval - ask the server to retrieve data corresponding to a set of keys• get, gets

Memcached protocol (con’t)– All others that don’t involve unstructured data • Deletion:delete • Statistics: stats, • flush_all: always succeeds, invalidate all existing items

immediately (by default) or after the expiration specified.• version, verbosity, quit

PHP and Memcached

• Make sure you have a working Apache/PHP install

• PHP has a memcached extension available through pecl.

• Installation:– shell> pecl install memcache

• Make sure the pear is installed (debian: apt-get install php-pear)

• Make sure that you also have php5-dev installed for phpize.– shell> apt-get install php5-dev

PHP Script Example• Information about the PHP API at

http://www.php.net/memcache

<?php // make a memcache object $memcache = new Memcache; // connect to memcache $memcache->connect('localhost', 11211) or die ("Could not connect"); //get the memcache version $version = $memcache->getVersion(); echo "Server's version: ".$version." \n";

http://www.php.net/memcache

PHP Script (con’t)// test data$tmp_object = new stdClass;$tmp_object->str_attr = 'test';$tmp_object->int_attr = 123;// set the test data in memcache$memcache->set('key', $tmp_object, false, 10) or die ("Failed to

save data at the server");echo "Store data in the cache (data will expire in 10 seconds) \

n";// get the data$get_result = $memcache->get('key');echo "Data from the cache: \n";echo ‘<pre>’, var_dump($get_result), ‘</pre>’;

MEMCACHE_COMPRESSED

PHP Script (con’t)// modify the data$tmp_object->str_attr = ‘boo’;$memcache->replace(‘key’, $tmp_object, false, 10) or die(“Failed to

save new data to the server \n”);Echo “Stored data in the cache changed \n”;// get the new data$get_result = $memcache->get(‘key’);Echo “New data from the cache: \n”;Echo ‘<pre>’, var_dump($get_result), “</pre>\n”;// delete the data$memcache->delete(‘key’) or die(“Data not deleted \n”);

MySQL’s memcached

• The API is consistent with the other API’s– Connect: mysql> SELECT

memc_servers_set('192.168.0.1:11211, 192.168.0.2:11211');• The list of servers used by the memcached UDFs is not

persistent over restarts of the MySQL server.– Set: mysql> SELECT memc_set('myid', 'myvalue');– Retreive: mysql> SELECT memc_get('myid');

Possible ways to secure memcached

• It has no authentication system - so protection is important

• Run as a non-priveledged user to minimize potential damage

• Specify the ip address to listen on using -l– 127.0.0.1, 192.168.0.1, specific ip address

• Use a non-standard port• Use a firewall

Memcached

69

• Slab AllocationSlab Allocation: : When you start to store data into the cache, memcached does not allocate the memory for the data on an item by item basis. Instead, a slab allocation is used to optimize memory usage and prevent memory fragmentation when information expires from the cache.

• Lazy Expiration + LRULazy Expiration + LRU• Lazy ExpirationLazy Expiration: : When an item is requested (a get request)

Memcached checks the expiration time to see if the item is still valid before returning it to the client.

• LRU LRU ((least recently used): Memcached is LRU per slab class, : Memcached is LRU per slab class, but not globally LRU. but not globally LRU.

Memcached Memory Management

70

Slab Allocation

71

Memcached Distributed Architecture

////Obtain a server ID based on Obtain a server ID based on Key Key valuevalueint getServerId(char *key, int serverTotal) {int getServerId(char *key, int serverTotal) { int c, hash = 0;int c, hash = 0; while (c = *key++) {while (c = *key++) { hash += c;hash += c; }} return hash % serverTotal;return hash % serverTotal;}}

////a list of serversa list of serversnode[0] => 192.168.0.1:11211node[0] => 192.168.0.1:11211node[1] => 192.168.0.2:11211node[1] => 192.168.0.2:11211node[2] => 192.168.0.3:11211node[2] => 192.168.0.3:11211

////get id get id int id = getServerId("test", 3);int id = getServerId("test", 3);

////get ip address and port numberget ip address and port numbernode[id] == node[1]node[id] == node[1]

Memcached Distributed Architecture

SQL Server Cache

• SQL Server cache mechanism :- Query plans- pages from the database files

• but it does NOT cache:- exact results from a query

REFERENCEhttp://searchsqlserver.techtarget.com/tip/SQL-Server-memory-configurations-for-procedure-cache-and-buffer-cache

Reference

• DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases

• Index Structure for Path Expressions• APEX: An Adaptive Path Index for XML Data

Reference Only• https://blog.couchbase.com/memcached-144-

windows-32-bit-binary-now-available• Memcached should run on Linux. It may not

work on some windows

https://blog.couchbase.com/memcached-144-windows-32-bit-binary-now-available

https://blog.couchbase.com/memcached-144-windows-32-bit-binary-now-available

Reference

Reference

• 0: <flag> • 60: timeout• 6: length

unit 06 : index and distributed caching comp 5323 web database technologies and applications 2014

Documents