unit 06 : index and distributed caching comp 5323 web database technologies and applications 2014
TRANSCRIPT
Unit 06 : Index and Distributed Caching
COMP 5323Web Database Technologies and
Applications 2014
• This PowerPoint is prepared for educational purpose and is strictly used in the classroom lecturing.
• We have adopted the "Fair Use" doctrine in this PowerPoint which allows limited copying of copyrighted works for educational and research purposes.
Doctrine of Fair Use
Learning Objectives
• Understand different path index techniques to improve the performance.
• Learn a distributed memory caching system which improves the performance of web database applications
Outline
1.Index for Semi-structured data2.Distributed Caching
1 Index for Semi-structured Data
6
Why is Indexing Needed?
• Allows fast access to data by replicating portions of the data in special purpose structures.
• Despite the additional cost (storage, maintenance and complexity) they have shown to be useful in evaluating queries.
Index Types
• Structural index– Accessing all elements of given name– Ancestor-descendant and parent-child
relationship between elements
• Content index– Accessing elements containing given
keywords– Supporting most text search functionalities
Classical Content Index
• Classically based on inverted lists – For each term, gives the
doc.ID + localization• Several variations allows
different search types– Offset, Relative, Proximity
• Generally stored in a B+-Tree to optimize search for a given word
• Size is an important issue– Memory and Disk
• (word, localization)– Fixed entry (word
repeated)
• (word, Frequency, (localization)*)– Variable length entry
Words Localization
- t1 : doc1-100, doc1-300, doc3-200, …
- t2 : doc2-30, doc4-70, …
- t3 : doc4-87, doc5-754, …
Short Reference: http://www.igi-global.com/dictionary/inverted-index/15654
9
Problem with XML
• Support of element addressing– Doc.ID should include
NodeId (Xpath) + Offset• Index size becomes very
large– XPath are long
• Support of typed data– Integer, float, simple types
of XML schema– Requires classical indexes
for certain elements
• Query processing– Structural joins– Text search– Exact search
• Support of updates– Incremental updates
would be a plus
Path-based approach
• Represent XML document into tree or graph structure
• Index XML document directly– Without the support of DTD
• Mainly use the memory as the index storage• Properties– Keep the structural information to improve query
performance– Easy to support query with regular path expression
Different Approaches• Patricia Trie
– Cooper et al. 2001 • DataGuides
– J. McHugh et al., 1997• T-Index
– Tova Milo and Dan Sucin• APEX (Adaptive Path Index for XML Data)
– C. W. Chung et al., 2002• Dewey Structure• K-ary Table• Path Table• OrdPath
Partricia Tries
• A compact representation of a trie in which any node that is an only child is merged with its parent.
• Also known as radix tree
Partricia Tries
• Cooper et al. 2001 • Idea:– Partitioned Partricia Tries to index strings– Encode XPath expressions as strings
(encode names, encode atomic values)
<book> <author>Whoever</author> <author>Not me</author> <title>No Kidding</title></book>
B A 1 WhoeverB A 2 Not meB T No Kidding
DataGuides
• World-Wide Web demonstrates that much of the information available online is semistructured.
• Graph-based data model called OEM, for Object Exchange Model
A sample OEM database
Lore Language (Lorel Query)
• Select Restaurant.Entree – returns all entrees served by any restaurant,
the set of objects {6, 10, 11}
• Select Restaurant.Name• where Restaurant.Entree = “Burger”– The answer to the query is the single object 5.
DataGuide
T-index
1-Index
1-Index DataGuides
2-Index
APEX
Representation of XML Data Structure
DataGuide
APEX
HAPEX GAPEX
Dewey - Structure• Each node is assigned a label that represents the path from
the document’s root to the node.• Each component of the label represents the local order of an
ancestor node.• Nodes with the same number of delimiters (“.”) in their label
are in the same level.Bib
book paper
paperauthor
Tim Sarah
author
(0)(0)
(0.0)(0.0)
(0.0.0)(0.0.0)
(0.0.0.0)(0.0.0.0)
(0.1)(0.1)
(0.2)(0.2)
(0.2.0)(0.2.0)
(0.2.0.0)(0.2.0.0)
Dewey – Supported Queries (1/3)
• Ancestors / Descendants– Node “X” is an ancestor of node “Y” if the label of
node “X” is a substring of the label of node “Y”.
Bib
book paper
paperauthor
Tim Sarah
author
(0)(0)
(0.0)(0.0)
(0.0.0)(0.0.0)
(0.0.0.0)(0.0.0.0)
(0.1)(0.1)
(0.2)(0.2)
(0.2.0)(0.2.0)
(0.2.0.0)(0.2.0.0)
Dewey – Supported Queries (2/3)• Parent / Child– Node “X” is parent of node “Y” if:
- The label of node “X” is a substring of the label of node “Y”
- And frags(X) = frags(Y) – 1, where frags(X) is the number of delimiters of the label of node X and frags(Y) is the number of delimiters of label of node Y.
Bib
book paper
paperauthor
Tim Sarah
author
(0)(0)
(0.0)(0.0)
(0.0.0)(0.0.0)
(0.0.0.0)(0.0.0.0)
(0.1)(0.1)
(0.2)(0.2)
(0.2.0)(0.2.0)
(0.2.0.0)(0.2.0.0)
Dewey – Supported Queries (3/3)• Siblings– Nodes “X” and “Y” are siblings if:
- They have the same number of delimiters in their labels - And X.prefix = Y.prefix, where prefix is the label of the
node without its positional identifier
Bib
book paper
paperauthor
Tim Sarah
author
(0)(0)
(0.0)(0.0)
(0.0.0)(0.0.0)
(0.0.0.0)(0.0.0.0)
(0.1)(0.1)
(0.2)(0.2)
(0.2.0)(0.2.0)
(0.2.0.0)(0.2.0.0)
Dewey – Updates• Insertion of new node– The label of the nodes in the subtree rooted at the
following sibling need to be updated– O(n) nodes need relabeling, where n is the number of
nodes of the XML fileBib
book paper
paperauthor
TimSarah
author
(0)(0)
(0.0)(0.0)
(0.0.0)(0.0.0)
(0.0.0.0)(0.0.0.0)
(0.1)(0.1)
(0.2)(0.2)
(0.2.0)(0.2.0)
(0.2.0.0)(0.2.0.0)
paper(0.2)(0.2)
(0.3)(0.3)
(0.3.0)(0.3.0)
(0.3.0.0)(0.3.0.0)
Dewey• Not efficient for dynamic XML files with many updates– Need to re-label many nodes
• As the depth of the tree increases:– Label size of a node increases rapidly • Storage size increases rapidly
– It becomes more costly to infer the supported queries between any two nodes (the string prefix matching becomes longer)
• Overflow problem– The original fixed length of bits assigned to store the
size of the label is not enough.
Document Tree
real node
virtual node
aa
3-ary tree
cc cc
dd ee ff gg
hh ii jj
ee ee ee ee ee
ee
ee
• Lee et al, ACM DL 1996.• Represent each document as a k-ary complete tree and assign a UID to each node
K-ary table
• Each document is assigned k, which is the maximum number of siblings in the document tree.
• Each element has an entry (row) in the K-ary table
• When a query is issued, the result set has pointers to the K-ary table.
Level and Element Type Number
• Level– Level means the level in the document tree– It gives a clue how many parent function is applied to
get to a target element• Element type number– A unique number is assigned to each element type in
DTD It enables to filter out unnecessary elements and accumulate the correct frequencies
• Element location– The unique position of an element instance in a
document tree
Element Labeling
Document(1)
Para(5)
Abstract(3)
Chapter(4)
Section(6)
Para(7)
Title(2)
UID
element UID element UID
D1 C1 C2 S1 S2
1 2 3 4 5
S3 S4 P1 P2 P3
8 9 14 15 16
Result of assigning UIDs
parent(i) = [(i-2)/k+1]
dd
3-ary tree
cc cc
ss ss ss ss
pp pp pp
ee ee ee ee ee
ee
• Unique element identifier
36
XML Index Path Table (Oracle)
BaseRid Path OrderKey
Value Locator NumValue
Rid1 po
Rid1 po.data 1 11
Rid1 po.data.item
1.1 “foo” 17
Rid1 po.data.pkg
1.2 “123” 32 123
Rid1 po.data.item
1.3 “bar” 46
<po> <data> <item>foo</item> <pkg>123</pkg> <item>bar</item> </data></po>
Some Typos
37
OrdPath
• ORDPATHs: Insert-Friendly XML Node Labels– Patrick O’Neil, Elizabeth O’Neil1, Shankar Pal,
Istvan Cseri, Gideon Schaller, Nigel Westbury– SIGMOD 2004– SQL Server 2005 implementation
38
OrdPath
• Aims to provide efficient insertion at any position of an XML tree, and also supports extremely high performance query plans for native XML queries.
• Tree modifications– new may be inserted– sub-trees be deleted– sub-trees may be moved around within the
tree
39
OrdPath• Encodes the parent-child relationship by extending the
parent’s ORDPATH label with a component for the child. – E.g.: 1.5.3.9 might be the parent ORDPATH, 1.5.3.9.1 the
child. • The various child components reflect the children’s relative
sibling order, so that byte-by-byte comparison of the ORDPATH labels of two nodes yields the proper document order.
• A new node (possibly a root node of a sub-tree) can be inserted under any designated parent node in an existing tree. – Its label is generated using an additional intermediate
“careting” component that falls between the components of its left and right siblings.
40
OrdPath• At the beginning– Only positive, odd integers are assigned during an initial
load; even-numbered and negative integer component values are reserved for later insertions into an existing tree
• Inserting in the middle– Even numbers are used as carets only. Do not count as
components that increase the depth of the nodes.– E.g. new nodes in between 3.5.5 and 3.5.7
• New siblings: 3.5.6.1, 3.5.6.2, …• A subtree: 3.5.6.1, 3.5.6.1.1, 3.5.6.3, 3.5.6.3.1, 3.5.6.3.3,
3.5.6.3.3.1, 3.5.6.3.3.3, 3.5.6.3.5, 3.5.6.5, 3.5.6.5.1
XML
42
ORDPATH Label of Nodes
BOOK1
Section1.3
Figure1.3.5
Title1.3.1
Section1.5
Title1.5.1
Figure1.5.5
@ISBN1.1
CAPTION1.3.5.1
Nobody…1.3.3
tree frogs1.5.7
All right…1.5.3
43
Infoset Table
2 Memcached
What is memcached briefly?
• memcached is a high-performance, distributed memory object caching system, generic in nature
• It is a key-based cache daemon that stores data and objects wherever dedicated or spare RAM is available for very quick access
• It is a dumb distributed hash table. It does not provide redundancy, failover or authentication. If needed the client has to handle that.
Why was memcached made?
• It was originally developed by Danga Interactive to enhance the speed of LiveJournal.com
• It dropped the database load to almost nothing, yielding faster page load times for users, better resource utilization, and faster access to the databases on a memcache miss
• http://www.danga.com/memcached/
47
Memcached
Where does memcached reside?
• Memcache is not part of the database but sits outside it on the server(s).
• Over a pool of servers
Architecture
When should I use memcached?
• When your database is optimized to the hilt and you still need more out of it.– Lots of SELECTs are using resources that could be better
used elsewhere in the DB.– Locking issues keep coming up
• When table listings in the query cache are torn down so often it becomes useless
• To get maximum “scale out” of minimum hardware
Hit-rate Management
• anything what is more expensive to fetch from elsewhere, and has sufficient hitrate, can be placed in memcached– How often will object or data be used?– How expensive is it to generate the data?– What is the expected hitrate?– Will the application invalidate the data itself, or will TTL be
used? – How much development work has to be done to embed it?
Why use memcached?
• To reduce the load on the database by caching data BEFORE it hits the database
• Can be used for more then just holding database results (objects) and improve the entire application response time
• Feel the need for speed– Memcache is in RAM - much faster then hitting
the disk or the database
Why not use memcached?
• Memcache is held in RAM. This is a finite resource.
• Adding complexity to a system just for complexities sake is a waste. If the system can respond within the requirements without it - leave it alone
What are the limits of memcached?
• Keys can be no more then 250 characters• Stored data can not exceed 1M (largest typical
slab size)• There are generally no limits to the number of
nodes running memcache• There are generally no limits the the amount of
RAM used by memcache over all nodes– 32 bit machines do have a limit of 4GB though
Platform
• You can build and install memcached from the source code directly, or you can use an existing operating system package or installation.– on a RedHat, Fedora or CentOS host, use yum:
• root-shell> yum install memcached– on a Debian or Ubuntu host, use apt-get:
• root-shell> apt-get install memcached– on a Gentoo host, use emerge:
• root-shell> emerge install memcached– on OpenSolaris, use the pkg for SUNWmemcached:
• root-shell> pkg install SUNWmemcached
Port 11211
• Get the source from the website: http://www.danga.com/memcached/download.bml– Memcache has a dependancy on libevent so make
sure you have that also.
• Decompress, cd into the dir• ./configure;make;make install;• Memcached listens on port 11211 by default,
this can be changed with –p option.
How do I start memcached?
• Memcached can be run as a non-root user if it will not be on a restricted port (<1024) - though the user can not have a memory limit restriction
• shell> memcached • Default configuration - Memory: 64MB, all
network interfaces, port:11211, max simultaneous connections: 1024
Memcached options
• You can change the default configuration with various options.– -u <user> : run as user if started as root– -m <num> : maximum <num> MB memory to use for items
• If more then available RAM - will use swap• Don’t forget 4G limit on 32 bit machines
– -d : Run as a daemon– -l <ip_addr> : Listen on <ip_addr>; default to INDRR_ANY– -p <num> : port
How can I connect to memcached?
• Memcached uses a protocol that many languages implement with an API.
• Languages that implement it:– Perl, PHP, Python, Ruby, Java, C#, C, Lua, Postgres,
MySQL, Chicken Scheme
• And yes - because it is a protocol you can even use telnet– shell> telnet localhost 11211
Memcached protocol
• Three types of commands– Storage - ask the server to store some data
identified by a key• set, add, replace, append, prepend and cas
– Retrieval - ask the server to retrieve data corresponding to a set of keys• get, gets
Memcached protocol (con’t)– All others that don’t involve unstructured data • Deletion:delete • Statistics: stats, • flush_all: always succeeds, invalidate all existing items
immediately (by default) or after the expiration specified.• version, verbosity, quit
PHP and Memcached
• Make sure you have a working Apache/PHP install
• PHP has a memcached extension available through pecl.
• Installation:– shell> pecl install memcache
• Make sure the pear is installed (debian: apt-get install php-pear)
• Make sure that you also have php5-dev installed for phpize.– shell> apt-get install php5-dev
PHP Script Example• Information about the PHP API at
http://www.php.net/memcache
<?php // make a memcache object $memcache = new Memcache; // connect to memcache $memcache->connect('localhost', 11211) or die ("Could not connect"); //get the memcache version $version = $memcache->getVersion(); echo "Server's version: ".$version."<br/>\n";
PHP Script (con’t)// test data$tmp_object = new stdClass;$tmp_object->str_attr = 'test';$tmp_object->int_attr = 123;// set the test data in memcache$memcache->set('key', $tmp_object, false, 10) or die ("Failed to
save data at the server");echo "Store data in the cache (data will expire in 10 seconds)<br/>\
n";// get the data$get_result = $memcache->get('key');echo "Data from the cache:<br/>\n";echo ‘<pre>’, var_dump($get_result), ‘</pre>’;
MEMCACHE_COMPRESSED
PHP Script (con’t)// modify the data$tmp_object->str_attr = ‘boo’;$memcache->replace(‘key’, $tmp_object, false, 10) or die(“Failed to
save new data to the server<br/>\n”);Echo “Stored data in the cache changed<br/>\n”;// get the new data$get_result = $memcache->get(‘key’);Echo “New data from the cache:<br/>\n”;Echo ‘<pre>’, var_dump($get_result), “</pre>\n”;// delete the data$memcache->delete(‘key’) or die(“Data not deleted<br/>\n”);
MySQL’s memcached
• The API is consistent with the other API’s– Connect: mysql> SELECT
memc_servers_set('192.168.0.1:11211, 192.168.0.2:11211');• The list of servers used by the memcached UDFs is not
persistent over restarts of the MySQL server.– Set: mysql> SELECT memc_set('myid', 'myvalue');– Retreive: mysql> SELECT memc_get('myid');
Possible ways to secure memcached
• It has no authentication system - so protection is important
• Run as a non-priveledged user to minimize potential damage
• Specify the ip address to listen on using -l– 127.0.0.1, 192.168.0.1, specific ip address
• Use a non-standard port• Use a firewall
Memcached
69
• Slab AllocationSlab Allocation: : When you start to store data into the cache, memcached does not allocate the memory for the data on an item by item basis. Instead, a slab allocation is used to optimize memory usage and prevent memory fragmentation when information expires from the cache.
• Lazy Expiration + LRULazy Expiration + LRU• Lazy ExpirationLazy Expiration: : When an item is requested (a get request)
Memcached checks the expiration time to see if the item is still valid before returning it to the client.
• LRU LRU ((least recently used): Memcached is LRU per slab class, : Memcached is LRU per slab class, but not globally LRU. but not globally LRU.
Memcached Memory Management
70
Slab Allocation
71
Memcached Distributed Architecture
////Obtain a server ID based on Obtain a server ID based on Key Key valuevalueint getServerId(char *key, int serverTotal) {int getServerId(char *key, int serverTotal) { int c, hash = 0;int c, hash = 0; while (c = *key++) {while (c = *key++) { hash += c;hash += c; }} return hash % serverTotal;return hash % serverTotal;}}
////a list of serversa list of serversnode[0] => 192.168.0.1:11211node[0] => 192.168.0.1:11211node[1] => 192.168.0.2:11211node[1] => 192.168.0.2:11211node[2] => 192.168.0.3:11211node[2] => 192.168.0.3:11211
////get id get id int id = getServerId("test", 3);int id = getServerId("test", 3);
////get ip address and port numberget ip address and port numbernode[id] == node[1]node[id] == node[1]
Memcached Distributed Architecture
SQL Server Cache
• SQL Server cache mechanism :- Query plans- pages from the database files
• but it does NOT cache:- exact results from a query
REFERENCEhttp://searchsqlserver.techtarget.com/tip/SQL-Server-memory-configurations-for-procedure-cache-and-buffer-cache
Reference
• DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases
• Index Structure for Path Expressions• APEX: An Adaptive Path Index for XML Data
Reference Only• https://blog.couchbase.com/memcached-144-
windows-32-bit-binary-now-available• Memcached should run on Linux. It may not
work on some windows
Reference
Reference
• 0: <flag> • 60: timeout• 6: length