© 2013 a. haeberlen, z. ives nets 212: scalable and cloud computing 1 university of pennsylvania...
TRANSCRIPT
1University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
NETS 212: Scalable and Cloud Computing
Hierarchical data
November 21, 2013
© 2013 A. Haeberlen, Z. Ives
2University of Pennsylvania
Announcements
How is the PennBook project going? Code is (technically) due on December 10th Getting Started guide is now available (see Piazza
post)
Second midterm exam on December 10th
Comparable to first midterm (open-book, closed-Google, ...)
Covers all material up to, and including, December 5 lecture
Please schedule a check/review session!!!
Each team should already have contacted their grader
If you have not done this yet, please do so right after class!
© 2013 A. Haeberlen, Z. Ives
Data = Records… or not?
To this point, we have assumed record-oriented data throughout most of the semester:
(key, value) pairs (from, to) edges Comma or tab-delimited files Relational tuples
Except implicitly: objects with collection-typed member variables sets of objects to be reduced XML with repeated sub-elements (e.g., in AJAX)
3
© 2013 A. Haeberlen, Z. Ives
4University of Pennsylvania
Goals for today: Beyond relations!
Representational capacity Pig Latin
Language and examples Implementation
XQuery Basic constructs Applicability to distributed settings
NEXT
© 2013 A. Haeberlen, Z. Ives
Representational capacity
We know that a Turing-complete language just needs:
sequence, selection, recursion (or iteration)
What does a complete data representation need?
Typed binary relations: pred(src,target) OR equivalently, triples: edge(src,type,target)
But in general, it’s not convenient to be limited to these!
5
© 2013 A. Haeberlen, Z. Ives
Towards Pig #1: Beyond relations?
The relational data model allows us to have arbitrary numbers of relations
Each with its own schema that includes arbitrary numbers of attributes
But: No nested tables! These would be converted into multiple tables by 1NF
normalization Hence SQL has no nested collections at all, (sets, lists,
bags...)
Can we add support for these?6
© 2013 A. Haeberlen, Z. Ives
Towards Pig #2: Programming model
Hadoop MapReduce: file-oriented, procedural regularized “pipeline” – map, combine, shuffle, reduce arbitrary Java functions at each step
SQL: random access-storage-oriented (DBMS controls storage) compositional, tuple-collection-oriented query model declarative queries are automatically optimized can accommodate Java functions, but not naturally Hive: SQL queries file-oriented Map/Reduce
Is there something in between? Declarative is nice, but many data analysts are
'entrenched' procedural programmers... Pig and Pig Latin!
7
rigid dataflow
custom codeeven for very
common operations
opacity
what about"procedural
programmers"?
© 2013 A. Haeberlen, Z. Ives
8University of Pennsylvania
Goals for today: Beyond relations!
Representational capacity Pig Latin
Language and examples Implementation
XQuery Basic constructs Applicability to distributed settings
NEXT
© 2013 A. Haeberlen, Z. Ives
Pig Latin and Pig
Pig Latin: a compositional, collections-oriented dataflow language
Oriented towards parallel data processing & analysis Think of it as a more procedural SQL-like language
with nested collections Emphasizes user-defined functions, esp. those that have nice
algebraic properties (unlike SQL) Supports external data from files (like Hive)
By Chris Olston et al. at Yahoo! Research http://www.tomkinshome.com/site_media/papers/papers/
ORS+08.pdf
Pig: the runtime system
9
© 2013 A. Haeberlen, Z. Ives
Pig Latin: Basic constructs Collection-valued expressions whose results
get assigned to variables A program does a series of assignments in a dataflow It gets compiled down to a sequence of MapReduces
Similar to Hive, but Pig Latin has its own query language (not SQL)
Basic SQL-like operations are explicitly specified:
load … as [HDFS scan] Remapping: foreach … generate [Map] Filtering: filter by [Map] Intersecting: join [Reduce] Aggregating: group by [Reduce] Sorting: order [Shuffle] store [HDFS store]
10
© 2013 A. Haeberlen, Z. Ives
Simple example: Face detection
Each expression creates a named collection
load collections from files process them (e.g., per tuple) using a user-defined
function store the results into files
I = load ‘/mydata/images’ using ImageParser() as (id, image);F = foreach I generate id, detectFaces(image);store F into ‘/mydata/faces’;
11
© 2013 A. Haeberlen, Z. Ives
Example: Session classification Goal: Find web sessions that end on the 'best'
page (i.e., the page with the highest PageRank) We need to join two tables, and then compare the final rank
in the sequence to the other ranks
12
User URL Time
Alice www.cnn.com 7:00
Alice www.digg.com 7:20
Alice www.social.com 10:00
Alice www.flickr.com 10:05
Joe www.cnn.com/index.htm 12:00
URL PageRank
www.cnn.com 0.9
www.flickr.com 0.9
www.social.com 0.7
www.digg.com 0.2
PagesVisits
. . .
. . .
© 2013 A. Haeberlen, Z. Ives
The computation in Pig Latin
Visits = load ‘/data/visits’ as (user, url, time); Visits = foreach Visits generate user,
Canonicalize(url), time;
Pages = load ‘/data/pages’ as (url, pagerank);
VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate
flatten(FindSessions(*));HappyEndings = filter Sessions by BestIsLast(*);
store HappyEndings into '/data/happy_endings';
13
© 2013 A. Haeberlen, Z. Ives
What does this query compile to? Parallel evaluation is really a
Map-Map/Reduce/Reduce chain:
Visit lists (filesystem)
Rank lists (filesystem)
⋈ ⋈ ⋈ ⋈ ⋈
…
…
Parallel joins (⋈)
Parallel group-by (g) session / choose best
© 2013 A. Haeberlen, Z. Ives
Pig Latin features Record-oriented transformations
Can work over, create nested collections (Resembles Nested Relational variants of SQL)
Basic operators expose parallelism; user-defined operators may not
Operations are explicit, not declarative Unlike SQL
15
operators: • FILTER
• FOREACH … GENERATE • GROUP
binary operators: • JOIN
• COGROUP• UNION
© 2013 A. Haeberlen, Z. Ives
Nesting: COGROUP & FLATTEN
Cogrouping: nesting groups into columns
Flattening: unnesting groups
16
© 2013 A. Haeberlen, Z. Ives
Pig Latin vs. MapReduce MapReduce combines 3 primitives:
process records create groups process groups
In Pig, these primitives are:– explicit– independent– fully composable
Pig adds primitives for:– filtering tables– projecting tables– combining 2 or more
tables
a = FOREACH input GENERATE flatten(Map(*));b = GROUP a BY $0;
c = FOREACH b GENERATE Reduce(*);
© 2013 A. Haeberlen, Z. Ives
Recap: Pig Latin
A dataflow language that compiles to MapReduce
Borrows many of the elements of SQL, but eliminates the reliance on declarative optimization
Incorporates primitives for nested collections
Quite successful: As of 2008: 25% of Yahoo Map/Reduce jobs from Pig Part of the Hadoop standard distribution
18
© 2013 A. Haeberlen, Z. Ives
Pig system implementation Let’s briefly look at the Pig implementation, and how it can do
a bit more because of the higher-level language:
executionplan
Pig compiler
cluster
parsedprogram
Parser
User
cross-joboptimizer
Pig Latin
program
MapReduce
MapReducejobs
MR Compilerjoin
output
filter
X
f( )
Y
© 2013 A. Haeberlen, Z. Ives
Key issue: Minimizing redundancy
Popular tables web crawl search log
Popular transformations eliminate spam pages group pages by host join web crawl with search log
Goal: Minimize redundant work
© 2013 A. Haeberlen, Z. Ives
Work-sharing techniques
Job 1
Job 2
Op1
Op2
Op3
A
A
execute similar jobs together
cache data transformations
cache data moves
JoinA & B
Worker 1 Worker 2
A
D CB
Job 1 Job 2
Op2Op1
A
A
© 2013 A. Haeberlen, Z. Ives
Recap: Pig and Pig Latin
Somewhere between a programming language and a DBMS
Allows distributed programming with explicit parallel dataflow operators
Supports explicit management of nested collections
Runtime system does caching and batching 22
© 2013 A. Haeberlen, Z. Ives
23University of Pennsylvania
Goals for today: Beyond relations!
Representational capacity Pig Latin
Language and examples Implementation
XQuery Basic constructs Applicability to distributed settings
NEXT
© 2013 A. Haeberlen, Z. Ives
One step further
Pig Latin gives us nested collections
Suppose we want to take a single hierarchical data model for everything!
XML, perhaps?
A functional language for XML: XQuery
24
© 2013 A. Haeberlen, Z. Ives
25
XQuery: Basic form Has an analogous form to SQL’s
SELECT..FROM..WHERE..GROUP BY..ORDER BY Builds upon XPath expressions (traversals from
the root) The model:
Bind nodes (or node sets) to variables Operate over each legal combination of bindings Produce a set of nodes
“FLWOR” statement [note case sensitivity!]:for {iterators that bind variables}let {collections}where {conditions}order by {order-conditions}return {output constructor}
© 2013 A. Haeberlen, Z. Ives
26
Example XML DataRoot
?xml dblp
mastersthesis university
mdatekey
authortitleyearschoolname loc
1992…
ms/Brown92
Kurt P….
PRPL…
1992
WISCWisconsin
USA
attributeroot
p-i element
text
article proceedings
year
1999
key
WISC
authortitleyear conf
Bob S….
PL…
1982
PLDI82
key
PLDI82
© 2013 A. Haeberlen, Z. Ives
27
XQuery: "Iterations"
A series of (possibly nested) FOR statements assigning the results of XPaths to variables
for $root in document(“http://my.org/dblp.xml”)for $sub in $root/dblp,
$sub2 in $sub/mastersthesis, …
Something like a template that pattern-matches, produces a “binding tuple”
For each of these, we evaluate the where and possibly output the return template
document() or doc() function specifies an input file as a URI
© 2013 A. Haeberlen, Z. Ives
28
Two XQuery examples
<root-tag> {for $p in document(“dblp.xml”)/dblp/proceedings, $yr in $p/yrwhere $yr = “1999”return <proc> {$p} </proc>
} </root-tag>
for $i in document(“dblp.xml”)/dblp/mastersthesis[author/text() = "John Doe"]
return <johns-theses> <title>{ $i/title/text() }</title> <key>{ $i/@key }</key> </johns-theses>
© 2013 A. Haeberlen, Z. Ives
29
XQuery: Nesting
Nesting XML trees is perhaps the most common operation
In XQuery, this is easy – put a subquery in the return clause where you want things to repeat!
for $u in document(“dblp.xml”)/dblp/university where $u/loc = “USA” return <ms-theses-99>
{ $u/name } { for $mt in $u/../mastersthesis where $mt/year/text() = “1999”
and $mt/school = $u/@key return $mt/title }
</ms-theses-99>
© 2013 A. Haeberlen, Z. Ives
30
XQuery: Collections & Aggregation
In XQuery, many operations return collections
XPaths, sub-XQueries, functions over these, … The let clause assigns the results to a variable
Aggregation simply applies a function over a collection, where the function returns a value
let $allpapers := document(“dblp.xml”)/dblp/*[author]return <article-author-counts>
<count> { fn:count(fn:distinct-values($allpapers/author)) } </count>
{ for $paper in allPaperslet $pauth := $paper/authorreturn <paper> {$paper/title}
<count> { fn:count($pauth) } </count> </paper>
} </article-author-counts>
© 2013 A. Haeberlen, Z. Ives
31
XQuery: Defining and querying metadata
Can get a node’s name by querying node-name():
for $x in document(“dblp.xml”)/dblp/*return node-name($x)
Can construct elements and attributes using computed constructors:
for $x in document(“dblp.xml”)/dblp/*,$year in $x/year,$title in $x/title/text(),
element node-name($x) {attribute {“year-” + $year} { $title }
}
Can't do this in SQL!
element name { contents }attribute name { value }
© 2013 A. Haeberlen, Z. Ives
32
XQuery: Functions
A function can return arbitrary XML…
declare function V() as element(content)* {for $r in doc(“R”)/root/tree, $a in $r/a, $b in $r/b, $c in $r/cwhere $a = “123”return <content>{$a, $b, $c}</content>
}
declare function fact($n as xs:integer) as xs:integer {
if ($n > 1) then $n * fact($n – 1) else 1
}
It can be queried over like a document…for $v in V()/content, $r in doc(“r”)/root/treewhere $v/b = $r/breturn <result>{$v, fact(1)} </result>
calls
calls
© 2013 A. Haeberlen, Z. Ives
33
Summary: XQuery
Flexible and powerful language for XML Supports arbitrary hierarchy, and conversion
between different structures Turing-complete: has been proposed as Web
programming language
Turing-complete, but… No “cloud” XQuery engines yet, though all major
commercial DBMSs support centralized XQuery It’s harder to parallelize hierarchical data!
A worthy alternative to SQL/Pig Latin?