© 2013 a. haeberlen, z. ives nets 212: scalable and cloud computing 1 university of pennsylvania...

1University of Pennsylvania

© 2013 A. Haeberlen, Z. Ives

NETS 212: Scalable and Cloud Computing

Hierarchical data

November 21, 2013



Announcements

How is the PennBook project going? Code is (technically) due on December 10th Getting Started guide is now available (see Piazza

post)

Second midterm exam on December 10th

Comparable to first midterm (open-book, closed-Google, ...)

Covers all material up to, and including, December 5 lecture

Please schedule a check/review session!!!

Each team should already have contacted their grader

If you have not done this yet, please do so right after class!


Data = Records… or not?

To this point, we have assumed record-oriented data throughout most of the semester:

(key, value) pairs (from, to) edges Comma or tab-delimited files Relational tuples

Except implicitly: objects with collection-typed member variables sets of objects to be reduced XML with repeated sub-elements (e.g., in AJAX)

3



Goals for today: Beyond relations!

Representational capacity Pig Latin

Language and examples Implementation

XQuery Basic constructs Applicability to distributed settings

NEXT


Representational capacity

We know that a Turing-complete language just needs:

sequence, selection, recursion (or iteration)

What does a complete data representation need?

Typed binary relations: pred(src,target) OR equivalently, triples: edge(src,type,target)

But in general, it’s not convenient to be limited to these!

5


Towards Pig #1: Beyond relations?

The relational data model allows us to have arbitrary numbers of relations

Each with its own schema that includes arbitrary numbers of attributes

But: No nested tables! These would be converted into multiple tables by 1NF

normalization Hence SQL has no nested collections at all, (sets, lists,

bags...)

Can we add support for these?6


Towards Pig #2: Programming model

Hadoop MapReduce: file-oriented, procedural regularized “pipeline” – map, combine, shuffle, reduce arbitrary Java functions at each step

SQL: random access-storage-oriented (DBMS controls storage) compositional, tuple-collection-oriented query model declarative queries are automatically optimized can accommodate Java functions, but not naturally Hive: SQL queries file-oriented Map/Reduce

Is there something in between? Declarative is nice, but many data analysts are

'entrenched' procedural programmers... Pig and Pig Latin!

7

rigid dataflow

custom codeeven for very

common operations

opacity

what about"procedural

programmers"?


Pig Latin and Pig

Pig Latin: a compositional, collections-oriented dataflow language

Oriented towards parallel data processing & analysis Think of it as a more procedural SQL-like language

with nested collections Emphasizes user-defined functions, esp. those that have nice

algebraic properties (unlike SQL) Supports external data from files (like Hive)

By Chris Olston et al. at Yahoo! Research http://www.tomkinshome.com/site_media/papers/papers/

ORS+08.pdf

Pig: the runtime system

9


Pig Latin: Basic constructs Collection-valued expressions whose results

get assigned to variables A program does a series of assignments in a dataflow It gets compiled down to a sequence of MapReduces

Similar to Hive, but Pig Latin has its own query language (not SQL)

Basic SQL-like operations are explicitly specified:

load … as [HDFS scan] Remapping: foreach … generate [Map] Filtering: filter by [Map] Intersecting: join [Reduce] Aggregating: group by [Reduce] Sorting: order [Shuffle] store [HDFS store]

10


Simple example: Face detection

Each expression creates a named collection

load collections from files process them (e.g., per tuple) using a user-defined

function store the results into files

I = load ‘/mydata/images’ using ImageParser() as (id, image);F = foreach I generate id, detectFaces(image);store F into ‘/mydata/faces’;

11


Example: Session classification Goal: Find web sessions that end on the 'best'

page (i.e., the page with the highest PageRank) We need to join two tables, and then compare the final rank

in the sequence to the other ranks

12

User URL Time

Alice www.cnn.com 7:00

Alice www.digg.com 7:20

Alice www.social.com 10:00

Alice www.flickr.com 10:05

Joe www.cnn.com/index.htm 12:00

URL PageRank

www.cnn.com 0.9

www.flickr.com 0.9

www.social.com 0.7

www.digg.com 0.2

PagesVisits

. . .

. . .


The computation in Pig Latin

Visits = load ‘/data/visits’ as (user, url, time); Visits = foreach Visits generate user,

Canonicalize(url), time;

Pages = load ‘/data/pages’ as (url, pagerank);

VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate

flatten(FindSessions(*));HappyEndings = filter Sessions by BestIsLast(*);

store HappyEndings into '/data/happy_endings';

13


What does this query compile to? Parallel evaluation is really a

Map-Map/Reduce/Reduce chain:

Visit lists (filesystem)

Rank lists (filesystem)

⋈ ⋈ ⋈ ⋈ ⋈

…

…

Parallel joins (⋈)

Parallel group-by (g) session / choose best


Pig Latin features Record-oriented transformations

Can work over, create nested collections (Resembles Nested Relational variants of SQL)

Basic operators expose parallelism; user-defined operators may not

Operations are explicit, not declarative Unlike SQL

15

operators: • FILTER

• FOREACH … GENERATE • GROUP

binary operators: • JOIN

• COGROUP• UNION


Nesting: COGROUP & FLATTEN

Cogrouping: nesting groups into columns

Flattening: unnesting groups

16


Pig Latin vs. MapReduce MapReduce combines 3 primitives:

process records create groups process groups

In Pig, these primitives are:– explicit– independent– fully composable

Pig adds primitives for:– filtering tables– projecting tables– combining 2 or more

tables

a = FOREACH input GENERATE flatten(Map(*));b = GROUP a BY $0;

c = FOREACH b GENERATE Reduce(*);


Recap: Pig Latin

A dataflow language that compiles to MapReduce

Borrows many of the elements of SQL, but eliminates the reliance on declarative optimization

Incorporates primitives for nested collections

Quite successful: As of 2008: 25% of Yahoo Map/Reduce jobs from Pig Part of the Hadoop standard distribution

18


Pig system implementation Let’s briefly look at the Pig implementation, and how it can do

a bit more because of the higher-level language:

executionplan

Pig compiler

cluster

parsedprogram

Parser

User

cross-joboptimizer

Pig Latin

program

MapReduce

MapReducejobs

MR Compilerjoin

output

filter

X

f( )

Y


Key issue: Minimizing redundancy

Popular tables web crawl search log

Popular transformations eliminate spam pages group pages by host join web crawl with search log

Goal: Minimize redundant work


Work-sharing techniques

Job 1

Job 2

Op1

Op2

Op3

A

A

execute similar jobs together

cache data transformations

cache data moves

JoinA & B

Worker 1 Worker 2

A

D CB

Job 1 Job 2

Op2Op1

A

A


Recap: Pig and Pig Latin

Somewhere between a programming language and a DBMS

Allows distributed programming with explicit parallel dataflow operators

Supports explicit management of nested collections

Runtime system does caching and batching 22


25

XQuery: Basic form Has an analogous form to SQL’s

SELECT..FROM..WHERE..GROUP BY..ORDER BY Builds upon XPath expressions (traversals from

the root) The model:

Bind nodes (or node sets) to variables Operate over each legal combination of bindings Produce a set of nodes

“FLWOR” statement [note case sensitivity!]:for {iterators that bind variables}let {collections}where {conditions}order by {order-conditions}return {output constructor}


26

Example XML DataRoot

?xml dblp

mastersthesis university

mdatekey

authortitleyearschoolname loc

1992…

ms/Brown92

Kurt P….

PRPL…

1992

WISCWisconsin

USA

attributeroot

p-i element

text

article proceedings

year

1999

key

WISC

authortitleyear conf

Bob S….

PL…

1982

PLDI82

key

PLDI82


27

XQuery: "Iterations"

A series of (possibly nested) FOR statements assigning the results of XPaths to variables

for $root in document(“http://my.org/dblp.xml”)for $sub in $root/dblp,

$sub2 in $sub/mastersthesis, …

Something like a template that pattern-matches, produces a “binding tuple”

For each of these, we evaluate the where and possibly output the return template

document() or doc() function specifies an input file as a URI

http://my.org/dblp.xml


28

Two XQuery examples

<root-tag> {for $p in document(“dblp.xml”)/dblp/proceedings, $yr in $p/yrwhere $yr = “1999”return <proc> {$p} </proc>

} </root-tag>

for $i in document(“dblp.xml”)/dblp/mastersthesis[author/text() = "John Doe"]

return <johns-theses> <title>{ $i/title/text() }</title> <key>{ $i/@key }</key> </johns-theses>


29

XQuery: Nesting

Nesting XML trees is perhaps the most common operation

In XQuery, this is easy – put a subquery in the return clause where you want things to repeat!

for $u in document(“dblp.xml”)/dblp/university where $u/loc = “USA” return <ms-theses-99>

{ $u/name } { for $mt in $u/../mastersthesis where $mt/year/text() = “1999”

and $mt/school = $u/@key return $mt/title }

</ms-theses-99>


30

XQuery: Collections & Aggregation

In XQuery, many operations return collections

XPaths, sub-XQueries, functions over these, … The let clause assigns the results to a variable

Aggregation simply applies a function over a collection, where the function returns a value

let $allpapers := document(“dblp.xml”)/dblp/*[author]return <article-author-counts>

<count> { fn:count(fn:distinct-values($allpapers/author)) } </count>

{ for $paper in allPaperslet $pauth := $paper/authorreturn <paper> {$paper/title}

<count> { fn:count($pauth) } </count> </paper>

} </article-author-counts>


31

XQuery: Defining and querying metadata

Can get a node’s name by querying node-name():

for $x in document(“dblp.xml”)/dblp/*return node-name($x)

Can construct elements and attributes using computed constructors:

for $x in document(“dblp.xml”)/dblp/*,$year in $x/year,$title in $x/title/text(),

element node-name($x) {attribute {“year-” + $year} { $title }

}

Can't do this in SQL!

element name { contents }attribute name { value }


32

XQuery: Functions

A function can return arbitrary XML…

declare function V() as element(content)* {for $r in doc(“R”)/root/tree, $a in $r/a, $b in $r/b, $c in $r/cwhere $a = “123”return <content>{$a, $b, $c}</content>

}

declare function fact($n as xs:integer) as xs:integer {

if ($n > 1) then $n * fact($n – 1) else 1

}

It can be queried over like a document…for $v in V()/content, $r in doc(“r”)/root/treewhere $v/b = $r/breturn <result>{$v, fact(1)} </result>

calls

calls


33

Summary: XQuery

Flexible and powerful language for XML Supports arbitrary hierarchy, and conversion

between different structures Turing-complete: has been proposed as Web

programming language

Turing-complete, but… No “cloud” XQuery engines yet, though all major

commercial DBMSs support centralized XQuery It’s harder to parallelize hierarchical data!

A worthy alternative to SQL/Pig Latin?



Stay tuned

Next time you will learn about: Peer-to-peer systems

htt

p:/

/ww

w.fl

ickr.co

m/p

hoto

s/ti

mb

od

on/2

32

84

84

29

4/

© 2013 a. haeberlen, z. ives nets 212: scalable and cloud computing 1 university of pennsylvania...

Documents

ives data

pig pig latin

ives pig latin

ives goals

recordoriented data

ives representational

ives nets

ives announcements