using cascalog and hadoop for rapid graph processing and exploration

41
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Using Cascalog and Hadoop for rapid graph processing and exploration Nils Grunwald and Hugo Zanghi Linkfluence 2012-02-05 - FOSDEM 2012 - Graph Devroom Nils Grunwald and Hugo Zanghi Linkfluence Using Cascalog and Hadoop for rapid graph processing and exploration

Upload: graphdevroom

Post on 28-Nov-2014

2.150 views

Category:

Technology


4 download

DESCRIPTION

Graphs are becoming increasingly popular as ways of modeling a wide variety of systems. As such, the label "graph processing" also covers a range of objectives and architectural constraints. At [Linkfluence][http://us.linkfluence.net/], we use graph processing on datasets produced with very different systems (Web crawler, Twitter and Facebook API, feed aggregator, etc.) We spend a lot of time doing exploratory programming, trying to use our eclectic datasets in interesting ways, and processing our data in asynchronous workflows.We have come to see [Hadoop][http://hadoop.apache.org/] and the processing framework [Cascalog][https://github.com/nathanmarz/cascalog] as essential tools in our toolbox when dealing with graphs, since it gives us architectural flexibility, scalability and the possibility of rapid prototyping.Cascalog is an open source framework built on top of Hadoop and [Cascading][http://www.cascading.org/]. Its syntactic and semantic roots come from Datalog and Prolog, which have been succesfully applied for a long time in the manipulation of graphs. Also, its ability to directly embed the expressive [Clojure][http://clojure.org/] language allows to very easily define custom operations and ad-hoc processing.In this talk, we will present the framework, consider its advantages and drawbacks when compared to other approaches, show concrete exemples of usage for graph processing and how we use them to complement graph databases.

TRANSCRIPT

Page 1: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Using Cascalog and Hadoop for rapid graphprocessing and exploration

Nils Grunwald and Hugo Zanghi

Linkfluence

2012-02-05 - FOSDEM 2012 - Graph Devroom

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 2: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Outline

Graph Analysis at Linkfluence

Why Cascalog

Introduction to Cascalog

Conclusion

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 3: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

What we do at Linkfluence

I Web data mining (blogs,media, etc.)

I Social Network data mining(Twitter, Facebook)

I Use this data to buildvarious search engines

I Visualize the data withvarious UI (Gephi, maps,etc.)

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 4: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

What we get

I Lots of nodes (users, pages, websites, words)

I Lots of edges (hyperlinks, comments, RT, co-occurences)I These datasets are interconnected (Twitter users link pages,

words occur everywhere)

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 5: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

What we get

I Lots of nodes (users, pages, websites, words)I Lots of edges (hyperlinks, comments, RT, co-occurences)

I These datasets are interconnected (Twitter users link pages,words occur everywhere)

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 6: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

What we get

I Lots of nodes (users, pages, websites, words)I Lots of edges (hyperlinks, comments, RT, co-occurences)I These datasets are interconnected (Twitter users link pages,

words occur everywhere)

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 7: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

The problem

I Collecting and processing this data as a graph is not theprimary goal of our system

I But it is a very rich dataset we want to explore for R&Dpurpose

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 8: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

The problem

I Collecting and processing this data as a graph is not theprimary goal of our system

I But it is a very rich dataset we want to explore for R&Dpurpose

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 9: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

The constraints

I The graph processing should not compromise the rest of thesystem

I Low-maintenanceI Used for queries and rapid prototypingI Flexible, hard to tell which field or metadata will be used

beforehand

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 10: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

The constraints

I The graph processing should not compromise the rest of thesystem

I Low-maintenance

I Used for queries and rapid prototypingI Flexible, hard to tell which field or metadata will be used

beforehand

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 11: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

The constraints

I The graph processing should not compromise the rest of thesystem

I Low-maintenanceI Used for queries and rapid prototyping

I Flexible, hard to tell which field or metadata will be usedbeforehand

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 12: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

The constraints

I The graph processing should not compromise the rest of thesystem

I Low-maintenanceI Used for queries and rapid prototypingI Flexible, hard to tell which field or metadata will be used

beforehand

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 13: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

What is Cascalog

I Built on top of Hadoop and Cascading (workflow management)

I Inspired by the Datalog query syntaxI Hosted on the JVM by the Clojure language

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 14: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

What is Cascalog

I Built on top of Hadoop and Cascading (workflow management)I Inspired by the Datalog query syntax

I Hosted on the JVM by the Clojure language

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 15: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

What is Cascalog

I Built on top of Hadoop and Cascading (workflow management)I Inspired by the Datalog query syntaxI Hosted on the JVM by the Clojure language

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 16: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Hadoop for reliability and scalability

I Reliable and scalable

I Everything is dumped in text files, we reuse our existingrsyslog infrastructure

I We can reuse existing hadoop instances of our systemI No need to know beforehand about indexed fields or to have

data in a perfectly uniform format

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 17: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Hadoop for reliability and scalability

I Reliable and scalableI Everything is dumped in text files, we reuse our existing

rsyslog infrastructure

I We can reuse existing hadoop instances of our systemI No need to know beforehand about indexed fields or to have

data in a perfectly uniform format

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 18: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Hadoop for reliability and scalability

I Reliable and scalableI Everything is dumped in text files, we reuse our existing

rsyslog infrastructureI We can reuse existing hadoop instances of our system

I No need to know beforehand about indexed fields or to havedata in a perfectly uniform format

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 19: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Hadoop for reliability and scalability

I Reliable and scalableI Everything is dumped in text files, we reuse our existing

rsyslog infrastructureI We can reuse existing hadoop instances of our systemI No need to know beforehand about indexed fields or to have

data in a perfectly uniform format

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 20: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Datalog for rapid protyping

I Subset of Prolog

I Declarative, expressive and very concise way of writing queriesI Prolog has long been used for making queries over graphs

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 21: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Datalog for rapid protyping

I Subset of PrologI Declarative, expressive and very concise way of writing queries

I Prolog has long been used for making queries over graphs

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 22: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Datalog for rapid protyping

I Subset of PrologI Declarative, expressive and very concise way of writing queriesI Prolog has long been used for making queries over graphs

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 23: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Clojure for flexibility

I Only one language and one file for queries and business logic

I Tasks unrelated to data processing are possible inside thequeries (Resolve shortened links for example)

I Allows complex algorithms to be concisely expressed

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 24: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Clojure for flexibility

I Only one language and one file for queries and business logicI Tasks unrelated to data processing are possible inside the

queries (Resolve shortened links for example)

I Allows complex algorithms to be concisely expressed

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 25: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Clojure for flexibility

I Only one language and one file for queries and business logicI Tasks unrelated to data processing are possible inside the

queries (Resolve shortened links for example)I Allows complex algorithms to be concisely expressed

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 26: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

The downsides

I Slow compared to in-memory computation or non-distributedgraph DB

I Cannot do realtime

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 27: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

The downsides

I Slow compared to in-memory computation or non-distributedgraph DB

I Cannot do realtime

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 28: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Use-cases

I Post-processing on large number of edges

I Filtering or transforming a dataset before exporting to Gephior Neo4j

I Back-processing old data with inconsistent fields and mergingdatasets from different sources

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 29: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Use-cases

I Post-processing on large number of edgesI Filtering or transforming a dataset before exporting to Gephi

or Neo4j

I Back-processing old data with inconsistent fields and mergingdatasets from different sources

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 30: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Use-cases

I Post-processing on large number of edgesI Filtering or transforming a dataset before exporting to Gephi

or Neo4jI Back-processing old data with inconsistent fields and merging

datasets from different sources

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 31: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Using Cascalog

I Declarative syntax

I Order of statements is arbitraryI Syntax is LISP-likeI Operations are based on tuplesI Possibility to control the flow with custom operators (filter,

mapcat, etc.)

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 32: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Using Cascalog

I Declarative syntaxI Order of statements is arbitrary

I Syntax is LISP-likeI Operations are based on tuplesI Possibility to control the flow with custom operators (filter,

mapcat, etc.)

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 33: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Using Cascalog

I Declarative syntaxI Order of statements is arbitraryI Syntax is LISP-like

I Operations are based on tuplesI Possibility to control the flow with custom operators (filter,

mapcat, etc.)

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 34: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Using Cascalog

I Declarative syntaxI Order of statements is arbitraryI Syntax is LISP-likeI Operations are based on tuples

I Possibility to control the flow with custom operators (filter,mapcat, etc.)

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 35: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Using Cascalog

I Declarative syntaxI Order of statements is arbitraryI Syntax is LISP-likeI Operations are based on tuplesI Possibility to control the flow with custom operators (filter,

mapcat, etc.)

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 36: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Anatomy of a Cascalog query (Aggregation)

Example (in-degree from cascalog.graph.core)

(defn in-degree ;; just a normal function"computes the in degrees" ;; docstring[edges](<- ;; returns a cascalog query[?dst ?in_d] ;; returned tuple(edges ?dst _) ;; destructuring on a generator(:distinct false)(c/count :> ?in_d))) ;; infers aggregation on ?dst

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 37: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Anatomy of a Cascalog query (Filtering)

Example (filtering on in-degree)

(defn filtered-nodes[edges threshold];; compute in-degree as a subquery(let [in-degrees (in-degree edges)](<-[?node-id ?in-deg];; filters on computed in-degree(> ?in-deg threshold);; uses previous subquery as a generator(in-degrees ?node-id ?in-deg))))

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 38: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Under the hood, this happens. . .

Example (using custom filter ops)

(deffilterop over-threshold[deg threshold](> deg threshold))

(defn filtered-nodes[edges threshold](let [in-degrees (in-degree edges)](<-[?node-id ?in-deg](in-degrees ?node-id ?in-deg);; use custom operator(over-threshold ?in-deg threshold))))

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 39: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Anatomy of a Cascalog query (Join)

Example (joining on heterogenous datasets)

(defn get-website[url](-> (URL. url)

(.getHost)))

(defn join-edges[backlinks rt];; compute in-degree as a subquery(<-

[?resolved](backlinks ?src ?url)(rt _ ?url)(get-website ?url :> ?resolved)))

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 40: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Further reading

I Cascalog home

https://github.com/nathanmarz/cascalogI More advanced uses: Pagerank and components detection

https://github.com/docteurZ/cascalog-contrib/tree/pagerank

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration

Page 41: Using Cascalog and Hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Thanks!

If you like this kind of problems, we’re hiring!Contact us at [email protected]

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration