using cascalog and hadoop for rapid graph processing and exploration

Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion

Using Cascalog and Hadoop for rapid graphprocessing and exploration

Nils Grunwald and Hugo Zanghi

Linkfluence

2012-02-05 - FOSDEM 2012 - Graph Devroom

Nils Grunwald and Hugo Zanghi Linkfluence

Using Cascalog and Hadoop for rapid graph processing and exploration


Outline

Graph Analysis at Linkfluence

Why Cascalog

Introduction to Cascalog

Conclusion




What we do at Linkfluence

I Web data mining (blogs,media, etc.)

I Social Network data mining(Twitter, Facebook)

I Use this data to buildvarious search engines

I Visualize the data withvarious UI (Gephi, maps,etc.)




What we get

I Lots of nodes (users, pages, websites, words)

I Lots of edges (hyperlinks, comments, RT, co-occurences)I These datasets are interconnected (Twitter users link pages,

words occur everywhere)




What we get

I Lots of nodes (users, pages, websites, words)I Lots of edges (hyperlinks, comments, RT, co-occurences)

I These datasets are interconnected (Twitter users link pages,words occur everywhere)




What we get

I Lots of nodes (users, pages, websites, words)I Lots of edges (hyperlinks, comments, RT, co-occurences)I These datasets are interconnected (Twitter users link pages,

words occur everywhere)




The problem

I Collecting and processing this data as a graph is not theprimary goal of our system

I But it is a very rich dataset we want to explore for R&Dpurpose




The constraints

I The graph processing should not compromise the rest of thesystem

I Low-maintenanceI Used for queries and rapid prototypingI Flexible, hard to tell which field or metadata will be used

beforehand




The constraints


I Low-maintenance

I Used for queries and rapid prototypingI Flexible, hard to tell which field or metadata will be used

beforehand




The constraints


I Low-maintenanceI Used for queries and rapid prototyping

I Flexible, hard to tell which field or metadata will be usedbeforehand




The constraints


I Low-maintenanceI Used for queries and rapid prototypingI Flexible, hard to tell which field or metadata will be used

beforehand




What is Cascalog

I Built on top of Hadoop and Cascading (workflow management)

I Inspired by the Datalog query syntaxI Hosted on the JVM by the Clojure language




What is Cascalog

I Built on top of Hadoop and Cascading (workflow management)I Inspired by the Datalog query syntax

I Hosted on the JVM by the Clojure language




What is Cascalog

I Built on top of Hadoop and Cascading (workflow management)I Inspired by the Datalog query syntaxI Hosted on the JVM by the Clojure language




Hadoop for reliability and scalability

I Reliable and scalable

I Everything is dumped in text files, we reuse our existingrsyslog infrastructure

I We can reuse existing hadoop instances of our systemI No need to know beforehand about indexed fields or to have

data in a perfectly uniform format





I Reliable and scalableI Everything is dumped in text files, we reuse our existing

rsyslog infrastructure

I We can reuse existing hadoop instances of our systemI No need to know beforehand about indexed fields or to have







rsyslog infrastructureI We can reuse existing hadoop instances of our system

I No need to know beforehand about indexed fields or to havedata in a perfectly uniform format






rsyslog infrastructureI We can reuse existing hadoop instances of our systemI No need to know beforehand about indexed fields or to have





Datalog for rapid protyping

I Subset of Prolog

I Declarative, expressive and very concise way of writing queriesI Prolog has long been used for making queries over graphs





I Subset of PrologI Declarative, expressive and very concise way of writing queries

I Prolog has long been used for making queries over graphs





I Subset of PrologI Declarative, expressive and very concise way of writing queriesI Prolog has long been used for making queries over graphs




Clojure for flexibility

I Only one language and one file for queries and business logic

I Tasks unrelated to data processing are possible inside thequeries (Resolve shortened links for example)

I Allows complex algorithms to be concisely expressed





I Only one language and one file for queries and business logicI Tasks unrelated to data processing are possible inside the

queries (Resolve shortened links for example)

I Allows complex algorithms to be concisely expressed





I Only one language and one file for queries and business logicI Tasks unrelated to data processing are possible inside the

queries (Resolve shortened links for example)I Allows complex algorithms to be concisely expressed




The downsides

I Slow compared to in-memory computation or non-distributedgraph DB

I Cannot do realtime




Use-cases

I Post-processing on large number of edges

I Filtering or transforming a dataset before exporting to Gephior Neo4j

I Back-processing old data with inconsistent fields and mergingdatasets from different sources




Use-cases

I Post-processing on large number of edgesI Filtering or transforming a dataset before exporting to Gephi

or Neo4j

I Back-processing old data with inconsistent fields and mergingdatasets from different sources




Use-cases

I Post-processing on large number of edgesI Filtering or transforming a dataset before exporting to Gephi

or Neo4jI Back-processing old data with inconsistent fields and merging

datasets from different sources




Using Cascalog

I Declarative syntax

I Order of statements is arbitraryI Syntax is LISP-likeI Operations are based on tuplesI Possibility to control the flow with custom operators (filter,

mapcat, etc.)




Using Cascalog

I Declarative syntaxI Order of statements is arbitrary

I Syntax is LISP-likeI Operations are based on tuplesI Possibility to control the flow with custom operators (filter,

mapcat, etc.)




Using Cascalog

I Declarative syntaxI Order of statements is arbitraryI Syntax is LISP-like

I Operations are based on tuplesI Possibility to control the flow with custom operators (filter,

mapcat, etc.)




Using Cascalog

I Declarative syntaxI Order of statements is arbitraryI Syntax is LISP-likeI Operations are based on tuples

I Possibility to control the flow with custom operators (filter,mapcat, etc.)




Using Cascalog

I Declarative syntaxI Order of statements is arbitraryI Syntax is LISP-likeI Operations are based on tuplesI Possibility to control the flow with custom operators (filter,

mapcat, etc.)




Anatomy of a Cascalog query (Aggregation)

Example (in-degree from cascalog.graph.core)

(defn in-degree ;; just a normal function"computes the in degrees" ;; docstring[edges](<- ;; returns a cascalog query[?dst ?in_d] ;; returned tuple(edges ?dst _) ;; destructuring on a generator(:distinct false)(c/count :> ?in_d))) ;; infers aggregation on ?dst




Anatomy of a Cascalog query (Filtering)

Example (filtering on in-degree)

(defn filtered-nodes[edges threshold];; compute in-degree as a subquery(let [in-degrees (in-degree edges)](<-[?node-id ?in-deg];; filters on computed in-degree(> ?in-deg threshold);; uses previous subquery as a generator(in-degrees ?node-id ?in-deg))))




Under the hood, this happens. . .

Example (using custom filter ops)

(deffilterop over-threshold[deg threshold](> deg threshold))

(defn filtered-nodes[edges threshold](let [in-degrees (in-degree edges)](<-[?node-id ?in-deg](in-degrees ?node-id ?in-deg);; use custom operator(over-threshold ?in-deg threshold))))




Anatomy of a Cascalog query (Join)

Example (joining on heterogenous datasets)

(defn get-website[url](-> (URL. url)

(.getHost)))

(defn join-edges[backlinks rt];; compute in-degree as a subquery(<-

[?resolved](backlinks ?src ?url)(rt _ ?url)(get-website ?url :> ?resolved)))




Further reading

I Cascalog home

https://github.com/nathanmarz/cascalogI More advanced uses: Pagerank and components detection

https://github.com/docteurZ/cascalog-contrib/tree/pagerank



https://github.com/nathanmarz/cascalog

https://github.com/docteurZ/cascalog-contrib/tree/pagerank


Thanks!

If you like this kind of problems, we’re hiring!Contact us at [email protected]



using cascalog and hadoop for rapid graph processing and exploration

Technology