apache giraph...agenda • a brief history of hadoop-based bigdata management • extracting graph...
TRANSCRIPT
![Page 1: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/1.jpg)
Apache Giraph!start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!
![Page 2: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/2.jpg)
Who’s this guy?
![Page 3: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/3.jpg)
Roman Shaposhnik • Sr. Manager at Pivotal Inc. building a team of ASF contributors • ASF junkie
• VP of Apache Incubator, former VP of Apache Bigtop • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem)
• Used to be root@Cloudera • Used to be a PHB at Yahoo! • Used to be a UNIX hacker at Sun microsystems
![Page 4: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/4.jpg)
Giraph in action (MEAP)
http://manning.com/martella/
![Page 5: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/5.jpg)
What’s this all about?
![Page 6: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/6.jpg)
Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads • Bulk Sequential Processing to the rescue • Apache Giraph: a Hadoop-based BSP graph analysis framework • Giraph application development • Demos! Code! Lots of it!
![Page 7: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/7.jpg)
On day one Doug created HDFS/MR
![Page 8: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/8.jpg)
Google papers • GFS (file system)
• distributed • replicated • non-POSIX
• MapReduce (computational framework) • distributed • batch-oriented (long jobs; final results) • data-gravity aware • designed for “embarrassingly parallel” algorithms
![Page 9: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/9.jpg)
One size doesn’t fit all • Key-value approach
• map is how we get the keys • shuffle is how we sort the keys • reduce is how we get to see all the values for a key
• Pipeline approach • Intermediate results in a pipeline need to be flushed to HDFS • A very particular “API” for working with your data
![Page 10: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/10.jpg)
It’s not about the size of your data;!it’s about what you do with it!
![Page 11: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/11.jpg)
Graph relationships • Entities in your data: tuples
• customer data • product data • interaction data
• Connection between entities: graphs • social network or my customers • clustering of customers vs. products
![Page 12: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/12.jpg)
Challenges • Data is dynamic
• No way of doing “schema on write” • Combinatorial explosion of datasets
• Relationships grow exponentially • Algorithms become
• explorative • iterative
![Page 13: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/13.jpg)
Graph databases • Plenty available
• Neo4J, Titan, etc. • Benefits
• Tightly integrate systems with few moving parts • High performance on known data sets
• Shortcomings • Don’t integrate with HDFS • Combine storage and computational layers • A sea of APIs
![Page 14: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/14.jpg)
Enter Apache Giraph
![Page 15: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/15.jpg)
Key insights • Keep state in memory for as long as needed • Leverage HDFS as a repository for unstructured data • Allow for maximum parallelism (shared nothing) • Allow for arbitrary communications • Leverage BSP approach
![Page 16: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/16.jpg)
Bulk Sequential Processing
time
communications
local processing
barrier #1
barrier #2
barrier #3
![Page 17: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/17.jpg)
BSP applied to graphs
@rhatr
@TheASF
@c0sin
Think like a vertex: • I know my local state • I know my neighbours • I can send messages to vertices • I can declare that I am done • I can mutate graph topology
![Page 18: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/18.jpg)
Bulk Sequential Processing
time
message delivery
individual vertex processing & sending messages
superstep #1
all vertices are “done”
superstep #2
![Page 19: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/19.jpg)
Giraph “Hello World” public class GiraphHelloWorld extends BasicComputation<IntWritable, IntWritable, NullWritable, NullWritable> { public void compute(Vertex<IntWritable, IntWritable, NullWritable> vertex, Iterable<NullWritable> messages) { System.out.println(“Hello world from the: “ + vertex.getId() + “ : “); for (Edge<IntWritable, NullWritable> e : vertex.getEdges()) { System.out.println(“ “ + e.getTargetVertexId()); } System.out.println(“”); } }
![Page 20: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/20.jpg)
Mighty four of Giraph API
BasicComputation<IntWritable, // VertexID -- vertex ref IntWritable, // VertexData -- a vertex datum NullWritable, // EdgeData -- an edge label datum NullWritable>// MessageData -- message payload
![Page 21: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/21.jpg)
On circles and arrows • You don’t even need a graph to begin with!
• Well, ok you need at least one node • Dynamic extraction of relationships
• EdgeInputFormat • VetexInputFormat
• Full integration with Hadoop ecosystem • HBase/Accumulo, Gora, Hive/HCatalog
![Page 22: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/22.jpg)
Anatomy of Giraph run
3 1 2
1 2 1 3
1 2 1 3 3 1 2
HDFS mappers reducers HDFS
@rhatr
@TheASF
@c0sin
InputFormat
OutputForm
at
1
2
3
![Page 23: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/23.jpg)
Anatomy of Giraph run mappers or YARN containers
@rhatr
@TheASF
@c0sin
InputFormat
OutputForm
at
1
2
3
![Page 24: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/24.jpg)
A vertex view
MessageData1
VertexID
VertexData
MessageData2
![Page 25: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/25.jpg)
Turning Twitter into Facebook
@rhatr
@TheASF
@c0sin
rhatr
TheASF
c0sin
![Page 26: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/26.jpg)
Ping thy neighbours public void compute(Vertex<Text, DoubleWritable, DoubleWritable> vertex, Iterable<Text> ms ){ if (getSuperstep() == 0) { sendMessageToAllEdges(vertex, vertex.getId()); } else { for (Text m : ms) { if (vertex.getEdgeValue(m) == null) { vertex.addEdge(EdgeFactory.create(m, SYNTHETIC_EDGE)); } } } vertex.voteToHalt(); }
![Page 27: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/27.jpg)
Demo time!
![Page 28: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/28.jpg)
But I don’t have a cluster! • Hadoop in pseudo-distributed mode
• All Hadoop services on the same host (different JVMs) • Hadoop-as-a-Service
• Amazon’s EMR, etc. • Hadoop in local mode
![Page 29: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/29.jpg)
Prerequisites • Apache Hadoop 1.2.1 • Apache Giraph 1.1.0-SNAPSHOT • Apache Maven 3.x • JDK 7+
![Page 30: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/30.jpg)
Setting things up $ curl hadoop.tar.gz | tar xzvf – $ git clone git://git.apache.org/giraph.git ; cd giraph $ mvn –Phadoop_1 package $ tar xzvf *dist*/*.tar.gz!!$ export HADOOP_HOME=/Users/shapor/dist/hadoop-1.2.1 $ export GIRAPH_HOME=/Users/shapor/dist/ $ export HADOOP_CONF_DIR=$GIRAPH_HOME/conf $ PATH=$HADOOP_HOME/bin:$GIRAPH_HOME/bin:$PATH
![Page 31: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/31.jpg)
Setting project up (maven) <dependencies> <dependency> <groupId>org.apache.giraph</groupId> <artifactId>giraph-core</artifactId> <version>1.1.0-SNAPSHOT</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>1.2.1</version> </dependency> </dependencies>
![Page 32: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/32.jpg)
Running it $ mvn package $ giraph target/*.jar giraph.GiraphHelloWorld \ -vip src/main/resources/1 \ -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat \ -w 1 \ -ca giraph.SplitMasterWorker=false,giraph.logLevel=error
![Page 33: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/33.jpg)
Testing it public void testNumberOfVertices() throws Exception { GiraphConfiguration conf = new GiraphConfiguration(); conf.setComputationClass(GiraphHelloWorld.class); conf.setVertexInputFormatClass(TextDoubleDoubleAdjacencyListVertexInputFormat.class); … Iterable<String> results = InternalVertexRunner.run(conf, graphSeed); … }
![Page 34: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/34.jpg)
Simplified view mappers or YARN containers
@rhatr
@TheASF
@c0sin
InputFormat
OutputForm
at
1
2
3
![Page 35: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/35.jpg)
Master and master compute
@rhatr
@TheASF
@c0sin
1
2
3
Active Master compute()
Standby Master
Zookeeper
![Page 36: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/36.jpg)
Master compute • Runs before slave compute() • Has a global view • A place for aggregator manipulation
![Page 37: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/37.jpg)
Aggregators • “Shared variables” • Each vertex can push values to an aggregator • Master compute has full control
@rhatr
@TheASF
@c0sin
1
2
3
Active Master min: sum:
1 1
2 2
3 1
![Page 38: Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads](https://reader033.vdocuments.us/reader033/viewer/2022042807/5f79bfd74558c37114005e6f/html5/thumbnails/38.jpg)
Questions?