infovore: an open source mapreduce framework for processing graph data
DESCRIPTION
This talk describes an Infovore, a tool that uses the Map/Reduce approach to clean up, filter and combine RDF data sets to deliver purpose-built data sets for practical consumers of linked dataTRANSCRIPT
![Page 1: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/1.jpg)
Infovore, an Open-Source Map/Reduce Framework For Processing Graph
Data
Paul Houle
Ontology2
![Page 2: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/2.jpg)
![Page 3: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/3.jpg)
![Page 4: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/4.jpg)
![Page 5: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/5.jpg)
2+ billion facts, 20+ gb!
![Page 6: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/6.jpg)
the data your project needs
![Page 7: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/7.jpg)
Why handle complete data sets?
Quality Perimeter
Infovore
![Page 8: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/8.jpg)
RDF Tools vs.Invalid Triples
Image cc-by from arj03
![Page 9: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/9.jpg)
Scaling Limits of Triple Stores
CPU Main Memory
CPU
CPU
CPU
CPU
CPU
Random-access bottleneck
Hard Drive or Flash Storage
![Page 10: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/10.jpg)
Map/Reduce conserves memory!
Image cc-by-sa from Anua22a
![Page 11: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/11.jpg)
Partitioning Data
md5(“http://dbpedia.org/resource/Tree”) =b78f8f508982ceb4e8dd3510fac75f62
331 332330 333 334 335… …
![Page 12: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/12.jpg)
If you really try it…
331 332330
333
334 335… …
![Page 13: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/13.jpg)
Preprocessing Freebase
• Expand prefixes
• Remove
• fbase:type.type.instance
• fbase:type.type.expected_by
• rdfs:type w/ fbase:* subject
• Reverse
• Fbase:type.permission.controls
• Fbase:dataworld_gardening_hint.replaced_by
• Rewrite
• Fbase:type.object.type to rdfs:type
![Page 14: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/14.jpg)
Parallel Super Eyeball
![Page 15: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/15.jpg)
sort | uniq
:Surgeon a :Occupation .:Surgeon rdfs:label “Surgeon” @en.:Surgeon :mustHave :Md.
:Tree a :Plant .:Tree rfs:label “Tree” @en .:Tree :has :Leaves .
:Victory a :AbstractConcept .:Vectory rdfs:label “Victory” .:Victory :emotialTone :Positive .
![Page 16: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/16.jpg)
Huge scalability…
:Tree
:Victory
:Surgeon
Main memory
![Page 17: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/17.jpg)
Pig, Hadoop and All That…
Source: http://www.dbis.informatik.hu-berlin.de/forschung/projekte/query-optimization-in-rdf-databases.html
![Page 18: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/18.jpg)
Monitoring for Quality Control
Operational Statistics(rdf)
Preprocess Partition Clean Sort Classify Filter
![Page 19: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/19.jpg)
:basekb
![Page 20: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/20.jpg)
Parallel Loading into Triple Stores
331 332330 333 334 335… …
Openlink Virtuoso4x Speedup
![Page 21: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/21.jpg)
:basekb lite
:Freebase
:Chosenfacts
:Rulebox
:Chosentopics
![Page 22: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/22.jpg)
rdf diff
![Page 23: Infovore: An Open Source MapReduce Framework For Processing Graph Data](https://reader034.vdocuments.us/reader034/viewer/2022042715/5599a62f1a28ab17698b479d/html5/thumbnails/23.jpg)
See for yourself
https://github.com/paulhoule/infovore/wiki