project by: anuj shetye vinay boddula. introduction motivation hbase our work evaluation related...
TRANSCRIPT
Project Overview
Introduction
Motivation
HBase
Our work
Evaluation
Related work.
Future work and conclusion.
Introduction
As RDF datasets goes on increasing, therefore size of RDF
is much larger than traditional graph
Cardinality of vertex and edges is much larger.
Therefore large data stores are required for following reasons
Fast and efficient querying .
Scalability issues.
Motivation
Research has been done to map RDF dataset onto relational databases
example: Virtuoso, Jena SDB.
But dataset is stored centrally i.e. on one server.
Examples: Jena SDB map RDF triple in relational database.
– Scalability
Some try to store RDF data as a large graph but on single node example Jena TDB– Scalability
Contd...Hbase is a
No SQL datbase.High Scalability , Highly Fault Tolerant.Fast Read/WriteDynamic DatabaseHadoop and other apps integrated.Column family oriented data layout.Max datasize : ~1 PB.Read/write limits millions of queries per second.
Who uses Hbase/BigtableAdobe, Facebook, Twitter, Yahoo, Gmail, Google maps etc.
Our Project
Our project to create a distributed data storage capability for RDF schema using Hbase .
We developed a system which takes the Ntriple file of an RDF graph as an input and stores the triples in Hbase as a Key value pair using Map reduce jobs.
The schema is simple
we create column families of each predicates
subjects as Row keys
objects as the values
Data Model
Row key Data
Anuj hasAdvisor : {‘Dr. Miller’} workedFor: {‘UGA’}
Vinay hasAdvisor : {‘Dr.Ramaswamy’} hasPapers : {‘Paper 1’,’Paper 2’} workedFor: {‘IBM’ , ‘UGA’}
Logical view as ‘Records’
Data Model contd..
Row Key
Column key
Timestamp
value
Anuj hasAdvisor
T1 Dr. Miller
Vinay hasAdvisor
T2 Dr.Ramaswamy
Row Key
Column key
Timestamp
value
Vinay hasPaper T2 Paper1
Vinay hasPaper T1 Paper2
Physical Model
hasAdvisor Column family
hasPaper Column family
Row Key Column key
Timestamp
value
Anuj workedFor T1 ‘UGA’
Vinay workedFor T3 ‘UGA’
Vinay workedFor T2 ‘IBM’
workedFor Column family
Two major issues can be solved using Hbase
Data insertionData updation
Versioning possible (Timestamps).
Bulk loading of data. Two types
complete bulk load (hbase File Formatter, our approach )Incremental bulk load
Related Work.
CumulusRDF: Linked Data Management on Nested Key-Value Stores appeared in SSWS 2011 works on distributed key value indexing on data stores they used Casandra as the data store.
Apache Casandra is currently capable of storing rdf data and has an adapter to store data in a distributed management system.
Future Work and Conclusion
Our future work lies in developing an efficient interface for sparql as querying with SQL like HIVE is slower in Hbase.
The testing of the system was done on single node, therefore testing it on multiple nodes would be an ultimate test of efficiency .