our projects on semantic web
DESCRIPTION
Managing Large RDF Graphs Vaibhav Khadilkar Dr. Bhavani Thuraisingham Department of Computer Science, The University of Texas at Dallas December 2008. - PowerPoint PPT PresentationTRANSCRIPT
Managing Large RDF Graphs
Vaibhav Khadilkar Dr. Bhavani Thuraisingham
Department of Computer Science,The University of Texas at Dallas
December 2008
Managing Large RDF Graphs
Vaibhav Khadilkar Dr. Bhavani Thuraisingham
Department of Computer Science,The University of Texas at Dallas
December 2008
Introduction
The Provost for the University of Texas at Dallas, Dr. B. Hobson Wildenthal, in conjunction with the Vice President for Research and Development, Dr. Bruce Gnade made a commitment on becoming a leader in emerging technologies recognizing that the university did not want to compete in legacy technologies.
After a detailed analysis and examination of unsolved problems the university committed to the Semantic Web and Cloud Computing as research areas. This was vetted through a large number of government and industrial clients. This resulted in the creation of the Semantic Web Lab.
Introduction
The Provost for the University of Texas at Dallas, Dr. B. Hobson Wildenthal, in conjunction with the Vice President for Research and Development, Dr. Bruce Gnade made a commitment on becoming a leader in emerging technologies recognizing that the university did not want to compete in legacy technologies.
After a detailed analysis and examination of unsolved problems the university committed to the Semantic Web and Cloud Computing as research areas. This was vetted through a large number of government and industrial clients. This resulted in the creation of the Semantic Web Lab.
Our Projects on Semantic WebOur Projects on Semantic Web Confidentiality, Privacy and Trust for the Semantic Web
Texas Enterprise Funds, 2005; NSF 2007
Building Geospatial Semantic Web Raytheon, 2006; NGA, 2007
Blackbook Experimentation Texas Enterprise Funds, 2007
Ontology Mining – part of Text mining project NASA 2007
Assured Information Sharing AFOSR MURI, 2008
Managing Large RDF Graphs and Ontology Homogenization IARPA, 2008
Confidentiality, Privacy and Trust for the Semantic Web Texas Enterprise Funds, 2005; NSF 2007
Building Geospatial Semantic Web Raytheon, 2006; NGA, 2007
Blackbook Experimentation Texas Enterprise Funds, 2007
Ontology Mining – part of Text mining project NASA 2007
Assured Information Sharing AFOSR MURI, 2008
Managing Large RDF Graphs and Ontology Homogenization IARPA, 2008
Managing Large RDF GraphsManaging Large RDF Graphs
Managing Large RDF GraphsManaging Large RDF Graphs
Current problems Semantic web does not scale Hinders ability to do reasoning and large graph
processing Current work focuses on load balancing and fault
tolerance, but the big bottleneck is memory Current systems can be broken with even 100,000
triples We work on load balancing and polynomial reasoning
but memory management breaks the systems even before any of the other problems can be addressed
Current problems Semantic web does not scale Hinders ability to do reasoning and large graph
processing Current work focuses on load balancing and fault
tolerance, but the big bottleneck is memory Current systems can be broken with even 100,000
triples We work on load balancing and polynomial reasoning
but memory management breaks the systems even before any of the other problems can be addressed
Solution History To solve this problem we only look at history In the 1960’s Dijkstra invented the multiprocess operating system This gave us general purpose resource management for files and memory In the 1970’s efforts were directed to taking the general purpose OS and placing database applications on top of them The drawback was that these systems did not scale In the 1980’s Robert Epstein and Michael Stonebreaker from UC Berkley defined specific algorithms for database processing like LRU/MRU These principles are accepted as a solved solution space resulting in ORACLE, MySQL and others
Solution History To solve this problem we only look at history In the 1960’s Dijkstra invented the multiprocess operating system This gave us general purpose resource management for files and memory In the 1970’s efforts were directed to taking the general purpose OS and placing database applications on top of them The drawback was that these systems did not scale In the 1980’s Robert Epstein and Michael Stonebreaker from UC Berkley defined specific algorithms for database processing like LRU/MRU These principles are accepted as a solved solution space resulting in ORACLE, MySQL and others
Managing Large RDF GraphsManaging Large RDF Graphs
Managing Large RDF GraphsManaging Large RDF Graphs Solution History
In 2001 we started with the Semantic Web Oracle, HP and others tried to apply database algorithms to graph processing We worked to expand resource management to use specific graph algorithms The solution is constructed so that memory is boundless (infinite graph) with deterministic reads that are an order of magnitude slower than pure memory solutions
Solution History In 2001 we started with the Semantic Web Oracle, HP and others tried to apply database algorithms to graph processing We worked to expand resource management to use specific graph algorithms The solution is constructed so that memory is boundless (infinite graph) with deterministic reads that are an order of magnitude slower than pure memory solutions
Mem MgtLRU/MRU
AB
C
Managing Large RDF GraphsManaging Large RDF Graphs
Relevance of problem This was an unsolved problem Critical in handling terabytes of data relevant in
today’s times Virtualize from memory space to disk space
Relevance of problem This was an unsolved problem Critical in handling terabytes of data relevant in
today’s times Virtualize from memory space to disk space
Managing Large RDF GraphsManaging Large RDF Graphs
Tools Used Jena
An open source Semantic Web framework used to build and manipulate large RDF graphs
Also gives the capability to handle RDFS and OWL Provides a query language SPARQL and a rule
based inference engine Developed by HP Labs Can represent RDF graphs as a model
Tools Used Jena
An open source Semantic Web framework used to build and manipulate large RDF graphs
Also gives the capability to handle RDFS and OWL Provides a query language SPARQL and a rule
based inference engine Developed by HP Labs Can represent RDF graphs as a model
Managing Large RDF GraphsManaging Large RDF Graphs
Tools Used Lucene
Lucene is a Java based text search engine library Is suitable for any application and is platform independent Does indexing and retrieval in a few milliseconds across
terabytes of data
MySQL An open source RDBMS used with the various database
representations in Jena (RDB, SDB, and, TDB) An easy to use alternative compared to other RDBMS’s
Tools Used Lucene
Lucene is a Java based text search engine library Is suitable for any application and is platform independent Does indexing and retrieval in a few milliseconds across
terabytes of data
MySQL An open source RDBMS used with the various database
representations in Jena (RDB, SDB, and, TDB) An easy to use alternative compared to other RDBMS’s
Managing Large RDF GraphsManaging Large RDF Graphs
In-memory Jena Model This solution formed the basis of the solution that we
will use for the RDB problem As nodes are added to the in-memory graph, memory
fills up Therefore we can handle medium sized graphs After a certain point when memory is full we get an out
of memory exception stopping program execution We want to solve this out of memory problem
In-memory Jena Model This solution formed the basis of the solution that we
will use for the RDB problem As nodes are added to the in-memory graph, memory
fills up Therefore we can handle medium sized graphs After a certain point when memory is full we get an out
of memory exception stopping program execution We want to solve this out of memory problem
Managing Large RDF GraphsManaging Large RDF Graphs
Memory Management Algorithm Graph representation
Memory Management Algorithm Graph representation
http://www.johnSmith.com
http://www.johnSmith.com/paper1
author
35
Age
Time
123-456-7890
Phone
ACM Society
Journal Society
Semantic Web Journal
Journal Name
Managing Large RDF GraphsManaging Large RDF Graphs Memory Management Algorithm
Graph Representation The graph is constructed in Jena by specifying nodes and their
properties. Triples are added in a monotonically increasing fashion. Nodes may be accessed at any time (this is a key point in the
algorithm) Data structure used in the algorithm
Create an in-memory LRU based cache For each node in the graph store an index number, a timestamp
value for when it was last accessed, and, the number of connections for that node
Each time the node is accessed or a triple added, update the associated cache entry
This structure will be used to determine the candidate node that will be written to disk
Memory Management Algorithm Graph Representation
The graph is constructed in Jena by specifying nodes and their properties.
Triples are added in a monotonically increasing fashion. Nodes may be accessed at any time (this is a key point in the
algorithm) Data structure used in the algorithm
Create an in-memory LRU based cache For each node in the graph store an index number, a timestamp
value for when it was last accessed, and, the number of connections for that node
Each time the node is accessed or a triple added, update the associated cache entry
This structure will be used to determine the candidate node that will be written to disk
Managing Large RDF GraphsManaging Large RDF Graphs
Memory Management Algorithm Algorithm
We use the LIMIT clause in MySQL to get back only a part of the results at a time
The triples retrieved are added to the revised in-memory Jena model
This leverages the memory management algorithm for the in-memory model
Since the revised in-memory model never runs out of memory this RDB solution does not run out of memory
Memory Management Algorithm Algorithm
We use the LIMIT clause in MySQL to get back only a part of the results at a time
The triples retrieved are added to the revised in-memory Jena model
This leverages the memory management algorithm for the in-memory model
Since the revised in-memory model never runs out of memory this RDB solution does not run out of memory
Managing Large RDF GraphsManaging Large RDF Graphs
Conclusions from In-Memory Jena Model As threshold increases the time required for the
calculations reduces As the memory size increases the time needed for the
calculations increases since more triples can be stored in memory
A node in memory takes about 35 ms whereas one cached to lucene takes about 300ms
The goal is for usage patterns to pull from memory.
Conclusions from In-Memory Jena Model As threshold increases the time required for the
calculations reduces As the memory size increases the time needed for the
calculations increases since more triples can be stored in memory
A node in memory takes about 35 ms whereas one cached to lucene takes about 300ms
The goal is for usage patterns to pull from memory.
Managing Large RDF GraphsManaging Large RDF Graphs
Conclusions from the RDF Jena Model Database creation times are almost the same as with the original
Jena implementation Database querying times vary depending upon the threshold value
set in the algorithm
General Conclusions Implemented an in-memory based LRU/Connectivity memory
management algorithm Solves the in-memory and RDB based models in Jena by creating
an infinite memory impression for the user
Conclusions from the RDF Jena Model Database creation times are almost the same as with the original
Jena implementation Database querying times vary depending upon the threshold value
set in the algorithm
General Conclusions Implemented an in-memory based LRU/Connectivity memory
management algorithm Solves the in-memory and RDB based models in Jena by creating
an infinite memory impression for the user
Managing Large RDF GraphsManaging Large RDF Graphs
Future Work Implement the memory management
algorithms for cloud computing Generalize the algorithm for all models Try various other memory management
algorithms which effect usage
Future Work Implement the memory management
algorithms for cloud computing Generalize the algorithm for all models Try various other memory management
algorithms which effect usage