our projects on semantic web

Managing Large RDF Graphs

Vaibhav Khadilkar Dr. Bhavani Thuraisingham

Department of Computer Science,The University of Texas at Dallas

December 2008

Managing Large RDF Graphs

Vaibhav Khadilkar Dr. Bhavani Thuraisingham

Department of Computer Science,The University of Texas at Dallas

December 2008

Introduction

The Provost for the University of Texas at Dallas, Dr. B. Hobson Wildenthal, in conjunction with the Vice President for Research and Development, Dr. Bruce Gnade made a commitment on becoming a leader in emerging technologies recognizing that the university did not want to compete in legacy technologies.

After a detailed analysis and examination of unsolved problems the university committed to the Semantic Web and Cloud Computing as research areas. This was vetted through a large number of government and industrial clients. This resulted in the creation of the Semantic Web Lab.

Introduction

The Provost for the University of Texas at Dallas, Dr. B. Hobson Wildenthal, in conjunction with the Vice President for Research and Development, Dr. Bruce Gnade made a commitment on becoming a leader in emerging technologies recognizing that the university did not want to compete in legacy technologies.

After a detailed analysis and examination of unsolved problems the university committed to the Semantic Web and Cloud Computing as research areas. This was vetted through a large number of government and industrial clients. This resulted in the creation of the Semantic Web Lab.

Our Projects on Semantic WebOur Projects on Semantic Web Confidentiality, Privacy and Trust for the Semantic Web

Texas Enterprise Funds, 2005; NSF 2007

Building Geospatial Semantic Web Raytheon, 2006; NGA, 2007

Blackbook Experimentation Texas Enterprise Funds, 2007

Ontology Mining – part of Text mining project NASA 2007

Assured Information Sharing AFOSR MURI, 2008

Managing Large RDF Graphs and Ontology Homogenization IARPA, 2008

Confidentiality, Privacy and Trust for the Semantic Web Texas Enterprise Funds, 2005; NSF 2007

Building Geospatial Semantic Web Raytheon, 2006; NGA, 2007

Blackbook Experimentation Texas Enterprise Funds, 2007

Ontology Mining – part of Text mining project NASA 2007

Assured Information Sharing AFOSR MURI, 2008

Managing Large RDF Graphs and Ontology Homogenization IARPA, 2008

Managing Large RDF GraphsManaging Large RDF Graphs


Current problems Semantic web does not scale Hinders ability to do reasoning and large graph

processing Current work focuses on load balancing and fault

tolerance, but the big bottleneck is memory Current systems can be broken with even 100,000

triples We work on load balancing and polynomial reasoning

but memory management breaks the systems even before any of the other problems can be addressed

Current problems Semantic web does not scale Hinders ability to do reasoning and large graph

processing Current work focuses on load balancing and fault

tolerance, but the big bottleneck is memory Current systems can be broken with even 100,000

triples We work on load balancing and polynomial reasoning

but memory management breaks the systems even before any of the other problems can be addressed

Solution History To solve this problem we only look at history In the 1960’s Dijkstra invented the multiprocess operating system This gave us general purpose resource management for files and memory In the 1970’s efforts were directed to taking the general purpose OS and placing database applications on top of them The drawback was that these systems did not scale In the 1980’s Robert Epstein and Michael Stonebreaker from UC Berkley defined specific algorithms for database processing like LRU/MRU These principles are accepted as a solved solution space resulting in ORACLE, MySQL and others

Solution History To solve this problem we only look at history In the 1960’s Dijkstra invented the multiprocess operating system This gave us general purpose resource management for files and memory In the 1970’s efforts were directed to taking the general purpose OS and placing database applications on top of them The drawback was that these systems did not scale In the 1980’s Robert Epstein and Michael Stonebreaker from UC Berkley defined specific algorithms for database processing like LRU/MRU These principles are accepted as a solved solution space resulting in ORACLE, MySQL and others


Managing Large RDF GraphsManaging Large RDF Graphs Solution History

In 2001 we started with the Semantic Web Oracle, HP and others tried to apply database algorithms to graph processing We worked to expand resource management to use specific graph algorithms The solution is constructed so that memory is boundless (infinite graph) with deterministic reads that are an order of magnitude slower than pure memory solutions

Solution History In 2001 we started with the Semantic Web Oracle, HP and others tried to apply database algorithms to graph processing We worked to expand resource management to use specific graph algorithms The solution is constructed so that memory is boundless (infinite graph) with deterministic reads that are an order of magnitude slower than pure memory solutions

Mem MgtLRU/MRU

AB

C


Relevance of problem This was an unsolved problem Critical in handling terabytes of data relevant in

today’s times Virtualize from memory space to disk space

Relevance of problem This was an unsolved problem Critical in handling terabytes of data relevant in

today’s times Virtualize from memory space to disk space


Tools Used Jena

An open source Semantic Web framework used to build and manipulate large RDF graphs

Also gives the capability to handle RDFS and OWL Provides a query language SPARQL and a rule

based inference engine Developed by HP Labs Can represent RDF graphs as a model

Tools Used Jena

An open source Semantic Web framework used to build and manipulate large RDF graphs

Also gives the capability to handle RDFS and OWL Provides a query language SPARQL and a rule

based inference engine Developed by HP Labs Can represent RDF graphs as a model


Tools Used Lucene

Lucene is a Java based text search engine library Is suitable for any application and is platform independent Does indexing and retrieval in a few milliseconds across

terabytes of data

MySQL An open source RDBMS used with the various database

representations in Jena (RDB, SDB, and, TDB) An easy to use alternative compared to other RDBMS’s

Tools Used Lucene

Lucene is a Java based text search engine library Is suitable for any application and is platform independent Does indexing and retrieval in a few milliseconds across

terabytes of data

MySQL An open source RDBMS used with the various database

representations in Jena (RDB, SDB, and, TDB) An easy to use alternative compared to other RDBMS’s


In-memory Jena Model This solution formed the basis of the solution that we

will use for the RDB problem As nodes are added to the in-memory graph, memory

fills up Therefore we can handle medium sized graphs After a certain point when memory is full we get an out

of memory exception stopping program execution We want to solve this out of memory problem

In-memory Jena Model This solution formed the basis of the solution that we

will use for the RDB problem As nodes are added to the in-memory graph, memory

fills up Therefore we can handle medium sized graphs After a certain point when memory is full we get an out

of memory exception stopping program execution We want to solve this out of memory problem


Memory Management Algorithm Graph representation

Memory Management Algorithm Graph representation

http://www.johnSmith.com

http://www.johnSmith.com/paper1

author

35

Age

Time

123-456-7890

Phone

ACM Society

Journal Society

Semantic Web Journal

Journal Name

Managing Large RDF GraphsManaging Large RDF Graphs Memory Management Algorithm

Graph Representation The graph is constructed in Jena by specifying nodes and their

properties. Triples are added in a monotonically increasing fashion. Nodes may be accessed at any time (this is a key point in the

algorithm) Data structure used in the algorithm

Create an in-memory LRU based cache For each node in the graph store an index number, a timestamp

value for when it was last accessed, and, the number of connections for that node

Each time the node is accessed or a triple added, update the associated cache entry

This structure will be used to determine the candidate node that will be written to disk

Memory Management Algorithm Graph Representation

The graph is constructed in Jena by specifying nodes and their properties.

Triples are added in a monotonically increasing fashion. Nodes may be accessed at any time (this is a key point in the

algorithm) Data structure used in the algorithm

Create an in-memory LRU based cache For each node in the graph store an index number, a timestamp

value for when it was last accessed, and, the number of connections for that node

Each time the node is accessed or a triple added, update the associated cache entry

This structure will be used to determine the candidate node that will be written to disk


Memory Management Algorithm Algorithm

We use the LIMIT clause in MySQL to get back only a part of the results at a time

The triples retrieved are added to the revised in-memory Jena model

This leverages the memory management algorithm for the in-memory model

Since the revised in-memory model never runs out of memory this RDB solution does not run out of memory

Memory Management Algorithm Algorithm

We use the LIMIT clause in MySQL to get back only a part of the results at a time

The triples retrieved are added to the revised in-memory Jena model

This leverages the memory management algorithm for the in-memory model

Since the revised in-memory model never runs out of memory this RDB solution does not run out of memory


Conclusions from In-Memory Jena Model As threshold increases the time required for the

calculations reduces As the memory size increases the time needed for the

calculations increases since more triples can be stored in memory

A node in memory takes about 35 ms whereas one cached to lucene takes about 300ms

The goal is for usage patterns to pull from memory.

Conclusions from In-Memory Jena Model As threshold increases the time required for the

calculations reduces As the memory size increases the time needed for the

calculations increases since more triples can be stored in memory

A node in memory takes about 35 ms whereas one cached to lucene takes about 300ms

The goal is for usage patterns to pull from memory.


Conclusions from the RDF Jena Model Database creation times are almost the same as with the original

Jena implementation Database querying times vary depending upon the threshold value

set in the algorithm

General Conclusions Implemented an in-memory based LRU/Connectivity memory

management algorithm Solves the in-memory and RDB based models in Jena by creating

an infinite memory impression for the user

Conclusions from the RDF Jena Model Database creation times are almost the same as with the original

Jena implementation Database querying times vary depending upon the threshold value

set in the algorithm

General Conclusions Implemented an in-memory based LRU/Connectivity memory

management algorithm Solves the in-memory and RDB based models in Jena by creating

an infinite memory impression for the user


Future Work Implement the memory management

algorithms for cloud computing Generalize the algorithm for all models Try various other memory management

algorithms which effect usage

Future Work Implement the memory management

algorithms for cloud computing Generalize the algorithm for all models Try various other memory management

algorithms which effect usage

our projects on semantic web

Documents