the fred kaiser building 2332 main mall vancouver bc dear...
TRANSCRIPT
Lijiulie Chen
1207-2725 Melfa Rd.
Vancouver BC V6T1N4
July 28, 2009
Jane Pavelich
Kaiser 3036
The Fred Kaiser Building
2332 Main Mall Vancouver BC
V6T 1Z4
Dear Ms Pavelich:
Re: EECE 496 Formal Report
Enclosed is a copy of my EECE 496 final report on “A Distributed Graph Analysis Tool
to Study Implicit Relations in Tagging Systems”. This report presents the design and
implementation of a software package. It also reveals work done for exploring parallel
database systems used for data processing in this tool. The challenges and future
development are discussed as well.
Yours sincerely,
Keven Chen
Enclosure
A Distributed Graph Analysis Tool to Study
Implicit Relations in Tagging Systems
Name: Lijiulie Chen – 57007064
Technical Supervisor: Matei Ripeanu
ii
Abstract
Tagging systems have been widely used to help people search, browse and share online
content. For my EECE 496 project, I designed and implemented a software package that
helps to characterize the implicit relationship between users, items and tags. These
relationships are often modeled as a graph, where nodes are users, items or tags and
edges connect these elements if they are similar according to some metric such as the
Jaccard Index. The project focuses on the creation of an extensible software package that
enables the analysis of these graphs. Since the size of the dataset of the tagging systems is
commonly very large in a research setting, the software package is also implemented to
fulfill a time-efficient requirement by utilizing a computer cluster to speed up the graph
building and analyzing process. In this project, MySQL [1] (i.e., a traditional open source
relational database solution) and Hive [2] (i.e., a data store solution based on MapReduce
[3] that allow applications to express queries in SQL) are used and evaluated for their
performance. In summary, we observe that MySQL achieves superior performance
compared to Hive in graph creation workload. Other database systems should be explored
in the future for performance evaluation such as HadoopDB [4], which provides a hybrid
solution (i.e., traditional databases and MapReduce paradigms).
Table of Contents
A Distributed Graph Analysis Tool to Study Implicit Relations in Tagging Systems ..................... i
Abstract .................................................................................................................................. ii
Glossary ..................................................................................................................................... iii
List of Tables .............................................................................................................................. v
List of Figures ............................................................................................................................ vi
List of Abbreviations ................................................................................................................. vii
1.0 Introduction ........................................................................................................................... 1
2.0 Tools and Methodology ......................................................................................................... 3
2.1 Tools ................................................................................................................................. 3
2.1.1 Software Tools ............................................................................................................ 3
2.1.2 Deployment Environment ........................................................................................... 4
2.2 Design Concepts ................................................................................................................ 5
2.2.1 Design Patterns ........................................................................................................... 5
3.0 DESIGNS ............................................................................................................................. 8
3.1 Data Preparation & Database Setup ................................................................................... 9
3.2 Data Access Layer ........................................................................................................... 10
3.4 Graph Analysis ................................................................................................................ 14
4.0 RESULTS ........................................................................................................................... 16
4.1 Project Summary ............................................................................................................. 16
4.2 Performance Evaluation ................................................................................................... 17
4.4 Project Difficulties........................................................................................................... 19
4.5 Future Development ........................................................................................................ 20
5.0 CONCLUSION ................................................................................................................... 21
7.0 REFERENCES .................................................................................................................... 23
Glossary
iv
Computer Cluster a group of linked computers that act like a single system and enable high
availability, load balancing and parallel processing
Node (machine) a device attached to a computer network or other telecommunications
network, such as a computer or router, or a point in a network topology
at which lines intersect or branch
Tag (metadata) a tag is a non-hierarchical keyword or term assigned to a piece of
information. This kind of metadata helps describe an item and allows it
to be found again by browsing or searching
Hadoop a free Java software framework that supports data intensive distributed
applications
Hive a data warehouse infrastructure build on top of Hadoop that provides
tools to enable easy data summarization, adhoc querying and analysis of
large datasets data store in Hadoop files
MapReduce a software framework introduced by Google to support distributed
computing on large data sets on clusters of computers
MySQL a relational database management system (RDBMS)
SQL a database computer language designed foro managing data in relational
database management systems
HadoopDB A hybrid of DBMS and MapReduce technologies that targets analytical
workloads
Tag a non-hierarchcal keyword or term assigned to a piece of information
(such as an internet bookmark, digital image or computer file)
v
List of Tables
Table 1: Development Tools Used ………...………………………………………..4
Table 2: Cluster Node Specification Chart .…………………………………………5
vi
List of Figures
Figure 1: two common data-oriented parallel-processing schemes ………………………6
Figure 2: system diagram …………………………………………………………………9
Figure 3: UML diagram of the data access layer ………………………………………..10
Figure 4: UML diagram of the graph factory module …………………………………. 12
vii
List of Abbreviations
JDBC Java Database Connector
UBC the University of British Columbia
NetSysLab Networked Systems Laboratory
API Application Programming Interface
1.0 Introduction
This report presents the design and evaluation of a data analysis tool for tagging systems.
The objective of this project is to deliver a software package that would process activity
traces of a tagging system (i.e., del.icio.us) and prepare it for further graph analysis. The
main focus of my investigation is on how to accomplish large-scale data processing more
efficiently. This software package will play a significant role in future tagging-system
data analysis research in NetSysLab at UBC.
Considering the increasingly large amount of data generated daily by users in tagging
systems, the graph analysis tool should be designed to cope with large-scale activity
traces. Hence, the goal of this project is to evaluate the efficiency of leveraging a parallel
processing paradigm like Hadoop to a traditional MySQL standalone solution. To achieve
this, the software is designed to be agnostic of data storage solution, so that it is easy to
experiment with different database application.
I have successfully built and tested a Java software package that processes tagging
activity records and constructs graph representations. The software is able to utilize either
MySQL or Hive as its underlying data storage to operate in either single-node or parallel-
processing (i.e., using a computer cluster) mode. I have also written two graph analysis
modules as samples for this software package. The package is extensible enough to allow
its users to exploit different data warehouse infrastructure and develop their own graph
analysis modules based on their needs.
2
The scope of this project includes all design and implementation of data processing and
deployment. Moreover, modification of existing parallel processing implementations to
improve performance is outside of the scope of the project because of time constraints.
This report is divided into three main sections. First, it provides an overview of the tools
and methodology used. Next, it explains the design and implementation of the software
package in detail. Finally, it discusses the results and future directions of this project.
3
2.0 Tools and Methodology
The implementation of the graph analysis package makes use of the computer cluster
located in the NetSysLab. This section provides an explanation of the equipments I used
and the methodology behind this approach.
2.1 Tools
The tools used in this project are split up into two categories: software and hardware. I
will describe each category separately in this section.
2.1.1 Software Tools
Most of the data analysis is written in Java, since it is architecturally neutral and provides
good integration with SQL and Hadoop. The code I wrote on a 32-bit machine can be
seamlessly compiled and run on a 64-bit system. Another big advantage of using Java is
because it is widely supported by the open source community. Thus, a variety of libraries
are available that may help support the development of the data analysis package. I can
easily integrate my project with the existing project that others have already developed.
As alternative data storage layers, I used both Hive/Hadoop to leverage a cluster of
computers to run the data analysis in parallel, and MySQL that provides a traditional
relational database backend data store.
Apache Hadoop (Hadoop) is an open-source implementation of the MapReduce
framework that enables applications to run on a cluster. Hive is a data warehouse
framework built on top of Hadoop. It provides users a SQL like query language which
4
allows them to query data. Hive also implemented a JDBC interface which enables
programmers to write Java class to access the file system.
Table 1 describes the tools used for this project and their corresponding version.
Tool Version Description
Eclipse 3.3.2 Development environment for Java
Java 1.6.0_04 Core programming language
MySQL Database 5.0.45 Database for storing dataset
Hadoop 0.19.0 Java software framework for creating parallel
processing in the cluster
Hive 0.19 Data warehouse infrastructure for saving data to
Hadoop File System
Table 1 – Development Tools Used
2.1.2 Deployment Environment
The hardware environment for the deployment of the project is a 22 nodes computer
cluster provided by the NetSysLab. All the nodes are interconnected by a
PowerConnectTM
6248 48-port Gigabit Ethernet Layer 3 switch, which should provide
high availability and improved performance compared to a single computer. The table
below describes the basic specification of each node in the cluster.
Intel® Xeon® CPU E5354 @ 2.33GHz
(quad-core machine)
5
4 GB RAM
800 GB hard drive
Table 2 - Cluster Node Specification Chart
2.2 Design Concepts
In this project, I use some software design patterns in the project architecture. This is to
ensure that I am following a pragmatic software design approach. Also, I have attempted
to use parallel processing for this project to utilize more computing power. These design
concepts will be explained in the following sections.
2.2.1 Design Patterns
In the design of the project I apply two design patterns to help make the data analysis tool
easy to extend and its modules properly decoupled. These design patterns are Factory
Method and Singleton [5]. I will explain the intent and application of these patterns in the
sections immediately following. However, I will only explain my implementation of
these patterns in more details in the section 3.0 of the report.
2.2.1.1 Factory Method
This design pattern is intended to let a client obtain instances of an object type without
knowledge about its concrete implementation. In other words, an object that implements
an interface can be created without specifying the exact class of the object that will be
created. [5] It allows greater extensibility because new subclasses can easily be added
later, without making a lot of changes to the existing code. Also, it decouples the abstract
interface with concrete implementations.
6
2.2.1.2 Singleton
This design pattern is intended to ensure that there is only one instance of a certain class
during the execution of the program. This instance should be able to be accessed globally
by outside classes. This pattern is able to restrict some class to have one and only one
instance. It allows strict control to one particular instance and is able to permit refinement
operations on the instance. [5]
2.2.2 Parallel Processing
The target data sets used in the graph analysis can easily reach over ten Gigabyte; I have
considered utilizing a computer cluster to do parallel processing. Research [6] shows that
there are two common data-oriented parallel-processing schemes. One uses centralized
data storage and another uses distributed data storage. In this section of the report, I will
explain the general concept of parallel processing and the difference between the two
parallel-processing approaches. Figure 1 illustrates the two schemes.
Figure 1 - two common data-oriented parallel-processing schemes
7
In the centralized data storage scheme, data is stored on a master node, but is processed
on each individual node in a cluster. The master node has to first send the data over the
network to a child node, and then let the child nodes complete required computations. In
the end, all the child nodes have to send back their results to the master node. The
performance of this approach is contingent on how fast the network is rather than how
fast each individual node can compute. In contrast, the distributed data storage scheme
relies more on computing power of the individual nodes.
In the distributed scheme, data is stored and computed on individual nodes. The master
node has prior knowledge of which node is hosting a particular part of the data set. The
master node will issue a computation command to each individual node upon a data
processing request. Each node then accesses the data stored on the node itself, completes
requested computation and sends the results back to the master node for result assembly.
Unlike the previous approach, the performance of the distributed scheme is contingent on
the number of child nodes in the cluster. For this project, I am using the distributed data
storage approach because it fits the application performance requirement.
8
3.0 DESIGNS
The overall design of the data analysis tool is divided into four main modules: database
storage layer, database access layer, graph creator and graph analyzer.
In general, the data set is imported into either the MySQL database or Hadoop File
System [7]. The database layer implements a factory method that creates a connection to
either of the data storage depending on the user’s configuration and returns a single
database interface to the client modules. The interface allows the client modules (i.e.,
Graph analyzer, Graph factory) to perform basic operations against the database (i.e.,
load data, execute queries). This architecture offers developers the freedom to implement
new java classes to access other types of data storage without changing the client
modules, as long as it keeps the same interface. Furthermore, the queries are saved in
separate files and are accessed by the classes from their file paths. In this way, users can
easily modify the queries without changing the code at all. Figure 2 illustrates the system
architecture.
I will talk about each module of the system in the next few sections, starting with initial
data preparation and database setup.
9
Figure 2 - System Diagram
3.1 Data Preparation & Database Setup
In order to start the graph analysis phase, raw data sets must be first processed into
workable forms and stored into a database. In this section, I will explain how I have
prepared raw traces of activity from popular tagging systems for processing. Also, I will
present some optimization techniques used in setting up the database structure that I have
used. This is a crucial initial step for the project, since data processing at this scale
requires timely efficient data storage.
To start, I obtained the dataset from a research lab’s website [8] in CSV (comma
separated values) format. The dataset used strings of hash codes as indexes. I converted
to integers for faster processing.
10
Two data storage layers are configured: MySQL and Hive. This enables different use
case scenarios and performance comparison.
To ensure each database is structured efficiently, in the MySQL database setup I have
chosen the B-TREE indexing method, which is an implementation of a tree data structure
that enables faster queries based on relational operations (i.e., =, >, etc). In the Hive
database setup, I have used a feature called replication, which makes multiple copies of
the same data set. Replication improves the availability of the dataset and reduces data-
request queuing, as independent queries that read the same data blocks can run in
different data nodes in parallel.
After the initial data preparation and database setup, the data access layer will come into
play for the next stage of data processing. In the next section, I will explain the role of
data access layer and its implementation.
3.2 Data Access Layer
The data access layer can create connection to various databases and returns one single
object to allow the client modules (i.e., GraphCreator and GraphAnalyzer) to invoke data
storage operations. This layer consists of three parts, which are the database connection
pool, the database factory and the concrete database implementations. The class diagram
of the data access layer is shown in Figure 3.
11
+executeBatch(in queryfile : String, in _conn : Connection) : void
+executeQuery(in queryfile : String, in _conn : Connection) : ResultSet
+excuteQuery(in queryfile : String, in parameters : Map, in _conn : Connection) : ResultSet
+executeUpdate(in queryfile : String, in _conn : Connection) : void
+executeUpdateWithParameter(in queryfile : String, in parameters : Map, in _conn : Connection) : int
+getConnectionPool() : ConnectionPool
+shutdown() : void
<<interface>>
Database
#ConnectionPool() : ConnectionPool
#createDBConnection() : Connection
+getAvailableConnection() : Connection
-getConnectionIfAvailable() : Connection
-init() : void
+releaseConnection() : void
+shutdown() : void
#DEFAULT_NUMBER_OF_CONNECTIONS : int
-driver : String
-numberOfConnections : int
-password : String
-pool : Map
-url : String
-user : String
ConnectionPool
+DatabaseFactory() : DatabaseFactory
+getDatabaseInstance() : Database
-DELICIOUS : String
-HIVE : String
DatabaseFactory
#createDBConnection() : Connection
+getExpresssionFromFile(in filename : String) : String
+prepareaStatement(in _query : String, in _map : Map) : String
#dbAlias : String
#dbConnection : Connection
#dbConnectionPool : ConnectionPool
#dbMachine : String
#pooledConnection : PooledConnection
-serialVerionUID : long
#url : String
<<implementation class>>
HiveDatabaseLayer
#createDBConnection() : Connection
+getExpresssionFromFile(in filename : String) : String
+prepareaStatement(in _query : String, in _map : Map) : String
#dbAlias : String
#dbConnection : Connection
#dbConnectionPool : ConnectionPool
#dbMachine : String
#pooledConnection : PooledConnection
-serialVerionUID : long
#url : String
<<implementation class>>
MySQLDatabaseLayer
<<uses>>
Figure 3 - UML diagram of the data access layer
As I mentioned in the design concepts section, a factory method is implemented to create
an instance of the database object. The users are able to configure which database
storage to use from the config.properties file. If one instance of the database is already
created, that particular instance is returned. This singleton object makes sure that there is
only one database object instantiated. Thus the number of connections to the database
storage can be easily controlled within the database layer. The ConnectionPool class
keeps track of the number of the connections that are connecting to the database. The
12
upper limit of the connections is set by users. Once the number of connections exceeds
the upper limit, the process that requires the new connection will be put into sleep mode
until another process releases one connection. A synchronized method is implemented in
this class to prevent the race condition that could potentially be created in multi-threaded
processes. The third module in the data access layer is the concrete database
implementation. There are currently two database classes implemented, MySQL and
Hive. MysqlDatabaseLayer is the class used to access the data saved in MySQL
databases. It is limited to access only one machine at a time. On the other hand, the
implementation of the HiveDatabaseLayer accesses Hive data storage which is highly
scalable in the number of machines. It is able to utilize the computational power of the
whole cluster. However, this setup does not really give the expected throughput when we
tested it. I will talk about the performance of the program in the latter part of the report.
3.3 Graph Factory
The graph factory module is responsible for computing the edges and edge weights of the
graph based on the activity trace. The graph factory module consists of three main parts,
which are the graph factory, the edge creator and the concrete graph implementation. The
class diagram of the graph factory is illustrated in the Figure 4 below.
13
+buildEdges(in type : EdgeType) : void
+buildNodes(in type : NodeType) : void
+getEdges(in type : EdgeType) : Set
+getEdgeWeight(in startNode : Node, in endNode : Node) : double
+getinstance() : IGraph
+getNodeCount() : int
+getNodeCount(in type : NodeType) : int
+getNodes(in type : NodeType) : Set
+getRelatedNodes(in type : Node) : Set
+getRelatedNodesNum(in type : Node) : int
<<interface>>
IGraph
+Node(in nodeType : NodeType, in nodeId : int) : Node
+getNodeId() : int
+getNodeType() : NodeType
-id : int
-type : int
Node
+valueOf() : NodeType
+values() : NodeType
+ITEM
+TAG
+USER
<<enumeration>>
NodeType
+valueOf() : EdgeType
+values() : EdgeType
+TagItemEdge
+TagTagEdge
+UserTagEdge
<<enumeration>>
EdgeType
+valueOf() : GraphType
+values() : GraphType
+tagPathwayGraph
+
<<enumeration>>
GraphType
+EdgetCreator(in tempNode : Set, in edgeType : EdgeType) : EdgeCreator
-populateEdgeTable() : void
-connection : Connection
-connectionPool : ConnectionPool
-db : Database
-parameter : Map
-rs : ResultSet
-startNodes : Set
-tempEdgeType : EdgeType
-tempQuery : String
<<implementation class>>
EdgeCreator
+run() : void
<<interface>>
Runnable
+TagPathwayGraph()
+addNewNodes(in tempQuery : String, in colNum : int, in type : NodeType, in nodeSet : Set) : void
-connection : Connection
-connectionPool : ConnectionPool
-db : Database
-items : Set
-tags : Set
-tempQuery : String
-tiEdges : Set
-ttEdges : Set
-users : Set
-utEdges : Set
TagPathwayGraph+GraphFactory()
-createGraph(in type : GraphType) : IGraph
GraphFactory
<<uses>>
<<uses>>
<<uses>>
End1
End2
End3
End4
Figure 4 - UML diagram of the graph factory module
The graph factory uses the same design pattern as Database factory does. It creates and
returns an instance of the graph object which implements the IGraph interface.
TagpathwayGraph is the class that implements the IGraph interface. This graph type is
defined to illustrate the network formulation that connects users to items via a path
composed of tags. (See Figure 3). It is a graph where the nodes are represented by the set
of users U, set of tags T and set of items I, edges are directed and divided into three
14
disjoint sets. More formally, G = (U ∪ T ∪ I, Eut ∪ Ett ∪ Eti). In particular, Eut contains
edges that connect users to tags, Ett contains edges that connect tags, and Eti is the set of
edges that connects tags to items. Additionally, the edge weights can be defined,
respectively, as the frequency a user uses a tag, the co-occurrence frequency between tags
and the frequency in which a tag is assigned to an item. [9]
Figure 5 - the network composed of users that are connected to items of items through tag paths
The dataset we analyze consists of the tagging assignments from del.icio.us during 2003-
2006.
3.4 Graph Analysis
The graph analysis module is responsible for computing characteristics of the graph built
by the Graph Factory module. The graph analyzer is well separated from the concrete
graph implementation. Therefore, it can perform the analysis on the graph regardless the
type of the graph. This is achieved by the definition of the IGraph interface and the
factory methods.
15
I have currently implemented two graph analyzers - node centrality analysis and weight
distribution analysis. The node centrality analysis counts the degree centrality of each and
returns an average or the distribution of degree centrality of the graphs. This analysis is
an example of performing the analysis on the basis of the IGraph interface. On the other
hand, the weight distribution analysis is an example of doing analysis against the
database layer directly. It enables the analyzer to perform query against the database
through the database layer.
16
4.0 RESULTS
The project was successful in achieving most of the objectives. I have completed the
development of a software package that will enable future researchers to conduct graph
analysis easily. This software package not only abstracted away a lot of the low-level
database operations involved in processing tagging activity traces, but it also provided a
simple abstract graph model for users without any database knowledge to conduct simple
graph analysis. Furthermore, efforts were made to ensure the software package is
extensible enough for users to add in their own graph model implementations easily and
run customized graph analysis with special SQL/HiveQL instructions. Most importantly,
I have explored ways to achieve all of the above in a time-efficient manner for large data
sets.
In this section, I will first summarize project deliverables as a whole. Then I will evaluate
the performance of this software package on different database infrastructures. Lastly, I
will briefly state the difficulties this project faced and discuss its future directions.
4.1 Project Summary
A database layer has been successfully implemented and is able to access two types of
database storage systems: MySQL and Hive. It is also extensible to other storage types by
implementing the Database interface without changing any existing codes. Furthermore,
a graph factory and an analyzer interface have been defined. One graph building tool and
two graph analyzing tool have been implemented and successfully run against MySQL
database. Due to a bug in the Hive API, the project cannot be fully conducted. However,
17
by using a smaller dataset, we are able to measure the performance of the Hive platform.
I also investigated the performance of the Hive infrastructure on a cluster. In the
following paragraph, I will talk about the performance of the project.
4.2 Performance Evaluation
The performance of the system is divided into two parts; the graph creation part and the
graph analysis part.
For the test of the graph creation module, a tag trace dataset from del.icio.us was used,
which is 1.2GB in size. The task is to extract the three types of edges as I described
before from the trace and save it back to the database. The MySQL SQL command that is
used for building edges for one node is shown below. This particular statement is for
building a user-tag-edge. It finds distinct uid and tagid pairs for one uid, finds the
frequency of appearance of each pair and then inserts the results back to the database.
Insert into userToTag (uid, tagid, weight)
Select A.uid, A.tagid, A.Count1 * 1.0 / B.Count2 As Freq
From
(Select uid, tagid, Count(*) As Count1
From delicious2003_2006_main Where uid=?
Group By tagid
Order by Count1 DESC) As A
Inner Join
(Select uid, Count(*) As Count2
From delicious2003_2006_main
Where uid=?
Group By uid) As B On A.uid = B.uid;
18
The SQL query used by Hive is similar to the one shown above in logic but with a
slightly different syntax.
This task is performed against both MySQL on one single machine and Hive on a cluster
of 17 machines. It takes MySQL less than 1 second to finish one query like the one
shown above and about 3 days to finish this task. On the other hand, it may take Hive 55
seconds to finish the same query and up to 6 months to complete the whole task, which is
estimated by the time to build one edge multiplied by the number of edges to build.
For the test of graph analysis module, the task is to find out the distribution of the weight
of the edges from a dataset in size of 24GB. The query used is shown below.
This query only involves a where clause which is simpler than the one used to build the
graph. To execute this query once, it takes MySQL about 180 seconds and Hive about
150 seconds. Hive performs slightly better than MySQL this time. It is important to
highlight, though, that Hive used 16 more nodes.
There are several reasons why Hive on a cluster of 17 machines is outperformed by
MySQL on a single machine. First, to execute a query, Hive’s compute an execution plan
that attempts to divide the work into smaller parallel sub-tasks. This process takes several
seconds to complete, which becomes a huge overhead when executing a relatively small
task. Second, the data transfer time between nodes is also a huge overhead. Hive splits
one query task into several jobs and assigns them to each machine. The result of each
task is then sent back to central node for assembly of the either intermediary results,
which leads to a new job submission, or the final result. This MapReduce process creates
SELECT COUNT(weight) FROM useruserweight WHERE weight <= ?
19
more overhead as it incurs in synchronization points that cannot be parallelized. When
the time to perform each task is small, that overhead becomes significant. To minimize
this overhead, I configured the HDFS layer to use three replicas of the dataset to increase
the data availability and decrease the need for reducing stages. This method only
increases the performance by 5% in contrast to using 3 times of disk space.
I have hypothesized that Hive will have a greater advantage over MySQL for even larger
datasets. However, for the datasets available to our investigation (i.e. around 20-50 GB),
there are probably not that much of a difference in performance time-wise because the
overhead of data distribution and setup is simply too big. For smaller datasets (i.e. under
15 GB), it is clear that MySQL on a single node is much faster. This performance
evaluation will serve as a rough benchmark for future users of this software package to
determine which database structure to use.
4.4 Project Difficulties
The main difficulty I encountered was how to parallelize the computing process. It
involves utilizing the computing power of multiple CPUs and distributing the dataset
storage on multiple hard drives. Other than program design and database setup involved
in parallelization, database optimization was also a challenge for me. Although there
were many attempts made on Hive, including ones mentioned in the newly published
paper [10], no significant performance improvement was shown. These difficulties have
delayed the intended project timeline considerably; however, I think the solution that I
came up with meets or even exceeds the expectations set for the project.
20
4.5 Future Development
Future development for this project can be done from two different aspects. One is to
enhance the current code base and database setup. Another is to extend the current
software to conduct complex graph analysis.
The first aspect involves enhancing the performance of existing database setups. I believe
Hive and MySQL single node are not the best available. Other database systems,
especially in the parallel processing domain, should be explored for their performances in
various tasks (e.g., loading and executing). One of such systems is HadoopDB[4]. These
systems should be compared to MySQL single node and Hive. These comparisons should
be documented and will help users pick the most appropriate database for their data sets.
The second aspect involves extending the current code base. Users can potentially use
the software package to perform more complex graph analysis. They can write their own
graph analysis implementations using the current interfaces. Also, users may extend this
software package to build their own graph models (i.e. write their own graph factories).
Overall, this software package is designed with further extensions in mind. Thus, future
development should be smooth with the provided code structure and database setup.
21
5.0 CONCLUSION
This report investigated the details of developing a software package that enables future
users to conduct complex graph analysis. It explained the details of setting up different
database systems and the design. The end product of this project will provide a shortcut
for researchers for tagging system data analysis. It also includes a basic graph factory and
two simple graph analysis implementations, which can both serve as example extensions
of the package. Users of this package could follow these examples to build their own
graph factories and graph analysis implementations. Challenges of database performances
are partially solved by using a parallel processing solution, Hive. However, its
performance for research-sized data sets is not very different from a MySQL database,
which is non-parallel. Hence, further explorations in other parallel systems, such as
MySQL Cluster or a hybrid approach (i.e., HadoopDB), are needed to find a more
suitable system for research-sized dataset. I will be performing this task if time permits.
The objective of delivering an efficient software package that will assist graph analysis
research is successfully achieved.
22
6.0 ACKNOWLEDGMENTS
I would like to thank Professor Matei Ripeanu for providing technical guidance for this
project. I would also like to thank Elizeu Santos-Neto for his continuous support and his
insightful comments and feedback for this report.
23
7.0 REFERENCES
[1] MySQL. http://www.mysql.com
[2] Hive. http://hadoop.apache.org/Hive/
[3] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on
Large Clusters, 2004
[4] Azza Abouzeid1, Kamil BajdaPawlikowski1, Daniel Abadi1, Avi Silberschatz1,
Alexander Rasin2, HadoopDB: An Architectural Hybrid of MapReduce and
DBMS Technologies for Analytical Workloads, 2009
[5] Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides, Design Patterns:
Elements of Reusable Object-Oriented Software, pp 107-127 Addison Wesley,
2005
[6] PC farm for ATLAS Tier 3 analysis. http://atlas-service-enews.web.cern.ch/atlas-
service-enews/2009/features_09/features_pcfarm.php
[7] Hadoop. http://hadoop.apache.org/
[8] http://www.uni-koblenz-landau.de/koblenz/fb4/institute/IFI/AGStaab/
Research/DataSets/PINTSExperimentsDataSets/index_html
[9] Santos-Neto et al. "Tracking User Attention in Collaborative Tagging
Communities <http://www.ece.ubc.ca/%7Eelizeus/CAMA07_elizeu.pdf>". In
CAMA'2007.
[10] Andrew Pavlo Erik Paulson Alexander Rasin
Daniel J. Abadi David J. DeWitt Samuel Madden Michael Stonebraker,
A Comparison of Approaches to Large-scale Data Analysis, 2009d