how structured data (linked data) help in big data ... · how structured data (linked data) ......

How structured data (Linked Data) help in Big Data

Analysis --- Expand Patent Data with Linked Data

Cloud

Lishan Zhang

Electrical Engineering and Computer SciencesUniversity of California at Berkeley

Technical Report No. UCB/EECS-2013-96

http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-96.html

May 17, 2013

Copyright © 2013, by the author(s).All rights reserved.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.

How structured data (Linked Data) help in Big Data Analysis

-‐-‐-‐ Expand Patent Data with Linked Data Cloud

M.Eng Program

Lishan Zhang

24106243

I

Outline

Abstract ....................................................................................................................................... 1

Introduction .............................................................................................................................. 2

Literature Review .................................................................................................................... 6

Unveil the underlying information among Big data .............................................................. 6

Previous solutions .......................................................................................................................................... 8

Approaches ........................................................................................................................................................ 8

Conclusion ....................................................................................................................................................... 10

Methodology ........................................................................................................................... 12

SPARQL: query language for RDF data .................................................................................... 13

SPARQL Endpoint query .............................................................................................................. 14

HTTP request ................................................................................................................................... 17

User Interface design .................................................................................................................... 17

Discussion ............................................................................................................................... 20

Results ................................................................................................................................................ 20

Explanation of Results ................................................................................................................................ 20

What is different ........................................................................................................................................... 22

Limitation of this approach ...................................................................................................................... 22

II

Evaluation ......................................................................................................................................... 23

User Study ....................................................................................................................................................... 23

Heuristic Evaluation .................................................................................................................................... 24

Future Work ..................................................................................................................................... 27

Conclusions or Impact Statement ................................................................................... 29

Bibliography ........................................................................................................................... 30

Appendix ................................................................................................................................. 32

1

Abstract

Big Data is currently a big topic in the world. It is a commonly used term to describe

data that exceeds the processing capacity of on-‐hand database management tools.

We often use 4V (Volume, Variety, Velocity and Value) to describe its characteristics.

Big Data can be structured or unstructured data that has potential values behind

them. It is of vital importance to extract and analysis the valuable information in Big

Data.

On the other hand, Linked Data is a new concept for most of the people. Linked Data

refers to the collection of interrelated datasets that can be publishing and sharing on

the web. Unlike Big Data, Linked Data is highly structured. It is used to build the

Semantic Web which huge amount of data on the web are available in standard

format. The technologies enable people to figure out more advanced analytical

questions by querying the data and drawing inferences using vocabularies.

In our project, we would like to explore the potential use of Linked Data in analyzing

Big Data. We will build a search engine to combine information in Linked Data into

these Patent Data to see if we can dig out more information of each patent. There is

already a huge Linked Data cloud that contains a large amount of publishing open

data. We can also see the potential to connect these public data with patent data to

answer advanced questions. When we search for inventor name or certain patent in

the search interface, we query from Linked Data Cloud and Patent database

separately and return the result. In this way, we can combine the patent itself with

the inventor information from DBpedia.

2

Introduction

Nowadays, we are generating much more data than any point in the history.

The explosion of data is driven from two particular sources: the social network

sharing information about our activities and a variety of sensors collating

information on our environment. [1]

Needless to say, there could be priceless value hidden in this booming data. If we

make good use of them, we may gain valuable information and pattern inside the

data. However, it will also become a thread if we cannot handle this ever-‐increasing

amount of data.

Big Data is a commonly used term to describe data that exceeds the processing

capacity of conventional database systems. [2] We often identified Big Data with

four main attributes: Volume, Velocity, Variety and Value. Big Data can be structured

or unstructured data that has potential values behind them. The McKinsey Global

Institute describes Big Data as “The next frontier for innovation, competition and

productivity.” [3] But processing these big raw datasets pose challenges in both data

management and algorithms. It is of vital importance to extract and analysis the

valuable information in Big Data.

The major difficulties in processing Big Data include capturing, storage, search,

sharing, analytics and visualizing. [4] There are already several approaches to

analyzing Big Data. For example, MapReduce is a programming model and an

implementation for processing and generating large data sets. It runs on a large

cluster of machines and is highly scalable. In addition, NoSQL employed non-‐

relational data storage systems to process unstructured and semi-‐structured Big

3

Data. Some institutes and companies also developed their own mathematics models

and algorithms to dig out useful information from Big Data.

We will mainly focus on variety of Data in this thesis. Variety means that Big Data

has different types of data and various degrees of structure that does not fit into

neat relational structures. It is a mix of structured, semi-‐structured and

unstructured data such as text, sensor data, video, log files and more. Those data

cannot be integrated into an application directly. [2]

The current approaches for Big Data emphasize the ability to deal with the volume

and velocity like MapReduce and NoSQL. In the paper, we are trying to work from a

different approach. We are concern about the variety of Big Data. Since most data is

unstructured, it is hard to interlink different datasets and create valuable context

behind that. We see there may be a potential value to link different datasets and

expend the value of the sole data with the help of Linked Data.

Linked Data is used to organize and publish highly structured data with globally

unique identifiers, which make it easy to combine various datasets. Richard

Cyganiak and Anja Jentzsch created Linked Data Diagram of the Cloud which

describes how many datasets have been published on the web. [5] The Linked Data

cloud is growing constantly, data integration is becoming more important in this

field.

4

Fig 1: The Linking Open Data cloud diagram

In this paper, we are trying to figure out the potential use for linked data into Big

Data analysis by building a prototype of our concepts. We are using U.S. utility

patent dataset and linked with the Public Linked Data cloud. We will build a search

engine for Patent Graph search, and query the endpoint from Linked Data

Cloud like DBpedia and Freebase and simultaneous query the SQL data from

Patent datasets and show the combined results in the interface. The diagram

below can illustrate the querying process:

5

Fig 2: The querying process of Patent Search Engine

In this way we can add more related information about the Patent and even provide

some recommendations for Patent search. We can see there will be many potential

values created by this interconnection. And Linked Data would definitely be valued

later in Big Data Analysis.

6

Literature Review

Unveil the underlying information among Big data

Big Data has become one of the hottest topics in the industry. In this data booming

world, some traditional technologies can no longer serve the need to analyze the

large volume of data. New approaches must be introduced in order to keep up with

the pace of the Big Data. Linked data concept is a useful way to unveil the useful

information, especially the data on the Internet.

Big Data is a commonly-‐used term to describe data that exceeds the processing

capacity of conventional database systems. We are generating much more data than

before with the booming of social network and Media, mobile devices, Internet

Transactions and networked devices and sensors.

Big Data is too big, too fast and doesn’t fit the conventional database architectures.

Due to the unique nature of Big Data, the first question we need to answer is can we

find an alternative way to process the data. More importantly, can we dig out the

useful information from the big data?

Big data requires exceptional technologies to efficiently process large quantities of

data. There are huge amount of valuable patterns and information hidden in the Big

Data, which require us to extract them. Usually, there are four problems when it

comes to Big data: Volume, Velocity, Variety and Value (4V) [6] .

7

Volume and Velocity

In this data booming world, the speed of data growth is exponential. Particularly,

with the increasingly popularity of social media, user generated content has started

to dominate. For example, there are roughly 60 hours of video uploaded to YouTube

every minute [7]. It is also astonishing that there are over 340 million tweets

generated daily in May 2012 [8]. Just to make this more visualizable, the amount of

information in the world doubles every five years [9]. There is more information in

the daily edition of The New York Times than an individual man or woman in the

16th Century had to process in their whole lives.

Huge amount of data requires tremendous storage space and extremely fast

processing speed to deal with the data. It has always been challenging for any

company, government or individual to deal with the issue.

Variety and Value

Big Data relates not just to new information sources: it’s equally applicable for

gaining new insights from data that was previously inaccessible and to accelerating

and easing existing analytical processes [10]. In fact, most big data is low value until

rolled up and analyzed, at which point it becomes valuable.

It is challenging due to big data’s variety. Big data has different structures and

shapes, causing it very difficult to analyze with traditional technologies, such as

MySQL or Oracle. Integrating these data sources are a very expensive operation

[11] . Plus, correlating different pieces of data and reconnect those data to make

8

them more valuable, readable and accessible has always been an interesting

problem.

Previous solutions

Previously, there are several ways to processing and analyzing big data. Usually,

they utilize advanced hardware and parallel processing techniques to break the

speed bottleneck. Others have employed non-‐relational data storage systems to deal

with unstructured and semi-‐structured big data. Meanwhile, a lot of companies and

have been trying to apply unique math models, advance analytics and data

visualization technology to dig the insights from Bit data.

Approaches

MapReduce

MapReduce is a breakthrough concept announced by Google. It is a programming

model and an implementation for processing and generating large data sets [12] . It

is able to run on a large cluster of machines and is highly scalable.

MapReduce is not only successful at Google, but is also open-‐sourced to the public

under the name of Hadoop, a highly scalable compute and storage platform [13].

Hadoop breaks huge chunk of data into pieces and process/analyze it at the same

time.

9

NoSQL

NoSQL was a database that did not expose the standard SQL interface and it was

first used by Carol Strozzi [14]. It works in conjunction with Hadoop to serve up

discrete data stored among large volumes of multi-‐structured data to end-‐user and

automated Big Data applications [15].

Digging useful information

Various companies have taken actions to dig out the useful information from the

various data in the web. For example, Splunk is a small company that has been in

the business for less than 5 years. Splunk’s mission is to make ambiguous big data

more readable, useful and valuable to everyone. For example, one of its partners,

Amazon, is asking Splunk to find out the habits of their customers.

Another company, Jive, is a software company in the social business software

industry. It is also trying to help its customers to consolidate the big data they are

dealing with. One of the example data is the price information of all the

merchandise: what price should be set in order to be the best price.

Downsides

However, all of these approaches are not perfect. For example, Hadoop is a very

young technology and still developing. It is very hard to manage the Hadoop system

and it does not support real-‐time data processing and analysis.

10

NoSQL, on the other hand, is that most NoSQL databases traded ACID (atomicity,

consistency, isolation, durability) compliance for performance and scalability. It

also suffers from its ‘youth’: no mature management and monitoring tools.

Conclusion

Key results

Big Data holds tremendous value and it will be beneficial to understand what it

really means. Many new technologies, such as MapReduce and NoSQL, have been

applied to solve this issue. However, it is never safe to say that we already have the

perfect tools for this job. As the data continues to boom exponentially, new

technology such as Linked data will definitely be the key to the next-‐generation

analytics platform and data management system.

Shortcomings

Linked data applications usually follow different architectures and pattern. For

instance, one pattern will require the data to be replicated so that the applications

may work with stale data. Another pattern, named On-‐The-‐Fly Dereferencing

Pattern works very slowly when dealing with complex operations.

Additional Work

Because Linked data is a newly raised concept and it is still undergoing a lot of

improvements, some of the disadvantages will be nullified in soon. However, the

11

fact that we are in a data-‐exploding era cannot be reverted. More and more data are

coming to us and the technology must keep evolving in order to keep up with the

pace.

Artificial intelligence can be applied when dealing with Big Data. A ‘databot’ that

can crawl the Linked data, infer relationships, and figure out what information can

be extracted will definitely be useful.

12

Methodology

For this thesis, we are building a use case in order to figure out the potential use for

Linked Data into Patent Data. More specifically, we will build a search engine and we

named it “Patent Graph”. So when people type a certain patent number or the

inventor, we can show them the relevant information such as the picture of inventor,

his workplace, alma mater, doctoral advisor and the biography. This information is

obtained from DBpedia, which is a structured data format from Wikipedia. And

DBpedia makes this information available on the web so that people can easily link

to the data. Besides, we will also make new search around the result simply by

clicking the related information on the page. For example, if we are interested in a

co-‐worker or the advisor in the patent that we search, we can just click the name

and then will return a new search around the person and his patents. In addition, we

can provide recommendations based on the searching results. If time allows, we will

also be willing to convert the Patent Data into RDF format and publish on the web

then more people can benefit from that. In this way, the Linked Data help us to

analysis the Patent Data by expanding our patent datasets with related data and

finding more useful information.

The Patent Data that we use is the Patent Inventor Database from Fung institute.

The database disambiguated all inventor names from the U.S. utility patent database

from 1979 to 2010. And the Linked Data we use is DBpedia. The DBpedia dataset

extract structured content from the information created by Wikipedia and it can be

accessed online through SPARQL query endpoint.

13

Since we are building a search engine to extract the information from both Linked

Data Cloud and relational database, we are building a web service based on that and

we use a Model-‐View-‐Controller (MVC) software architecture.

My part of work includes implementing the search interface and query from the

Linked Data Cloud. The techniques involve SPARQL endpoint query, HTTP request

and User Interface design.

SPARQL: query language for RDF data

Resource Description Framework (RDF) is a directed, labeled graph data format to

describe resources on the web. It is designed to be read and understand by

computer rather than people. Most RDF documents are written in XML, which can

easily be exchanged between different computers and platforms. The RDF language

is also a part of “The Semantic Web”. Semantic Web is a set of standards and best

practices for sharing data and the semantics of that data over the web for use by

application. [16] Rather than just putting data on the web, the Semantic Web is

about making links so that a person or machine can explore the web of data. [17]

We define RDF statement as a triple of the form (Subject, Predicate, Object) and uses

uniform resource identifiers (URIs) to name the data objects. For example, if we

need to express “Tom is a man”, we should represent as Tom(Subject),

sex(Predicate), man(Object). The data stored in Linked Data Cloud is RDF data.

SPARQL stays for SPARQL Protocol and RDF Query Language. SPARQL is a standard

query language designed for querying RDF databases. There are four different forms

of query in SPARQL: SELECT, ASK, DESCRIBE and CONSTRUCT and we use SELECT

14

form most of the time. [18]The main idea of SPARQL is pattern matching. So it is

easily traverse relationship by querying collections of triples. The syntax of SPARQL

is quite similar to SQL. A simple SPARQL query example can be as follow:

PREFIX dbont: <http://dbpedia.org/ontology/>

SELECT ?musician ?place

WHERE {

?musician dbont:birthPlace ?place .

}

First we need to initiate a namespace. In this case is http://dbpedia.org/ontology.

And we find all the musicians and their birth places as place and return. The partial

result is showed below. We can type the SPARQL query example in DBpedia

endpoint to get the full list.

musician place http://dbpedia.org/resource/Federico_Garc%C3%ADa_Lorca http://dbpedia.org/resource/Andalusia

http://dbpedia.org/resource/Trinidad_Jim%C3%A9nez http://dbpedia.org/resource/Andalusia

http://dbpedia.org/resource/Ibn_Tufail http://dbpedia.org/resource/Andalusia

http://dbpedia.org/resource/Fran_Perea http://dbpedia.org/resource/Andalusia

http://dbpedia.org/resource/Ver%C3%B3nica_S%C3%A1nchez http://dbpedia.org/resource/Andalusia

http://dbpedia.org/resource/Berni_Rodr%C3%ADguez http://dbpedia.org/resource/Andalusia

http://dbpedia.org/resource/Jos%C3%A9_Celestino_Mutis http://dbpedia.org/resource/Andalusia

http://dbpedia.org/resource/Pepe_Marchena http://dbpedia.org/resource/Andalusia

http://dbpedia.org/resource/Antonio_de_Olivares http://dbpedia.org/resource/Andalusia

http://dbpedia.org/resource/Tanya_Anne_Crosby http://dbpedia.org/resource/Andalusia

SPARQL Endpoint query

Endpoint is an association between a fully specified Interface Binding and a network

address, specified by a URI. It is used to communicate with an instance of a Web

Service. An endpoint indicates a specific location for accessing a Web Service using a

15

specific protocol and data format. [19] A SPARQL endpoint enables users to query a

knowledge base via the SPARQL language. Results are typically returned in one or

more machine-‐processable formats like HTML. For simplicity, we can say that a

SPARQL endpoint is the place you send your SPARQL query and receive the result.

The commonly used SPARQL Endpoints are lists below (SparqlEndpoints, 2013):

Data Source Endpoint Address

DBpedia http://dbpedia.org/sparql

U.S. Census http://www.rdfabout.com/sparql

FactForge http://factforge.net/sparql

data.gov.uk http://data.gov.uk/sparql

In our project, we need to query the bio information of the patent inventor from

DBpedia through SPARQL endpoint query. The information of a certain person is the

same as we often see in Wikipedia, but it is in a different format. For example, as for

our professor David A. Patterson, the Wikipedia page and DBpedia page are showed

as below. We can see they have quite different representation of the same

information. In DBpedia, data is machine-‐readable. We can get the value from the

property on the left side. We just need to select the properties we need in SPARQL

query and can get the corresponding values more convenient.

16

Fig 3: Screenshot of an example of Wikipedia

Fig 4: Screenshot of an example of DBpedia

17

HTTP request

The Hypertext Transfer Protocol (HTTP) can work as a request-‐response protocol

between a client and server. An HTTP request consists of a request method, a

request URL, header fields and a body. The request methods are GET, HEAD, POST,

PUT, DELETE, OPTIONS, TRACE. [20] The two commonly used HTTP request

methods are GET and POST. While these two methods have similar function, GET

emphasizes requests data from a specified resource while POST submits data to be

processed to a specified resource. We use POST method here to avoid caching.

In our case, the client is the Search Interface that submits an HTTP request using

JavaScript to the server endpoint with the SPARQL query. Then the server returns a

response to the client. The response contains status and content information about

the request. Consider that JavaScript is not good at dealing with RDF data; we set the

return format as json format.

User Interface design

The User Interface (UI) design for our prototype is simple and clean. It looks like a

simplified Wikipedia. We query from both the Patent Data and Linked Data Cloud

and display the output in the interface. The structure of the User Interface is the

Patent information surrounded by some information of the inventor of the patent.

We can see the screenshot as below:

18

Fig 5: Screenshot of Paten Search Interface

The left side contains the basic information including his profile picture, working

place, Alma Mater and Doctoral Advisor. The upper right side is a biography of the

inventor. Then followed his patent information got from relational database. If we

click the link in the left side, it can lead us to the certain Wikipedia page to get more

information. The UI design emphasizes the Patent part while putting the relevant

information surrounded.

The procedure

The procedure works as below:

On the client-‐side, when people search a keyword, a HTTP request message will

send to the DBpedia web server. We write a wrapper class “SPARQLWrapper.js” in

JavaScript that is similar to SPARQL Endpoint interface to Python. [21]

19

The SPARQL endpoint query is http://dbpdia.org/sparql. We send the request with

searched title and some properties like abstract, workplaces and so on to the server

endpoint. But it will return html page, which is not what we need. So we set the

accept field in Request Header to identify the return data type. Here we need to

return json format. We use GET and POST methods to send the SPARQL.

The web server then will provide resources and return a response message to the

client. The response message is read by JavaScript and write into html and display in

the User Interface.

For the Patent Data part, we have potentially two main approaches. One approach is

to use the Patent Data as the relational database and query the data from local

database. And the other approach is to convert it to RDF format and store it in triple

store or even publish on the web. The first approach is efficient because we just

need to obtain the Patent information from the search keyword. It is quite

convenient to use relational database. The bottleneck would be how to store the

data. The whole dataset could be saved locally or upload in Google Datastore.

The second approach is more complex because we need to pre-‐process the whole

dataset and convert to RDF format. Since the Patent Data is quite large, many

existing tools like Google Refine cannot hold such a large amount of data. The

advantage for the second approach is that the Patent Data can interlink with other

Linked Data and make Patent Data more available.

Since the large amount of Data is always a problem, we will begin from a small

subset and go from there. For example, we can use the Patent Data from Berkeley

Professor first.

20

Discussion

In this section, I will main discuss the use case that we bring Linked Data in Patent

Data search. Also I will talk about how linked Data helped in patent search, what is

the limitation and how linked data can be used in broader context. I also evaluate

the User Interface of the search interface and test with real users.

Results

Explanation of Results

For our Capstone Project, we would like to explore the potential use of Linked Data

to help Big Data Analysis. And thus we are building a patent search engine based on

these two concepts. Linked Data has many advantages like highly structured data,

machine-‐readable and interlinked between different data sources. So we take

advantage of the structured data format of Linked Data and use it to expand the

search result for patent and add more values to it. Basically we have proven the

hypothesis that Linked Data works in this situation and it will have many other

implications.

Here is our User Interface for after searching for a certain patent:

21

Fig 6: Screenshot of Paten Search Result

From the screenshot we can easily see that it has association information adding

into the patent search result. Here we add some wiki information for the certain

inventor. In this way user can easily distinguish the exact inventor by looking at the

biography or some related information like work place, alma mater and doctoral

advisor. It will help in disambiguation for patents since there will be a large amount

of people with the same name but work in different areas and have totally different

patents.

Besides, users can also search for the patents for the coworkers by clicking their

names in the page. Or if the users are interested in the workplace or alma mater,

they can also just click the link and it will lead them to the Wikipedia page of the

certain item.

22

With the help of Open Linked Data, we have a new kind of patent association search

that disambiguation the patent search and provide a broader context of the patent

related information.

What is different

We have many some changes compare to our initial ideas in our implementation.

First for the patent data, we retain its format as relational database and query with

SQL rather than converting it into RDF format. Actually we have worked in some

small prototype to convert the data using Google Refine. But it becomes really

complex when we use a large amount of data. And it is not necessary to covert data

format in our use case. So we decided to query the relational database directly and

combine the result with inventor information from Linked Data.

Also we decide to put the patent data locally and use PHP to query the relational

database and send back to client side with json format. We find out this is the most

efficient way of doing that at this stage. If time permits, we would probably put them

in the cloud server so that we can run the search engine remotely.

Limitation of this approach There are also some limitations of our patent search.

Firstly, we are assuming that the inventor would have a Wikipedia page so that we

can find the corresponding information in DBpedia. However, this would not also be

the case. Although more people get their own page in Wikipedia, there would not be

23

all the people who held their patents. In such case, we won’t find their information

from the Linked Data Cloud and it would cause a problem.

Secondly, the user will need to type the full name of the inventor in order to match

the name in DBpedia and the inventor name in patent database. Compare with

Google Patent Search, it is kind of limited because Google can find us a lot of

information based on selection rank even if we didn’t type the full name.

Thirdly, we are using patent data as its original format and run two queries to

search from DBpedia and relational database. It doesn’t make the best use of Linked

Data because the advantage of Linked Data over other format is that it is in the same

format and different datasets can be interlinked together. Later it would be better if

we can actually convert the patent data into RDF format and even publish the data

into Open Linked Data Cloud. In this way, the patent data would have been

interlinked with all the other data source in the cloud and make use of the Linked

Data concept better.

Evaluation

In the evaluation part, I will mainly discuss the User Interface we build for patent

search and the effectiveness and convenience of search experience for real users.

User Study

We have asked some people in different areas to do the usability experiments to

experience the search engine and made some changes based on their feedback.

24

Most of them think that the patent association search result is better comparing it

with the traditional approach. They often encounter the problem whether they get

the right one when they search for patents. With our prototype they can easily get

the information of the inventor and therefore get correct and comprehensive

understanding of the information they retrieve.

They thinks that our patent search has clear output with the associate information

and it can also run relevant search. But they also point out the limitation of the

approach. We can only have basic information for the patent itself. If users would

like to know about some details of the patent itself, we cannot provide that because

we don’t have that information in Patent Database.

Heuristic Evaluation

We examine our User Interface with the famous 10 Usability Heuristics introduced

by Jakob Nielsen. It is a usability engineering method for finding the usability

problems in a user interface design. [22] We have a small set of evaluators examine

the interface with the recognized usability principles with point one to ten and

combine the result of evaluation.

We asked our users to go through a set of tasks we designed in our search interface

and provide evaluators with the goals of the system and allowed them to do their

own tasks. After that, they filled out the sheet of Heuristic Evaluation.

The Heuristic Evaluation Sheet is designed as followed:

25

Heuristic Evaluation principles Points (1-‐10)

Comments

Visibility of system status

Match between system and the real world

User control and freedom

Consistency and standards

Error prevention

Recognition rather than recall

Flexibility and efficiency of use

Aesthetic and minimalist design

Help users recognize, diagnose, and recover from errors

Help and documentation

We analyzed the results the real users provides and explained the evaluation result.

The principle got Good if the average point is more than 6 out of 10, otherwise it

need to improve.

(1). Visibility of system status: Good (8.7)

Our interface has clear layout and different components will not combine together

when it shows. User can easily see if they have obtained the search result and how

the information likes.

(2). Match between system and the real world: Good (8.2)

The interface is kind of like the Wikipedia format to show the bio-‐information and

put the patent in the front of the page so that it is easy to understand.

26

(3). User control and freedom: Good (7.1)

Users can search new patent by using the textbox in the upper left corner or simply

click the information in the page.

(4). Consistency and standards: Need to improve (5.8)

For the search textbox, we can only do search for the existing patents number and

some inventor information. So user may get confused about what they should enter

at first.

(5). Error prevention: Need to improve (5.0)

We don’t build the function for auto-‐completion or auto-‐correction so that users

need to type correctly in order to get the result.

(6). Recognition rather than recall: Good (7.5)

We have minimized the user’s memory load by making the objects and actions

visible. Users don’t have to remember information but can just click in the old result.

(7). Flexibility and efficiency of use: Good (7.2)

The differences between novice user and expert user will not be huge because there

are no complicated actions needed for the search feature.

(8). Aesthetic and minimalist design: Good (6.8)

The interface contains the most relevant and needed information and diminishes

the extra information with low visibility.

27

(9). Help users recognize, diagnose, and recover from errors: Need to improve

(5.7)

If users type some names that does not exist in the Wikipedia or they make some

typo, there is no error messages to indicate the problem precisely.

(10). Help and documentation: Need to improve (5.5)

We actually didn’t implement the documentation part to help user understand the

functionality of the search engine. Normally people will understand because the

interface looks like all the other search engine.

Future Work

Enriching the functionality of the Patent Search

Now we only focus on how to combine the Linked Data and relational Data together

to make the patent search more convenient. So we only use a limited information

collected from only one source of Open Linked Data Cloud. In fact, there are many

more things we can do to enrich the functionality of the Patent Search. For example,

we can obtain the geo information in the Patent Data and do some visualization of

from the Geo Names Data from Linked Data Cloud. Or we can even visualize some

Patent Search Graph to show the relationships between different inventors and

their patents more explicit.

28

Querying a Collection of Datasets in Linked Data

We query data from only DBpedia for this project. But since Linked Data is

interlinked, we may be able to query a collection of datasets using an existing

SPARQL endpoint and access to a set of copies of relevant dataset. For example,

OpenLink SW has a majority of dataset from the LOD cloud using SPARQL endpoint.

[23]

Applying the concept to other topics

Currently we apply the patent data with the DBpedia in Linked Data Cloud. There

are many other sources in Linked Data Cloud we may use like Geo Names data,

IMDB data, BBC music and so on. We may make use of these sources and find other

available applications. For example, we can search for a certain music singer and get

the relevant biographical information along with their albums and songs in different

data sources.

29

Conclusions or Impact Statement

For our capstone project, it is a research project to explore the potential use of

Linked Data into Big Data. We have do some research about Big Data, knowing the

existing approaches to analysis Big Data and their strength and weakness. And we

figured out that the highly structured Linked Data might be a potential solution for

unstructured Big Data analytics and dig out more values behind the Big Data.

Based on that, we are building a search engine to describe how Linked Data help in

Big Data Analysis by expanding the Patent Data with the Open Linked Data Cloud. In

this way, we may be able to find out the patent association information through the

Linked Data Cloud and combine with the patent search to get a comprehensive

answer.

Although we have learned a lot about the mechanism of Linked Data and use it in

our prototype, there is something remains to be learned. For example, we just query

from a single sources from Linked Data Cloud, we may explore multiply queries

from different sources or directly convert the Patent Data into RDF format and

publish it in the Linked Data cloud.

The strength for Linked Data is its structured and uniform format that information

can be shared among different datasets and it can be read automatically by

computers. Yet we still need to figure out the drawbacks like complicated pre-‐

processing procedures and the way to protect the available data in the web.

Our prototype has proven that linked data has many advantages and can be used in

data analysis in different situations. We can see a bright future for making better use

of linked data and semantic web to help in big data analysis.

30

Bibliography 1. Ian Mitchell, Mark Wilson. Linked Data: Connecting and exploiting big data. London : Fujitsu UK, 2012.

2. Dumbill, Edd. What is big data? An introduction to the big data langscape. [Online] January 11, 2012. http://strata.oreilly.com/2012/01/what-‐is-‐big-‐data.html.

3. James Manyika, Michael Chui, Brad Borwn, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers. Big Data: The next frontier for innovation, competition, and productivity. s.l. : McKinsey Global Institute, 2011. http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation.

4. Roebuck, Kevin. Big Data: High-‐impact Strategies – What You Need to Know: Definitions, Adoptions, Impact, Benefits, Maturity, Vendors. s.l. : Lightning Source Incorporated, 2011.

5. Richard Cyganiak, Anja Jentzsch. Linking Open Data cloud diagram. [Online] 2011. http://lod-‐cloud.net/.

6. Hopkins, Brian and Evelson, Boris. Expand Your Digital Horizon with Big Data . s.l. : Forrester , 2011.

7. Oreskovic, Alexei. YouTube, Google Inc's video website, is streaming 4 billion online videos every day, a 25 percent increase in the past eight months, according to the company. [Online] Jan. 23, 2012. [Cited: Nov. 30, 2012.] http://www.reuters.com/article/2012/01/23/us-‐google-‐youtube-‐idUSTRE80M0TS20120123.

8. twittersearch. The Engineering Behind Twitter’s New Search Experience. [Online] May 31, 2011. [Cited: Nov 30, 2012.] http://engineering.twitter.com/2011/05/engineering-‐behind-‐twitters-‐new-‐search.html.

9. O'Brien, Kevin. Why Media Literacy? A Catholic Reflection. [Online] [Cited: Nov. 30, 2012.] http://www.medialit.org/reading-‐room/why-‐media-‐literacy-‐catholic-‐reflection.

10. IDC European Software Predictions. Woodward, Alys, et al. 2012, IDC.

11. IDC Worldwide Big Data Taxonomy . Woo, Benjamin, et al. 2011.

12. Ghemawat, Jeffrey Dean and Sanjay. 2004, OSDI, p. 13.

13. Jablonski, Joey. Introduction to Hadoop. Fremont : Dell Inc., 2011.

14. ADAM LITH, JAKOB MATTSSON. Investigating storage solutions for large data. Goteborg : Chalmers University of Technology, 2010.

15. Kelly, Jeff. Big Data: Hadoop, Business Analytics and Beyond. Nov. 8, 2012.

31

16. DuCharme, Bob. Learning SPARQL. s.l. : O'REILLY, 2011.

17. Berners-‐Lee, Tim. Linked Data Design Issues. [Online] 06 18, 2009. http://www.w3.org/DesignIssues/LinkedData.html.

18. Matthews, Andrew. Understanding SPARQL. [Online] 2008. http://www.ibm.com/developerworks/xml/tutorials/x-‐sparql/section3.html.

19. SPARQL endpoint. [Online] 2011. http://semanticweb.org/wiki/SPARQL_endpoint.

20. HTTP Requests. [Online] http://docs.oracle.com/javaee/1.4/tutorial/doc/HTTP2.html.

21. Ivan Herman, Sergio Fernandez, Carlos Tejo. SPARQL Endpoint interface to Python. [Online] 2008. http://sparql-‐wrapper.sourceforge.net/.

22. Nielsen, Jakob. 10 Usability Heuristics for User Interface Design. [Online] 1995. http://www.nngroup.com/articles/ten-‐usability-‐heuristics/.

23. Hartig, Olaf. Querying Linked Data with SPARQL. [Online] 2009. http://www.slideshare.net/olafhartig/querying-‐linked-‐data-‐with-‐sparql.

24. Public Data Sets on AWS. [Online] http://aws.amazon.com/publicdatasets.

25. SparqlEndpoints. [Online] 2013. http://esw.w3.org/topic/SparqlEndpoints.

32

Appendix Here I will list some code snippets described in methodology. SPARQLWrapper.js (function(root, factory) { if(typeof define === "function"){ define("SPARQLWrapper", factory); // AMD || CMD }else{ root.SPARQLWrapper = factory(); // <script> } }(this, function(){ 'use strict' function SPARQLWrapper(endpoint){ this.endpoint = endpoint; this.queryPart = ""; this.type = "json"; } SPARQLWrapper.prototype = { constructor: SPARQLWrapper, setQuery: function(query){ this.queryPart = "query=" + encodeURI(query); }, setType: function(type){ this.type = type.toLowerCase(); }, query: function(type, callback){ callback = callback === undefined ? type : this.setType(type) || callback; var xhr = new XMLHttpRequest(); xhr.open('POST', this.endpoint, true); xhr.setRequestHeader('Content-‐type', 'application/x-‐www-‐form-‐urlencoded'); switch(this.type){ case "json": type = "application/sparql-‐results+json"; break; case "xml": type = "application/sparql-‐results+xml"; break; case "html": type = "text/html";

33

break; default: type = "application/sparql-‐results+json"; break; } xhr.setRequestHeader("Accept", type); xhr.onreadystatechange = function(){ if(xhr.readyState == 4){ var sta = xhr.status; if(sta == 200 || sta == 304){ callback(xhr.responseText); }else{ console && console.error("Sparql query error: " + xhr.status + " " + xhr.responseText); } window.setTimeout(function(){ xhr.onreadystatechange= new Function(); xhr = null; },0); } } xhr.send(this.queryPart); } } return SPARQLWrapper; }));

how structured data (linked data) help in big data ... · how structured data (linked data) ......

Documents