search for personal information using yahoo boss by evgeny dosychev dmitry kichin supervisor: eddie...

16
Search for personal information using Yahoo BOSS by Evgeny Dosychev Dmitry Kichin Supervisor: Eddie Bortnikov

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Search for personal information using Yahoo BOSS

by

Evgeny Dosychev

Dmitry Kichin

Supervisor: Eddie Bortnikov

HomePage Project

Finding personal information in the web is not an easy task.

We want to create an automatic tool that will find and present personal information for the requested person.

Technical Issues We need an effective way to find information in

the web.

We will use Yahoo BOSS.

Personal information on the web is not in a standart format.

We focus on working with IEEE pdf documents.

Technical Issues How will we parse the info and identify the

differnt details?

PDF to Text - using special Java package.

Using the standrt structure of the IEEE documents.

How will we avoid confusion between different people with the same name (name ambiguity)?

Divide the info to clusters.

Let the user make the choise between the clusters*.

Technologies

• Java

Will be used to build the Windows desktop application.

• Yahoo! BOSS

Provides free access to Yahoo search index.

• PDFbox

Java library. Used for extracting text from PDF documents

BOSS

• Yahoo! Search BOSS (Build your Own Search Service) is a Yahoo! initiative that gives the developers free access to the Yahoo! Search index.

• The results can be supplied into the developer's application so that they can manipulate the resources according to their needs.

• Up to 500 results can be retrieved.

Based on Wikipedia

HomePage functionality Desktop Java application.

• Gets from the user the search target.

• Searches the web using Yahoo! BOSS.

• Downloads and parses PDF documents and Images and produces HTML page with the information which was found.

(Currently it is: email, publication titles, publication short summary, images, and links to the full document)

HomePage functionality

Devides the information to clusters (based on the key=email)

Gets the user choise to decide which info fits.

Produces HTML page with all the details.

Sceenshots

Clustering algorithm

It is very hard to the computer to solve name ambiguity.

We leave this task to the user.

Each group of information items (cluster) will be defined by its key (email) and the user make the choise.

The result page will be produced from the chosen clusters

Workflow

Class Diagram

Flow Diagram

Challenges

• PDFbox appeared to be not reliable and problematic. It is not the best solution to PDF parsing.

• Perhaps the main challange was the semantic parsing (finding information in the text). We discovered that the sematic parsing by itself very problematic task, that requires time and resourses beyond the project scope.

Conclusions

• We learned the principle of the BOSS project, and used the power that it provides

• We prepared a well-designed object oriented infrastructure for the task.

• HomePage can be a good infrastructure for adding additional algorithms that find additional information in the texts.

• In order to extract and identify information from the text, we need to use specific algorithms and methods.