knowaan final presentation

24
Project group knowAAN Final presentation Adrian Wilke info[REMOVE]@adrianwilke.de Computer Science Education Group University of Paderborn October 20th 2011

Upload: adrianwilke

Post on 30-Jun-2015

493 views

Category:

Technology


0 download

DESCRIPTION

Project group knowAAN, Final presentation, Computer Science Education Group, University of Paderborn

TRANSCRIPT

Page 1: knowAAN final presentation

Project group knowAANFinal presentation

Adrian Wilkeinfo[REMOVE]@adrianwilke.de

Computer Science Education GroupUniversity of Paderborn

October 20th 2011

Page 2: knowAAN final presentation

Overview

Overview

I IntroductionI System components & Work flowI DemonstrationI Development processI Summary & OutlookI Time for further questions of detail

PG knowAAN 2

Page 3: knowAAN final presentation

Overview

Overview: First part

I GoalsI Extraction & Storage (of data)I Exploration (of data)I System components & Work flowI Analysis & Visualization (of data)

PG knowAAN 3

Page 4: knowAAN final presentation

Goals

Goals

I Explore research networksI Based on: Artifacts (scientific publications) and metadataI Combination and analysis of dataI Computation of similarities of full textsI Support for conference management system GinkgoI Data visualizationI Recommendations

(Source: PG knowAAN project description)

PG knowAAN 4

Page 5: knowAAN final presentation

Goals

Imagine you are interested in a conference.You downloaded the papers of 2 or 3 years.

Now you have nearly 100 publications.How do you explore them?

100 publications. Do you know tools?PG knowAAN 5

Page 6: knowAAN final presentation

Extraction & Storage

Extraction & Storage

First step: Extract data and store it.

PG knowAAN 6

Page 7: knowAAN final presentation

Extraction & Storage

PG knowAAN 7

Page 8: knowAAN final presentation

Exploration

Exploration

Second step: Explore data.

PG knowAAN 8

Page 9: knowAAN final presentation

Exploration

Exploring a conference

PG knowAAN 9

Page 10: knowAAN final presentation

Exploration

Exploration

Which extracted data is available for a publication?

→ Database schema

PG knowAAN 10

Page 11: knowAAN final presentation

publication

id GUID

lucuid VARCHAR(512)

title VARCHAR(512)

booktitle VARCHAR(512)

normtitle VARCHAR(512)

date VARCHAR(512)

editor VARCHAR(512)

journal VARCHAR(512)

note VARCHAR(512)

pages VARCHAR(512)

publisher VARCHAR(512)

tech VARCHAR(512)

volume VARCHAR(512)

number VARCHAR(512)

rawstring VARCHAR(4096)

xmlfile VARCHAR(512)

pdffile VARCHAR(512)

topicfile VARCHAR(512)

created BIGINT

modified BIGINT

Indexes

author

id GUID

text VARCHAR(512)

normtext VARCHAR(512)

firstname VARCHAR(512)

lastname VARCHAR(512)

created BIGINT

modified BIGINT

Indexes

pub_aut

publication_id GUID

author_id GUID

Indexes

affiliation

id GUID

text VARCHAR(512)

location_id GUID

Indexes

address

id GUID

text VARCHAR(512)

location_id GUID

Indexes

pub_aff

publication_id GUID

affiliation_id GUID

Indexes

pub_add

publication_id GUID

address_id GUID

Indexes

citation

publication1_id GUID

publication2_id GUID

Indexes

discipline

id GUID

text VARCHAR(512)

parent_id GUID

Indexes

location

id GUID

latitude DOUBLE

longitude DOUBLE

text VARCHAR(512)

Indexes

keyword

id GUID

text VARCHAR(512)

Indexes

pub_key

publication_id GUID

keyword_id GUID

score DOUBLE

source VARCHAR(512)

Indexes

pub_evt

publication_id GUID

event_id GUID

Indexes

pub_dis

publication_id GUID

discipline_id GUID

Indexes

pub_con

publication_id GUID

concept_id GUID

score DOUBLE

source VARCHAR(512)

Indexes

concept

id GUID

text VARCHAR(512)

Indexes

event

id GUID

text VARCHAR(512)

filepath VARCHAR(512)

predecessor_id GUID

successor_id GUID

Indexes

eventseries

id GUID

text VARCHAR(512)

filepath VARCHAR(512)

Indexes

evt_evs

event_id GUID

eventseries_id GUID

Indexes

aut_add

author_id GUID

address_id GUID

Indexes

aut_aff

author_id GUID

affiliation_id GUID

Indexes

pub_cat

publication_id GUID

category_id GUID

score DOUBLE

source VARCHAR(512)

Indexes

category

id GUID

text VARCHAR(512)

Indexes

bib_coupling

co_author

co_citationkeyword_count

discipline_count

category_count

concept_count

evt_pub_aut_count

Page 12: knowAAN final presentation

System components & Work flow

System components & Work flow

How is our system structured?

→ Some examples.

PG knowAAN 12

Page 13: knowAAN final presentation

System components & Work flow

Components

<< component >>

FileStorage

<< component >>

Backend

<< component >>

xmlBuilder

<< component >>

TopicExtraction

<< component >>

TF-Component

<< component >>

TrendDetection

<< component >>

Roundtrip

<< component >>

Recommendation

<< component >>

PDFToText

<< component >>

Clustering

<< component >>

DB

<< component >>

Parscit

<< component >>

DataBase

<< component >>

SolrWebServices

<< component >>

DocBrowser

<< component >>

FrontendReferenceExtraction

<< component >>

ParscitTrainer

JDBC

JDBC

Model

WebServices

WebServices

FileSystem

PG knowAAN 13

Page 14: knowAAN final presentation

Languagedetection: DB:Solr:NounExtraction:Lemmatizer:Parscit:PDFToText :RoundTripExecutor :RoundTrip :DocumentBrowser:

a / 1) .addPDF

a / 1)

a / 2) .writeToFS

a / 2) Path

a / 3) .createThread

a / 3)

.submitThread

b / 1) .run

b / 1)

b / 2) .getText

b / 2) Text

b / 3) .ParseFullText

b / 3) ParscitXML

b / 6) .lemmatize

b / 6) LemmatizedText

b / 4) .extractBodyAndAstract

b / 4) BodyAndAbstract

b / 7) .extractNouns

b / 7) NounsList

b / 8) .lemmatizeNounslist

b / 8) LemmatizedNouns

b / 10) .writeToFiles

b / 10) Paths

b / 5) .getLanguage

b / 5) LanguageString

b / 9) .ReduceToTopNouns

b / 9) TopNouns

b / 11) .addTexts

b / 11) Solrid

b / 12) .addPublication

b / 12)

Page 15: knowAAN final presentation

System components & Work flow

Work flow

PG knowAAN 15

Page 16: knowAAN final presentation

Analysis & Visualization

Analysis & Visualization

Third step: Analyze and visualize data.

PG knowAAN 16

Page 17: knowAAN final presentation

Analysis & Visualization

Analysis of authors

PG knowAAN 17

Page 18: knowAAN final presentation

Analysis & Visualization

Analysis of scientific publications

PG knowAAN 18

Page 19: knowAAN final presentation

Demonstration

Demonstration

Now: Demo.Image: http://www.flickr.com/photos/plaisanter/5525977163/

PG knowAAN 19

Page 20: knowAAN final presentation

Development process

Technologies

Jersey

PG knowAAN 20

Page 21: knowAAN final presentation

Development process

Methods of agile software development

FDD XPScrum

PG knowAAN 21

Page 22: knowAAN final presentation

Development process

Methods of agile software development

I Weekly meetingsI Sit together (as much as possible)I Automated building systemI Continuous integrationI Issue tracking

PG knowAAN 22

Page 23: knowAAN final presentation

Summary and Outlook

Summary and future work

Summary

I Integrated processing of scientific papersI Aggregated visualization of authors, publications and

eventsI Compute various analysis over the dataI Cleaning functionality for automated processed data

Future work

I Parallelized ClusteringI Additional graphical visualizationI Improve extraction of metadata from PDF files

PG knowAAN 23

Page 24: knowAAN final presentation

Summary and Outlook

Thank you for your attention

Questions?

PG knowAAN 24