fostering serendipity through big linked data

15
Fostering Serendipity through Big Linked Data Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille Ngonga Ngomo Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia

Upload: muhammad-saleem

Post on 24-Jan-2015

709 views

Category:

Education


0 download

DESCRIPTION

Semantic Web Challenge - Big Data track winner at ISWC2013

TRANSCRIPT

Page 1: Fostering Serendipity through Big Linked Data

Fostering Serendipity through Big Linked Data

Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille

Ngonga Ngomo

Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia

Page 2: Fostering Serendipity through Big Linked Data

Agenda

• Motivation• Datasets• Architecture• Evaluation• Requirements• Demo• Conclusion and Future Work

Page 3: Fostering Serendipity through Big Linked Data

Motivation

Fostering Serendipity through Big Data Triplification, Continuous Integration,

and Visualization

Page 4: Fostering Serendipity through Big Linked Data

Triplification: Linked TCGA• TCGA is publicly accessible atlas of cancer

related data from National Cancer Institute (NCI) – 9000 patients– 33 cancer types– 147,645 raw data files– 12.7 TB

• Only 46% of the total expected data with new data being submitted every day

• Goal is to enable cancer researchers to make and validate important discoveries

• Total Linked TCGA > 30 billion triples (Largest Dataset of LOD)

Page 5: Fostering Serendipity through Big Linked Data

Triplification:PubMed• Collection of publications from the bio-

medical domain• Large amount of metadata (MESH Terms)• 23+ million publications• 10,000 new publications/month

Page 6: Fostering Serendipity through Big Linked Data

Big Data Continuous Integration

TopFed

Parser

Federator Optimizer

Integrator

Results

ResultsSPARQL Query

Sub-queryPubMed

Entrez UtilitiesRDFizer

Auto Loader

TCGA Data Portal

SPARQL endpoint

RDF

SPARQL endpoint

RDF

SPARQL endpoint

RDF

Index

Page 7: Fostering Serendipity through Big Linked Data

b1 b2 p1 p2 g1 g2 g3p3 p4 g4 g5 g6p5 p6 g7 g8 g9

C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical}

F = {Expression-Exon}M = {beta_value, position}

(CNV, SNP, E-Gene, miRNA, E-Protein, Clinical)

Exon-Expression

Methylation

D = {seg_mean, rpmmm, scaled_est, p_exp_val}

C-2 = {{p {∈ E ∪ A ∪ G} ∨ {p = rdf:type o ∧ ∈ F}} ∧ {{S-Join(p, E ∪ F) P-Join(∨ p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ !P-Join(p, M ∪ B ∪ D ∪ C) }}}

C-3 = {{p {∈ M ∪ A} {p = rdf:type o ∨ ∧ ∈ B}} ∧ {{S-Join(p, M ∪ B) P-Join(∨ p, M ∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ !P-Join(p, E ∪ F ∪ D ∪ C) }}}

C-1 = {{p {∈ D ∪ A ∪ G} {p = rdf:type o ∨ ∧ ∈ C}} ∧ {{S-Join(p, D ∪ C) P-Join(p, ∨ D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ !P-Join(p, M ∪ B ∪ E ∪ F) }}}

C-1 Category ∨Colour = blue

IF tumour lookup is successful forward to corresponding leafElse broadcast to every one

For each query triple t(s, p, o) T ∈

A = {chromosome, result, bcr_patient_barcode} G = {start, stop}

B = {DNA-Methylation}

E = {RPKM}

Tumours

SPARQL endpoints

C-2 Category ∨Colour = pink

C-3 Category ∨Colour = green

1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33

Highly Scalable

Page 8: Fostering Serendipity through Big Linked Data

Evaluation:Number of Sub-Query Submission

• TopFed number of sub-queries submission is 1/3 to FedX• Number of ASK requests

– FedX 480– TopFed 10

1 2 3 4 5 6 7 8 9 10 Avg0

10

20

30

40

50

60

FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission

Page 9: Fostering Serendipity through Big Linked Data

Evaluation: Query Runtime

1 2 3 4 5 6 7 8 9 10 Average10

100

1000

10000

100000FedX TopFed

Que

ry E

xecu

tion

Tim

e (m

sec)

in

log

scal

e

• TopFed outperform FedX significantly on 90% of the queries • On average, the query run time of TopFed is about 1/3 to that of FedX • TopFed‘s best run-time (query 2, query 3) is more than 75 times

smaller than that of FedX

Page 10: Fostering Serendipity through Big Linked Data

Big Data Track Requirements• Data Volume

– 7.36 billion triples from Linked TCGA – 23 million publications from PubMed

• Data Variety– The Linked TCGA data was extracted from raw text files of different

structures– Processed the metadata associated with PubMed publications and

transform them into RDF– Unstructured data (publication abstracts) is processed to extract mentions

of gene names and cancers

• Data Velocity– TCGA data doubles /2 months– PubMed publications 10k/month

Page 11: Fostering Serendipity through Big Linked Data

Big Data Visualization

Page 12: Fostering Serendipity through Big Linked Data

Tumor-wise Visualization

Page 13: Fostering Serendipity through Big Linked Data

PubMed Paper-wise Visualization

Page 14: Fostering Serendipity through Big Linked Data

Genome-wise Patients Results Visualization

Page 15: Fostering Serendipity through Big Linked Data

Everything is Public• Demo: http://srvgal78.deri.ie/tcga-pubmed/• TopFed: https://code.google.com/p/topfed/• TCGA Data Refiner, RDFizer: http://goo.gl/vSnBEJ• Utilities: http://goo.gl/kNrFdI• Linked TCGA : http://tcga.deri.ie/

[email protected] AKSW, University of Leipzig, Germany