fostering serendipity through big linked data

Fostering Serendipity through Big Linked Data

Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille

Ngonga Ngomo

Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia

Agenda

• Motivation• Datasets• Architecture• Evaluation• Requirements• Demo• Conclusion and Future Work

Motivation

Fostering Serendipity through Big Data Triplification, Continuous Integration,

and Visualization

Triplification: Linked TCGA• TCGA is publicly accessible atlas of cancer

related data from National Cancer Institute (NCI) – 9000 patients– 33 cancer types– 147,645 raw data files– 12.7 TB

• Only 46% of the total expected data with new data being submitted every day

• Goal is to enable cancer researchers to make and validate important discoveries

• Total Linked TCGA > 30 billion triples (Largest Dataset of LOD)

Triplification:PubMed• Collection of publications from the bio-

medical domain• Large amount of metadata (MESH Terms)• 23+ million publications• 10,000 new publications/month

Big Data Continuous Integration

TopFed

Parser

Federator Optimizer

Integrator

Results

ResultsSPARQL Query

Sub-queryPubMed

Entrez UtilitiesRDFizer

Auto Loader

TCGA Data Portal

SPARQL endpoint

RDF

SPARQL endpoint

RDF

SPARQL endpoint

RDF

Index

b1 b2 p1 p2 g1 g2 g3p3 p4 g4 g5 g6p5 p6 g7 g8 g9

C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical}

F = {Expression-Exon}M = {beta_value, position}

(CNV, SNP, E-Gene, miRNA, E-Protein, Clinical)

Exon-Expression

Methylation

D = {seg_mean, rpmmm, scaled_est, p_exp_val}

C-2 = {{p {∈ E ∪ A ∪ G} ∨ {p = rdf:type o ∧ ∈ F}} ∧ {{S-Join(p, E ∪ F) P-Join(∨ p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ !P-Join(p, M ∪ B ∪ D ∪ C) }}}

C-3 = {{p {∈ M ∪ A} {p = rdf:type o ∨ ∧ ∈ B}} ∧ {{S-Join(p, M ∪ B) P-Join(∨ p, M ∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ !P-Join(p, E ∪ F ∪ D ∪ C) }}}

C-1 = {{p {∈ D ∪ A ∪ G} {p = rdf:type o ∨ ∧ ∈ C}} ∧ {{S-Join(p, D ∪ C) P-Join(p, ∨ D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ !P-Join(p, M ∪ B ∪ E ∪ F) }}}

C-1 Category ∨Colour = blue

IF tumour lookup is successful forward to corresponding leafElse broadcast to every one

For each query triple t(s, p, o) T ∈

A = {chromosome, result, bcr_patient_barcode} G = {start, stop}

B = {DNA-Methylation}

E = {RPKM}

Tumours

SPARQL endpoints

C-2 Category ∨Colour = pink

C-3 Category ∨Colour = green

1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33

Highly Scalable

Evaluation:Number of Sub-Query Submission

• TopFed number of sub-queries submission is 1/3 to FedX• Number of ASK requests

– FedX 480– TopFed 10

1 2 3 4 5 6 7 8 9 10 Avg0

10

20

30

40

50

60

FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission

Evaluation: Query Runtime

1 2 3 4 5 6 7 8 9 10 Average10

100

1000

10000

100000FedX TopFed

Que

ry E

xecu

tion

Tim

e (m

sec)

in

log

scal

e

• TopFed outperform FedX significantly on 90% of the queries • On average, the query run time of TopFed is about 1/3 to that of FedX • TopFed‘s best run-time (query 2, query 3) is more than 75 times

smaller than that of FedX

Big Data Track Requirements• Data Volume

– 7.36 billion triples from Linked TCGA – 23 million publications from PubMed

• Data Variety– The Linked TCGA data was extracted from raw text files of different

structures– Processed the metadata associated with PubMed publications and

transform them into RDF– Unstructured data (publication abstracts) is processed to extract mentions

of gene names and cancers

• Data Velocity– TCGA data doubles /2 months– PubMed publications 10k/month

Big Data Visualization

Tumor-wise Visualization

PubMed Paper-wise Visualization

Genome-wise Patients Results Visualization

Everything is Public• Demo: http://srvgal78.deri.ie/tcga-pubmed/• TopFed: https://code.google.com/p/topfed/• TCGA Data Refiner, RDFizer: http://goo.gl/vSnBEJ• Utilities: http://goo.gl/kNrFdI• Linked TCGA : http://tcga.deri.ie/

[email protected] AKSW, University of Leipzig, Germany

http://srvgal78.deri.ie/tcga-pubmed/

http://srvgal78.deri.ie/tcga-pubmed/

https://code.google.com/p/topfed/

https://code.google.com/p/topfed/

http://goo.gl/vSnBEJ

http://goo.gl/kNrFdI

http://goo.gl/kNrFdI

http://tcga.deri.ie/

fostering serendipity through big linked data

Education