learning to create data-integration queries partha pratim talukdar, marie jacob, muhammad salman...

52
Learning to Create Data- Learning to Create Data- Integration Queries Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira, Sudipto Guha Crammer, Zachary G. Ives, Fernando Pereira, Sudipto Guha VLDB2008 VLDB2008 Seminar Presented by Seminar Presented by Noel Gunasekar Noel Gunasekar CSE Department – SUNY Buffalo CSE Department – SUNY Buffalo

Upload: denis-page

Post on 18-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

Learning to Create Data Integration Queries 3 Introduction

TRANSCRIPT

Page 1: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data-Learning to Create Data-Integration QueriesIntegration Queries

Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira, Sudipto GuhaCrammer, Zachary G. Ives, Fernando Pereira, Sudipto Guha

VLDB2008 VLDB2008

Seminar Presented bySeminar Presented byNoel GunasekarNoel Gunasekar

CSE Department – SUNY BuffaloCSE Department – SUNY Buffalo

Page 2: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries2

Learning to Create Data-Integration QueriesIntroduction

Motivation Example

Existing solutionsQ-System Solution

Q-System ArchitectureQuery and Query AnswersExecuting QueryLearning From feedback

ConclusionExperimental resultsFuture Work

Page 3: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries3

Introduction

Page 4: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries4

Motivation

Need for non-expert user to pose queries across multiple data resources.

• Non-expert user - Not familiar with querying languages

• Multiple resource - Databases, Data warehouses, Virtual integrated schemas

Page 5: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Bio-Science Field

Learning to Create Data Integration Queries 5

• Many "standardized" databases with overlapping and cross-referenced information

• Each site is being independently extended, corrected, and analyzed

• Differing levels of data quality/confidence

Protein Databases

Protein DataBase: - PDB information and service listings at Brookhaven National Laboratory [ BNL ]PIR: - Protein Identification Resource database at [ JHU ] PRF: - Protein Research Foundation database at GenomeNet SwissProt - Protein database at ExPASy [ Switzerland ]

Page 6: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Example

Learning to Create Data Integration Queries 6

genomics

Life Sciences

Researcher Disease Studies

Life Sciences researcher querying on data-sources like genomics, disease studies and pharmacology. Pharmacolog

y

“What are the proteins and genes associated with the disease Narcolepsy?”

http://www.expasy.org/uniprot/P04049

Page 7: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 7

Existing Solution

Using keyword based queries on Web-Forms

Match the keywords with terms in the tuples and form the query by joining different databases using foreign-keys

• Cost for the query is fixed and doesn’t accommodate the context of the query

http://www.expasy.org/uniprot/P0C852

Page 8: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 8

Proposed Solution - Q System

Automatically generate Web-Forms for given set of keywords

Pose queries across multiple data resources using the generated web-form

Page 9: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 9

Proposed Solution - Q System

Q SystemKeywords

Protein, gene, disease Reusable Web-Form For querying

User(Author)

Create re-usable web-form

Use web-form for Querying

Reusable Web-Form For querying

ParametersUsers

(Author + others)Query Results

Page 10: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries10

Q System Architecture

Page 11: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 11

Architecture of Q System

Four Components

•Initial Schema Loader

•Query Template Creation

•Query Execution

•Learning Through Feedback

Page 12: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 12

Architecture of Q System

Page 13: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 13

Initial Setup

Schema Loader

Input

• Given a set of data sources with its own schema• Foreign Keys and Links• Schema Mappings• Record Link

Output

• Schema Graph

Page 14: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 14

Example Schema Graph

Node: Databases and their attributes (UniProt database, Entrez GeneInfo db, term)Edge: Relation based on foreign keys/cross-references (UniProt to PIR)Cost: Reliability, completeness

cb

d

0.070.1 0.04

Initial Setup

Page 15: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 15

Query Template Creation

Page 16: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 16

Query Template Creation - Example

Input: “protein”, “plasma membrane”, “gene” and “disease”

Output:

Page 17: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Find trees connecting red

nodes

e

a cb

fd

0.070.1

0.1

0.10.04

0.1

Schema Graph

Rank = 2

Cost = 0.41

Rank = 1

Cost = 0.4e

a cb

fd

0.07

0.1

0.10.04

0.1

e

a b

fd

0.1

0.1

0.1

0.1

Query Keywords a, e, f

Q2Q1

Query Template Creation - Example

Page 18: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Query Formulation

•Trees can be easily written as executable queries:

Steiner Tree

Conjunctive query: a(x,y),b(y,z),d(z,w),e(w,u),f(w,v)

e

a b

fd

0.1

0.1

0.1

0.1

Page 19: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

View Refinement

Page 20: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Web-Form

Page 21: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 21

Query Execution

Page 22: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Input: Web-Form

Page 23: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Output: Result Answers

Q1

Q1

Q1,2

Q2

Q2Q2

System determines “producer” queries using provenance

Page 24: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 24

Query Execution

Query Processing Engine with

• Support for querying remote data sources• Record data provenance

Solution:

ORCHESTRA

http://www.cis.upenn.edu/~zives/orchestra/

Page 25: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 25

Orchestra Project

• The ORCHESTRA project focuses on the challenges of data sharing scenarios in the sciences

• Bioinformatics Scenario - many "standardized" databases with overlapping information, similar but not identical data and differing levels of data quality/confidence

• Each site is being independently extended, corrected, and analyzed

•ORCHESTRA collaborative data sharing system (CDSS) is on how to support reconciliation across different schemas, with disagreeing users

Page 26: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 26

Orchestra Project – Data Provenance

http://www.cis.upenn.edu/~zives/research/exchange.pdf

Page 27: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 27

Learning through Feedback

Page 28: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 28

• Input: Ranked Results + provenance

Q1

Q1

Q1,2

Q2

Q2Q2

Learning through Feedback

Page 29: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 29

• User provides feedback

Q1

Q1

Q1,2

Q2

Q2Q2

Learning through Feedback

Page 30: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Query Formulation - Recap

Find trees connecting red

nodes

e

a cb

fd

0.070.1

0.1

0.10.04

0.1

Schema Graph

Rank = 2

Cost = 0.41

Rank = 1

Cost = 0.4e

a cb

fd

0.07

0.1

0.10.04

0.1

e

a b

fd

0.1

0.1

0.1

0.1

Query Keywords a, e, f

Q2Q1

Page 31: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

e

a cb

fd

0.1

0.1

0.1 0.04

0.1

e

a cb

fd0.1

0.1 0.04

0.1

e

a b

fd

0.1

0.1

0.1

0.1

Change weights so Q2 is “cheaper” than Q1

Rank = 1

Cost = 0.4

Rank = 2

Cost = 0. 41

Rank = 2

Cost = 0.4

Rank = 1

Cost = 0.39

0.05

Q1 Q20.05

0.07

0.07

Learning through Feedback

Page 32: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 32

Iteration!

Page 33: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Q-System: Challenges

Computation of ranked queries which in turn produce ranked tuples: K-Best Steiner Tree Generation

Predicting new query rankings based on user feedback over tuples, and also generalizing feedback: Learning

Maintaining associations between tuples and queries: Query answers with provenance

Everything at interactive speed!

Page 34: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Cost of a Query

•Query Cost = Sum of edge costs in the tree.•Edge Cost = Sum of weights of features defined over it.

•Features are properties of the edges, e.g., nodes connected•Each feature has a corresponding weight. •Feature example:

Term Synonymf 1 w8

f = 1 if the edge connects Term and Synonym tables, else 0

Page 35: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Steiner Trees: Finding Lowest-Cost Queries A tree of minimal cost in a graph (G)

which includes all the required nodes (S).

Cost of a Steiner Tree is the sum of costs of edges present in the tree.

Steiner Tree is generalization of Minimum Spanning Tree (MST) [equivalent when S = all vertices in G].

e

a cb

fd

0.070.1

0.1

0.10.04

0.1

Page 36: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

K-Best Steiner Tree Algorithms

• Exact (practical for ~100 nodes and edges).• Integer Linear Program (ILP) based formulation

for finding K-best Steiner Trees in a graph. • The ILP uses ideas from multi-commodity network

flows

• Approximate (for 100s+ nodes and edges).• Novel Shortest Paths Complete Subgraph

Heuristic.• Significantly faster; in practice, often gives optimal

solution.

Page 37: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Multi-Commodity Flow Problem

Page 38: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

MIP for min-cost Steiner Tree

Page 39: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

MIP for K min-cost Steiner Tree

Page 40: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Constraints

C1 : Flow of commodity k starts at root rC2 : Flow of commodity k terminates at node kC3 : Conservation of flow at Steiner nodesC4 : Flow of an edge allowed only if that edge is

included ( Yij = 1 )C5 : Non-negativity constraintC6 : Defines value for YC7 : Ensures no incoming active edge into the rootC8 : Ensures that all nodes have at most one incoming

active edgeC9 : Flow of at least one commodity on all edges in IC10 : Ensures no flow on edges in X

Page 41: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Finding K-Best Steiner Trees

2-best Steiner trees connecting terminal nodes.

e

a cb

fd

0.070.1

0.1

0.10.04

0.1

Rank = 2

Cost = 0.41

Rank = 1

Cost = 0.4e

a cb

fd

0.07

0.1

0.10.04

0.1

e

a b

fd

0.1

0.1

0.1

0.1

Page 42: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

K-Best Steiner Tree Algorithms

• Approximate (for 100s+ nodes and edges).• Novel Shortest Paths Complete Subgraph

Heuristic.• Send “m” shortest path graph as input.• Shortest path between each pair of nodes in S• Significantly faster; in practice, often gives optimal

solution.

Page 43: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Q Challenge : Getting User Feedback

e

a b

fd

T

e

a b

fd

c

T*

Q

Q*

.

.

.

Query

.

.

.

Tuples

.

.

.

Top

Bottom

T and T* differ in 3 edges. This difference is termed loss: L(T, T*)

Page 44: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning: Update Weights

Term2TermTerm(T1)Edge Cost: 0.07

w8 = 0.06w25 = 0.01

Term2TermTerm(T1)Edge Cost: 0.05

w8 = 0.04w25 = 0.01

Edge Cost Update

Page 45: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Re-ranked Steiner Trees

Weight Update

Rank 1 Rank 2

e

a cb

fd

0.07

0.1

0.10.04

0.1

e

a b

fd

0.1

0.1

0.1

0.1

e

a cb

fd

0.05

0.1

0.10.04

0.1

e

a b

fd

0.1

0.1

0.1

0.1

Page 46: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Experimental Results

The Key Questions• Can the algorithm start with uninitialized weights and learn

expert (“gold standard”) ranking of queries?

• Can the results be generated at interactive speeds?

• Does the approach scale to larger graphs?

Page 47: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Results: Learning Expert Weights

•Graph: Start with the BioGuide bio sources, with 28 vertices and 96 edges.

•Goal: Learn the queries corresponding to the expert-set weights in BioGuide

• Methodology:• All weights are set to default. • Sequence of 25 searches•For each, user feedback identifies & promotes a tuple from the gold standard answer.

• After 40-60% searches with feedback, system finds the top query immediately.• For each individual search, a single feedback is enough to learn the top query.

# G

old

quer

ies

abse

nt in

top

-3 p

redi

ctio

ns

Page 48: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Results: Time to generate K-best Queries

K Time (s)

1 0.115 2.0010 4.0220 8.42

• Schema graph of size (28, 96) from BioGuide (Boulakia et al., 2007).

It is possible to generate the top query in < 1 sec and top 5 queriesin about 2 sec, all within interactive range. Query execution is pipelined.

Page 49: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Results: Scalability to Larger Graphs

K Speedup

Error

1 12 02 14.6 03 20.3 05 72.4 0

•Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL.

It is possible to do K-best inference in larger graphs quickly and with little or no loss (none

in this case).

Queries (K)

Page 50: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 50

Experimental Results

“Gold Standard”

• Using BioGuide – a biomedical information integration system

• BioGuide generates the schema graph based on keywords

• The edge cost in the schema graph are manually assigned by experts

• This expert given schema graph is called the “gold standard”

Experiment involves in comparing the result produced by the q system with the results produced by the gold standard.

Page 51: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 51

Learning against expert cost

• Started with an expert query template

“What are the related proteins in [DB1] and genes in [DB2] associated with disease Narcolepsy in [DB3] ?

• By instantiating the template 25 queries were formed

• Each time for a query the lowest-cost Steiner tree is computed

• The “gold standard” is used as the feedback and learning is done

Page 52: Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Learning to Create Data Integration Queries 52

Future Work

• Work on other approximation algorithms for computing the Steiner trees

• Evaluation against real biological applications

• Incorporating data-level keyword matches.