qiagram: a novel interface for scientists to mine and understand big data with natural-language...

40
Copyright © 2009 Proprietary & Confidential Copyright © 2012 - Proprietary & Confidential QIAGRAM: A NOVEL INTERFACE TO MINE AND UNDERSTAND LARGE DATA SETS WITH NATURAL LANGUAGE QUERIES Making data useful by making data smarter Matthew Clark, Ph. D. BioFortis February 6, 2013

Upload: biofortis

Post on 17-Jan-2015

426 views

Category:

Documents


1 download

DESCRIPTION

Life sciences is a fast becoming a data problem - in this presentation we explore the challenges faced by scientists wishing to leverage life science and healthcare big data. We demonstrate Qiagram - a collaborative visual, ad hoc query tool for exploring these large complex data sets. Using examples form Adverse Event Reporting Database, MedRA and SNOMED we illustrate how scientists with little IT knowledge can mine these data sets and unlock their potential.

TRANSCRIPT

Page 1: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2009Proprietary & Confidential

Copyright © 2012 - Proprietary & Confidential

QIAGRAM: A NOVEL INTERFACE TO MINE AND UNDERSTAND LARGE DATA SETS WITH NATURAL LANGUAGE QUERIES

Making data useful by making data smarter

Matthew Clark, Ph. D. BioFortis

February 6, 2013

Page 2: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Life sciences is a fast becoming a data problem

Beyond the obvious issues of scale and

reproducibility, the complexity and

diversity of …data poses the greatest

challenge to unlocking knowledge and

scientific discovery.*

Higdon et al (2012) Unraveling The Complexities Of Life Sciences Data: DOI: 10.1089/big.2012.1505

Page 3: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Big Data Challenges

VolumeLarge amounts of data

Veracity The credibility/quality, how trusted is the data

VelocityNeed for rapid analysis

Value Actionable outcomes for an organization

VarietyMany disparate types

Page 4: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

A multitude of ‘omics and more

genomics proteomics cellomics metabalomics lipidomics transcriptomics

High Throughput technologiesNGS, imaging, mass-spectrometry, high

capacity flow, arrays

Collecting data at a prodigious rate – not always clear on how to use

Other DataHealthcare data (EMR), demographics,

Adverse events, clinical trials

Page 5: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Potential is huge…

• Targeted trials• Adverse events from pharmacy/clinical data• Segment patients based on profiles/responses• Biomarker discovery• Observational/outcome studies• “Virtual” clinical trials• In-silico discovery

Enormous promise…. Enormous challenges

Page 6: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Big data challenges - general

Source: The Economist, 2011, Big Data, Harnessing a Game changing asset

Page 7: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Barriers to extracting value

Source: The Economist, 2011, Big Data, Harnessing a Game changing asset

Page 8: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Big Data in life sciences/healthcare

• Multiple disparate data sources• Lack of integration patient-molecular-clinical-

assay-payer• “Swiss cheese” problem• Data cleansing/verification/credibility• Standards for data interchange• Privacy concerns• Lack good tools for cross-domain analytics

Page 9: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Changing paradigms

• Hypotheses driven– Traditional, test the hypotheses, scientific method

• what’s the mechanism of action of this drug

• Discovery driven– More open, questioning, enumerates elements to

drive hypotheses• What data do I have, what’s interesting

• Hybrid– Discovery driven + Hypotheses

• Human Genome Project

Page 10: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Data Exploration

Often neglected, but now key to getting value from life science big data

Page 11: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

What is Data Exploration?

• Occurs before in-depth statistical/analytics• Explore and probe the data• Determine “what’s interesting”, “what’s relevant”• Generate hypotheses• Ensure data is there to support hypotheses

Page 12: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Asking questions of the data

Very complex query that touches many of the 5 V’s of Big Data

Page 13: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Asking questions of the dataMultiple data

sources

Domain expertiseMore data sources

More data sources

Requires considerable IT resources to program this query

Page 14: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Data Exploration challenges

• Programming is hard– Mostly SQL, SAS

• Lack of a shared language to support collaboration– Multidisciplinary data

requires domain experts

• Meaningful access to data– Sensitive to

regulatory/compliance

“The hands-on analytics time to write the SAS code and specify clearly what you need for each hypothesis is very time-consuming,” Felix Freuh, CEO, Medco*

*Miller, K. Big Data Analytics, Biomedical Computational Review, Winter 2011/2012

Page 15: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Clinical and molecular data

The Problem

Data Managers, overwhelmed by

researchers questions on complex data sources

Researchers with many questions across

disciplines

“Weeks to months to NEVER”“Lost in Translation”

SELECT DISTINCT PATIENT_ID, SAMPLE_ID, SAMPLE_NAME

FROM SAMPLE_INVENTORY S INNER JOIN PATIENTS P ON S.PATIENT_ID = P.PATIENT_ID

INNER JOIN DIAGNOSIS D ON S.PATIENT_ID = D.PATIENT_ID

INNER JOIN MEDICATIONS M ON S.PATIENT_ID = M.PATIENT_ID

INNER JOIN BIOMARKERS B ON S.PATIENT_ID = B.PATIENT_ID

WHERED.DIAGNOSIS_NAME = ‘LUNG CANCER’ AND

M.MEDICATION_GENERIC_NAME = ‘CETUXIMAB’ ANDB.BIOMARKER_NAME = ‘EGFR’ AND

B.OBSERVATION = 1ORDER BY PATIENT_ID, SAMPLE_NAME

No common language for

questions

Page 16: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Overcoming the challenges

• Deep Collaboration– Easy access to dynamic data– Intuitive tools– Secure holistic view of data– Collaboration

Page 17: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Big Data - Deep Collaboration

Single researcher in a silo often can go deep into the data, but maybe limited by their domain expertise

Small groups of researchers may be able to collaborate on asking questions but can’t go very deep with the tools they have today

QIAGRAM

Deep Collaboration is when multiple groups of researchers can collaborate in asking questions deeper into the layers of data. Shared domain knowledge allows deeper insights

Page 18: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Clinical and molecular data

Qiagram – Collaborative Scientific Intelligence

Researchers and data managers can collaborate

on creating queriesQiagram acts as a

shared, visual language for queries

More efficient and effective query creationTransparent to all stakeholders

Page 19: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

MINING AERS DATAExamples

Page 20: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Small Data Can Become Big Data

1000 Drugs

1000 Drug Categories

1000 Adverse Events

109 Possible Combinations

Page 21: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Introduction to Qiagram

Page 22: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Answer the Question –Which Sources Have Data on Cholestasis?

Page 23: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Joining Data Sources is Simple

Page 24: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Combining Data Sources

• SNOMED contains hierarchy of drug and medical terms– ~12M records

• AERS contains reports of adverse events– ~70M records

• MedDRA contains hierarchy of adverse event terms– 150k records

Page 25: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

SNOMED Ontology

Page 26: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

MedDRA Hierarchy

low level term pref term hlt pref termhlgt pref term soc term

abdominal migraine migraine migraine headaches headaches nervous system disordersacute migraine migraine migraine headaches headaches nervous system disordersband-like headache tension headache headaches nec headaches nervous system disordersbasilar migraine basilar migraine migraine headaches headaches nervous system disorderscephalalgia headache headaches nec headaches nervous system disorderscephalalgia or cephalgia headache headaches nec headaches nervous system disorderscephalgia headache headaches nec headaches nervous system disorders

Page 27: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

AERS

• All drug-related adverse events reported to FDA since 2000

• Tables for Drugs, demographics, indications, therapy, reactions, outcomes

Page 28: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Filter AERS Drugs by SNOMED Categories

AERS Drug list

Results in all analgesics in AERS, with associated case #s

Page 29: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Count the Various MedDRA high-level group terms reported for all drugs in AERS from

the SNOMED “antibiotic” category

AERS Drug list

SNOMED Categories

MedDRA Hierarchy Mapping

Page 30: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Top Ten Antibiotic Adverse Event High-Level Group Terms

Count hlgt pref term - MedDRA browser692,058 general system disorders nec522,739 epidermal and dermal conditions491,409 neurological disorders nec385,375 joint disorders297,433 gastrointestinal signs and symptoms290,760 respiratory disorders nec278,723 allergic conditions230,134 cardiac disorder signs and symptoms208,985 injuries nec197,446 gastrointestinal motility and defaecation conditions

Page 31: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Data Quality

• With more date, more chance for inconsistencies.

• Need easy ways to dynamically check the data, identify errant records

• Example: AERS data

Page 32: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Query to Locate Patients with Treatment Dates After Death Dates

Page 33: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Example Results

isr - drugs drug age death_dtstart_dt -

therapies

days after death - therapies + demographics

4016857 darbepoetin 59 9/5/2002 8/12/3003 365,5834006065 interferon I 46 8/23/2002 1/27/2991 361,0176013473 naloxone 54 6/15/1953 6/15/2008 20,0896038344 combivent 50 12/10/1958 12/8/2008 18,2616038344 levofloxacin 50 12/10/1958 12/8/2008 18,2616105245 dexamethasone 49 1/27/1959 7/15/2008 18,0676105245 bortezomib 49 1/27/1959 7/12/2008 18,0646252126 enfuvirtide 50 8/29/1956 7/8/2005 17,8456252126 efavirenz 50 8/29/1956 5/10/2002 16,6906252126 didanosine 50 8/29/1956 5/10/2002 16,690

Over 2,000 results

Page 34: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Answer Questions At the Speed of Thought

• Many “purpose built” systems answer pre-defined questions.

• However, in data exploration we need the ability to explore new questions

Page 35: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Collaborative Experience

• Team of physicians, informaticians, safety experts collaboratively explored questions based on large amounts of clinical (SDTM) data –

– Did subjects who were pre-treated with certain drug classes have the most change in cardiac function?

– What was in common with the subjects that were outliers in cardiac function change?

• Team defined baselines, changes, etc

Page 36: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

CPRD

• The Clinical Practice Research Datalink (CPRD) is the new English NHS observational data and interventional research service, jointly funded by the NHS National Institute for Health Research (NIHR) and the Medicines and Healthcare products Regulatory Agency (MHRA).

• 6 large fact tables with 1 B to 2 B rows

• Example query– Identify patients with coronary artery disease who

have taken aspirin, then study readmission rates.

Page 37: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Thomson Reuters' MarketScan

• Several characteristics set MarketScan databases apart from other research databases. The core databases, Commercial, Medicare Supplemental, and Medicaid, are huge – over 170 million patients since 1995. 

• Over 25 Fact tables 100 M up to 1.5 B rows• Example

– Identify cancer patients, looking at opiate treatment and study duration of the escalation

Page 38: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Premier Research Services

• Patient level data is available from more than 600 hospitals, 45 million records and 310 million hospital visits

• 5 large Fact tables from 100 M to over 4 B rows

Page 39: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Cerner Health Facts Database

• 8 Large Fact tables most in the 10's of millions of records 

• Example– Looking for type II diabetes patients, study infection

rates of these patients based on hospital types.

Page 40: Qiagram: A Novel Interface for Scientists to Mine and Understand Big Data with Natural-Language Queries

Copyright © 2012 - Proprietary & Confidential

Launching in March, 2013, cloud based Qiagram offering with AERS and TCGA data