systemt: declarative information extraction

21
Scalable Natural Language Processing (SNaP) IBM Research | Almaden SystemT: Declarative Information Extraction Yunyao Li

Upload: yunyao-li

Post on 04-Jul-2015

331 views

Category:

Technology


0 download

DESCRIPTION

Slides used for my talk "SystemT: Declarative Information Extraction" at the event "University of Oregon Big Opportunities with Big Data Meeting" on August 8, 2014 (http://bigdata.uoregon.edu).

TRANSCRIPT

Page 1: SystemT: Declarative Information Extraction

Scalable Natural Language Processing (SNaP)IBM Research | Almaden

SystemT:

Declarative Information Extraction

Yunyao Li

Page 2: SystemT: Declarative Information Extraction

Text analytics is the key for

Unlocking big data value

Customer 360º ViewSocial Media Analytics, CRM Analytics

Security & PrivacyEmail Analytics, Data Redaction, Digital Piracy

Operations AnalysisMachine Data Analytics

Financial AnalyticsSystemic risk analysis, Event monitoring

Healthcare AnalyticsPatient care, drug discovery

and many more …

Page 3: SystemT: Declarative Information Extraction

Text Analytics

Text Analytics vs Information Extraction

3

InformationExtractionInformationExtraction

ClusteringClustering

ClassificationClassification

RegressionRegression

Core operations suchas pos tagging, deep

parsing, sentence detection etc.

Core operations suchas pos tagging, deep

parsing, sentence detection etc.

Information extraction: A key enabling technology for text analytics •Input: Raw text and features from low-level natural language processing operations•Output: Structured data suitable for general purpose analysis software

Page 4: SystemT: Declarative Information Extraction

Information Extraction All Over

… hotel close to Seaworld in Orlando? Planning to go this coming weekend…

Social Media

Customer 360º

Security & Privacy

Operations Analysis

Financial Analytics

my social security # is 400-05-4356

Email

Machine Data

Financial Statements

Medical Records

News

CRM

Patents

Product:Hotel Location:Orlando

Type:SSN Value:400054356

Storage Module 2 Drive 1 fault

Event:DriveFail Loc:mod2.Slot1

Page 5: SystemT: Declarative Information Extraction

5

Extraction Example: Fact Extraction (Tables)

Singapore 2012 Annual Report(136 pages PDF)

Identify note breaking down Operating expenses line item, and extract opex components

Identify note breaking down Operating expenses line item, and extract opex components

Identify line item for Operating expenses from Income statement (financial table in pdf document)

Identify line item for Operating expenses from Income statement (financial table in pdf document)

Page 6: SystemT: Declarative Information Extraction

6

Extraction Example: Sentiment Analysis

Mcdonalds mcnuggets are fake as shit but they so delicious.Mcdonalds mcnuggets are fake as shit but they so delicious.

We should do something cool like go to Z (kidding).We should do something cool like go to Z (kidding).

Makin chicken fries at home bc everyone sucks! Makin chicken fries at home bc everyone sucks!

You are never too old for Disney movies. You are never too old for Disney movies.

Social Media

Different domain Different ball game!

Bank X got me ****ed up today! Bank X got me ****ed up today!

Customer Reviews

Not a pleasant client experience. Please fix ASAP. Not a pleasant client experience. Please fix ASAP.

I'm still hearing from clients that Merrill's website is better.I'm still hearing from clients that Merrill's website is better.

X... fixing something that wasn't brokenX... fixing something that wasn't broken

Analyst Research Reports

Intel's 2013 capex is elevated at 23% of sales, above average of 16%Intel's 2013 capex is elevated at 23% of sales, above average of 16%IBM announced 4Q2012 earnings of $5.13 per share, compared with 4Q2011 earnings of $4.62 per share, an increase of 11 percent IBM announced 4Q2012 earnings of $5.13 per share, compared with 4Q2011 earnings of $4.62 per share, an increase of 11 percent We continue to rate shares of MSFT neutral. We continue to rate shares of MSFT neutral.

Sell EUR/CHF at market for a decline to 1.31000…Sell EUR/CHF at market for a decline to 1.31000…FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury. FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.

Equity

Fixed Income

Currency

Page 7: SystemT: Declarative Information Extraction

Enterprise Requirements

• Accuracy– Garbage-in garbage-out: Usefulness of analytics results is tied to the quality of

extraction

• Scalability– Large data volumes, often orders of magnitude larger than classical NLP corpora

• Social Media: – Twitter alone has 400M+ messages / day; 1TB+ per day

• Financial Data:– SEC alone has 20M+ filings, several TBs of data, with documents range from few KBs to few

MBs

• Machine Data: – One application server under moderate load at medium logging level 1GB of logs per day

• Transparency– Every customer’s data and problems are unique in some way– Need to quickly implement new business logic

• Usability– Building an accurate IE system is labor-intensive– Professional programmers are expensive

Page 8: SystemT: Declarative Information Extraction

8

Core OperationsCore OperationsExecution EngineExecution Engine

SystemT’s 5 Layered Architecture

Compiler + OptimizerCompiler + Optimizer

Extraction Language Extraction Language

Generic LibrariesGeneric Libraries

Tooling + UITooling + UI

20+ languages currently supported

NE libraries, Sentiment, informal-text normalizer, …

Scalable Embeddable

Runtime

AQL

Cost based optimizer

Eclipse tooling: editor, workflow analysis, pattern discovery, regex

learner, profiler,…

Page 9: SystemT: Declarative Information Extraction

9

package com.ibm.avatar.algebra.util.sentence;

import java.io.BufferedWriter;import java.util.ArrayList;import java.util.HashSet;import java.util.regex.Matcher;

public class SentenceChunker{ private Matcher sentenceEndingMatcher = null;

public static BufferedWriter sentenceBufferedWriter = null;

private HashSet<String> abbreviations = new HashSet<String> ();

public SentenceChunker () {

}

/** Constructor that takes in the abbreviations directly. */ public SentenceChunker (String[] abbreviations) {

// Generate the abbreviations directly. for (String abbr : abbreviations) { this.abbreviations.add (abbr); } }

/** * @param doc the document text to be analyzed * @return true if the document contains at least one sentence boundary */ public boolean containsSentenceBoundary (String doc) {

String origDoc = doc;

/* * Based on getSentenceOffsetArrayList() */

// String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0;

do {

/* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary ();

if (boundary != -1) {doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1);

String candidate = /* * Looks at the last character of the String. If this last * character is part of an abbreviation (as detected by * REGEX) then the sentenceString is not a fullSentence and * "false” is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) {

/* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1);

}

if (candidate.length () > 0) { // sentences.addElement(candidate.trim().replaceAll("\n", " // ")); // sentenceArrayList.add(new Integer(currentOffset + boundary // + 1)); // currentOffset += boundary + 1; // Found a sentence boundary. If the boundary is the last // character in the string, we don't consider it to be // contained within the string. int baseOffset = currentOffset + boundary + 1; if (baseOffset < origDoc.length ()) { // System.err.printf("Sentence ends at %d of %d\n", // baseOffset, origDoc.length()); return true; } else { return false; } } // origDoc.substring(0,currentOffset)); // doc = doc.substring(boundary + 1); doc = remainder; } } while (boundary != -1);

// If we get here, didn't find any boundaries. return false;

}

public ArrayList<Integer> getSentenceOffsetArrayList (String doc) { ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();

// String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; sentenceArrayList.add (new Integer (0));

do {

/* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary ();

if (boundary != -1) { String candidate = doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1);

/* * Looks at the last character of the String. If this last character * is part of an abbreviation (as detected by REGEX) then the * sentenceString is not a fullSentence and "false" is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) {

/* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1);

}

if (candidate.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + boundary + 1)); currentOffset += boundary + 1; } // origDoc.substring(0,currentOffset)); // doc = doc.substring(boundary + 1);

doc = remainder; } } while (boundary != -1);

if (doc.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + doc.length ())); }

sentenceArrayList.trimToSize (); return sentenceArrayList; }

private void setDocumentForObtainingBoundaries (String doc) { sentenceEndingMatcher = SentenceConstants. sentenceEndingPattern.matcher (doc); }

private int getNextCandidateBoundary () { if (sentenceEndingMatcher.find ()) { return sentenceEndingMatcher.start (); } else return -1;

}

private boolean doesNotBeginWithPunctuation (String remainder) { Matcher m = SentenceConstants.punctuationPattern.matcher (remainder); return (!m.find ()); }

private String getLastWord (String cand) { Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand); if (lastWordMatcher.find ()) { return lastWordMatcher.group (); } else { return ""; } }

/* * Looks at the last character of the String. If this last character is * par of an abbreviation (as detected by REGEX) * then the sentenceString is not a fullSentence and "false" is returned */ private boolean isFullSentence (String cand) {

// cand = cand.replaceAll("\n", " "); cand = " " + cand;

Matcher validSentenceBoundaryMatcher = SentenceConstants.validSentenceBoundaryPattern.matcher (cand); if (validSentenceBoundaryMatcher.find ()) return true;

Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);

if (abbrevMatcher.find ()) { return false; // Means it ends with an abbreviation } else { // Check if the last word of the sentenceString has an entry in the // abbreviations dictionary (like Mr etc.) String lastword = getLastWord (cand);

if (abbreviations.contains (lastword)) { return false; }

}

return true; }}

Java Implementation of Sentence Boundary Detection

Easy to Develop & Maintain

create dictionary AbbrevDict from file 'abbreviation.dict’;

create view SentenceBoundary asselect R.match as boundaryfrom ( extract regex /(([\.\?!]+\s)|(\n\s*\n))/

on D.text as match from Document D ) Rwhere Not(ContainsDict('AbbrevDict', CombineSpans(LeftContextTok(R.match, 1),R.match)));

create dictionary AbbrevDict from file 'abbreviation.dict’;

create view SentenceBoundary asselect R.match as boundaryfrom ( extract regex /(([\.\?!]+\s)|(\n\s*\n))/

on D.text as match from Document D ) Rwhere Not(ContainsDict('AbbrevDict', CombineSpans(LeftContextTok(R.match, 1),R.match)));

Equivalent AQL Implementation

Page 10: SystemT: Declarative Information Extraction

10

State-of-the-artindustrial system

State-of-the-art open-source system

SystemT

Scalability via OptimizationSystemT needs less than 10% of the cores to keep up with Twitter’s daily

feed without drop in quality

10x50x

Page 11: SystemT: Declarative Information Extraction

Products utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasks

Products fully exposing Products fully exposing the AQL language and enginethe AQL language and engineProducts fully exposing Products fully exposing the AQL language and enginethe AQL language and engine

SystemT in IBM Products

InfoSphere BigInsights

InfoSphere Streams

Social Media

Analytics

eDiscovery Analyzer

Optim Data Lifecycle

Management Solutions

SmartCloud Analytics:

Log Analysis

Watson Discovery Advisor

Installed on 400+ client sites

Page 12: SystemT: Declarative Information Extraction

Case Study: Drug Discovery

Page 13: SystemT: Declarative Information Extraction

0-I II III IV

Time to develop a drug; 12 -15 years

Avg cost to develop a drug: 1.2 billion

Adapted from PhRMA (Pharmaceutical Research and Manufacturers of America) 2013 profile

Laboratory50,000 +

Compounds

Pre-Clinical250 Compounds

Clinical5 Compounds

FDA approval

1 compound

“..Toxicity and Serious Adverse Events in Late Stage Drug Development are the Major Causes of Drug Failure”

Intervention should happen at critical transition bottlenecks between stages (most likely to impact outcome)

Drug Development Pipeline

Page 14: SystemT: Declarative Information Extraction

LaboratoryLaboratory Pre-clinicalPre-clinical ClinicalClinical FDA ApprovalFDA Approval BedsideBedside

Objective: enable data-driven decision makingmore efficient clinical trial design, data analytics and drug success / failure predictions

Case Study: Drug Discovery

Page 15: SystemT: Declarative Information Extraction

Structure

Indication

Indication

Mode of action / target

Effect level

Side Effect

360° view on drugs allows for comparison between drugs, and therefore allow to predict drug behavior

…Structured and unstructured data sources

Extract known facts about a drug

Page 16: SystemT: Declarative Information Extraction

Patents also have (Manually Created) Chemical Complex Work Units (CWU’s)

As text

Chemical names found in the text of

documents

As bitmap images

Pictures of chemicals found in the document

Images

Unstructured data contain chemical compound information in many different forms

Chemical nomenclature can

be daunting

Information Extraction Example

Page 17: SystemT: Declarative Information Extraction

StructureRenders content searchable and avoid redundant and expensive testsBuilding block for automation of the discovery and decision process

Internal reports

Public Medline articles with limited structured info

POPULATION

gender FEMALE,MALE

species RAT

strain Sprague Dawley

precondition pregnant

From unstructured to structured content

Handle variants in entity reference Disambiguate based on context, etc… Extract relationships Assemble complex concepts

Combine rule-based extraction with Machine Learning

Handle variants in entity reference Disambiguate based on context, etc… Extract relationships Assemble complex concepts

Combine rule-based extraction with Machine Learning

Page 18: SystemT: Declarative Information Extraction

EfficacyEfficacyEfficacyEfficacyResults

StudyStudy

Authorship InterventionInterventionAuthorship InterventionAuthorship Intervention

PopulationPopulationPopulationPopulationPopulation

Study name

Size Population description

Time interval

Intervention Phase Type Authorship

Author Author affiliation

Funding org

Compound Target Admin route

dosage Type (preventive /therapeutic)

Species Subspecies Gender Age Pre-existing conditions

Physical attributes

Geographic Location

1. E

xtra

ct m

enti

on

s2.

Ass

emb

le m

enti

on

s in

to i

nfo

rmat

ion

In-life Observations

Hematology Clinical chem Comparative Macroscopic Results

Comparative Microscopic Resutls

Page 19: SystemT: Declarative Information Extraction

Preliminary Results SUGGEST NEW P53 KINASES

2. Cluster proteins by their literature distance: known p53 kinases cluster (green) and suggest others that may also

phosphorylate p53.

GREEN – p53 KinasesRED/ORANGE/YELLOW – Predicted Targets

1. Quantify the “literature distances” between

proteins

BMPR2CHEK2

Bag of Words Model

Example Use Case: Drug Development

Page 20: SystemT: Declarative Information Extraction

Summary

• SystemT – Declarative IE system based on an algebraic framework– Expressivity and performance advantages over grammar-

based IE systems– Text-specific optimizations– Key enabling technology for Big Data analytics applications

• Ongoing Work– Tooling support for non-programmers– Improved optimization strategies– New operators for advanced features (e.g. co-reference

resolution, text normalization)

Page 21: SystemT: Declarative Information Extraction

Thank you!

• For more information…– Visit our website: https://ibm.biz/BdF4GQ

– Get a free copy:• IBM BigInsights Quick start Edition http://www-

01.ibm.com/software/data/infosphere/biginsights/quick-start/• IBM Academic Initiative: http://www-

03.ibm.com/ibm/university/academic/pub/page/academic_initiative. InfoSphere BigInsights 2.1.1 (and Streams 3.0) are freely available for academic purposes.

– Take free online classes: BigDataUniversity.com

– Watch YouTube videos:• IBM InfoSphere BigInsights Text Analytics Channel:

http://www.youtube.com/playlist?list=PL2C66C6E455AD0216• IBM Big Data Channel: http://www.youtube.com/user/ibmbigdata• IBM InfoSphere Streams Channel: http://www.youtube.com/playlist?

list=PLCF04A48C22F34B19

– Contact me• [email protected]

SystemT Team: Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao LiSriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu

SystemT Team: Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao LiSriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu