making sense of unstructured data by turning strings into things

47
Analyze Extract Match Transform Information Revealed Connect

Upload: analyticsweek

Post on 01-Nov-2014

762 views

Category:

Data & Analytics


0 download

DESCRIPTION

We all know about the promise of Big Data Analytics to transform our understanding of the world. The analysis of structured data, such as inventory, transactions, close rates, and even clicks, likes and shares is clearly valuable, but the curious fact about the immense volume of data being produced is that a vast majority of it is unstructured text. Content such as news articles, blog post, product reviews, and yes even the dreaded 140 character novella contain tremendous value, if only they could be connected to things in the real world – people, places and things. In this talk, we’ll discuss the challenges and opportunities that result when you extract entities from Big Text. Speaker: Gregor Stewart – Director of Product Management for Text Analytics at Basis Technology As Director of Product Management, Mr. Stewart helps to ensure that Basis Technology’s offerings stay ahead of the curve. Previously Mr. Stewart was the CTO of a storage services startup and a strategy consultant. He holds a Masters in Natural Language Processing from the University of Edinburgh, a BA in PPE from the University of Oxford, and a Masters from the London School of Economics. Thanks to our amazing sponsors: MicrosoftNERD (http://microsoftnewengland.com/) for Venue Basis Technology(http://basistech.com) for Food and Kindle Raffle

TRANSCRIPT

Page 1: Making sense of unstructured data by turning strings into things

Analyze

Extract

Match

Transform

Information

Revealed

Connect

Page 2: Making sense of unstructured data by turning strings into things

Analyze

Extract

Match

Transform

Information

Revealed

Connect

Page 3: Making sense of unstructured data by turning strings into things
Page 4: Making sense of unstructured data by turning strings into things

Overview

• Very briefly introduce Basis

• Motivate the move from Strings to Things

• Review two enabling technologies:

– Entity Extraction: finding names in text (and classifying them)

– Entity Resolution: connecting names together and to things

• Give you three examples of things you can do:

– Entity-based search, illustrating:

• How entities and enriched typing can empower searchers

• How human corrections might be used to improve accuracy over time

– Get additional high quality enrichments from knowledge sources

– Recognize anomalies/outliers, by establishing rich norms

4

Page 5: Making sense of unstructured data by turning strings into things

Introduction: Basis Technology

5

Page 6: Making sense of unstructured data by turning strings into things

Introduction: Gregor Stewart

6

Page 7: Making sense of unstructured data by turning strings into things

Facebingler cares, should you?

Page 8: Making sense of unstructured data by turning strings into things

Facebingler cares, should you?

8

Page 9: Making sense of unstructured data by turning strings into things

Entity Extraction: What is it?

9

Page 10: Making sense of unstructured data by turning strings into things

Entity Extraction: How is it done?

10

Probabilistic Extractor

Supervised Model

Unsupervised Model

Deterministic Extractor

Exact Match (Gazetteer)

Pattern Match (Regex)

En

tity

Red

act

or

JoiningInputText

Filtering

Adjudication

TaggedText

Domain

Text

Annotated

Text

User Defined

Lists

User DefinedPatterns

Page 11: Making sense of unstructured data by turning strings into things

Entity Resolution: What is it? (1)

Page 12: Making sense of unstructured data by turning strings into things

Entity Resolution: What is it? (2)

Alberto

Alberto

AlbertoAlberto

Alberto Amos Fernandez…

Alberto M.Fernandez…

Alberto Fernandez…

Alberto Fernandiz…

AlbertFernandez…

Alberto

Alberto

AlbertoAlberto

Alberto Fernandez…

… Chief of Cabinet… Argentina… …Prof of Criminal Law…

Alberto Fernandez…

… born Sept 7, 1984… cycling… Madrid

Alberto Fernandez…

… born in Cuba… US Ambassador

Alburto Fernandez…

Alberto

Alberto Fernandezde la Puebla…

Alberto

Ratio ofPoliticians to Sportsmen?

2:1

Alberto Fernandez… Sportsmen?

YES

Nickname“El Galleta?”

?

Page 13: Making sense of unstructured data by turning strings into things

But it’s not just text (1)

Page 14: Making sense of unstructured data by turning strings into things

But it’s not just text (2)

?

Page 15: Making sense of unstructured data by turning strings into things

Entity Resolution: How is it Done? (1)

Page 16: Making sense of unstructured data by turning strings into things

Entity Resolution: How is it Done? (2)

16

Page 17: Making sense of unstructured data by turning strings into things

Entity Resolution: How is it Done? (3)

17

Page 18: Making sense of unstructured data by turning strings into things

Entity Resolution: How is it Done? (4)

18

Resolution EngineCandidate Selection

Entity Index

Entity Mentio

n+

Context

Link or Ghost

Ranking

Knowledge Base

Learned

Seeded

!

Page 19: Making sense of unstructured data by turning strings into things

A (Convenient) Fiction…

• In a nearby place, not so long ago… the CIA was asked by the President to assess the likelihood that the Syrian opposition would use chemical weapons by mid-2014.

• As part of building that analysis, and because there are Al-Qaeda elements in the Syrian opposition, Alice the analyst was asked to: characterize Al-Qaeda’s attitude to using chemical weapons against Middle Eastern governments.

19

Page 20: Making sense of unstructured data by turning strings into things

20

From: Ayman Al-Zawahiri (?)To: “Hafiz Sultan”

Dear Brother, We need guidance from you on the issue of using chlorine gas technology. It was reported that the brothers in Iraq have used it, but this was implicitly denied in a

statement issued by the Islamic State of Iraq.

The brothers where Mahmud is have the potential to use chlorine gas on the forces of the apostates, Jalal Talabani and Mas'ud Barzani, and have already considered using it.

However, I informed them that matters as serious as this require centralized [coordination] and permission from the senior [al-Qa'ida] leadership, because the gas could be difficult to control and might harm some people, which could tarnish our image, alienate people from

us, and so on.”

A document that Alice needs to read (socom-2012-

0000011)…

Page 21: Making sense of unstructured data by turning strings into things

21

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 22: Making sense of unstructured data by turning strings into things

22

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 23: Making sense of unstructured data by turning strings into things

23

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 24: Making sense of unstructured data by turning strings into things

24

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 25: Making sense of unstructured data by turning strings into things

25

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 26: Making sense of unstructured data by turning strings into things

26

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 27: Making sense of unstructured data by turning strings into things

27

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 28: Making sense of unstructured data by turning strings into things

28

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 29: Making sense of unstructured data by turning strings into things

29

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 30: Making sense of unstructured data by turning strings into things

30

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 31: Making sense of unstructured data by turning strings into things

31

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 32: Making sense of unstructured data by turning strings into things

32

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 33: Making sense of unstructured data by turning strings into things

33

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 34: Making sense of unstructured data by turning strings into things

34

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 35: Making sense of unstructured data by turning strings into things

35

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 36: Making sense of unstructured data by turning strings into things

36

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 37: Making sense of unstructured data by turning strings into things

37

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 38: Making sense of unstructured data by turning strings into things

38

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 39: Making sense of unstructured data by turning strings into things

39

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 40: Making sense of unstructured data by turning strings into things

40

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 41: Making sense of unstructured data by turning strings into things

41

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 42: Making sense of unstructured data by turning strings into things

42

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Page 43: Making sense of unstructured data by turning strings into things

Advanced enrichment: Topics?

• Some knowledge sources have rich connectivity between things, concepts, etc.

• Developers often ask for “Topic”

• Even advanced Topic approaches often yield “howlers”

• Better labels might be derived from node info or graph walking.

Page 44: Making sense of unstructured data by turning strings into things

Advanced enrichment: Norms?

• By walking the graph in very specific ways, we can build one or more efficient representations of what is normal or expected context for an entity.

• This could be focused on particular entities, types of entities, relationships, etc.

• We could use these representations to affect result rankings, raise alerts.

• Note again, that this is not specific to text: elementary parts of other unstructured sources such as images and video might be connected/used in the same way.

Page 45: Making sense of unstructured data by turning strings into things

Summary

• Extraction and resolution components like REX and RES, can reliably connect Strings to Things in a range of texts.

• This allows existing knowledge to be usefully applied:• We can add properties (like types), and other advanced enrichments• We can discover where existing knowledge is lacking

• Thing-based search can allow each query to be more precise and productive• Fewer queries, fewer adjustments, fewer results to read

• By using abundant human feedback, KB quality and resolution accuracy can be increased.• More subtle distinctions between entities can be learned, example by

example.

• But…

Page 46: Making sense of unstructured data by turning strings into things

…these tools are like shoes…

Page 47: Making sense of unstructured data by turning strings into things

Thank

[email protected]

om