oss 2013 - real world facets with entity resolution by benson margulies
DESCRIPTION
Solr’s ability to facet search results gives end-users a valuable way to drill down to what they want. But for unstructured documents, deriving facets such as the persons mentioned requires advanced analytics. Even if names can be extracted from documents, the user doesn’t want a “George Bush” facet that intermingles documents mentioning either the 41st and 43rd U.S. Presidents, nor does she want separate facets for “George W. Bush” or even “乔治·沃克·布什” (a Chinese translation) that are limited to just one string. We’ll explore the benefits and challenges of empowering Solr users with real-world facets.TRANSCRIPT
Real World Facets with Entity Resolution
Benson Margulies CTO
Basis Technology
The analyst’s dilemma
§ Andy’s job is to analyze interac;ons between European countries and Syria.
§ In par;cular, Andy wants to find unusual events on the Internet that he can report to his boss.
Government Analyst Andy
2
Haystack of raw data
§ The problem? Searching mul;ple messy datasets with both structured and unstructured data.
§ A common solu;on? Use complex, string-‐based queries to to try and “structurize” messy data
3
Analyst as Google keyword search pro
4
The lim of keyword search
5
The limita;on of Google is that it is engaged with strings, not things.
What is Andy really looking for?
6
Co-‐occurence
People Locations Organizations Date & Time Language
Entity resolution can help Andy…
§ Find references to REAL THINGS (people, places…)
§ Know that all men;ons of SYRIA are one en;ty
§ Reference a master dataset to resolve en;;es, connec;ng names to knowledge sources
§ Ul;mately, Andy can spend more ;me reading the right documents
7
Facet search by location [Syria]
8
Filter by time range [Sept. 26-‐27, 2013]
9
Filter by person of interest [Laurent Fabius]
10
What’s Hard About All This?
11
• Who goes with whom?
• Many John Smiths, and no one can spell Ghadaffi.
• What happens when new things appear?
• …and its implications for scale.
• What happens when the system is rwong?
• This system that makes decisions that stick around.
Addressing ambiguity
12
Addressing variety
13
The shock of the new
14
• Starting point: digest a knowledge base, match new arrivals.
• For example, Wikipedia
• Here comes someone new, we don’t want to:
• Decide that Jones Smyth is John Smith
• Decide that Jones Smtyh is different from Jones Smyth
• … with more limited evidence
• Relationship to scale
• Now we have a data structure that gets modified
Evaluation / Confidence
15
Human Correction
16
17
Thank you.