Linked Open Data and RDF Stores meet
Search Based Applications
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX
Dr. Christoph Goller, Director Research, IntraFind AG
Outline
IntraFind
Search Based Applications (SBA) / Enterprise Search
Natural Language Processing (NLP): extract semantics from text
RDF Stores and Linked Open Data (LOD)
SBAs, RDF Stores and LOD: bringing it all together
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 2
IntraFind Software AG
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 3
IntraFind Business Model
Founding of the company: October 2000
More than 700 customers mainly in Germany, Austria, and Switzerland
Partner Network (> 30 VAR & embedding partners)
Employees: 25
Lucene Committers: B. Messer, C. Goller
Our Open Source Search Business:
Product Company: iFinder, Topic Finder, Knowledge Map, Tagging Service, …
Products are a combination of Open Source Components and in-house Development
Support (up to 7x24), Services, Training, Stable API
Automatic Generation of Semantics
Linguistic Analyzers for most European Languages
Named Entity Recognition
Text Classification
Semantic Search
Clustering
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 4
Search Based Applications (SBA)
“The elements that make search powerful are not necessarily the search box, but the ability to
bring together multiple types of information quickly and understandably, in real time, and at
massive scale. Databases have been the underpinning for most of the current generation of
enterprise applications; search technologies may well be the software backbone of the future”
---Susan Feldman, IDC LINK, June 9, 2010
bring together data from multiple sources (structured and unstructured)
Massive Scalability of Web Search (fulltext indexes are NoSQL DBs)
Agile: days vs. months
Semantic NLP Technologies to analyze, enrich, and aggregate source data
Usability of Web Search
Search Box (natural language)
Facets / Dynamic Clustering (refining navigation / web shops)
ad hoc reporting and analysis (dashboard tools, information visualization)
build BI / decision tools on structured and unstructured information
E.g. Social Media Opinion Mining
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 5
iFinder
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 6
Unique
Search Box
Result from
Structured Data
Result from
Unstructured Data
Facets
(correlated terms)
Facets
(named entities)
Facets
(other meta-data)
iFinder Knowledge Map
Tab browsing
Full-text search combined with Knowledge Maps
Content sources provide
filtering capabilities within
the Knowledge Map
Search-while-you-type
based on filters (in this
example: filter „author“)
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 7
Morphological Analyzer vs. Stemming
Morphological Analyzer:
Lemmatizer: maps words to their base forms
Decomposer: decomposes words into their compounds
Kinderbuch (children‘s book) -> Kind (Noun) | Buch (Noun)
Versicherungsvertrag (contract of insurance) -> Versicherung (Noun) | Vertrag (Noun)
Holztisch (wooden table), Glastisch (table made of glass)
Stemmer: usually simple algorithm
going -> go king -> k ???????????
Messer -> mess ??????
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 8
English German
going -> go (Verb) lief -> laufen (Verb)
bought -> buy (Verb) rannte -> rennen (Verb)
bags -> bag (Noun) Bücher -> Buch (Noun)
bacteria -> bacterium (Noun) Taschen -> Tasche (Noun)
Advantages of Morphological Analysis
Combines high Recall with high Precision for Search Applications
Improves subsequent statistical methods
Better suited as descriptions for semantic search / faceting / clustering than
artificial stems, categories used for phrase detection
Reliable Lookup in Lexicon Resources
Thesaurus / Ontologies
Cross-lingual search
Available Languages:
German, English, Spanish, French, Italian, Dutch, Russian, Polish,
Serbo-Croatian, Greek, Chinese, Arabian, Pasthu
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 9
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 10
Named Entity Recognition (NER)
Automated extraction of information from unstructured data
People names
Company names
Brands from product lists
Technical key figures from technical data
(raw materials, product types, order IDs,
process numbers, eClass categories)
Names of streets and locations
Currency and accounting values
Dates
Phone numbers, email addresses, hyperlinks
Search query: Which people / companies
strongly correlate with each other?
10
Named Entities: Applications in Search
Applications:
Facets
Search for „Experts“
Additional Query Types
Index Structure: Additional Tokens on the same position:
N_PersonName
N_Peter Müller
N_Peter
N_Müller
Search for a person named „Brown“
Search for a company near „founded“ and „Bill Gates“
Question Answering / Natural Language Queries (Demo)
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 11
Topic Learner Classifier
Rules (Parameters)
Definition of
Topics 1……N
Topic Classifier
Topic Association
Learning Phase
Classification Phase
New Document
Example Documents
Text Classification
Automatic Assignment of Documents to Topics based on their content
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 12
Applications of Text Classification
News: Newsletter-Management System
Spam-Filtering
Mail / Email Classification
Product Classification (Online Shops), ECLASS /UNSPSC
Subject Area Assignment Libraries & Publishing Companies
Opinion Mining / Sentiment Detection
Part of our Tagging Services
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 13
RDF Stores & Linked Open Data
RDF Stores:
Originally designed as Meta-Data model: machine-readable information
Triples: subject-predicate-object
General Data Model for Knowledge Representation
Query and Inference Languages: SPARQL
Linked Open Data:
method of publishing structured data so that it can be interlinked
Uses RDF
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 14
Linked Open Data
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 15
SBAs, RDF, and LOD: bringing it all together
Use RDF Stores to store structured data within SBAs
LOD (Ontologies / Thesauri) can be used for tagging, topic identification, disambiguation, Auto-Complete, Did-You-Mean, Semantic Search, Navigation / Facetts, Cross-Lingual
Use LOD/knowledge to better link structured and unstructured data
Use LOD/knwoledge (not only statistics) to better determine query-document matching
Use LOD-Resources for improving / developing Information Extraction NLP components
RDF Stores may benefit from SBAs in terms of scalability
(e.g. Project Sindice: Offloading RDF Content into a Lucene Store)
Other NoSQL DBs
RDF stores and LOD are designed for machine-usability but human usability is needed
Vision:
combine Fulltext Search an RDF Stores into a single (Enterprise) Content Repository
provide a simple unified intuitive user interface for accessing all (Enterprise)
knowledge, structured and unstructured
Current and Future Needs for SBAs: more Semantics, more NLP
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 16
Questions?
Dr. Christoph Goller
Director Research
Phone: +49 89 3090446-0
Fax: +49 89 3090446-29
Email: [email protected]
Web: www.intrafind.com
IntraFind Software AG
Fraunhofer Strasse 15
82152 Planegg
Germany
PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 17