elasticsearch: implementing document full-text...
TRANSCRIPT
Elasticsearch: implementingdocument full-text search
Bastian Mathes
Elasticsearch Meetup Köln2015-08-27
2 Introduction
• Elasticsearch is very successful as a loganalysis tool
• but it is also a very good search engine• . . . with some unique features for handling
structured data
• in this talk let’s focus on unstructured data
Elasticsearch: implementing document full-textsearchBastian Mathes
Elasticsearch Meetup Köln
2015-08-27
3 What do we mean by unstructured data ?
• can be websites . . .• but often this is a more diverse office mix
• various formats• many languages• large in file size or pages• various source systems
Elasticsearch: implementing document full-textsearchBastian Mathes
Elasticsearch Meetup Köln
2015-08-27
4 Where is that used ?
• Website search• eCommerce search• Enterprise search
• one place to find all information inside thecompany
• honor access rights
Elasticsearch: implementing document full-textsearchBastian Mathes
Elasticsearch Meetup Köln
2015-08-27
5 What are the challenges ?
• Document conversion / text extraction• Linguistics• Secure search• Source systems access
Elasticsearch: implementing document full-textsearchBastian Mathes
Elasticsearch Meetup Köln
2015-08-27
6 Document conversion
• Extraction of text (and metadata) fromyour documents
• Need converters• Commercial: Oracle Outside In, Microsoft IFilter,
HP/Autonomy KeyView• Open source: Apache Tika
• Move processing out of the search cluster• Near the source system, somewhere in between
Elasticsearch: implementing document full-textsearchBastian Mathes
Elasticsearch Meetup Köln
2015-08-27
7 Linguistics
• at least tokenization (Standard Tokenizer)• a lot of room for improvement
• Language specific tokenization• Stemming or better lemmatization
• Handle tokens with the same meaning as equal• Raise recall, keep precision• Overstemming: universal, universe, university
• Synonyms (Synonym Token Filter)• Decompounding• Named Entity Recognition
• Detecting entities (locations, organizations, people) inthe text
Elasticsearch: implementing document full-textsearchBastian Mathes
Elasticsearch Meetup Köln
2015-08-27
8 Linguistics cont.
• Language detection in Apache Tika• a lot of Analyzers in Elasticsearch/Lucene• Try Hunspell for lemmatization• Play with Stanford NER (English, German,
Chinese) and Apache OpenNLP• there is more in open-source academia, but
very specific
Elasticsearch: implementing document full-textsearchBastian Mathes
Elasticsearch Meetup Köln
2015-08-27
9 Linguistics cont.
©Basis Technology
Elasticsearch: implementing document full-textsearchBastian Mathes
Elasticsearch Meetup Köln
2015-08-27
10 Secure Search
• Document level security• Have a look at Shield for protecting cluster and
indexes• Basic idea is simple:
• Transfer access right from source system tosearch index
• at search time create a filter to only show resultthe searcher is authorized to see
• Common pitfalls• User-to-groups resolution has to be cached• Multiple source systems with different
authentication/authorization schemas• Domain migrations etc.
Elasticsearch: implementing document full-textsearchBastian Mathes
Elasticsearch Meetup Köln
2015-08-27
11 Secure Search cont.
Elasticsearch: implementing document full-textsearchBastian Mathes
Elasticsearch Meetup Köln
2015-08-27
12 Connectors
• Get data from source system to searchindex
• Events vs. synchronization, track changes• Access control lists• Open source solutions: Apache Nutch,
Apache ManifoldCF• Make or buy
Elasticsearch: implementing document full-textsearchBastian Mathes
Elasticsearch Meetup Köln
2015-08-27
Thank you
Bastian [email protected]
Raytion GmbH
Benrather Strasse 18-2040213 DüsseldorfGermany
T +49. 211. 55 02 66. 0
www.raytion.com
© Copyright 2015 Raytion GmbH, Düsseldorf