kyiv.py #16 october 2015

36
Kyiv.py #16 Andrii Soldatenko 24 October 2015 @a_soldatenko

Upload: andrii-soldatenko

Post on 16-Apr-2017

449 views

Category:

Internet


0 download

TRANSCRIPT

Page 1: Kyiv.py #16 october 2015

Kyiv.py #16

Andrii Soldatenko 24 October 2015 @a_soldatenko

Page 2: Kyiv.py #16 october 2015

ElasticSearch in Python

world.Andrii Soldatenko 24 October 2015 @a_soldatenko

Page 3: Kyiv.py #16 october 2015

About me:• Software Engineer in Test at

• Speaker at PyCon Russian 2015

• Speaker at PyCon Ukraine 2014

• Speaker at PyCon Belarus 2015

• in past:

Page 4: Kyiv.py #16 october 2015

Preface

Page 5: Kyiv.py #16 october 2015

Information Explosion

Page 6: Kyiv.py #16 october 2015

Text Searchgrep --ignore-case --recursive foo books/

grep --ignore-case --recursive --file=words.txt books/

Entry.objects.get(headline__icontains='foo')

words = []with open('words.txt', 'r') as f: words = f.readlines()

Entry.objects.get(headline__icontains_in=words)

Page 7: Kyiv.py #16 october 2015

Full text search

Page 8: Kyiv.py #16 october 2015

Search index

Page 9: Kyiv.py #16 october 2015

Simple sentences

1. The quick brown fox jumped over the lazy dog

2. Quick brown foxes leap over lazy dogs in summer

Page 10: Kyiv.py #16 october 2015

Inverted indexTerm Doc_1 Doc_2-------------------------Quick | | XThe | X |brown | X | Xdog | X |dogs | | Xfox | X |foxes | | Xin | | Xjumped | X |lazy | X | Xleap | | Xover | X | Xquick | X |summer | | Xthe | X |------------------------

Page 11: Kyiv.py #16 october 2015

Inverted index

Term Doc_1 Doc_2-------------------------brown | X | Xquick | X |------------------------Total | 2 | 1

Page 12: Kyiv.py #16 october 2015

Inverted index: normalization

Term Doc_1 Doc_2-------------------------brown | X | Xdog | X | Xfox | X | Xin | | Xjump | X | Xlazy | X | Xover | X | Xquick | X | Xsummer | | Xthe | X | X------------------------

Term Doc_1 Doc_2-------------------------Quick | | XThe | X |brown | X | Xdog | X |dogs | | Xfox | X |foxes | | Xin | | Xjumped | X |lazy | X | Xleap | | Xover | X | Xquick | X |summer | | Xthe | X |------------------------

Page 13: Kyiv.py #16 october 2015

Search Engines

Page 14: Kyiv.py #16 october 2015

ElasticSearch

Page 15: Kyiv.py #16 october 2015

Who uses ElasticSearch?

Page 16: Kyiv.py #16 october 2015

ElasticSearch: Quick Intro

Relational DB Databases TablesRows Columns

ElasticSearch Indices FieldsTypes Documents

Page 17: Kyiv.py #16 october 2015

ElasticSearch: Quick Intro

PUT /haystack/user/1{ "first_name" : "Andrii", "last_name" : "Soldatenko", "age" : 30, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ], "likes": [ "python", "django" ]}

Page 18: Kyiv.py #16 october 2015

ElasticSearch: Locks

•Pessimistic concurrency control

•Optimistic concurrency control

Page 19: Kyiv.py #16 october 2015

ElasticSearch: Setup

#!/bin/bash

VERSION=1.7.1

curl -L -O https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-$VERSION.zipunzip elasticsearch-$VERSION.zipcd elasticsearch-$VERSION

# Download plugin marvel./bin/plugin -i elasticsearch/marvel/latest

echo 'marvel.agent.enabled: false' >> ./config/elasticsearch.yml

# run elastic./bin/elasticsearch -d

Page 20: Kyiv.py #16 october 2015

ElasticSearch: Setup

$ curl ‘http://localhost:9200/?pretty'

{ "status" : 200, "name" : "Dredmund Druid", "cluster_name" : "elasticsearch", "version" : { "number" : "1.7.1", "build_hash" : "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19", "build_timestamp" : "2015-07-29T09:54:16Z", "build_snapshot" : false, "lucene_version" : "4.10.4" }, "tagline" : "You Know, for Search"}

Page 21: Kyiv.py #16 october 2015

ElasticSearch: Settings

curl -X POST 'http://localhost:9200/<index_name>/_close'

curl -XPUT "http://localhost:9200/<index_name>/_settings" -d'{ "settings": { "analysis": { "analyzer": { "my_analyzer": { "type": "standard", "stopwords": [ "and", "the" ] } } } }}'

curl -X POST 'http://localhost:9200/<index_name>/_open'

Page 22: Kyiv.py #16 october 2015

Haystack

Page 23: Kyiv.py #16 october 2015

Adding search functionality to Simple Model

$ cat myapp/models.py

from django.db import modelsfrom django.contrib.auth.models import User

class Page(models.Model): user = models.ForeignKey(User) name = models.CharField(max_length=200) description = models.TextField()

def __unicode__(self): return self.name

Page 24: Kyiv.py #16 october 2015

Haystack: Installation$ pip install django-haystack

$ cat settings.py

INSTALLED_APPS = [ 'django.contrib.admin', 'django.contrib.auth', 'django.contrib.contenttypes', 'django.contrib.sessions', 'django.contrib.sites',

# Added. 'haystack',

# Then your usual apps... 'blog',]

Page 25: Kyiv.py #16 october 2015

Haystack: Settings

$ pip install elasticsearch

$ cat settings.py...HAYSTACK_CONNECTIONS = { 'default': { 'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine', 'URL': 'http://127.0.0.1:9200/', 'INDEX_NAME': 'haystack', },}...

Page 26: Kyiv.py #16 october 2015

Haystack: Creating SearchIndexes

$ cat myapp/search_indexes.py

import datetimefrom haystack import indexesfrom myapp.models import Note

class PageIndex(indexes.SearchIndex, indexes.Indexable): text = indexes.CharField(document=True, use_template=True) author = indexes.CharField(model_attr='user') pub_date = indexes.DateTimeField(model_attr='pub_date')

def get_model(self): return Note

def index_queryset(self, using=None): """Used when the entire index for model is updated.""" return self.get_model().objects. \ filter(pub_date__lte=datetime.datetime.now())

Page 27: Kyiv.py #16 october 2015

Haystack: SearchQuerySet API

from haystack.query import SearchQuerySetfrom haystack.inputs import Raw

all_results = SearchQuerySet().all()

hello_results = SearchQuerySet().filter(content='hello')

unfriendly_results = SearchQuerySet().\ exclude(content=‘hello’).\ filter(content=‘world’)

# To send unescaped data:sqs = SearchQuerySet().filter(title=Raw(trusted_query))

Page 28: Kyiv.py #16 october 2015

How to configure elasticSearch?

https://github.com/django-haystack/django-haystack/blob/9d92d4da0a1ec75978fc3949375dda9a1707469f/haystack/

backends/elasticsearch_backend.py#L41

Page 29: Kyiv.py #16 october 2015

ElasticSearch settings

Page 30: Kyiv.py #16 october 2015

ElasticStack backend

https://github.com/bennylope/elasticstack

HAYSTACK_CONNECTIONS = { 'default': { 'ENGINE': 'elasticstack.backends.ConfigurableElasticSearchEngine', 'URL': 'http://127.0.0.1:9200/', 'INDEX_NAME': 'haystack', },}

ELASTICSEARCH_INDEX_SETTINGS = {}

ELASTICSEARCH_DEFAULT_ANALYZER = 'synonym_analyzer'

Page 31: Kyiv.py #16 october 2015

Keeping data in sync# Update everything../manage.py update_index --settings=settings.prod

# Update everything with lots of information about what's going on../manage.py update_index --settings=settings.prod --verbosity=2

# Update everything, cleaning up after deleted models../manage.py update_index --remove --settings=settings.prod

# Update everything changed in the last 2 hours../manage.py update_index --age=2 --settings=settings.prod

# Update everything between Dec. 1, 2011 & Dec 31, 2011./manage.py update_index --start='2011-12-01T00:00:00' --end='2011-12-31T23:59:59' --settings=settings.prod

Page 32: Kyiv.py #16 october 2015

Signalsclass RealtimeSignalProcessor(BaseSignalProcessor): """ Allows for observing when saves/deletes fire & automatically updates the search engine appropriately. """ def setup(self): # Naive (listen to all model saves). models.signals.post_save.connect(self.handle_save) models.signals.post_delete.connect(self.handle_delete) # Efficient would be going through all backends & collecting all models # being used, then hooking up signals only for those.

def teardown(self): # Naive (listen to all model saves). models.signals.post_save.disconnect(self.handle_save) models.signals.post_delete.disconnect(self.handle_delete) # Efficient would be going through all backends & collecting all models # being used, then disconnecting signals only for those.

Page 33: Kyiv.py #16 october 2015

Haystack: Pros and Cons

Pros:

• easy to setup • looks like Django ORM but for searches • search engine independent • support 4 engines (Elastic, Solr, Xapian, Whoosh)

Cons:

• poor SearchQuerySet API • difficult to manage stop words • loose performance, because extra layer • Model - based

Page 34: Kyiv.py #16 october 2015

Final Thoughts

https://www.elastic.co/guide/en/elasticsearch/guide/master/index.html

Page 35: Kyiv.py #16 october 2015

Thank You

@a_soldatenko

https://asoldatenko.com

Page 36: Kyiv.py #16 october 2015

Questions

?