search analytics business value & nosql backend
DESCRIPTION
TRANSCRIPT
Search Analytics
Business Value&
NoSQL Backend
Otis Gospodnetić – Sematext International@otisg ◦ @sematext ◦ sematext.com
sematext.com/search-analytics
Copyright 2011 Sematext Int'l. All rights reserved.2
About Otis Gospodnetić
• ASF Member: Lucene, Solr, Nutch, Mahout
• Author: Lucene in Action 1 & 2
• Entrepreneur: Sematext, Simpy
Copyright 2011 Sematext Int'l. All rights reserved.3
Sematext Metrics● 100% organic: no GMO, no VC● 4 years old● < 10 people● 7 countries● 3 timezones● 2 continents● > 100 customers
Copyright 2011 Sematext Int'l. All rights reserved.4
About Sematext
Products & ServicesConsulting, Development, Tech Support:
● Search (Lucene, Solr, ElasticSearch...)● Big Data (Hadoop, HBase, Voldemort...)● Web Crawling (Nutch, Droids)● Machine Learning (Mahout)
Copyright 2011 Sematext Int'l. All rights reserved.5
Agenda
● What is Search Analytics and why it matters● Example reports and their value● What we built, why, and how
Copyright 2011 Sematext Int'l. All rights reserved.6
Communication● twitter.com/sematext● twitter.com/otisg● hash tags: #stsa or #stanalytics● http://sematext.com/search-analytics/index.html● Raise your hand!● [email protected]
Copyright 2011 Sematext Int'l. All rights reserved.7
The Compass
Search logs are your MapSearch Analytics is your Compass
Copyright 2011 Sematext Int'l. All rights reserved.8
High Level Why
searchusers
searchproviders
searchexperience
Copyright 2011 Sematext Int'l. All rights reserved.9
High Level Why
searchproviders
searchexperience
This search sucks!It takes 17 tries to find anything here!
F!?@#$%^&?!?
searchusers
Cool, the latest search tweaks made our site really sticky!
Awesome!
Copyright 2011 Sematext Int'l. All rights reserved.10
Don't Be Like This Dude
Copyright 2011 Sematext Int'l. All rights reserved.11
Got Clue?
Search Analytics
Performance Monitoring
Quality Assurance
Tuning UI
Copyright 2011 Sematext Int'l. All rights reserved.12
More Concrete Why● Measure and monitor everything. Introspection.● Supports (re)design, navigation choices● Helps with content acquisition & enhancement● Improve search experience● Mula
Copyright 2011 Sematext Int'l. All rights reserved.13
The Moment of Truth
Question for the audience #1
What do you use for Search Analytics?
a) Home grown stuffb) Google Analyticsc) Omnitured) Webtrendse) Otherf ) Nothing
Copyright 2011 Sematext Int'l. All rights reserved.14
Search Analytics Outline● Collect: queries & clicks & interactions & ...● Analyze: actions / xactions / conversions● Output: reports – over time● Output++: feedback loop
● The means, not the goal● Ongoing, not one-off
remember this
Copyright 2011 Sematext Int'l. All rights reserved.15
Search vs. Web Analytics● User intent and information needs vs. inferring● Hand in hand● Ideally you can relate data from both or even
unify it
Copyright 2011 Sematext Int'l. All rights reserved.16
Example Core Reports● Rate & Volume, Latency (mean, avg, 90%)● Click Through Rate, Mean Reciprocal Rank● Top Queries by count, clicks, 0 hits...● Query Trending● Top Seen Docs, Top Clicked Docs (msft)● Page & Click Depth● Facet & Sort Usage● ...
Copyright 2011 Sematext Int'l. All rights reserved.17
More Reports in More Detail● See Search Analytics What? Why?
How?
http://blog.sematext.com/tag/analytics/
Copyright 2011 Sematext Int'l. All rights reserved.18
Part Dos
Switching gears... Juno digs NoSQL
Copyright 2011 Sematext Int'l. All rights reserved.19
What We've Built● Search Analytics SaaS
● Numerous reports (e.g. query volume, rate, latency, term frequencies / comparisons, hit buckets, search origins, etc.)
● Trending over time● Comparisons of time periods● Top N reports● Filter, slice and dice
Copyright 2011 Sematext Int'l. All rights reserved.20
Who Needs a Compass?● We need it
● search-hadoop.com & search-lucene.com
● Our customers need it!
● You?
Copyright 2011 Sematext Int'l. All rights reserved.21
Sematext Search Analytics
Copyright 2011 Sematext Int'l. All rights reserved.22
Big Dreams● SaaS● Multitenant● Large Scale – Massive Data● Cloud
Copyright 2011 Sematext Int'l. All rights reserved.23
Storage Choices● RDBMS: MySQL, PostgreSQL● HDFS● Hive● HBase● Cassandra
Copyright 2011 Sematext Int'l. All rights reserved.24
SaaS vs. In-HouseQuestion for the audience #2
SaaS vs in-house Search Analytics?
a) SaaSb) in-house
Copyright 2011 Sematext Int'l. All rights reserved.25
Sematext Search Analytics
Copyright 2011 Sematext Int'l. All rights reserved.26
Sematext Search Analytics
Copyright 2011 Sematext Int'l. All rights reserved.27
Sematext Search Analytics
Copyright 2011 Sematext Int'l. All rights reserved.28
Sematext Search Analytics
Copyright 2011 Sematext Int'l. All rights reserved.29
Data Flow● See Search Analytics with Flume and HBase
http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/
Copyright 2011 Sematext Int'l. All rights reserved.30
Data Collection● See Search Analytics with Flume and HBase
http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/
Copyright 2011 Sematext Int'l. All rights reserved.31
Core Tech● JavaScript Beacons● Metric Capture Web App aka Receiver● Flume Agents, Collectors, Sinks● HBase● MapReduce Aggregations● Search Analytics Reporting Web App
Copyright 2011 Sematext Int'l. All rights reserved.32
What is Flume● Distributed data/log collection service● Scalable, configurable, extensible● Centrally manageable, open source
● Agents get data from app, Collectors save it● Abstractions: Source → Decorator(s) → Sink
Copyright 2011 Sematext Int'l. All rights reserved.33
What is HBase● Scalable, reliable, distributed, column-oriented DB● On top of HDFS● MapReducable
Copyright 2011 Sematext Int'l. All rights reserved.34
Data Flow, Detailed
Copyright 2011 Sematext Int'l. All rights reserved.35
Why Flume● Reliable delivery
● e.g. queue msgs locally if destination unreachable● Easy, centralized management via Web UI or
console● Good community, good progress, now @ASF● But: more complex, more moving parts● On Flume: slideshare.net/cloudera/inside-flume● Alternatives: Kafka, Scribe...
Copyright 2011 Sematext Int'l. All rights reserved.36
Why HBase● Scalable raw & aggregate data storage● MapReduce data input● Fast scans for time ranges, fast key lookups● Easy storage and compute power expansion● Good looking roadmap, community, progress
Copyright 2011 Sematext Int'l. All rights reserved.37
Open Sourcing● 2 open-source projects:
github.com/sematext/HBaseWDgithub.com/sematext/HBaseHUT
● See sematext.com/open-source/index.html
● Patches for Flume and HBaseblog.sematext.com/tag/flume/
Copyright 2011 Sematext Int'l. All rights reserved.38
Challenges● Data size. Solutions:
● Compression (4-5x smaller with lzo)● Data pruning (variable levels)
● Query string distribution: very long-tail● Lots of data to process, update, aggregate
● Young tools: Flume, HBase● Poor IO on EC2● Hadoop distributions
Copyright 2011 Sematext Int'l. All rights reserved.39
Output++● AutoComplete - $MM improvement● Better DYM Spellchecker● Related Searches● Recommendations● Relevance Feedback● ...
Copyright 2011 Sematext Int'l. All rights reserved.40
Closing the Loop
searchusers
searchproviders
searchexperience
Copyright 2011 Sematext Int'l. All rights reserved.41
Resource
http://rosenfeldmedia.com/books/searchanalytics/
Search Analytics for Your SiteLouis Rosenfeld
Copyright 2011 Sematext Int'l. All rights reserved.42
We're Hiring
Dig Search?Dig Analytics?Dig Big Data?Dig Performance?Dig working with and in open-source?We're hiring world-wide!http://sematext.com/about/jobs.html
Copyright 2011 Sematext Int'l. All rights reserved.43
sematext.com blog.sematext.com @sematext @otisg [email protected]
Want SA? Grab me or go to: sematext.com/search-analytics
Hash tags: #stsa or #stanalytics
Contact