download slides - nosql roadshow
TRANSCRIPT
![Page 1: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/1.jpg)
16.05.2013
Full-text Search with NoSQL Technologies
NoSQL Search Roadshow 2013, Berlin
Kai Spichale
![Page 2: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/2.jpg)
About me
► Kai Spichale
► Software Engineer at adesso AG
► NoSQL, Full-text searching, Spring, Java EE
16.05.2013 1
► adesso is among Germany‘s top IT service providers
► Consulting and software development focus
► More than 1,000 members of staff
► Some of the most important customers are Allianz, Hannover Rück, Westdeutsche Lotterie, Zurich Versicherung, DEVK, and DAK
![Page 3: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/3.jpg)
Motivation
NoSQL
► Exponential data growth
► Semi-structured data
► More connections
► 80 percent of business-relevant information is in unstructured form
Search
► Shift in data access:
> More full-text search
> Higher user expectations
► Keyword search and link directories become impractical
16.05.2013 2
![Page 4: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/4.jpg)
Agenda
► Lucene full-text search
► NoSQL:
> Architectural drivers
> MongoDB
> Neo4j
> Apache Cassandra
> Apache Hadoop
► Summary
16.05.2013 3
![Page 5: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/5.jpg)
Full-text search
► Techniques for searching documents in collections
► grep-like naive approach:
> Serial scanning is slow
> No negation
> No distinction between phrase and keyword search
► Build inverted index
> Term Document
> Contains references to documents for each token
16.05.2013 4
![Page 6: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/6.jpg)
Apache Lucene
► Java lib for full-text searches
► De facto standard for open source software
► Attributes:
> Application-agnostic
> Scalable, high performance
► Features:
> Ranked searching
> Multiple query types, faceting
> Sorting
> Multi-Index searching
16.05.2013 5
![Page 7: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/7.jpg)
Text Analysis
16.05.2013 6
Extraction
Parsing
Character
Filter
Tokenizer
Token Filter
Documents
de.GermanAnalyzer:
StandardTokenizer > StandardFilter
> LowerCaseFilter > StopFilter > GermanStemFilter
Inverteted
Index
![Page 8: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/8.jpg)
Text Analysis
16.05.2013 7
ID Term Document
1 come 2
2 dog 1
3 eat 1
4 exception 3
5 first 2
5 food 1
6 own 1
7 prove 3
8 rule 3
9 serve 2
10 your 1
Eat your
own dog
food.
First
come,
first
served.
The
exception
proves the
rule.
a
and
around
every
for
from
in
is
it
not
on
one
the
to
under
Stop word List
![Page 9: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/9.jpg)
Query types
Type Example
Term
(MUST, MUST_NOT, SHOULD)
+adesso –italy
Phrase „foo bar“
Wildcard fo*a?
Fuzzy fobar~
Range [A TO Z]
16.05.2013 8
![Page 10: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/10.jpg)
Agenda
► Lucene full-text search
► NoSQL:
> Architectural drivers
> MongoDB
> Neo4j
> Apache Cassandra
> Apache Hadoop
► Summary
16.05.2013 9
![Page 11: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/11.jpg)
NoSQL and Search
One size fits all approach
► Which NoSQL store satisfies our requirements best?
► Is full-text search supported?
5/16/2013 10
Data Structure Access Patterns
Volume Performance
Availability Updates
Consistency
![Page 12: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/12.jpg)
NoSQL and Search
5/16/2013 11
Let‘s take a closer look on:
► MongoDB
► Neo4j
► Apache Cassandra
► Apache Hadoop
![Page 13: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/13.jpg)
Document-oriented Databases
{ "_id" : ObjectId(„42"),
"firstname" : "John",
"lastname" : "Lennon",
"address" : { "city" : "Liverpool",
"street" : "251 Menlove Avenue“ }
}
5/16/2013 12
► Designed for storing and retrieving documents
► Semi-structured content such as BSON documents
![Page 14: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/14.jpg)
MongoDB
► Supports ad-hoc CRUD operations
db.things.find({firstname:"John"})
► Server-side execution of JavaScript
► Aggregations, MapReduce
► Simple keyword search with multikey indexes:
> Index array content as separate entries
16.05.2013 13
{ article : “some long text",
_keywords : [ “some" , “long" , “text“]
}
![Page 15: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/15.jpg)
MongoDB
► Version 2.4 supports text indexes
► Language-specific stemming based on Snowball
► Still a beta feature
16.05.2013 14
db.foo.runCommand(“text“, {search: “adesso –italy”, language: “english”})
![Page 16: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/16.jpg)
MongoDB
► Mongo Connector integrates MongoDB with another system (backup MongoDB cluster, Solr, elasticsearch,)
► System architecture with separate search engine possible
16.05.2013 15
MongoDB SolrMongo
Connector
1 2 3 4 5
update synccreatedocument index search
![Page 17: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/17.jpg)
Choosing the Right Approach
MongoDB MongoDB
+ Search Engine
Search Engine
No result set merging
Complex queries with
aggregations
Simple text search
(but experimental text
index)
Full-text search with
faceting
Complex queries with
aggregations
Result set merging
Increased complexity
(ops, dev)
No result set merging
Full-text search with
faceting
Backup?
Aggregations?
16.05.2013 16
![Page 18: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/18.jpg)
Graph Databases
► Stored data is represented as graph structures
> Nodes
> Edges (Relationships)
> Properties
► Universal datamodel
► Traversing
► Example: Neo4j
5/16/2013 17
id=1
name=“John“
id=2
name=“George“
id=3
name=“Paul“
friend friend
![Page 19: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/19.jpg)
Neo4j
► Traversing
> Visiting nodes by following relationships
> Breadth- and depth-first traversing
> Gremlin, Cypher
Result = George
5/16/2013 18
START john=node:peoplesearch(name=‘John’)
MATCH john<-[:friend]->afriend RETURN afriend
![Page 20: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/20.jpg)
Neo4j
► Database itself is a natural index consisting of its edges and nodes
> Example: „name“, „city“
► Auto indexing keeps track of property changes
16.05.2013 19
personRepository.findByPropertyValue("name", "John");
![Page 21: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/21.jpg)
Neo4j
► The default separate index engine used is Apache Lucene
16.05.2013 20
Index<PropertyContainer> index = template.getIndex("peoplesearch");
index.query("name", "Jo*");
@NodeEntity class Person {
@Indexed(indexName="peoplesearch", indexType=IndexType.FULLTEXT) private String name;
..
}
![Page 22: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/22.jpg)
Wide Column Store
► Google BigTable: „a sparse, distributed multi-dimensional sorted map“
► Data is organized in rows, column families, and columns
► Ideal for sharding (horizontal partitioning)
16.05.2013 21
jlennon
pmccart
gharris
name
„Lennon“
name
„McCartney“
name
„Harrison“
state
„UK“
address
„Liverpool ..“
address
„Liverpool ..“
different columns per row
unique
row keys
![Page 23: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/23.jpg)
Apache Cassandra
► BigTable clone
► Distributed Hash Table (Amazon Dynamo)
► Eventual consistency (configurable levels)
► Cassandra Query Language (CQL) = SQL dialect without joins
► Hadoop integration
5/16/2013 22
SELECT name FROM user WHERE firstname=„John“;
![Page 24: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/24.jpg)
Apache Cassandra
► Solandra = Solr using Cassandra as backend
► DataStax Enterprise Search
> One local Solr instance per Cassandra node
> Integration is based on secondary index API
> CQL supports Solr Queries
> Cassandra’s ring information is used to
construct Solr distributed search queries
16.05.2013 23
SELECT title FROM solr WHERE solr_query=‘name:jo*';
![Page 25: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/25.jpg)
Agenda
► Lucene full-text search
► NoSQL:
> Architectural drivers
> MongoDB
> Neo4j
> Apache Cassandra
> Apache Hadoop
► Summary
16.05.2013 24
![Page 26: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/26.jpg)
Apache Hadoop
► Hadoop:
> Framework for distributed processing of large data sets in computer clusters
> Distributed filesystem + MapReduce implementation
► Scalable and reliable platform of a comprehensive data analysis ecosystem
16.05.2013 25
![Page 27: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/27.jpg)
Hadoop MapReduce
5/16/2013 26
Persistent Data
Map Map Map Map
Transient Data
Persistent Data
Reduce Reduce Reduce
► Map Phase:
> Records are processed by map function
► Shuffle Phase:
> Distributed sort and grouping
► Reduce Phase:
> Intermediate results are processed by reduce function
![Page 28: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/28.jpg)
Hadoop MapReduce
► Data is processed by mappers and reducers
5/16/2013 27
map(k, v) -> [(K1,V1), (K2,V2), ... ]
reduce(Kn, [Vi, Vj, …]) ->
(Km, R)
Data
Mapper
Shuffle Reducer Result
![Page 29: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/29.jpg)
What kind of problems does MapReduce solve?
► Problems processed without reducer
> Searching
> File converting
> Sorting
> Map-side join
► Problem processed with reducer
> Grouping and aggregation
> Reduce-side join
► More complex problems:
> Solved by combinations of multiple MapReduce jobs
5/16/2013 28
![Page 30: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/30.jpg)
Hadoop MapReduce: Searching
► Search document including „A“
5/16/2013 29
1
2 1
Documents
1: A,B,C
3
4
5
Mapper emits only documents that fit
the searching criteria
4
5
2: D,E
3: B,E
4: A,D
5: A,C,E
Result = 1, 4, 5
![Page 31: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/31.jpg)
Hadoop MapReduce: Indexing
16.05.2013 30
HDFS
MapReduce
Job
Lucene Lucene
Index
► HDFS:
> Stores raw data
► Mapper:
> Extracts text (creates e.g. SolrInputDocument)
> Calls Lucene for indexing (calls e.g. StreamingUpdateSolrServer)
![Page 32: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/32.jpg)
Hadoop MapReduce: Indexing
16.05.2013 31
1
2
Daten
1: text
3
4
5
2: text
3: text
4: text
5: text
Mapper
@Override public void map( LongWritable key, Text val, OutputCollector<NullWritable, NullWritable> output, Reporter reporter) throws IOException { st = new StringTokenizer(val.toString()); lineCounter = 0; while (st.hasMoreTokens()) { doc= new SolrInputDocument(); doc.addField("id", fileName + key.toString() + lineCounter++); doc.addField("txt", st.nextToken()); try { server.add(doc); } catch (Exception exp) { … } }}
![Page 33: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/33.jpg)
Apache Tika
16.05.2013 32
HDFS
MapReduce
Job
Lucene Lucene
Index
Tika
► Extracts metadata and structured text content
> HTML, MS Office documents, PDF, etc.
► Stream parser can process large files
![Page 34: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/34.jpg)
Apache Solr / elasticsearch
5/16/2013 33
HDFS
MapReduce
Job
Lucene Lucene
Index
Tika
ela
sticse
arc
h
► Lucene is only a libary, not a standalone search engine
► Complete search engines:
> Solr
> ElasticSearch
![Page 35: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/35.jpg)
Apache Flume
5/16/2013 34
HDFS
MapReduce
Job
Lucene Lucene
Index
Tika
Flume
Web Server,Applikations
ela
sticse
arc
h
► Distributed service for collecting, aggregating and moving large amounts of data (e.g. log data)
► Streaming techniques
► Fault tolerant
![Page 36: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/36.jpg)
Alternatives
5/16/2013 35
HDFS
MapReduce
Job
Lucene Lucene
Index
Tika
Flume
Web Server,Apps, DBs
Crawler DistCp Sqoop
ela
sticse
arc
h
► Nutch Crawler creates one entry in CrawlDB per URL
► Hadoop DistCp copies data within and between hadoop systems
► Apache Sqoop transfers bulk data between Hadoop and RDMBS
![Page 37: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/37.jpg)
Apache Hadoop
5/16/2013 36
Web Content,Intranet
Loading
Tools
Hadoop
Search Analysis Export Visualization
► Fundamental mismatch:
> MapReduce for batch processing
> Lucene for interactive searching
► MapReduce for indexing large datasets
► Basis for (offline) BigData solutions
![Page 38: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/38.jpg)
Summary
► More semi-structured data
► Increasing relevance of full-text searching
► Combination of NoSQL and Lucene:
> MongoDB: integration via MongoDB Connector
> Neo4j: native Lucene integration
> Cassandra: Datastax‘s Solr integration
> Hadoop: indexing large datasets with MapReduce
► Alternative: search engine as document-oriented database
5/16/2013 37
![Page 39: Download slides - NoSQL Roadshow](https://reader031.vdocuments.us/reader031/viewer/2022021005/620382a8da24ad121e4a4252/html5/thumbnails/39.jpg)
Thank you for your attention!
16.05.2013 38