![Page 1: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/1.jpg)
SCALINGSCALINGTHE DOCUMENT REPOSITORYTHE DOCUMENT REPOSITORY
WITH ELASTICSEARCHWITH ELASTICSEARCH
![Page 2: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/2.jpg)
SOME CONTEXTSOME CONTEXTWhat we Do and What Problems We Try to Solve
![Page 3: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/3.jpg)
NUXEONUXEO
Nuxeo
we provide a Platform that developers can use to build highly
customized Content Applications
we provide components, and the tools to assemble them
everything we do is open source (for real)
various customers - various use cases
me: developer & CTO - joined the Nuxeo project 10+ years ago
Track game builds Electronic Flight Bags Central repository for Models Food industry PLM
https://github.com/nuxeo
![Page 4: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/4.jpg)
DOCUMENT REPOSITORYDOCUMENT REPOSITORY
Store Documents / Assets / Objects
Blob objects
Complex data Structures
Hierarchy, references and links
Audit trail & VersioningData level security & encryptionLifecycle, workflows ... API (REST, CMIS, Java, JS...)
CRUD
Search
Service API
Heavily configurable : all data structures are flexible / customizable
Used by developers to buildContent Applications on top of
the Nuxeo Repository
![Page 5: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/5.jpg)
OUR CHALLENGESOUR CHALLENGES
CRUD on large repository works
inject at 6,000 docs/s up to 1 Billion
not so many companies have that many documents anyway
Queries are the main scalability issue
impact of c_ud vs search
multi-criteria queries + full-text
security filtering
configurable data structures
user defined queries
UI heavily depends on search
Search API is the most used:
search is the main scalability challenge
![Page 6: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/6.jpg)
HISTORY : NUXEO & LUCENEHISTORY : NUXEO & LUCENE
2006: Nuxeo CPS 3.6
(Python / Zope based)
Replace built-in index with
lucene + XML-RPC server
pyLucene
(GCJ build+ python bindings!)
Complex setup
2007: Nuxeo Platform 5.1
JCR : queries (and backup) issues
Integrate Compass Core
transactionnal & storage abstraction
Missing sync & concurrency issues
2009: Nuxeo 5.2
VCS : Homebrew SQL based repository
Search in database but some real limitations
2013 / 2014: Nuxeo 5.9.3
Reintroduce Lucene in the stack via elasticsearch
Learn from our past mistakes
Leverage elasticsearch architecture
easy deployment
safe indexing
powerful search
... we are now happy with Elasticsearch
Lucene and Nuxeo have a long story ...
![Page 7: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/7.jpg)
REPOSITORY & SEARCHREPOSITORY & SEARCHUnderstanding the Issue
![Page 8: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/8.jpg)
REPOSITORY & SEARCHREPOSITORY & SEARCH
Search API is the most used :
search is the main scalability challenge
![Page 9: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/9.jpg)
COMPLEX SQL QUERIESCOMPLEX SQL QUERIES
Configurable Data Structure+ User defined multi-criteria searches=> multiple & complex SQL queries
Search API is the most used:
search is the main scalability challenge
SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy" JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id" LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id" LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id" WHERE ("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio')) AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext"))) AND ("hierarchy"."isversion" IS NULL) AND ("_F1"."lifecyclestate" <> 'deleted') AND ("_F2"."created" IS NOT NULL )
ORDER BY "_F2"."created" DESC
LIMIT 201 OFFSET 0;
![Page 10: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/10.jpg)
ABOUT SQL LIMITATIONSABOUT SQL LIMITATIONSScaling queries is complex
depend on indexes, I/O speed and available memory
can not satisfy all types of queries
poor performances on unselective multi-criteria queries
some types of queries can simply not be fast in SQL
Scalability
Scale up is expensive
Scale out is complex at best (XA & MVCC)
Sharding requires a global index
Fulltext support is usually poor
limitations on features & impact on performances
SQL technology is not the solution
![Page 11: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/11.jpg)
IS NOSQL THE SOLUTION!?IS NOSQL THE SOLUTION!?
![Page 12: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/12.jpg)
USING NOSQL FOR THE REPOSITORYUSING NOSQL FOR THE REPOSITORY
![Page 13: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/13.jpg)
ABOUT THE NOSQL OPTIONABOUT THE NOSQL OPTION(sadly) NoSQL is no magic
it does work very well for CRUD and it scales easily, but
query options are limited and performance is not that good
multi-document transactions is usually not safe
more adapted for DBs with billions of entries and simple queries
SQL has some real advantages
ACID (and MVCC) is good
Workflows and bulk updates are a typical use case
(even transient) lack of consistency is complex to explain to users
lot of existing tools (BI & reporting), lot of existing skills (DBA)
PGSQL (or AWS RDS) can be very cost effective
SQL or NoSQL repository are not the solution
![Page 14: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/14.jpg)
KEEP THE REPOSITORYKEEP THE REPOSITORYSQL OR NOSQLSQL OR NOSQL
BUTBUTFIND A SUPER FAST INDEX ENGINEFIND A SUPER FAST INDEX ENGINE
![Page 15: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/15.jpg)
REPOSITORY & ELASTICSEARCHREPOSITORY & ELASTICSEARCHToward an Hybrid Storage
![Page 16: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/16.jpg)
HYBRID STORAGEHYBRID STORAGEUse each storage solution for what it does the best
SQL DB
store content in an ACID way
store & retrieve
queries needed ACID and MVCC
elasticsearch
provide powerful and scalable queries
do the heavy lifting that the RDBMS can not do
scoring, native full-text, aggregates
distributed search
Route the query to the correct index dependingon requirements
![Page 17: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/17.jpg)
ELASTICSEARCH & REPOSITORYELASTICSEARCH & REPOSITORY
One querySeveral possible backends
![Page 18: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/18.jpg)
PERFORMANCE RESULTSPERFORMANCE RESULTSFast indexing
No ACID constraints / No impedance issue
3,500 documents/s when using SQL backend
10,000 documents/s when using MongoDB
Super query performance
query on term using inverted index
very efficient caching
native full text support & distributed architecture
3,000 queries/s with 1 elasticsearch node
6,000 queries/s with 2 elasticsearch nodes
![Page 19: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/19.jpg)
SOME REAL LIFE FEEDBACKSOME REAL LIFE FEEDBACK
“ We are now testing the Nuxeo 6 stack in AWS.DB is Postgres SQL db.r3.8xlarge which is a a 32 cpusBetween 350 and 400 tps the DB cpu is maxed out.
“ Please activate nuxeo-elasticsearch !
“ We are now able to do about 1200 tps with almost 0 DB activity.Question though, Nuxeo and ES do not seem to be maxed out ?
“ It looks like you have some networkcongestion between your client and the servers.
“ ...right... we have pushed past 1900 tps ... I think we are close todeclaring success for this configuration ...
Customer
Customer
Customer
Nuxeo support
Nuxeo support
![Page 20: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/20.jpg)
SQL VS ELASTICSEARCHSQL VS ELASTICSEARCH
Scalability is simply fromanother order of magnitude
![Page 21: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/21.jpg)
SCALE OUTSCALE OUT
![Page 22: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/22.jpg)
UNIFIED INDEX ON SHARDED REPOSITORYUNIFIED INDEX ON SHARDED REPOSITORY
Tested with 10 PgSQL databases
10 x 100 Million documents => 1 Billion documents
1 elasticsearch cluster
![Page 23: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/23.jpg)
IS THIS MAGIC?IS THIS MAGIC?
For users
it really looks like magic
For sales guys & solution architects
it is magic: it unleashes a lot of possibilities
performance is just one aspect
For Nuxeo Core Dev team
it was almost magic: some integration work was needed
![Page 24: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/24.jpg)
INTEGRATING ELASTICSEARCHINTEGRATING ELASTICSEARCHInside nuxeo-elasticsearch Plugin
![Page 25: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/25.jpg)
CHALLENGES TO ADDRESSCHALLENGES TO ADDRESS
Keep index in sync with the repository
No transaction management
Do not lose anything
Without support for update
Mitigate eventually consistent effect
Avoid displaying transient inconsistent state
Handle security filtering
Without join
Without post-filtering
![Page 26: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/26.jpg)
SECURITY FILTERINGSECURITY FILTERING
Constraints
Filtering must be done at index level : no post filtering
Join is not an option
can not join with DB or withing lucene (previously tested without success)
Solution
index the ReadACL as part of the JSON Document
list of groups / users who can read the resource
automatically add a filter clause on ACL
Consequences
Recursive indexing is needed
More pressure to maintain re-indexing procesing
in last resort: the Document security is checked by the repository anyway
![Page 27: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/27.jpg)
SAFE INDEXING FLOWSAFE INDEXING FLOWDo not try to make it Transactionnal
Collect and de-duplicate Repository Events during Transaction
Wait for commit to be done at the repository level
then call elasticsearch
Do not lose any updaterun Indexing Tasks in a distributed Job infrastructure
Jobs should be persisted
Jobs should be retried
Jobs should be monitored
![Page 28: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/28.jpg)
ASYNC INDEXING FLOWASYNC INDEXING FLOW
![Page 29: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/29.jpg)
MITIGATE EVENTUALLY CONSISTENTMITIGATE EVENTUALLY CONSISTENT
In the code :
use case : need to see results from within the transactionquery directly on the repository
leverage ACID and MVCC of SQL repository
full-text search and facets are usually not needed by the code
For the users :
use case : see changes in listings in "real time"use pseudo-real time indexing
indexing actions triggered by UI threads are flagged
run as afterCompletion listener
refresh elasticsearch index
![Page 30: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/30.jpg)
PSEUDO-SYNC INDEXING FLOWPSEUDO-SYNC INDEXING FLOW
![Page 31: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/31.jpg)
DOES THIS WORK ?DOES THIS WORK ?
Live for about 18 months now No missing sync issue
some customers asked for verification toolsbut no problem was foundre-index in bulk mode is very fast anyway
No consistency issues
good usage of hybrid query engines
elasticsearch helped address several scaling challenges
but elasticsearch brings us much more than just scalability
![Page 32: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/32.jpg)
BONUS FROM ELASTICSEARCHBONUS FROM ELASTICSEARCHMore than Raw Speed
![Page 33: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/33.jpg)
LEVERAGE AGGREGATESLEVERAGE AGGREGATES
Leverage elasticsearch aggregates
integrate with the Query system (PageProvider)
integrate with the Listing / UI model (ContentView)
Allow to easily build and configure faceted search
![Page 34: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/34.jpg)
ADVANCED INDEXINGADVANCED INDEXINGFine tuning of elasticsearch indexing
multi language support using multiple analyzers and copy_to
compound fields created using groovy scripts
Introduce elasticsearch hints into NXQL
select a specific elasticsearch index / analyzer
leverage elasticseach operators
do geolocation search
-- Use an explicit Elasticsearch fieldSELECT * FROM Document WHERE /*+ES: INDEX(dc:title.ngram) */ dc:title = 'foo'
-- Use ES operators not present in NXQLSELECT * FROM Document WHERE /*+ES: OPERATOR(regex) */ dc:title = 's.*y'SELECT * FROM Document WHERE /*+ES: OPERATOR(fuzzy) */ dc:title = 'zorkspaces'
-- Use ES for GeoQuery based on geo_hash_cell location near a point using geohash; SELECT * FROM Document WHERE /*+ES: OPERATOR(geo_hash_cell)*/ osm:location IN ('40','-74','5')
leverage what comes for free with elasticsearch
![Page 35: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/35.jpg)
INDEX AUDIT TRAIL WITH ELASTICSEARCHINDEX AUDIT TRAIL WITH ELASTICSEARCHUse elasticsearch to store & index Audit trail
all events are serialized in JSON and stored inside elasticsearch
Unleash Audit system power
can store a lot of events
can store and query arbitrary JSON structure
![Page 36: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/36.jpg)
ELASTICSEARCH PASS-THROUGHELASTICSEARCH PASS-THROUGH
Expose an HTTP pass-through API on top of Nuxeo integration
Integrate Authentication & Authorization
not all users can access workflow index
Integrate Security Filtering
activate data level security filtering
Expose "virtual index" via http
index + filter
Use elasticsearch API related components on Nuxeo data
Documents + Audit log
With embedded security
Easy real time data analytics on business data
![Page 37: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/37.jpg)
DATA ANALYTICS WITH ELASTICSEARCHDATA ANALYTICS WITH ELASTICSEARCHQueries on Documents + Audit: flexible reporting on workflows
![Page 38: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/38.jpg)
READ DOCUMENTS FROM ELASTICSEARCHREAD DOCUMENTS FROM ELASTICSEARCH
Full JSONDocument is stored in elasticsearch
required to be able to do fast re-indexing
We can retrieve Documents from elasticsearch
execute full search & retrieve without touching the DB
By controling indexing we can use the elasticsearch index
as a persistent cache on top of the repository
as a staging area for queries
_source
![Page 39: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/39.jpg)
NEXT STEPSNEXT STEPSLeveraging Even More elasticsearch
![Page 40: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/40.jpg)
NEXT STEPSNEXT STEPS
Leverage elasticsearch percolator
push update on the nuxeo-drive clients
notify users about saved search
automatic categorization
Search result highlighting
not sure why it is still not there ...
Plug automatic denormalization
![Page 41: Scaling the Content Repository with Elasticsearch](https://reader033.vdocuments.us/reader033/viewer/2022050806/58ec1a6f1a28ab70538b458b/html5/thumbnails/41.jpg)
ANY QUESTIONS ?ANY QUESTIONS ?Thank You !
https://github.com/nuxeo
http://www.nuxeo.com/careers/