the alfresco ecm 1 billion document benchmark on aws and aurora - benchmark details and scalability...

Alfresco 1 Billion document benchmarkInfrastructure, use cases and performance considerations for

an Enterprise Grade ECM implementation

Gabriele ColumbroSr. Product Manager, Core Platform / API

ECM Use cases

5.1 Disclaimer

The following information is based on an development version of the unreleased Alfresco 5.1.Performance data is provisional and subtle to change based on testing the final Alfresco 5.1 release.

4

Alfresco reaches the 1B document mark on AWS• 10 Alfresco 5.1 nodes, 20 Solr 4 nodes in Sharding mode, 1 Aurora DB • Loaded 1B documents at 1000 docs / sec – 86M per day• Indexed 1B documents in 5 days – > 2000 docs / sec• No degradation in ingestion or content access upon content growth• Tested up to 500 Share concurrent users and 200 CMIS concurrent sessions

“We applaud Alfresco’s ability to leverage Amazon Aurora to address business requirements of the

modern digital enterprise, and enable a more agile and cost-effective content deployments.”

Anurag Gupta, Vice President, Database Services, Amazon Web Services, Inc. – 2015 October 6th

Highlights

Press release

https://www.alfresco.com/node/4141

5

ECM Use cases

Systems of record at scale

Enterprise Document Library

Loans & Policies

Claims & Case Processing

Transaction & Logistics Records

Research & Analysis

Real-time Video

Internet of Things

Medical & Personnel Records

Government Records & Archives

Discovery & Litigation

ECM Use cases

Systems of engagement use cases at scale

Document Library

Image Management

File Sync & Share

Search & Retrieval

Business Process

Management

Records Management

Case Management

Media Management

Information Archiving

Accelerate user adoption

Freedom to innovate

SIMPLE

SMART

OPEN

Drive digital transformation

Connect people, content, and processes to accelerate digital transformation

ECM BPM

Content in context

Consumer-like search & usability

Secure & mobile collaboration

Modular & scalable architecture

Effortless Information Governance

SIMPLE

SMART

OPENCloud integration & sync

Highly extensible & open source

ECM BPM

Powerful metadata, rules & relationships

Easy process (app) creation & analysis

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=VXPYbA1wFTWUWM&tbnid=rmKXbghC7jHefM:&ved=0CAUQjRw&url=http://www.weblinkinternational.com/associations&ei=6gbLUY2LB-GoiAKC0YCYBA&bvm=bv.48340889,d.cGE&psig=AFQjCNFeFUBwIVukkWlGl2Fz9sI3xYSbqQ&ust=1372346467203196

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=VXPYbA1wFTWUWM&tbnid=rmKXbghC7jHefM:&ved=0CAUQjRw&url=http://www.weblinkinternational.com/associations&ei=6gbLUY2LB-GoiAKC0YCYBA&bvm=bv.48340889,d.cGE&psig=AFQjCNFeFUBwIVukkWlGl2Fz9sI3xYSbqQ&ust=1372346467203196

9

Divide et impera

Decomposing the problem of Alfresco Scalability

Alfresco Index Server

Alfresco Repository

Search ServicesContent Services

Database Storage Network

Customizations / Applications

Share or Bespoke

10

Sizing Area Collaboration Headless Content Platform

Search Search is usually just a small portion of the operations percentage (around 10%)

In most of the cases especially for very large repositories there wont be full text indexing/search.

Permissions Permission control happens at Alfresco layer. User authority structure will be complex. With users belonging to many groups in average.

Most of the times permission control is happening elsewhere. Authority structures will be in general fairly simple.

Ingestion Ingestion rates are usually not important uploads are normally manually driven.

Injection rates are usually very important. Dedicated layers/nodes may be needed.

Repository Size

Repository Sizes are usually of small (hundreds of thousands) or intermediate (millions) size.

Repository sizes are usually quite big (tens of millions to billions).

Customization

Level of customization will vary but in most cases will concentrate at the front end (Share).

Customizations are usually important, typically on the repository side. Custom solution code may live external to Alfresco by using CMIS, public APIs, etc.

Architecture Architecture options will be in general the standard ones provided by Alfresco (cluster, dedicated index/transformation layers, etc).

Architecture options may vary considerably with more high scale and availability solutions being used: proxies, cluster and un-clustered layers, multi-repositories options of Alfresco repository, etc.

Concurrency Concurrent users will possibly be many, with average and peak values important to be considered.

Concurrent users will be in general few but think times will be much smaller than for collaboration

Interfaces You may expect mostly the Share interface to be used, but also it will be very common SPP, CIFS, IMAP, WebDAV and other public interfaces (CMIS) for other interfaces (mobile).

Most of the load should concentrate around public API (CMIS) and custom developed REST API (Webscripts).

Batch Batch operations should mostly be around human interaction workflows and the standard Alfresco jobs.

Batch operations will usually have a considerable importance, including content injection processes (bulk import), custom workflows and scheduled jobs.

ECM Scenarios

ECM is no one size fits all.

11

Bechmark

Results Introducing the 1B documents benchmark• Repository Layout

– 10k sites; 2 levels deep; 10 folders per level; 1000 files per folder– 100 kb avg plain text files with varying content complexity (for

indexing purpose)– Default content model

• Scenarios– Share interaction (Enterprise Collaboration)

• First focused on the Repository, no Search• Then with Search, including Solr4 Sharding

– CMIS interaction (Headless Content Platform)• Transactional Metadata Query testing

• AWS Fully cloud environment (provisioned by chef-alfresco)– Alfresco 5.1 + Share 5.1 (development code, unreleased)– AWS EC2 / Aurora (Mysql compatible and Alfresco supported)– Ephemeral for Index storage / EBS for content storage (spoofed)

https://github.com/Alfresco/chef-alfresco


12

Cloud stack1.2B documents execution environment

UI Test x 20 m3.2xlarge Simulate 500 Users• Selenium / Firefox• 1 hour constant load• 10 sec think time

UI Test UI Test

Alfresco Alfresco Alfresco x 10 c3.2xlarge Alfresco Repo and Share

Solr x 20 m3.2xlarge Solr Solr

Aurora x 1 db.r3.xlarge

ELB

Sharded Solr 4

sites folders files transactions dbSize GB

10,804 1,168,206 1,168,206,000 15,475,064 3,185

EBSIngestion (in place)

EBS

13

Cloud scale testing

How did we test it?

• Repository Loaded using bm-dataload (with file spoofing option)

• 1B document benchmark AKA BM-0004 - Testing Repository Limits base on bm-share

• Scalability & Sizing testing on Enterprise Collaboration Scenario (bm-share) and Headless Content Platform (bm-cmis)

https://wiki.alfresco.com/wiki/Benchmark_Testing_with_Alfrescohttps://github.com/derekhulley/alfresco-benchmark

Benchmark Server

Tomcat 7

Rest API

MongoDBConfig Data

Services

MongoDBTest Data

UI

Benchmark Driver (xN)Benchmark Driver (xN)Benchmark Driver

Tomcat 7 Extras(Selenium)

Servers / APIs Servers / APIs

Load Balancer

Servers / APIs

Test

Services

Rest API

https://wiki.alfresco.com/wiki/Running_Benchmark_Applications:_Alfresco_Data_Load

https://svn.alfresco.com/repos/alfresco-open-mirror/benchmark/tests/share/trunk/

https://svn.alfresco.com/repos/alfresco-open-mirror/benchmark/tests/share/trunk/

https://wiki.alfresco.com/wiki/Running_Benchmark_Applications:_CMIS

https://wiki.alfresco.com/wiki/Benchmark_Testing_with_Alfresco

https://wiki.alfresco.com/wiki/Benchmark_Testing_with_Alfresco

https://github.com/derekhulley/alfresco-benchmark

14

Benchmark

Results Getting to 1B documents

• Ingestion– With 10 nodes, 1000 documents / second (3 million per hour, 86M per

day, 12 days for the full repo) – spoofed content comparable to in place BFSIT loading

– Load rate consistent even beyond 1B documents– Throughput grew linearly by adding ingestion nodes (100 docs / sec

per node)– Adding additional loading nodes likely to raise ingestion throughput –

as Aurora was only at 50% CPU• Indexing

– Index distributed over 20 Alfresco Index Servers, sharding on ACLs (good for site based repository), with Alfresco dedicated tracking instance

– Each shard holds approx (in excess of) 50M nodes – Re-Indexing completed in about 5 days (each node tracks a sub-set of

the 1B)– Dynamic sharding autoconfiguration (5.1 feature)

NOTE: requires Alfresco tracking nodes to be in the cluster

https://wiki.alfresco.com/wiki/Running_Benchmark_Applications:_Alfresco_Data_Load%23Create_and_Start_a_Test_Run

http://docs.alfresco.com/5.0/concepts/bulk-import-in-place.html

1515

Bechmark Results

Testing Alfresco on 1b docs

• Repository Only (500 Share users) test– Sub-second login times and good, linear responses for other actions

• Open Library: 4.5s / Page Results: 1s / Navigate to Site: 2.3– CPU loads:

• Database: 8-10% / Alfresco (each of 10 nodes): 25-30%• Shows room for growth up to 1000 concurrent users

• Repository + Search (100 Share users)– Metadata and full text search ~ 5s (on 1B documents)– 1.2 searches / sec hitting the 20 shards

• TMDQ queries (database only, no index) via CMIS– IN_FOLDER (sorted, limited) ~ 160ms at CMIS interface– CMIS:NAME (=, LIKE) ~ 20ms at CMIS interface

16

1B docs testsRepository – Performances at 1B docs

500 concurrent Share users – no searchNOTE: Minor repo changes between 5.0.1 and 5.1 – performance are comparable

share.doclib

.page

share.doclib

.selec

tFileU

pload

share.doclib

.uploadFil

e

share.lo

gin

share.nav.

dashboard

share.nav.

documen

tLibrar

y

share.navi

gateT

o.site

share.navi

gateT

oShare

0500

100015002000250030003500400045005000

Arithmetic Mean (ms)Standard Deviation (ms)

Avg response time (ms)

Std deviation (ms)

17

Recomm

endations

Lessons Learned

• A single Alfresco repository can grow to 1B documents on AWS without notable issues, especially with a scalable DB like AWS Aurora

• As for the index, Shard, Shard, Shard– Shard to cope with content growth

• Single Solr instance tuned for 50M docs / 32GB– Shard for performance / SLA

• Improve performance of search on large scale repositories to hit SLA requirements

– Shard for operational reasons• Improve reindexing time (1B docs re-index in 5 days with 20 shards)

– NOTE: Sharding has a cost of results post-ranking. Use reasonably.• No indications of any size-related bottlenecks with 1.1 Billion

Documents• DB Indexes optimized (no index scans) even at a 3.2TB Aurora DB• Low

18

Folder Size m

atters

Limiting the number of files in a folder is a good best practice

Avg response time (ms) – 1000 docs/folder

Avg response time (ms) – 5000 docs/folder

19

Out of scope1B document Benchmark – Requires further testing• The following items were out of scope for the benchmark and

will be tested in the future. Keep this into account when using this info for sizing.

• Content Store I/O– File were spoofed, so not on the filesystem (bm-dataload allows to

store them)– What does it mean from a scalability standpoint?

• For ingestion, comparable to an in-place ingestion of content with BFSIT

• For indexing, no difference, Alfresco provides Solr with on-the-fly generated content

• For performance testing, difference in download, negligible with large files

• Transformation server / subsystem• All files are plain text files• Can be added to testing at later stage, as it’s a separate dimension• Trying to keep the problem ‘testable’

20

ConclusionsConclusions• Alfresco can power Enterprise Grade deployments of several ECM

use cases in a fully AWS best-of-breed cloud environment • Alfresco Repository can ingest and serve 1B documents without

bottlenecks or notable performance issues• The Alfresco Index Server, as of 5.1, will leverage sharding to

support large distributed, high performance indices• Using Alfresco in conjunction with AWS Aurora is a powerful

combination to reach high scalability without operational complexity

• Alfresco is investing in provisioning technologies like chef-alfresco to ensure a seamless experience for DevOps deploying Enterprise Grade architectures in the cloud

• This data is based on Alfresco: further testing is undergoing to provide additional data and provide Alfresco 5.1 final sizing & scalability guidelines


21

5.1Key Alfresco 5.1 scalability items to look forward to• Alfresco Solr Sharding

– On ACL– Tested up to 80M documents per shard and 20 shards

• Improved Transactional metadata queries– Boolean, Double and OR construct

• Easy deployment and scaling in AWS using provisioning technologies like chef-alfresco

• Alfresco support for Amazon Aurora (also available in Alfresco 5.0)• Updated field collaterals

– Scalability Blueprint for Alfresco 5.1– Sizing Guide for Alfresco 5.1– AWS Reference architecture, implementation guide and

CloudFormation template for Alfresco 5.0 and 5.1

https://github.com/fnichol/chef-alfresco

https://www.alfresco.com/news/press-releases/alfresco-supports-integration-amazon-aurora

22

Wrap upQuestions?• Please send feedback to:

– [email protected]– Twitter: @mindthegabz

• Participate to the Alfresco Research process:

Help us help you. Our products are better with your input and thoughts. Sign up for research at:

http://bit.ly/alfresco-research-signupThere are many ways to help:

– Research Surveys– Remote or in person interviews – Investigative workflow conversations or online design exercises

mailto:[email protected]

mailto:[email protected]

http://bit.ly/alfresco-research-signup



the alfresco ecm 1 billion document benchmark on aws and aurora - benchmark details and scalability...

Technology