openstack summit tokyo - know-how of challlenging deploy/operation ntt docomo's mail cloud...
TRANSCRIPT
Copyright © 2015 NTT DATA Corporation
2015/10/27 NTT DATA Corporation
Know-how of Challlenging Deploy/Operation NTT DOCOMO's Mail
Cloud System Powered by OpenStack Swift
2 Copyright © 2015 NTT DATA Corporation
Abstract
Docomo mail is 24/7 cloud mail system which has accesses from over 20 million people. This mail system stores user's mail archive in OpenStack Swift with Peta Byte scale capacity deployed by NTT DATA. We have been successfully operating this service since Sep 2014 without any downtime. In this session, we'll present the actual issues and challenges we have faced and conquered.
3 Copyright © 2015 NTT DATA Corporation
Today’s contents and presenter
○Project Overview
Changes of Japanese mobile situation and abstraction of this project
– Project Manager : Sosuke Kakehi
○Migrate process
Process of migrating swift to existed docomo mail system
– OpenStack Swift Engineer : Masaaki Nakagawa
○Technical challenges
Swift technical challenges on this project
– OpenStack Engineer : Ryosei Kasai
○Operating session
Large scale swift operation
– OpenStack Swift Engineer : Masaaki Nakagawa
Copyright © 2013 NTT DATA Corporation 4
Project Overview
5 Copyright © 2015 NTT DATA Corporation
Project Overview
1 NTT Docomo's Cloud Mail System
2 Project Background
3 Customer Requirements
6 Copyright © 2015 NTT DATA Corporation
Cloud Mail System
NTT Docomo's Cloud Mail System - System Summary
• Docomo Mail - NTT Docomo’s Cloud Mail Service
• Over 20 million users
• Powered by OpenStack Swift
High Performance Storage
Object Storage OpenStack Swift
Later Mail
Tablet PC Smart Phone
Archived Mail
Stored to Swift
7 Copyright © 2015 NTT DATA Corporation
NTT Docomo's Cloud Mail System - System Scale
• Geographically Distributed Swift Cluster
• Over 6.4 Peta Byte Logical Capacity
• Over Hundreds of Servers
Site2
Site3
Site4
Site1
Proxy Node
Storage Node Region1
Storage Node Region2
Storage Node Region3
8 Copyright © 2015 NTT DATA Corporation
Project Background
Shift from “Feature phone” to “Smart phone”
Service
Service
Service
Service
Smart Phone / Tablet PC
Service
Documents
Text
Photos
Music Movie Application
E-mail Data Size was increased
9 Copyright © 2015 NTT DATA Corporation
Cost
Cost Cost
Cost Cost Cost
Project Background
High-end Storage
High-end Storage
High-end Storage
High-end Storage
High-end Storage
Extend the High-end Storage, extend, extend
= expensive cost, cost, cost
High-end Storage
10 Copyright © 2015 NTT DATA Corporation
Customer Requirements
High Availability
Low Cost
High Scalability
OSS(Software Storage) + IA Server
Disaster Recovery
etc
Adopt OpenStack Swift
Copyright © 2013 NTT DATA Corporation 11
Migrate session
12 Copyright © 2015 NTT DATA Corporation
Overview of migration session
NTT DOCOMO has launched docomo mail service since Oct 2013, and swift was installed docomo mail system at Jan 2015. When we migrated swift to docomo mail system, docomo mail did not stop user service.
In this section, I would like to introduce overall of docomo mail system and migration process.
later older
Oct, 2013 docomo mail service in
Jan, 2015 Swift service in
May, 2014 test user start to use swift
Oct, 2015 General user start to test use Swift
13 Copyright © 2015 NTT DATA Corporation
swift (archived mail holder)
High speed block storage (later mail holder)
Swift migrate session System construction overview
Docomo mail frontend server (proxy of block storage and swift)
Proxy
Storage Storage Storage
Internet
archived user mail
archived user mail
archived user mail
user mail user mail user mail
14 Copyright © 2015 NTT DATA Corporation
Swift migrate session Mail access flow
Docomo mail frontend server (proxy of block storage and swift)
Block Storage
Proxy
Storage Storage Storage
Internet
archived user mail
archived user mail
archived user mail
access device
user mail user mail user mail
User mail will be archived/stored to swift
15 Copyright © 2015 NTT DATA Corporation
Swift migrate session System construction (before swift installed)
Docomo mail frontend server
Block Storage
Internet
archived user mail
archived user mail
user mail
16 Copyright © 2015 NTT DATA Corporation
Swift migrate session Migration 1st step – deploy swift and test
Docomo mail frontend server
Block Storage
Proxy
Storage Storage Storage
Internet
• Deploy swift • Trouble test • Tuning
archived user mail
archived user mail
user mail
17 Copyright © 2015 NTT DATA Corporation
Swift migrate session Migration 2nd step – copy test user’s archived mail
Docomo mail frontend server
Block Storage
Proxy
Storage Storage Storage
Internet
Copy test user’s archived mail
General user’s mail is not copied
archived user mail
archived user mail
archived user mail
archived user mail
archived user mail
user mail
18 Copyright © 2015 NTT DATA Corporation
Swift migrate session Migration 3rd step – copy general user’s archived mail
Docomo mail frontend server
Block Storage
Proxy
Storage Storage Storage
Internet
Move general user’s archived mail
keep all mail archive against swift trouble
archived user mail
archived user mail
archived user mail
archived user mail
archived user mail
user mail
19 Copyright © 2015 NTT DATA Corporation
Swift migrate session Migration 4th step – launch service
Docomo mail frontend server
Block Storage
Proxy
Storage Storage Storage
Internet
archived user mail
archived user mail
archived user mail
archived user mail
archived user mail
user mail
20 Copyright © 2015 NTT DATA Corporation
Conclusion of migrate session
• Firstly, docomo mail has only block storage
• We need to deploy and migrate swift with no down time
• To achieve it, we divide migrate to 4 steps
– Deploy
– Test user mail copy to swift
– General user mail copy to swift with remaining block storage
– System durability check
• We achieve no service down migration
As I said , in migrating, we achieve some technical challenges. Next session, Mr. Kasai introduce it.
Copyright © 2013 NTT DATA Corporation 21
Technical session
22 Copyright © 2015 NTT DATA Corporation
Our Technical Challenges
1 Durability assurance
2 Geographically distributed cluster
3 Quality
23 Copyright © 2015 NTT DATA Corporation
Challenge 1: Durability assurance
• Quality requirement in Japan
• This system needs very high quality.
• Everything should be under control
• System design for normal situation
• System design for defeat situation
Even on distributed system
• Analyze every behavior before building system
24 Copyright © 2015 NTT DATA Corporation
Recovery test in variety of defeat pattern
• Variety of failure pattern
(1) The point of failure • Disk, NIC, Process, Node, …
(2) The number of failures • 1, 2, 3, 4, …
(3) The range of failures • 1 node, multiple nodes/zones/regions, …
100s of test cases!!
Case #201
Proxy
Sto
rage
Sto
rage
Sto
rage
Sto
rage
Sto
rage
Sto
rage
Zone1 Zone2
…
Region 1
Case #201
Proxy
Sto
rage
Sto
rage
Sto
rage
Sto
rage
Sto
rage
Sto
rage
Zone1 Zone2
…
Region 1
Case #001
Proxy
Storage Storage Storage
Case #001
Proxy
Storage Storage Storage
Case #001
Proxy
Storage Storage Storage
Case #101
Proxy
Storage Storage Storage
Case #301
Proxy
Storage Storage Storage
Case #501
Proxy
Sto
rage
Sto
rage
Sto
rage
Sto
rage
Sto
rage
Sto
rage
Zone1 Zone2
…
Region 1
25 Copyright © 2015 NTT DATA Corporation
Result of recovery test
• Extreme durability and recoverability of swift
• Swift rarely loses data in it. Only accurate snipe or great disaster can causes data lost.
26 Copyright © 2015 NTT DATA Corporation
private network
Site 3
Storage
Site 4
Storage
Site 2
Storage
Challenge 2: Geographically distributed cluster
• Geographically distributed swift cluster to realize disaster recovery
• Important points to evaluate global distribution
1. Client request
2. Durability Site 1
Proxy 300km~ 300km~
300km~ 300km~
300km~
27 Copyright © 2015 NTT DATA Corporation
Pseudo-global cluster
• Pseudo-global cluster with simulated network latency
• Proxy and 3 Storage regions placed in different locations
• 10~200msec latency between locations simulated by tc
• TL msec latency for one way, 2*TL msec latency for round trip
Proxy
Storage region 1
Storage region 2
Storage region 3
10~200msec latency
10~200msec latency
10~200msec latency
10~200msec latency
10~200msec latency
10~200msec latency
Client Proxy
Storage region1
TLmsec
TLmsec
28 Copyright © 2015 NTT DATA Corporation
2 points of Pseudo-global cluster testing
1. Client request
• Object PUT/GET/DELETE from client
• Error rate
• Turnaround time for 1 request
• Throughput
• Latency between proxy and storage
2. Durability
• Auto recovery by object-replicator
• Error rate
• Turnaround time of 1 sync process
• Throughput
• Latency between storages
Proxy
Storage region 1
Storage region 2
Storage region 3
Storage region 1
Storage region 2
Storage region 3
Client
Proxy
PUT GET
Client
29 Copyright © 2015 NTT DATA Corporation
Test1: Client request
Object PUT/GET/DELETE from client
• No error caused by latency
• Degradation of turnaround time
• No throughput degradation for concurrent requests
latency
limitation of network bandwidth
PUT/GET
DELETE
Latency concurrency
Throughput Turnaround time
30 Copyright © 2015 NTT DATA Corporation
Test2: Durability
Auto recovery by object-replicator
• No error caused by latency
• Performance degradation of one process
• No throughput degradation for concurrent process
Latency concurrency
Throughput
latency
limitation of network bandwidth
Defeat
Recovery
Performance
31 Copyright © 2015 NTT DATA Corporation
Challenge 3: Quality
1. Software Quality
• All processes work well ?
• Account / Container / Object
• server / replicator / updater / reaper
2. System Quality
• Our system is working well ?
• All nodes
• All APIs
32 Copyright © 2015 NTT DATA Corporation
Software quality
1 Add process name checking into swift-init
2 Prevent redundant commenting by drive-audit
3 Remove invalid connection checking in db_replicator
4 Add timestamp checking in AccountBroker.is_status_deleted
5 Fix error log of proxy-server when cache middleware is disabled
Source Code Analysis and Customize
• Official patch (below)
• Original patch
Strict test all processes
and more …
Our official patch
33 Copyright © 2015 NTT DATA Corporation
System quality
storage servers …
…
Tempest
proxy servers
checking tool
Test all nodes
• Automation testing tools for
1. APIs : All swift APIs, including error case
2. Nodes : All swift nodes
• Extended Tempest and checking tool
Test all APIs
34 Copyright © 2015 NTT DATA Corporation
Our solutions
1 Durability assurance
2 Geographically distributed cluster
3 Quality
Recovery test in variety of failure pattern
Performance test of frontend/backend with pseudo-global swift cluster
・Source Code Analysis and Customize ・Automated testing
Challenge Solutions
Copyright © 2013 NTT DATA Corporation 35
Operating session
36 Copyright © 2015 NTT DATA Corporation
Overview of operating session
Operation scheme of Docomo mail is high confidential.
We would like to introduce about NTT DATA swift solution's operation.
Docomo mail system uses NTT DATA swift solution with customizing.
37 Copyright © 2015 NTT DATA Corporation
Operating session Large scale system makes operation costly
Large scale Swift
scale out management repair tuning
38 Copyright © 2015 NTT DATA Corporation
Operating session Reduce operating work amount
Parallel access (pssh / pscp)
Automatic deploy (kickstart)
Tuning (svn / puppet)
Master repository
39 Copyright © 2015 NTT DATA Corporation
Operating session Reduce operation frequency
Disk failure Node down Server Process Down Backend process down ex)auditor process
Service affect
40 Copyright © 2015 NTT DATA Corporation
Operating session Stop monitoring which low priority
Periodic performance check
monitoring alert
41 Copyright © 2015 NTT DATA Corporation
Conclusion of operating session
• Swift is consisted by many nodes
• System operating costs of Swift tend to be costly
• NTT DATA has know-how to reduce swift operation cost
– Using operation parallelized tool
– Customizing for monitoring priority
– Change monitoring items to periodic check
42 Copyright © 2015 NTT DATA Corporation
Conclusion of this presentation
We introduce usage, challenge, and operating OpenStack swift at docomo mail service system
• System migration with no service down time
• Three technical achievement
• Reduce operating cost
Docomo mail has been service with no down time.
If you have something questions, please come to NTT booth.
○Attention All company names, product names, and service names mentioned are trademarks or registered trademarks of the respective companies
Copyright © 2011 NTT DATA Corporation
Copyright © 2015 NTT DATA Corporation