project hecatonchire - the lego cloud : status, vision, roadmap 08/2013

44
Project Hecatonchire Dr. Benoit Hudzia, TIP HANA Cloud Computing, Systems Engineering July 2013

Upload: benoit-hudzia

Post on 27-May-2015

2.480 views

Category:

Technology


1 download

DESCRIPTION

Hecatonchire aim to address these gaps by developing key enabling technologies that allow virtualization to go beyond the boundaries of the underlying hardware, breaking the barrier of using a single physical server for a virtual machine. With Hecatonchire, compute, memory, and I/O resources will be provisioned across multiple hosts and can also be consumed in dynamically changing quantities instead of in fixed bundles. This will effectively enable a fluid transformation of cloud infrastructures from commodity sized physical nodes to very large virtual machines to meet the rapidly growing demand for cloud-based services. This slide deck provide the vision, status and roadmap update of the Hecatonchire project Website: http://hecatonchire.com/ Git: https://github.com/hecatonchire

TRANSCRIPT

Page 1: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

Project HecatonchireDr. Benoit Hudzia, TIP HANA Cloud Computing, Systems Engineering July 2013

Page 2: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

The NeedNext Generation Datacentre

Page 3: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 3Public

Advantages of Current Datacenter Designs

• Cheap, off the shelf, commodity parts• No need for custom servers or networking kit• (Sort of) easy to scale horizontally

• Runs standard software• No need for “clusters” or “grid” OSs

• Ideal for VMs• Highly redundant• Homogeneous

Page 4: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 4Public

Lots of Problems

• Datacenters mix customers and applications• Heterogeneous, unpredictable workload patterns• Competition over resources• How to achieve high-reliability?• Inefficient resource consumption model

• Heat and Power• 30 billion watts per year, worldwide• May cost more than the machines

Page 5: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 5Public

Achieving the Right-sized Servers

• Eliminates unnecessary components• Increased power efficiency • Optimized performance by workload

Using high-volume components to build high-value systems

Page 6: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

The Lego CloudOriginal Concept

Page 7: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 7Public

Original Idea :Lego CloudResources aggregation Idea : Virtual Machine compute and memory span Multiple physical Nodes Node can be specialized for specific functionality :

– Memory Servers– Compute Servers– IO Serverso Storageo Accelerator (GPU)o Etc..

Hecatonchire Value Proposition Optimal price / performance by using commodity hardware Seamless deployment within existing cloud Memory / CPU / IO resource pooling Just in time resource consumption Heterogeneous workload and Hardware Simplified Reliability , Availability and Serviceability model

CPUs

Memory

I/O

CPUs

Memory

I/O

CPUs

Memory

I/O

H/W

OS

App

VM

H/W

OS

App

VM

H/W

OS

App

VM

H/W

OS

App

VM

Server #1 Server #2 Server #n

Guests

Fast RDMA Communication

Page 8: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 8Public

From Rack Scale to Cloud Scale

Fa

st R

DM

A C

om

mu

nic

atio

n

CP

Us

Mem

ory

I/O

CP

Us

Mem

ory

I/O

CP

Us

Mem

ory

I/O

Rack Scale Datacentre ScaleAuto Scaling VM(s) Efficient Resources Pooling

Page 9: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 9Public

Aggregating resource with Fast Networking

Author: Chaim Bendalac

PCIe SSD (fusionio)

PCIe Fabric Infiniband QDR

Infiniband FDR

Iwarp 10GbE Iwarp 40 GbE

Intel Silicon Photonic

Bandwidth 15Gbps 64Gbps 40 Gbps 54Gbps 10Gbps 40Gbps 100Gbps

Latency (4kB) ~50 µsec ~0.8 µsec 4 µsec 1 µsec 6-7 µsec 1-2 µsec > 1 µsec

Page 10: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

The Memory CloudFocus On Memory Aggregation

Page 11: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 11Public

How do we offer better Memory TCO

• Memory in the nodes of current clusters is Overscaled in order to fit the requirements of “any” application• It remains unused most of the time

• How can we unleash your memory-constrained application by using the memory in the rest of nodes

Memory that grows with your business, not before.

Page 12: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 12Public

Decouple memory from cores aggregation

Many shared-memory parallel applications do not scale beyond a few tens of cores...However, may benefit from large amounts of memory:

• In-memory databases• Datamining• VM• Scientific applications• etc

Eliminate Physical Limitation of Cloud / DC

Page 13: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 13Public

The Idea: Turning memory into a distributed memory service

Breaks memory from the bounds of the physical box

Transparent deployment with performance at scale and Reliability

Page 14: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

ArchitectureHigh Level Version

Page 15: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 15Public

Linux kernel

High Level Architecture

• Lightweight: Minimal set of hooks in the MMU (5)

• Module with 3 core components

• 3 existing consumption model• Native• User Space Library • KVM

• 4th model under dev– Hgroup• Similar to Cgroup , allow barebone memory extension

without modificationKernel MMU Coherency Engine

RDMA EngineRDMA Drivers

Management/ Tracing

User Space Library

Char Device

Apps

AppsApps

hgroup

Apps

Page 16: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 16Public

Extending Virtual Machine Memory

• Instead of using swap for freeing up the RAM, push pages to remote hosts

• If a page is needed again, request it back

Page 17: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 17Public

How does it work (Simplified Version)

Virtual Address

MMU(+ TLB)

Physical Address

Page Table Entry

Coherency Engine

RDMA Engine RDMA Engine

MMU(+ TLB)

Physical Address

Page Table Entry

Coherency Engine

Miss

Remote PTE

(Custom Swap Entry)

Page request

Page Response

PTE write

Update MMU

Invalidate PTE

Invalidate MMU

Extract Page

Extract Page

Prepare Page for RDMA transfer

Physical Node BPhysical Node A

Network

Fabric

Page 18: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 18Public

Hecatonchire Memory Structure• 2^64 Group (DSM-VM , ex: meta-VM / Process) that own a single

Virtual Memory address space

• 2^64 sub group (SVM- physical process )

• 2^64 Memory region per group that can partition 64 bit address space

• Each Sub group is associated with a single process and can own 0 or more MR

• A virtual memory address space can be backed by more than one Memory region ( RRAIM)

• Each Memory Region is own by one physical Node ( sub-group)

• Under Dev: • MR split – Merge• MR relocation

Heca

Group Group Group

Virtual Memory Address Space

Sub-Group Sub-Group Sub-Group

MR MR MR MR MR MR MR

Page 19: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 19Public

Coherency engine• Different coherency model

• Shared Nothing• Write Invalidate• Some variant

• Custom Distributed Index• Worst case : O(log n) hops • On a 16 Node cluster

• Random R/W on each node• 97 % < 2 hops • 99% < 3 hops

• Smart Prefetching

Coherency Engine

Index

Page Table/MMU

Coherency Engine

Index

Page Table/MMU

Coherency Engine

Index

Page Table/MMU

Coherency Engine

Index

Page Table/MMU

RAM

RAM

RAM

RAM

Prefetch

Prefetch

Prefetch

Prefetch

Page 20: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 20Public

RDMA Engine• In Kernel Implementation

• Optimised memory use

• Fabric agnostic ( as long as we have RDMA verbs)

• Tested on : • Iwarp ( Chelsio ), IB ( Connectx2 – QDR),

SoftIwarp, RoCE, SoftRoCE

• Multiplex Groups/Apps

• Flow Control / Multi queue support

• Under Dev: • Multi NIC support for single group• NIC Multiplexing (active/ active fail over)

RDMA

Engine

App

App

App

RDMA

Engine

App

App

App

RDMA

Engine

App AppApp

RAM

RAM

RAM

Page 21: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

Raw PerformanceHow fast can we move memory page

Page 22: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 22Public

Hard Page Fault Resolution Performance(10 GB sequential R/W scan)

Resolution timeAverage (μs) 4KB Page

IOpSPage (4KB)

(with prefetch)

SoftIwarp (10 GbE)

330 50k/s

Iwarp (10GbE)

28 250k/s

Infiniband (QDR)

16 650k/s

Page 23: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 23Public

Average Compounded Page Fault Resolution Time(R/W With Prefetch)

1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads 8 Threads1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000IW 10GE Sequential IB 40 Gbps SequentialIW 10GE- Binary splitIB 40Gbps- Binary splitIW 10GE- Random WalkIB- Random WalkM

icro-seconds

Avg IW

Avg IB

Page 24: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

Use Case: In Memory DatabaseScaling Out HANA memory

Page 25: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 25Public

Memory Scale out for SAP HANA – Benchmark

Hardware:•Server with Intel Xeon West Mere

• 4 socket • 10Core • 1 TB RAM

•Fabric: • Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 • 10 GbE Ethernet Switch + Chelsio T422 NIC

Virtual Machine:• Large Size: 1 TB Ram - 40 vCPU• Hypervisor: KVM

• Note: 5-7% overhead due to virtualization vs barebone

• Application : SAP HANA ( In memory Database)

• Workload : OLAP ( TPC-H Variant)• Data size ~2.5 TB uncompressed => 300 GB

compressed• 18 different Queries• 15 iteration of each query set

Page 26: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 26Public

Use Case : Scaling out Hana Virtual Memory

KernelHeca

Kernel

Memory Process (empty VM)

Kernel

Memory Process (empty VM)

Kernel

Memory Process (empty VM)

See 4 TB RAMConstraint to 1

TB Physical RAM

(scratch pad)

1 TB Virtual Memory allocated

Kernel

Memory Process (empty VM)

RDMA Fabric

We never consume more than 4 TB of Physical Ram at anytime

Heca

Heca

Heca

Heca

VM

Page 27: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 27Public

1 20 40 600

50

100

150

200

250

300

350

400

450

500

Baseline

Half-Half

Scaling Out Hana (half of memory remote)

Nb Users HECA Overhead per query set

1 ~3%

20 ~3%

40 ~3%

60 ~3%

Virtual Machine:• 1TB Ram , 40 vCPU ()• Application : HANA ( In memory Database)• Workload : OLAP ( TPC-H Variant)• ~300GB data set compressed

Hardware:•Intel Xeon West Mere – 4 socket - 10Core – 1 TBB RAM•Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2

Query Set Completion Time (s)

Users

Page 28: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 28Public

Per Query Overhead

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

-50%

0%

50%

100%

150%

200%

1 User20 Users40 Users

Query

Page 29: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 29Public

Scaling out HANA

Virtual Machine:• 1TB GB Ram , 40 vCPU • Application : HANA ( In memory Database )• Workload : OLAP ( TPC-H Variant)- 80 Users• ~300GB data set compressed

Memory Ratio HECA Overhead per query set – 80 users

1:2 4%1:3 5.6%

2:1:1 0.9%1:1:1 2 %

Hardware:•Intel Xeon West Mere – 4 socket - 10Core – 1 TB RAM•Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2

Page 30: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 30Public

High Availability of Memory Node ( RRAIM ) running HANA

Memory Ratio (Medium 128GB - SAP-H Benchmark – 80 users)

RRAIM Overhead vs Heca

1:2 -0.2%

1:3 -0.1%

RAMRAM

RAMRAM

RAMRAM

RAMRAM

RAMRAM

RAMRAM

HA

Mirroring

• Memory region backed by two remote nodes.

• Remote page faults and swap outs initiated simultaneously to all relevant nodes.

• No immediate effect on computation node upon failure of node.

• When we a new remote enters the cluster, it synchronizes with computation node and mirror node.

Page 31: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 31Public

Enterprise Class Feature

Before DedupAfterDedup

Online RAM Deduplication (via KSM) Automatic Memory Tiering

Local Memory

Remote Memory

Compressed Memory

SSD/NVRAM

HDD

Local Node Remote Node

Page 32: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

NextWork in progress and future possible directions

Page 33: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 33Public

Transitioning to a Memory Cloud Transparent Cloud Integration

Compute VMMemory Demander

Memory Cloud Management Services (OpenStack)

App App

memoryMemoryCloud

Heca-NOVA

VM

RAM

VM VM

RAM

Many Physical NodesHosting a variety of VMs

Combination VMMemory Sponsor & Demander

Memory VMMemory Sponsor

PoC Q4 2013

• Automatically reclaim underutilized or abandoned memory resources

• Automatically and intelligently redeploys memory workloads across the infrastructure

Page 34: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 34Public

Hecatonchire Cloud

• Hierarchical, coordinated, global system can set and manage power, budgets, respond to faults, support enclave components that leverage machine learning

• Resource management is hierarchical, and managers are stackable

• Resource managers are integrated

• Resource managers are customizable and adaptable

• Sharing is avoided whenever possible

• Strict enforcement is costly

Page 35: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 35Public

BareBone Memory Scale Out and Fine grained Memory Sharing

BareBone Memory Scale Out

• No need for Virtualization and/or Hypervisor

• Similar to Linux Control Group

• Allow barebones memory scale out for HANA or any other applications

Fine grained Distributed Shared Memory for user space:

• Memory coherence Model • Shared Nothing,• Write Invalidate• Read-Write protection

• Posix Style Shared Memory

PoC Q4 2013

PoC Q4 2013

Page 36: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 36Public

Hierarchical Memory

• In the future a significant Portion of Memory will be non-volatile• Helps reduce power• Helps with resilience• Helps with cost• Aim to offer transparent and/or ease the use Storage Class memory

• Transparent integration with Storage Class memory :• Local NVRAM • Remote NVRAM• Hybrid DRAM/NVRAM• Interface directly with NVMe• Persistence / HA

Page 37: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 37Public

Cpu Aggregation

• Leverage ACPI virtualization feature from Intel/AMD

• “Titanomachie” Smart vCPU and data scheduling • mixing prefetch , vCPU migration/placement , data / computation reordering

• Provide variable granularity support • From Cache line • To Huge page

Page 38: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 38Public

Incremental effort

• Prefetch

• Support New fabric

• Dirty Bit tracking and HW feature from next gen arch of AMD/ Intel

• MRIOV support

Page 39: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

Thank You!Contact information:

Dr. Benoit Hudzia [email protected]

Hecatonchire Project: WWW: http://www.hecatonchire.comGithub: https://github.com/hecatonchire/

Page 40: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

Backup Slides

Page 41: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 41Public

Raw Bandwidth usageHW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps

Total Gbit/sec (SIW - Seq)

Total Gbit/sec (IW-Seq)

Total Gbit/sec (IB-Seq)

Total Gbit/sec (SIW- Bin split)

Total Gbit/sec (IW- Bin split)

Total Gbit/sec (IB- Bin split)

Total Gbit/sec (SIW- Random)

Total Gbit/sec (IW- Random)

Total Gbit/sec (IB- Random)

0

5

10

15

20

251 Thread 2 Threads3 Threads4 Threads5 Threads6 Threads7 Threads

Gb/s Sequential Walk over 1GB of shared RAM Bin split Walk over 1GB of shared RAM Random Walk over 1GB of shared RAM

Maxing out Bandwidth

Not enough core to saturate (?)

No degradation under high load

Software RDMA has significant

overhead

Page 42: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 42Public

Enabling Live Migration of HANA DB (Small Instance)

Baseline Pre-Copy(Standard)

Post-Copy(Heca)

Downtime N/A 7.47 s 675 ms

Performance Degradation

(80 users)

0% Benchmark Failed

(HANA crash- Vm unresponsive)

5%

Page 43: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 43Public

Redundant Array of Inexpensive RAM: RRAIM

1. Memory region backed by two remote nodes. Remote page faults and swap outs initiated simultaneously to all relevant nodes.

2. No immediate effect on computation node upon failure of node.

3. When we a new remote enters the cluster, it synchronizes with computation node and mirror node.

Page 44: Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

© 2013 SAP AG. All rights reserved. 44Public

Quicksort Benchmark with Memory Constraint

Memory Ratio (constraint using cgroup)

HECA Overhead RRAIM Overhead

3:4 2.08% 5.21%1:2 2.62% 6.15%1:3 3.35% 9.21%1:4 4.15% 8.68%1:5 4.71% 9.28%

Quicksort Benchmark 512 MB Dataset Quicksort Benchmark 1GB Dataset Quicksort Benchmark 2GB Dataset

3:041:021:031:041:050.00%

2.00%

4.00%

6.00%

8.00%

10.00%

DSM OverheadRRAIM Overhead