mohan and tikale slides - mass open cloud

Post on 29-May-2022

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Using Elastic Secure Infrastructure (ESI) in (De)-Centralized Environments: Why and How?

Apoorve Mohan Sahil Tikale Northeastern University Boston University

Mass Open Cloud Mass Open Cloud

Many Organizations Invest $$$$ in Bare-Metal Clusters

❏ Buy or Rent Bare-Metal Infrastructure

● private data center (on-premise)● shared data center (co-located facility)● rent from a service provider (hosting)

2

❏ Setup Bare-Metal Clusters

● provision infrastructure with intended software stack to manage and run workloads

Cluster size determined by factors such as budget, historical usage, workload type, etc.

3

1. Fixed-Sized Bare-Metal Clusters Can Lead to Underutilization or Starvation

E.g. Microsoft Azure (Cloud)

Short-Lived VMs Resource Demand

Strong Predictable Diurnal Patterns

4

Maximum Allocation

Source: https://github.com/Azure/AzurePublicDataset

AllocationPattern

Underutilization!

E.g. LANL Mustang (HPC)Long Job Queue

Steady High Utilization;

Jobs waiting for long periods for

resources to free-up

5Source: http://ftp.pdl.cmu.edu/pub/datasets/ATLAS/

Starvation!

E.g. U.S. NET2 ATLAS (HTC)

Cluster Size Limited by Budget

Millions of Short-lived Single Node Batch Jobs

6Source: http://egg.bu.edu

Starvation!

Time-Multiplexing of Unused Bare-Metal Clusters Not Possible Due to Siloing

7

2. Bare-Metal Siloing can Lead to Poor Aggregate Resource Efficiency

E.g. Two Sigma Data Center (Analytics)

Siloing Side-Effect

Frequently Use Public Cloud

Despite Unused Capacity

}

8Source: Two Sigma Investments

Data Center Capacity

Aggregate Allocation

If Such Silos are Co-located

Bare-Metal Multiplexing Can Improve Aggregate Resource Efficiency, BUT

9

10

Are Potential Gains from Bare-Metal Multiplexing Marginal?

Common Concerns

❏ Impact of Multiplexing Speed?

❏ What About Spare Nodes in Clusters Running On-Demand Workloads?(e.g. bursty situations, hardware failures, etc.)

❏ What if HPC runs Slow on Commodity?(e.g. bare-metal servers from the cloud)

Potential Benefits: Bare-Metal Multiplexing

11

~57% Utilization Improvement ~7% Cost Savings Irrespective of Multiplexing Cost

No Spare Nodesfor On-demand

WorkloadsAggregate Utilization

Cost Savings

12

25% Spare Nodesfor On-demand

Workloads

Potential Benefits: Bare-Metal Multiplexing

Constant Slack Impact Gains Significantly

Aggregate Utilization

Cost Savings

E.g. Microsoft Azure (Cloud)

Why 25% Spare Node All the Time?

13

Maximum Allocation

Source: https://github.com/Azure/AzurePublicDataset

AllocationPattern

Why not just account for maximum expected

Slope for given multiplexing cost?

14

Maximum Observed Slope For Different Multiplexing Cost

Potential Benefits: Bare-Metal Multiplexing

Low Multiplexing Cost is Crucial

Aggregate Utilization

Cost Savings

15

Aggregate Utilization

Cost Savings

HPC Jobs Degraded 100%

Potential Benefits: Bare-Metal Multiplexing

Significant Gains Despite Performance Degradation

16

Bare-Metal Multiplexing

Questions?

❏ How fast can we multiplex practically?

❏ Multi-tenancy➔ Security and Isolation➔ System Visibility➔ Control

❏ Performance

Elastic Secure Infrastructure (ESI)

17

Hardware Isolation Layer

(Hennessy et. al. [SoCC’16])

Bare Metal Imaging Service

(Mohan et. al. [IC2E’18])

Tenant Controlled Bare-Metal Security

(Mosayebzadeh et. al. [ATC’19])

Trusted Boot and Remote Attestation

Network Isolation

Diskless Provisioning

Elastic Secure Infrastructure (ESI)

18

Hardware Isolation Layer

(Hennessy et. al. [SoCC’16])

Bare Metal Imaging Service

(Mohan et. al. [IC2E’18])

Tenant Controlled Bare-Metal Security

(Mosayebzadeh et. al. [ATC’19])

Trusted Boot and Remote Attestation

Network Isolation

Diskless Provisioning

Multi Tenancy

Elastic Secure Infrastructure (ESI)

19

Hardware Isolation Layer

(Hennessy et. al. [SoCC’16])

Bare Metal Imaging Service

(Mohan et. al. [IC2E’18])

Tenant Controlled Bare-Metal Security

(Mosayebzadeh et. al. [ATC’19])

Trusted Boot and Remote Attestation

Network Isolation

Diskless Provisioning

Multi Tenancy

State Management

Elastic Secure Infrastructure (ESI)

Hardware Isolation Layer

(Hennessy et. al. [SoCC’16])

Bare Metal Imaging Service

(Mohan et. al. [IC2E’18])

Tenant Controlled Bare-Metal Security

(Mosayebzadeh et. al. [ATC’19])

Trusted Boot and Remote Attestation

Network Isolation

Diskless Provisioning

20

Security and Verification

Multi Tenancy

StateManagement

Elastic Secure Infrastructure (ESI)

Hardware Isolation Layer

(Hennessy et. al. [SoCC’16])

Bare Metal Imaging Service

(Mohan et. al. [IC2E’18])

Tenant Controlled Bare-Metal Security

(Mosayebzadeh et. al. [ATC’19])

Trusted Boot and Remote Attestation

Network Isolation

Diskless Provisioning

21

Rapid and Secure Bare-Metal Multiplexing

Multi Tenancy

State Management

Security and Verification

22

Bare-Metal Multiplexing with ESI

Bare-Metal Multiplexing with ESI

23

Slurm to Spark

OpenStack to Spark

<7 minutes to multiplex 32 nodes in an unoptimized environment

Aggregate Utilization

24

Cost Savings

Maximum Observed Slope

Potential Benefits: Bare-Metal Multiplexing

Low Multiplexing Cost is Crucial

With ESI

Exploring Two Scenarios

Single Organization Hosting Multiple Clusters

Multiple Such Organizations Co-located in a Data Center

25

26

Telemetry

ESI

❏ Value proposition based model

❏ Clusters define independent value proposition metrics

❏ Value proposition metric translates to $$

❏ Organization-level $$ maximization

❏ Clusters prevent SLO violation by $$ they pay to acquire resources

❏ Support for dynamically changing SLOs

Cost Model

Case 1: Single Organization Hosting Multiple Clusters

27

Case 1: Single Organization Hosting Multiple Clusters

Meta-Scheduling cycle of BareShala

HPC Cloud Big-data

Telemetry

Future Demand Prediction

Cost-Model

ESI

28

Case 1: Single Organization Hosting Multiple Clusters

Meta-Scheduling cycle of BareShala

HPC Cloud Big-data

Telemetry

Future Demand Prediction

Cost-Model

ESICurrent Value Proposition of each cluster

1

29

Case 1: Single Organization Hosting Multiple Clusters

Meta-Scheduling cycle of BareShala

HPC Cloud Big-data

Telemetry

Future Demand Prediction

Cost-Model

ESICurrent Value Proposition of each cluster

1

Future Value Proposition of each Cluster

2

30

Case 1: Single Organization Hosting Multiple Clusters

Meta-Scheduling cycle of BareShala

HPC Cloud Big-data

Telemetry

Current Value Proposition of each cluster

1

Future Demand Prediction

Cost-ModelFuture Value Proposition of each Cluster

Cluster resize decisions

2

3

ESI

31

Case 1: Single Organization Hosting Multiple Clusters

HPC Cloud Big-data

Telemetry

Current Value Proposition of each cluster

1

Future Demand Prediction

Cost-Model

Cluster resize decisions3

ESI

Future Value Proposition of each Cluster

2

Meta-Scheduling cycle of BareShala

Case 2: Co-located Non-Trusting Organizations

HPC/HTC Cluster

● Unlimited CPU demand.● Aggregated CPU usage per month● Happy to share if monthly CPU

usage > HPC owned CPUtime

Scalability Lab @ Red Hat

● High volume demand: 1000s of servers● Predictable cyclical demands.

Security Sensitive Clusters

● Tedious and time consuming to built● Utilization < 1%● Willing to share if compliant hardware

available when required.

● Dedicated data-centers for National emergencies utilized mostly around 2%

● Willing to share if they can use the shared pool to ramp up their systems in during emergencies.

GovernmentAgencies

● Interactive demand: Short term peaks.

● Let other use than running idle

Clouds OS researchers:

● Need “Exact-same-hardware”● Willing to share if guaranteed

availability “exact-same-hardware” is guaranteed to be available on demand.

● Peak demand : paper deadlines

Cloud Lab

Case 2: Co-located Non-Trusting Organizations

HPC/HTC Cluster

● Unlimited CPU demand.● Aggregated CPU usage per month● Happy to share if monthly CPU

usage > HPC owned CPUtime

Scalability Lab @ Red Hat

● High volume demand: 1000s of servers● Predictable cyclical demands.

Security Sensitive Clusters

● Tedious and time consuming to built● Utilization < 1%● Willing to share if compliant hardware

available when required.

● Dedicated data-centers for National emergencies utilized mostly around 2%

● Willing to share if they can use the shared pool to ramp up their systems in during emergencies.

GovernmentAgencies

● Interactive demand: Short term peaks.

● Let other use than running idle

Clouds OS researchers:

● Need “Exact-same-hardware”● Willing to share if guaranteed

availability “exact-same-hardware” is guaranteed to be available on demand.

● Peak demand : paper deadlines

Cloud Lab

Case 2: Co-located Non-Trusting Organizations

Why should I share my servers?

How to encourage sharing of servers ?

● Access to your own hardware whenever you want.● Ability to reserve nodes for future use. ● Ability to request and offer specific hardware.● Strong incentive to give up nodes when

○ You do not need them○ Or someone else needs them more than

you do.

Case 2: Co-located Non-Trusting Organizations

A Marketplace with an underlying economic model

36

Case 2: Co-located Non-Trusting Organizations

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization- X

ESI- Y

Organization- Y

ESI- X

37

Case 2: Co-located Non-Trusting Organizations

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization- X Organization- Y

FLOCX

ESI- YESI- X

38

Case 2: Co-located Non-Trusting Organizations

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization- X Organization- Y

Marketplace

ESI- YESI- X

FLOCX

39

Case 2: Co-located Non-Trusting Organizations

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization- X Organization- Y

Marketplace

Trading Agent-X Trading Agent-Y

ESI- YESI- X

FLOCX

40

Case 2: Co-located Non-Trusting Organizations

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization- X Organization- Y

Marketplace

Offer Bid

FLOCX

Trading Agent-X Trading Agent-Y

ESI- YESI- X

41

Case 2: Co-located Non-Trusting Organizations

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization- X Organization- Y

FLOCX Marketplace

Offer Bid

Auction Engine

Contracts

Trading Agent-X Trading Agent-Y

ESI- YESI- X

42

Case 2: Co-located Non-Trusting Organizations

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization- X Organization- Y

FLOCX Marketplace

Offer Bid

Auction Engine

Contracts

Contract-id : 12bc4rRenter : Org-XBorrower : Org-YTrading Agent-X Trading Agent-Y

ESI- YESI- X

43

Case 2: Co-located Non-Trusting Organizations

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization- X Organization- Y

FLOCX Marketplace

Offer Bid

Auction Engine

Contracts

Contract-id : 12bc4rRenter : Org-XBorrower : Org-YTrading Agent-X Trading Agent-Y

ESI- YESI- X

BareShala-X BareShala-Y

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization

Key Takeaways

Silos of statically

allocated Clusters

Poor Aggregate Resource Efficiency=

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization

Key Takeaways

ESI

● Elastic Secure Infrastructure (ESI): ○ Rapid and Secure multiplexing of bare-metal

servers is possible

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization

Key Takeaways

ESI

BareShala

● Elastic Secure Infrastructure (ESI): ○ Rapid and Secure multiplexing of bare-metal

servers is possible

● BareShala: ○ Centralized meta-scheduler.○ Improve aggregate resource efficiency across

clusters of a single organization

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization

Key Takeaways

ESI

BareShala

Marketplace

Trading Agent

● Elastic Secure Infrastructure (ESI): ○ Rapid and Secure multiplexing of bare-metal

servers is possible

● BareShala: ○ Centralized meta-scheduler.○ Improve aggregate resource efficiency across

clusters of a single organization

● FLOCX:○ Decentralized Incentive system.○ Improve aggregate resource efficiency across

organizations

FLOCX

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization

Key Takeaways

ESI

BareShala

FLOCX

● Elastic Secure Infrastructure (ESI): ○ Rapid and Secure multiplexing of bare-metal

servers is possible

● BareShala: ○ Centralized meta-scheduler.○ Improve aggregate resource efficiency across

clusters of a single organization

● FLOCX:○ Decentralized Incentive system.○ Improve aggregate resource efficiency across

organizations

● ESI + BareShala + FLOCX:○ Efficient usage of Data-center.○ Support for current and future clusters. ○ Enjoy flexibility without giving up control.

Marketplace

Trading Agent

50

Case 2: Co-located Non-Trusting Organizations

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

ESI

Organization- X

Auction Engine

Contracts

Marketplace

Trading Agent Trading Agent

ESI

Offer Bid

Organization- Y

Contract-id : 12bc4rRenter : Org-XBorrower : Org-Y

FLOCX

51

Case 2: Co-located Non-Trusting Organizations

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

ESI

Organization- X

Auction Engine

Contracts

Marketplace

Trading Agent Trading Agent

ESI

Offer Bid

Organization- Y

Contract-id : 12bc4rRenter : Org-XBorrower : Org-Y

52

Case 2: Co-located Non-Trusting Organizations

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization- X

ESI

Organization- Y

ESI

FLOCX Marketplace

Trading Agent Trading Agent

Offer Bid

Auction Engine

Contracts

Contract-id : 12bc4rRenter : Org-XBorrower : Org-Y

53

Case 2: Co-located Non-Trusting Organizations

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization- X

BareShala-Y

Organization- Y

BareShala-X

FLOCX Marketplace

Offer Bid

Auction Engine

Contracts

Contract-id : 12bc4rRenter : Org-XBorrower : Org-YTrading Agent-X Trading Agent-Y

ESI- YESI- X

54

Case 2: Co-located Non-Trusting Organizations

Organization-X

ESICluster-A

Cluster-B

Trading

Agent

Marketplace

Offers, Bids, Contracts

Offer

Contract

s

Bids

Auction

Engine

Organization-Y

ESICluster-1

Cluster-2

Trading

Agent

Contract-XY

Cluster-3

FLOCX

Organization- X Organization- Y

FLOCX Marketplace

Offer Bid

Auction Engine

Contracts

Contract-id : 12bc4rRenter : Org-XBorrower : Org-YTrading Agent-X Trading Agent-Y

ESI- YESI- X

55

Current Status

● Elastic Secure Infrastructure (ESI) is being productized as a part of upstream multi-tenant ironic.

● Prototypes of BareShala and FLOCX are being developed

Case 1: Single Organization Hosting Multiple Clusters

Cost Model for best placement of servers per cluster

Total nodes that all clusters can give in this interval

HPC

Big-Data

Cloud

Case 1: Single Organization Hosting Multiple Clusters

Decision Model for best placement of servers per cluster

Total nodes that all clusters can give in this interval

HPC

Big-Data

Cloud

Eg. can give 80% of its capacity

Eg. can give 10% of its capacity

Eg. can give 40% of its capacity

Case 1: Single Organization Hosting Multiple Clusters

Decision Model for best placement of servers per cluster

Total nodes that all clusters can give in this interval

HPC

Big-Data

Cloud

Eg. can give 80% of its capacity

Eg. can give 10% of its capacity

Eg. can give 40% of its capacity

58

Total nodes that all clusters need in this interval

HPC

Big-Data

Cloud

Eg. Needs 50% more than its capacity

Eg. Needs 40% more than its capacity

Eg. Needs nothing in this interval

59

Case 1: Single Organization Hosting Multiple Clusters

Decision Model for best placement of servers per cluster

Total nodes that all clusters can give in this interval

Total nodes that all clusters need in this interval

HPC

Big-Data

Cloud

HPC

Big-Data

Cloud

Eg. can give 80% of its capacity

Eg. Needs 50% more than its capacity

Eg. can give 10% of its capacity Eg. Needs 40% more

than its capacity

Eg. can give 40% of its capacity

Eg. Needs nothing in this interval

60

Case 1: Single Organization Hosting Multiple Clusters

Decision Model for best placement of servers per cluster

Total nodes that all clusters can give in this interval

Total nodes that all clusters need in this interval

HPC

Big-Data

Cloud

HPC

Big-Data

Cloud

Eg. can give 80% of its capacity

Eg. Needs 50% more than its capacity

Eg. can give 10% of its capacity Eg. Needs 40% more

than its capacity

Eg. can give 40% of its capacity

Eg. Needs nothing in this interval

Maximize value gained by moving node from

one cluster to another.

top related