ip communication fabric mike polston hp [email protected]

34
IP Communication Fabric Mike Polston HP [email protected]

Upload: kelley-glenn

Post on 17-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IP Communication Fabric Mike Polston HP michael.polston@hp.com

IP Communication Fabric

Mike Polston

HP

[email protected]

Page 2: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Agenda

• Data Center Networking Today

• IP Convergence and RDMA

• The Future of Data Center Networking

Page 3: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Communication Fabric versusCommunication Network

Page 4: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Communication Fabrics

• The Need– Fast, efficient

messaging between two users of a shared network or bus

– Predictable response and fair utilization for any 2 users of the ‘fabric’

• Examples– Telephone switch– Circuit switch– ServerNet– Giganet– InfiniBand– RDMA over IP

Page 5: IP Communication Fabric Mike Polston HP michael.polston@hp.com

How Many, How Far, How Fast?

Number of Systems ConnectedExponential Scale

1/Speed

BUS

LAN

SAN

Internet

“Fabrics”

Distan

ce

Page 6: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Data Center ConnectionsConnects for

ManagementPublic Net AccessClient (PC) AccessStorage AccessServer to Server MessagingLoad BalancingServer to Server BackupServer to DBMSServer to Server HA

Page 7: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Fabrics Within the Data Center Today

• Ethernet Networks– Pervasive Infrastructure– Proven Technology– IT Experience– Management Tools– Volume and Cost Leader– Accelerated Speed Improvements

Page 8: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Fabrics Within the Data Center Today

• Clustering– High Availability– Computer Clustering– Some on Ethernet– Memory Channel– Other Proprietary– Async connections– Early standards (ServerNet, Giganet, Myranet)

Page 9: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Fabrics Within the Data Center Today

• Storage Area Networks– Fibre Channel– Mostly Standard– Gaining Acceptance– Record– File– Bulk Transfer

Page 10: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Fabrics Within the Data Center Today

• Server Management– KVM Switches– HP Riloe, iloe– KVM over IP– Private IP nets

Page 11: IP Communication Fabric Mike Polston HP michael.polston@hp.com

•Processors Scale at Moore’s Law•Doubling every 18 months

•Networks Scaling at Gilder’s Law•Doubling every 6 months

•Memory Bandwidth growth rate•Only 10-15% per year

Scale

U

pS

cale

U

p

Scale Scale OutOut

Partitionable

Sea of Servers

Solution to ScalabilitySolution to Scalability

Business Growth …… And the need for Scale

Page 12: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Why Scale Out?

• Provide benefits by adding, not replacing …• Fault Resiliance

– HA Failover– N + 1 Protection

• Modular System Growth– Blades, Density– Investment Protection

• Parallel Processing– HPTC– DBMS Processing– Tiered Architectures

Page 13: IP Communication Fabric Mike Polston HP michael.polston@hp.com

The “Hairy Golf Ball” Problem

Page 14: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Agenda

• Data Center Networking Today

• IP Convergence and RDMA

• The Future of Data Center Networking

Page 15: IP Communication Fabric Mike Polston HP michael.polston@hp.com

IP Convergence

convergence

storagestorage

networkingnetworking remote remote managementmanagement

clusteringclustering

Page 16: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Ethernet Bandwidth Evolution

1994

1998

1973

1979

2002

20xx

3 Mbps

10 Mbps

100 Mbps

1 Gbps

10Gbps

1xx Gbps

Page 17: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Sockets ScalabilityWhere is the Overhead?

• Send Message• 9000 Instructions• 2 mode switches• 1 memory registration• 1 CRC Calculation

• Receive Message• 9000 Instructions

• Less with CRC & LSS offload• 2 mode switches• 1 buffer copy• 1 interrupt • 1 CRC calculation

• Systemic Effects• Cache, Scheduling

• Single RPC Request• = 2 sends & 2 receives

UserKernel

OSV API (Winsock)

OSV API Kernel Service(s)

Protocol Stack(s)Protocol Stack(s)

Device DriverDevice Driver

LAN Media InterfaceLAN Media Interface

Traditional LAN Architecture Components

Traditional LAN Architecture Components

Application

50-150mSone-way

Page 18: IP Communication Fabric Mike Polston HP michael.polston@hp.com

What is RDMA?• Remote DMA (RDMA) • The ability to move data from the memory space of one

process to another memory space, without minimal use of the remote node’s processor. • Provides error free data placement without CPU

intervention or data movement at either node. a.k.a. Direct Data Placement (DDP)

• Capable of being submitted and completed from user-mode without subverting memory protection semantics. (OS bypass)

• Request processing for Messaging and DMA handled by receiver without host OS/CPU involvement.

Page 19: IP Communication Fabric Mike Polston HP michael.polston@hp.com

The Need for RDMA•At 1Gbps and above, memory copy overhead is significant, and it’s not necessarily the CPU cycles – Server designs don’t have 100MBytes/sec of additional memory bandwidth

for buffer copying– RDMA makes each segment self describing, it can be landed in the right

place w/o copying and/or buffering•Classic networking requires two CPUs to be involved in a request/response pair for data access– End-to-end latency includes kernel scheduling events at both ends, which is

guaranteed to be 10s-100s of milliseconds. – TOE alone doesn’t help with the kernel scheduling latency– RDMA initiates data movement from one CPU only, with no kernel transition.

End-to-end latency is 10s of microseconds

Page 20: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Typical RDMA Implementation

Applications

User Agent (Verbs)Open/Close/Map Memory Send/Receive/Read/Write

Kernel Agent

Kernel HW Interface

Fabric Media Interface ( ServerNet, IB, Ethernet)

OS Vendor API (WinSock, MPI, Other)

DBMSApps

QP QP QP CQ

Page 21: IP Communication Fabric Mike Polston HP michael.polston@hp.com

“Big Three” wins for RDMA• Accelerate legacy sockets apps

• User space sockets -> SDP -> RDMA• Universal 25% - 35% performance gain in Tier 2-3

application communication overhead• Parallel commercial database

• <100us latency needed to scale real world apps• Requires user space messaging and RDMA

• IP based storage• Decades old block storage access model (iSCSI, SRP)

• Command/RDMA Transfer/Completion• Emerging user space file access (DAFS, NFS, CIFS)

• Compaq experiment identified up to 40% performance advantage. First lab test beat hand-tuned TPC-C run by 25%

Page 22: IP Communication Fabric Mike Polston HP michael.polston@hp.com

WHY IP versus IB?

• Ethernet Hardware Continues to AdvanceSpeed Low Cost Ubiquity

• TCP Protocol Continues to Advance• Management and Software Tools• Internet WorldWide Trained Staff• World Standards – Power, Phone, IP

Page 23: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Formed in Feb, 2002Went public in May, 2002Founders were Adaptec,

Broadcom, Compaq, HP, IBM, Intel, Microsoft, NetApp. Added EMC and Cisco

Open group with no fees working fast and furious

Deliverables Include:Framing, DDP and RDMA SpecificationsSockets Direct

SCSI Mapping InvestigationDeliverables to be submitted

to the IETF as informational RFCs

RDMA Consortium (RDMAC)

Page 24: IP Communication Fabric Mike Polston HP michael.polston@hp.com

The Stack•RDMA – Converts RDMA Write, RDMA Read, and Sends into a DDP message(s).

•DDP – Segments outbound DDP Messages into 1 or more DDP Segments; reassembles 1 or more DDP Segments into a DDP Message.

•MPA – Adds a backward marker at a fixed interval to DDP Segments. Also adds a length and CRC to each MPA Segment.

•TCP – Schedules outbound TCP Segments and satisfies delivery guarantees.

•IP – Adds necessary network routing information.

IP

TCP

MPA

DDP

RDMA

Page 25: IP Communication Fabric Mike Polston HP michael.polston@hp.com

RDMA Architectural Goals• Data transfer from local to remote system into an

advertised buffer• Data retrieve from a remote system to local from an

advertised buffer• Data transfer from a local to remote system into a non

advertised buffer• Allow local system to signal completion to the remote

system• Provide for reliable sequential delivery from local to

remote• Provide for multiple stream support

Page 26: IP Communication Fabric Mike Polston HP michael.polston@hp.com

RDMAP Data Transfer Operations

• Send• Send with Invalidate• Send with Solicitated Event (SE)• Send with SE and Invalidate• RDMA Write• RDMA Read• Terminate

Page 27: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Direct Data Placement

• Contain Placement Information– Relative Address– Record Length

• Tagged Buffers• UnTagged Buffers• Allows NIC Hardware to access application

memory (Remote DMA)• Can be implemented with or without TOE

Page 28: IP Communication Fabric Mike Polston HP michael.polston@hp.com

RDMA over MPA/TCP Header Format

TCPHeader

TCP Payload / TCP DataTCPSegment

IPDatagra

m

Length ULP Payload

IPHeader

IP Data

FramePDU

DDPSegment

ULP PDURDMAMessage

Oper-ation

RDMA or Anonymous

Buffer

EthernetHeader

Data

(logical operation)

RDMA ReadRDMA WriteSend

DDP/RDMAHeader(s)

DDP/RDMA Payload

Mar-ker

Page 29: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Agenda

• Data Center Networking Today

• IP Convergence and RDMA

• The Future of Data Center Networking

Page 30: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Emerging Fabric Adoption Two Customer Adoption Waves as Solutions Evolve

RDMA/TCP

InfiniBandInfiniBandInfiniBandInfiniBand

InfiniBandInfiniBand

Time

Wave 1 First fabric solutions

available (InfiniBand) Fabric evaluation within

data centers begins

Wave 2 Fabric Computing pervasiveness

IP fabric solutions become the leading choice for data center fabric deployments Leverage existing investment and improve infrastructure performance and utilization

InfiniBand used for specialized applications

Page 31: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Ethernet Roadmap Continued Ethernet Pervasiveness in the Datacenter

Improved Ethernet Performance and Utilization

• 1 Gigabit Ethernet

• TCP/IP Offload & acceleration

• 10 Gigabit Ethernet

• iSCSI (Storage over IP)

Today’sEthernet

Infrastructure

• Lights-out Management (KVM over IP)

Revolutionary IP Improvements & Advancements•Interconnect convergence•Scalability & performance•Resource virtualization

• RDMA/TCP Fabrics

• IP Sec Security acceleration

Page 32: IP Communication Fabric Mike Polston HP michael.polston@hp.com

hp Fabric LeadershipBringing NonStop Technologies to Industry Standard Computing

Robust, Scalable Computing

Technology & Expertise

Drive FabricDrive FabricStandardsStandards

IntroducedIntroducedthe First the First

Switched FabricSwitched Fabric

BreakthroughBreakthroughFabricFabric

EconomicsEconomics

High Volume Knowledge

FabricFabricComputingComputing

LeadingLeadingStorage FabricStorage Fabric

Page 33: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Fabrics Within Future Data CentersFoundation for Future Adaptive Infrastructure Vision

Move from “tiers” to “elements”

• n-tier architecture, like DISA, replaced by element “pools” available over the fabric

• resource access managed by tools like ProLiant Essentials and hp OpenView

• centrally administered automation tools

Heterogeneous fabric “islands”

• data center fabric connecting “islands” of compute & storage resources

• RDMA/TCP enables practical fabric scaling across the datacenter

• protocol routers translate between islands

compute fabric storage fabric

edge router

database servers

webservers

NAS iSCSI SANvirtualizedfunctions

internet

internet

application servers

IP to FCrouter

Fibre Channel SAN

IP to IBrouter(UNIX)

data center fabric

firewall

routing switches

- switches -

management

systems•provisioning•monitoring•resource mgmt by policy •service-centric

hp hp OpenViewOpenView

ProLiant ProLiant EssentialsEssentials

hp utility hp utility data centerdata center

Page 34: IP Communication Fabric Mike Polston HP michael.polston@hp.com

Making the IP Fabric Connection