ip communication fabric mike polston hp [email protected]
TRANSCRIPT
Agenda
• Data Center Networking Today
• IP Convergence and RDMA
• The Future of Data Center Networking
Communication Fabric versusCommunication Network
Communication Fabrics
• The Need– Fast, efficient
messaging between two users of a shared network or bus
– Predictable response and fair utilization for any 2 users of the ‘fabric’
• Examples– Telephone switch– Circuit switch– ServerNet– Giganet– InfiniBand– RDMA over IP
How Many, How Far, How Fast?
Number of Systems ConnectedExponential Scale
1/Speed
BUS
LAN
SAN
Internet
“Fabrics”
Distan
ce
Data Center ConnectionsConnects for
ManagementPublic Net AccessClient (PC) AccessStorage AccessServer to Server MessagingLoad BalancingServer to Server BackupServer to DBMSServer to Server HA
Fabrics Within the Data Center Today
• Ethernet Networks– Pervasive Infrastructure– Proven Technology– IT Experience– Management Tools– Volume and Cost Leader– Accelerated Speed Improvements
Fabrics Within the Data Center Today
• Clustering– High Availability– Computer Clustering– Some on Ethernet– Memory Channel– Other Proprietary– Async connections– Early standards (ServerNet, Giganet, Myranet)
Fabrics Within the Data Center Today
• Storage Area Networks– Fibre Channel– Mostly Standard– Gaining Acceptance– Record– File– Bulk Transfer
Fabrics Within the Data Center Today
• Server Management– KVM Switches– HP Riloe, iloe– KVM over IP– Private IP nets
•Processors Scale at Moore’s Law•Doubling every 18 months
•Networks Scaling at Gilder’s Law•Doubling every 6 months
•Memory Bandwidth growth rate•Only 10-15% per year
Scale
U
pS
cale
U
p
Scale Scale OutOut
Partitionable
Sea of Servers
Solution to ScalabilitySolution to Scalability
Business Growth …… And the need for Scale
Why Scale Out?
• Provide benefits by adding, not replacing …• Fault Resiliance
– HA Failover– N + 1 Protection
• Modular System Growth– Blades, Density– Investment Protection
• Parallel Processing– HPTC– DBMS Processing– Tiered Architectures
The “Hairy Golf Ball” Problem
Agenda
• Data Center Networking Today
• IP Convergence and RDMA
• The Future of Data Center Networking
IP Convergence
convergence
storagestorage
networkingnetworking remote remote managementmanagement
clusteringclustering
Ethernet Bandwidth Evolution
1994
1998
1973
1979
2002
20xx
3 Mbps
10 Mbps
100 Mbps
1 Gbps
10Gbps
1xx Gbps
Sockets ScalabilityWhere is the Overhead?
• Send Message• 9000 Instructions• 2 mode switches• 1 memory registration• 1 CRC Calculation
• Receive Message• 9000 Instructions
• Less with CRC & LSS offload• 2 mode switches• 1 buffer copy• 1 interrupt • 1 CRC calculation
• Systemic Effects• Cache, Scheduling
• Single RPC Request• = 2 sends & 2 receives
UserKernel
OSV API (Winsock)
OSV API Kernel Service(s)
Protocol Stack(s)Protocol Stack(s)
Device DriverDevice Driver
LAN Media InterfaceLAN Media Interface
Traditional LAN Architecture Components
Traditional LAN Architecture Components
Application
50-150mSone-way
What is RDMA?• Remote DMA (RDMA) • The ability to move data from the memory space of one
process to another memory space, without minimal use of the remote node’s processor. • Provides error free data placement without CPU
intervention or data movement at either node. a.k.a. Direct Data Placement (DDP)
• Capable of being submitted and completed from user-mode without subverting memory protection semantics. (OS bypass)
• Request processing for Messaging and DMA handled by receiver without host OS/CPU involvement.
The Need for RDMA•At 1Gbps and above, memory copy overhead is significant, and it’s not necessarily the CPU cycles – Server designs don’t have 100MBytes/sec of additional memory bandwidth
for buffer copying– RDMA makes each segment self describing, it can be landed in the right
place w/o copying and/or buffering•Classic networking requires two CPUs to be involved in a request/response pair for data access– End-to-end latency includes kernel scheduling events at both ends, which is
guaranteed to be 10s-100s of milliseconds. – TOE alone doesn’t help with the kernel scheduling latency– RDMA initiates data movement from one CPU only, with no kernel transition.
End-to-end latency is 10s of microseconds
Typical RDMA Implementation
Applications
User Agent (Verbs)Open/Close/Map Memory Send/Receive/Read/Write
Kernel Agent
Kernel HW Interface
Fabric Media Interface ( ServerNet, IB, Ethernet)
OS Vendor API (WinSock, MPI, Other)
DBMSApps
QP QP QP CQ
“Big Three” wins for RDMA• Accelerate legacy sockets apps
• User space sockets -> SDP -> RDMA• Universal 25% - 35% performance gain in Tier 2-3
application communication overhead• Parallel commercial database
• <100us latency needed to scale real world apps• Requires user space messaging and RDMA
• IP based storage• Decades old block storage access model (iSCSI, SRP)
• Command/RDMA Transfer/Completion• Emerging user space file access (DAFS, NFS, CIFS)
• Compaq experiment identified up to 40% performance advantage. First lab test beat hand-tuned TPC-C run by 25%
WHY IP versus IB?
• Ethernet Hardware Continues to AdvanceSpeed Low Cost Ubiquity
• TCP Protocol Continues to Advance• Management and Software Tools• Internet WorldWide Trained Staff• World Standards – Power, Phone, IP
Formed in Feb, 2002Went public in May, 2002Founders were Adaptec,
Broadcom, Compaq, HP, IBM, Intel, Microsoft, NetApp. Added EMC and Cisco
Open group with no fees working fast and furious
Deliverables Include:Framing, DDP and RDMA SpecificationsSockets Direct
SCSI Mapping InvestigationDeliverables to be submitted
to the IETF as informational RFCs
RDMA Consortium (RDMAC)
The Stack•RDMA – Converts RDMA Write, RDMA Read, and Sends into a DDP message(s).
•DDP – Segments outbound DDP Messages into 1 or more DDP Segments; reassembles 1 or more DDP Segments into a DDP Message.
•MPA – Adds a backward marker at a fixed interval to DDP Segments. Also adds a length and CRC to each MPA Segment.
•TCP – Schedules outbound TCP Segments and satisfies delivery guarantees.
•IP – Adds necessary network routing information.
IP
TCP
MPA
DDP
RDMA
RDMA Architectural Goals• Data transfer from local to remote system into an
advertised buffer• Data retrieve from a remote system to local from an
advertised buffer• Data transfer from a local to remote system into a non
advertised buffer• Allow local system to signal completion to the remote
system• Provide for reliable sequential delivery from local to
remote• Provide for multiple stream support
RDMAP Data Transfer Operations
• Send• Send with Invalidate• Send with Solicitated Event (SE)• Send with SE and Invalidate• RDMA Write• RDMA Read• Terminate
Direct Data Placement
• Contain Placement Information– Relative Address– Record Length
• Tagged Buffers• UnTagged Buffers• Allows NIC Hardware to access application
memory (Remote DMA)• Can be implemented with or without TOE
RDMA over MPA/TCP Header Format
TCPHeader
TCP Payload / TCP DataTCPSegment
IPDatagra
m
Length ULP Payload
IPHeader
IP Data
FramePDU
DDPSegment
ULP PDURDMAMessage
Oper-ation
RDMA or Anonymous
Buffer
EthernetHeader
Data
(logical operation)
RDMA ReadRDMA WriteSend
DDP/RDMAHeader(s)
DDP/RDMA Payload
Mar-ker
Agenda
• Data Center Networking Today
• IP Convergence and RDMA
• The Future of Data Center Networking
Emerging Fabric Adoption Two Customer Adoption Waves as Solutions Evolve
RDMA/TCP
InfiniBandInfiniBandInfiniBandInfiniBand
InfiniBandInfiniBand
Time
Wave 1 First fabric solutions
available (InfiniBand) Fabric evaluation within
data centers begins
Wave 2 Fabric Computing pervasiveness
IP fabric solutions become the leading choice for data center fabric deployments Leverage existing investment and improve infrastructure performance and utilization
InfiniBand used for specialized applications
Ethernet Roadmap Continued Ethernet Pervasiveness in the Datacenter
Improved Ethernet Performance and Utilization
• 1 Gigabit Ethernet
• TCP/IP Offload & acceleration
• 10 Gigabit Ethernet
• iSCSI (Storage over IP)
Today’sEthernet
Infrastructure
• Lights-out Management (KVM over IP)
Revolutionary IP Improvements & Advancements•Interconnect convergence•Scalability & performance•Resource virtualization
• RDMA/TCP Fabrics
• IP Sec Security acceleration
hp Fabric LeadershipBringing NonStop Technologies to Industry Standard Computing
Robust, Scalable Computing
Technology & Expertise
Drive FabricDrive FabricStandardsStandards
IntroducedIntroducedthe First the First
Switched FabricSwitched Fabric
BreakthroughBreakthroughFabricFabric
EconomicsEconomics
High Volume Knowledge
FabricFabricComputingComputing
LeadingLeadingStorage FabricStorage Fabric
Fabrics Within Future Data CentersFoundation for Future Adaptive Infrastructure Vision
Move from “tiers” to “elements”
• n-tier architecture, like DISA, replaced by element “pools” available over the fabric
• resource access managed by tools like ProLiant Essentials and hp OpenView
• centrally administered automation tools
Heterogeneous fabric “islands”
• data center fabric connecting “islands” of compute & storage resources
• RDMA/TCP enables practical fabric scaling across the datacenter
• protocol routers translate between islands
compute fabric storage fabric
edge router
database servers
webservers
NAS iSCSI SANvirtualizedfunctions
internet
internet
application servers
IP to FCrouter
Fibre Channel SAN
IP to IBrouter(UNIX)
data center fabric
firewall
routing switches
- switches -
management
systems•provisioning•monitoring•resource mgmt by policy •service-centric
hp hp OpenViewOpenView
ProLiant ProLiant EssentialsEssentials
hp utility hp utility data centerdata center
Making the IP Fabric Connection