mission critical computing on x86 · 2018. 10. 29. · high application availability* minimal...
TRANSCRIPT
Mission Critical Computing on x86
Rob Kypriotakis
Datacenter Solutions Architect
Intel Corporation
Mission Critical Workloads
• 1,000s – 1,000,000+ online
users
• Support large transactional
databases
• 24 x 7 operation
• Enable all users
• Complex queries
• Multiple data sources
• Large data warehouse
• Large scalable enterprise
databases
• No single point
of failure
• Extremely fast operational
speed
Transaction Processing Database Business Intelligence and Analytics
An Hour Of Downtime Can Mean
Millions In Lost Revenue
The Evolving Mission Critical Data Center
Today Long Term Mid Term
Infrastructure Silos Trapped in Legacy IT
Cloud Infrastructure for
Mainstream Enterprise
Dedicated Infrastructure for
Mission Critical Legacy
RISC
Legacy Mainframe
x86
Robust Cloud for All Workloads
Plans to adopt Cloud Infrastructure in the long term
Plans to scale resources for the deluge of data
Intel® Platforms Deliver Advanced Reliability
99.9976%
99.9973%
99.9971%
99.9962%
11X improve
2.4X improve
Source: ITIC Nov2011 Global Server Hardware &
Server OS Reliability Survey
Source: ITIC Jul2009 Global Server Hardware &
Server OS Reliability Survey
>4 nines
IT Managers Report comparable results between
IBM Power & x86 on unplanned downtime
Why the large improvement in x86 server reliability results?
• Significant new OS releases with mission critical
capabilities, including reliability
• Intel Xeon platform refreshes with advanced RAS
• Greater datacenter discipline for x86 servers (modeled
after RISC)
• More x86-based mission critical deployments with
higher uptime requirements
• Better IT tracking of x86 downtime
Causes of unplanned downtime
Operator error
40% Application
failures
40%
Hardware,
OS & power
20%
Source: Best Practices for Continuous Application
Availability, Gartner Data Center Conference 2008
Improvements due to a combination of New Platforms And Greater IT Focus
6
Tectonic Shift in Mission-Critical
Today Future
MISSION- CRITICAL
EPIC/RISC
x86
MAIN- FRAME
MISSION- CRITICAL
x86 EPIC/RISC
MAIN- FRAME
INDUSTRY STANDARD
MISSION-CRITICAL
ALLOWING GREATER FLEXIBILITY
STILL EXCEEDING
SLAs
Industry is innovating on Intel Architecture
Innovation first, and sometimes only on Intel Architecture… Follow the Innovation
Software Partners
HP Systems A few as example
Intel Xeon Processor E7 Family
Architected for scalable Windows and Linux performance with advanced reliability
Family of Mission Critical Processors
Intel Itanium Processor 9300 Series
Architected for Mission Critical UNIX with mainframe resiliency and scalability
Hardened OS OEM System
Capability
Application
Availability
OEM Service &
Support
Common Platform Strategy
Common Ingredients: Chipset | Interconnects | Buffers | Memory
Xeon Volume Economics to Itanium
Itanium RAS Capabilities to Xeon
OpenVMS
NonStop
Machine Check Architecture (MCA) Recovery
Supports recovery from otherwise fatal system
errors*
Allows the software layers (OS, VMM, DBMS) to
cooperate with the silicon layer to recover from
uncorrectable data errors
Previously seen only in RISC, mainframe, and
Itanium®-based systems
Silicon
Systems
Software
Integrated RAS Capabilities For Highly Available Deployments
*Errors detected using Patrol Scrub or Explicit Write-back from cache
Intel® Xeon® E7 Family RAS Philosophy
Repair Failing Data Connections
Recover From Uncorrectable
Errors
Minimize Planned Downtime Predict
Failures
Monitor
Heal
Detect & Correct Errors
Contain Uncorrected Errors
Continuous Self Monitoring and Self Healing
HW Un-correctable Errors
Machine Check Architecture Recovery How It Works
Normal Status With Error Prevention
System Recovery with SW
Error Corrected
Error Detected*
Error Contained
HW Correctable Errors Un-correctable Errors
System works in conjunction with OS,
VMM, or DBMS to recover or restart processes and
continue normal operation
Bad memory location flagged so data will not
be used by OS or applications
Error information passed to SW layer
MCA Recovery
*Errors detected using Patrol Scrub or Explicit Write-back from cache
Allows Recovery From Otherwise Fatal System Errors
Intel® Xeon® E7 Family Protection With Advanced Memory RAS
Repair Failing Data Connections
Recover From Uncorrectable Errors
Monitor
Heal
Detect & Correct Errors
Contain Uncorrected Errors
• ECC (cache, memory)
• Memory Address Parity Protection
• Memory Demand & Patrol Scrub
• Corrupt Data Containment Mode
• Memory Thermal Throttling • Enhanced DRAM Double Device Data
Correction (DDDC+1) • Enhanced DRAM Single Device Data
Correction (SDDC+1) • Fine Grained Memory Mirroring • Memory sparing & migration • Intel® SMI Lane failover • Intel® SMI Clock Fail Over • Intel® SMI Packet Retry
• Machine Check Architecture (MCA)
Recovery
• Failed DIMM Identification
• Memory Hot Add
• Corrected Machine Check Interrupt
(CMCI) for Preventive Failure Analysis
Xeon E7 Extensive Memory RAS
Minimize Planned Downtime
Predict Failures
Ensuring High Availability Eliminating Single Sources for Failure
Socket Redundancy & Failover • Dynamic OS Assisted Processor Socket Migration*
Memory Redundancy & Failover • Fine-Grained Memory Mirroring
• Intel® SMI Lane Failover
• Intel® SMI Clock Fail Over
• Intel® SMI Packet Retry
• Memory DIMM and Rank Sparing
• Dynamic Memory Migration
• Enhanced DRAM Double Device Data Correction (DDDC+1)
• Enhanced DRAM Single Device Data Correction (SDDC+1)
Intel® QPI Redundancy & Failover • QPI Self-Healing
• QPI Clock Fail Over
• Intel QPI Packet Retry
Intel® QPI
Xeon® E7
Xeon® E7
Xeon® E7
Xeon® E7
PCI Express* 2.0 PCI Express* 2.0
Memory Memory
IOH IOH
Memory Memory
Intel® QPI = Intel® QuickPath Interconnect Intel® SMI = Intel® Scalable Memory Interconnect
Built-In Redundancy, Failover, & Self-Healing
Intel® QuickPatvh Interconnect (QPI) Self-Healing
Intel® QPI Self-Healing maintains system availability in the event of persistent interconnect errors
On detecting persistent errors the QPI port automatically reduces to half the current width and keeps operating at a reduced level
The system administrator sets the threshold at which to go into self-healing mode
IOH
DDR3
Intel®
QPI
PCI Express*
2.0 Technology
Socket / IOH / Node Controller
Socket / IOH / Node Controller
Socket / IOH / Node Controller
Socket / IOH / Node Controller
Full
Width
Half
Width
QPI Port
Intel® Scalable Memory Interconnect (SMI) Lane Failover
Intel® SMI allows the memory interconnect to automatically failover and recover from partial link failures maintaining availability and performance
Intel® SMI provides an additional interconnect lane in each direction (memory write & read)
If a single lane failure is detected, the failed lane is automatically mapped out by the CPU and the spare lane is enabled
Processor Memory Buffer
Spare Lane Lane Failure Detected
Processor Memory
Buffer
Spare Lane Enabled
Lane Disabled
Memory Read Example
Static / Physical Partitioning Xeon® E7 Reliability Features
Allows a system to be divided into multiple machines, each capable of running its own OS and applications, by enabling/disabling links
Max number of partitions is determined by number of IOHs in the platform
Isolation among partitions are guaranteed by hardware
Repartitioning requires shutting down the system, reconfiguring and a system wide reboot.
Requires no OS support, Little or no FW support and some BMC support
CPU1 CPU2
CPU3 CPU4
IOH
IOH
ICH
ICH
IOH
IOH
ICH
ICH
BMC
`
Partition Manager
IO IO IOIO IO IO
IO IO IOIO IO IO
CPU1 CPU2
CPU3 CPU4
IOH
IOH
ICH
ICH
IOH
IOH
ICH
ICH
BMC
`
Partition Manager
IO IO IOIO IO IO
IO IO IOIO IO IO
Machine Check Architecture Recovery Extensible Error Recovery Architecture
Growing Software Industry Support for MCA Recovery 1Errors detected using Patrol Scrub or Explicit Write-back from cache
Uncorrectable data errors isolated
and corrected by OS1
Affected application may require
restart
System remains up and running
Window Server* 2008 R2
RHEL* 6
U8+
SLES11* SP1
Uncorrectable data error isolated
to a single VM / guest OS1
Affected VM may require restart
System and all other VMs remain
up and running
Uncorrectable data error isolated
within the DBMS buffer pool1
Affected (non-critical) buffer is
transparently reloaded from disk
System and DBMS remain up and
running
HANA In-Memory DBMS
* Other names and brands may be claimed as the property of others
vSphere* 5.0
RHEL* 6 - KVM
Database Related Innovation on Intel® Xeon® Processor
Oracle* Exa-Series
• Portfolio of OLTP / OLAP Appliances
• Offered on Intel® Xeon® Processors
"Across the stack, increases of 50 percent in both core count and cache drive up performance on the Intel® Xeon ® processor
5600 series." – Marie-Anne Neimat, VP Embedded Databases, Oracle*
Exa-Series Portfolio
Exadata Exalogic
Exalytics
SAP HANA
• Appliance offered on Intel® Xeon® processor
7500 from key OEM’s
• Instant response times to real-time events
“Intel and SAP, through joint engineering, have optimized SAP HANA…enabling greater business agility and innovative usage models that let customers respond to changing
conditions in real time.” - Press Announcement, December 2010
HANA
Tim
e in
Se
con
ds
lower is better
Source: SAP HANA Benchmark Study
* Other names and brands may be claimed as the property of others. Copyright © 2012, Intel Corporation.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and
functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products. Configurations: see Notes section. For more information go to http://www.intel.com/performance
System
Performance1
IBM DB2 pureScale
• DB2 Mainframe clustering capability made available
by IBM on Xeon based platforms!
“Two complementary advances make it possible to get high volume, low cost computing for online transaction processing: X5 servers based on the Intel® Xeon®
Processor 7500 series and IBM pureScale”
Sal Vella, VP DB2 Development, IBM
Where Does Platform RAS Matter?
Lower Higher Individual System Resiliency Basic
Server RAS
AdvancedServer RAS
* Other names and brands may be claimed as the property of others
High Application Availability
Minimal Application State
or Data
Cluster Deployment with
Mission Critical Servers
Redundant Deployment Model
Standard or Mission Critical Servers
Distributed Workload Centralized Workload
* 99.999%
System Availability
State is Critical
Data Intensive
Time Critical
2
0
Business Criticality and Application Type Determine RAS Requirements
Intel Xeon E7 Intel Xeon E5
Advanced Reliability Starts with the Right Silicon Intel® Xeon® Processor E7 Family Reliability, Availability, and Serviceability (RAS) Features
Repair Failing Data Connections
• Memory Thermal Throttling
• Enhanced DRAM Device Data Correction
• Fine-Grained Memory Mirroring
• Memory Sparing & Migration
Detect & Correct Errors
• ECC (cache, memory)
• Memory Address Parity Protection
• Memory Demand & Patrol Scrub
• Corrected Machine Check Interrupt (CMCI)
Enhance Responsiveness
• Intel® QuickPath Interconnect
• QPI Packet Retry
• QPI Protocol Protection via CRC
Increase Availability
• Machine Check Architecture (MCA) recovery
• Physical CPU Hot Add/Replace
• OS CPU On-lining
Minimize Planned Downtime
• Physical IOH Hot Add
• OS IOH On-lining
• PCI-E Hot Plug
INTEL CONFIDENTIAL
Over 20 major features added in the last two processor generations …and committed to continue to focus on RAS
• Advanced error detection, correction, &
containment
• Intel Machine Check Architecture Recovery
(MCA Recovery)
• Partial memory mirroring
• Advanced error logging
and management
• Control groups
• Advanced error reporting
• PCI hot plug
• Multipath I/O
• Hardware-based checksumming
• KVM hypervisor
Comprehensive Mission-Critical RAS
Reliability Availability Serviceability
Mission Critical Your Way
Expanding your mission critical systems With Intel, Redhat
Future Direction
Strong Server Platform Roadmap Sustained Server Microprocessor Leadership
Tick Tock Tick Tock
45nm 32nm
Penryn
Nehalem
Westmere
Sandybridge
22nm
Tick Tock
Ivybridge Haswell
Tock
65nm
Tukwila
32nm
Tock 22nm
Tock
Poulson Kittson
Future
15nm
2
6
System Failures
15nm
2013* 11nm
2015* 8nm
2017*
2019+ 45nm
2007 32nm
2009 22nm
2011* 65nm
2005 DEVELOPMENT MANUFACTURING RESEARCH
latest Intel manufacturing Advances
*projected
Intel’s process roadmap visibility is out at least 10 years
28
Poulson: The most significant Intel Itanium® Processor Key Highlights
• 2x the cores, 2x instructions throughput
• >2x the performance, 2x memory density
• New RAS and performance features
• And still completely compatible with existing software; no recompilation required
Poulson continues Itanium advances, and On-Track for Later this Year
Database Solution Cost Comparison Intel® Xeon® E7 family vs. Power* with Oracle* database
3
0
vs.
8-socket
DL980
8-socket
IBM Power 770*
Hardware cost $99,269 $304,530
Oracle EE cpu cost
($47,500 per license)
80Cx0.5=40 Core licenses
$1,900,000
64Cx1.0=64 Core licenses
$3,040,000
Total Hardware/Software Acquisition Cost $2.00M $3.34M
Source: Hardware Cost- HP Alinean TCO tool (http://h71028.www7.hp.com/enterprise/us/en/migrate-to-hp/tco-challenge.html ); Software- Oracle.com
Intel solution provides ~40% lower Total Cost of Acquisition
Legal Disclaimer • INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS
GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
• Intel may make changes to specifications and product descriptions at any time, without notice.
• Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
• Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance
• Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.
• Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details.
• Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
• Intel Turbo Boost Technology requires a system with Intel® Turbo Boost Technology capability. Consult your PC manufacturer. Performance varies depending on hardware, software and system configuration. For more information, visit http://www.intel.com/technology/turboboost
• Intel Virtualization Technology requires a computer system with a processor, chipset, BIOS, virtual machine monitor (VMM) and applications enabled for virtualization technology. Functionality, performance or other virtualization technology benefits will vary depending on hardware and software configurations. Virtualization technology-enabled BIOS and VMM applications are currently in development.
• 64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information.
• Intel, Intel Xeon, Intel Core microarchitecture, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Copyright © 2012 Intel Corporation. All rights reserved.