larry mead – cto platform modernization team – microsoft rob shiveley – data center – intel...

33
Building a business critical system – technology, architecture, process Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE: WSV202

Upload: penelope-ward

Post on 11-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Building a business critical system – technology, architecture, process

Larry Mead – CTO Platform Modernization Team – MicrosoftRob Shiveley – Data Center – IntelScott Rosenbloom – Platform Strategy - Microsoft

SESSION CODE: WSV202

Page 2: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Session Objectives and Takeaways

Session Objectives: Definition of Mission CriticalWindows Server 2008 R2 support for Mission Critical in conjunction with Intel technologyConsiderations at the Application and Tools for Mission Critical

Key Takeaways:Positioning Windows Server and SQL Server for mission criticalGuidance on mission critical considerations

Page 3: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Mission Critical - Defined

* Business Dictionary: http://www.businessdictionary.com/definition/critical-business-function.html

“We’ve enjoyed faultless performance with Windows Server and SQL Server “We haven’t had any unscheduled downtime.”

- Dr. Elnaggar, IT Director at Bavarian Auto Group.

Mission Critical* Vital function (such as production and sales) without which a firm cannot operate or remain viable. If a critical business function is interrupted, a firm could suffer serious financial, legal, or other damages or penalties.

Reliable – system is tolerant of various component failures

Available – application is accessible across system outages

Serviceable – systems are monitored, self-corrects and notifies when necessary

Scalability / Performance – systems can scale to the needs of the business while maintaining consistent and

System attributes to support Mission Critical

Page 4: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Mission Critical Application

DEMO

Page 5: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Hardware and Operating System

Page 6: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Combined Power of Windows Server 2008 R2, SQL Server 2008 R2 and Intel Xeon 7500 is Mission Critical

Intel and Microsoft delivering together Scale-up and scale-out capabilities

Windows Server clusteringHyper-V Virtualization

Business continuity and manageability Multi-site managementEnterprise class error checking and recovery

And …

Page 7: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Synergy of Windows Server 2008 R2 + Intel Xeon 7500Power

Management

• Timer coalescing• Tick skipping• Core parking• Report power

consumption to OS via ACPI

• Accessible via WMI (reading/writing of power plans – active plan can be changed remotely)

Virtualization

• SLAT• VMQ• Jumbo Frames• Intel VT

Scalability

• 256 Logical Processors

• Turbo Boost• Quickpath• 16 MB L3 Cache

(7400)• Multi-site

manageability

RAS

• Memory Mirroring – writes to 2 locations to compensate for DRAM failure

• Memory Sparing – predicts a failing DIMM and copies data to a spare DIMM

• I/O Hot plug• MCA Recovery• WHEA – root

cause

Page 8: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Machine Check Architecture - Recovery

Video

Page 9: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Built-In Redundancy & Failover Throughout the PlatformSocket Redundancy & Failover• Dynamic OS Assisted Processor Socket Migration*

• Electronically Isolated (Static) Partitioning

Memory Redundancy & Failover•Inter-socket Memory Mirroring

•Intra-socket Memory Mirroring

•Intel® SMI Lane Failover

•Intel® SMI Clock Fail Over

•Intel® SMI Packet Retry

•Memory DIMM and Rank Sparing

•Dynamic Memory Migration

•Fail Over from Single DRAM Device Failure (SDDC)

•Recovery from Single DRAM Device Failure (SDDC) plus random bit error

Intel® QPI Redundancy & Failover•QPI Self-Healing

•QPI Clock Fail Over

•Intel QPI Packet Retry

Intel® QPI

NHM-EX

NHM-EX

NHM-EX

NHM-EX

PCI Express* 2.0PCI Express* 2.0

MemoryMemory

ICH10

IOH IOH

MemoryMemory

Intel® QPI = Intel® QuickPath Interconnect

Intel® SMI = Intel® Scalable Memory Interconnect

Page 10: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Machine Check Architecture Recovery

Allows Recovery From Otherwise Fatal System Errors

Normal Status

With Error

Prevention

First Machine Check Recovery in Xeon®-based Systems

*Errors detected using Patrol Scrub or Explicit Write-back from cache

Previously seen only in RISC, mainframe, and Itanium-based systems

REG

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

REG

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

REG

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

REG

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

S

M

BREG

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

REG

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

REG

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

REG

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

DDR3

S

M

B

SMI

SMIError

Corrected

HW Correctable Errors

ErrorDetected*

Patrol Scrubber scans memory for errors

Un-correctable Error

Error

Contained

Bad memory location flagged so data will not be used by OS or

applications

Error information passed to OS / VMM

System works in conjunction with OS or VMM to recover or restart

processes and continue normal operation

SystemRecovery with OS

10

Page 11: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Windows Hardware Error ArchitectureIntroduced in Windows Server 2008*• Better root cause analysis– Error reporting via common error record format, richer data

content (e.g. FRU info)– Platform and the OS flows are well integrated which allows both

to contribute information to the log

• Better support for hardware error recovery– Built in infrastructure for error injection– Platform Specific Hardware Error Driver (PSHED) Plugins allow for

platform participation in error recovery

• Error avoidance with health monitoring– Allows for applications to register for hardware error event notification– PFA apps can be used to monitor platform health

• WHEA enhancements on Intel® Architecture in Windows Server 2008* R2– Support for Nehalem-EX MCA recoverable errors– Corrected Machine Check Interrupt (CMCI) error handling support

Intel® server processors codename Nehalem-EX

Page 12: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

12

MCA Recovery :Explicit Write Back Error

Error detected

CPU

Cores

UnCore

Core0

Link

Memory Controller

Core7

LLC

WB Data

PoisonTag

WB Data

PoisonTag

Memory

New Data

Log the errorMCi_Status.Valid = 1MCi_Status.EN = 1 (Error enabled)MCi_Status.UC (uncorrected error ) = 1MCi_Status.PCC (Process context corrupt ) = 0MCi_Status.OVER (overflow) = 0MCi_Status.MCA_error_codes indicates which error is detected MCG_Status.RIPV = 1MCi_Status.ADDRV = 1MCi_Status.MISCV = 1 MCi_Status.MSCOD = poison (model specific)

EWB Error detectedData stored with poison bit

System Software recovers the error

Broadcast MCE to all threads

Page 13: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

13

CPU

Cores

UnCore

Core0

Link

Memory Controller

Core7

LLC

Memory

New Data

Memroy Error detected

Memory Error is Detected

And Corrected1

1

Corrected ErrorCount is

Incremented2

Error CountExceeds

Threshhold3

Uncore Issues CMCI to the OSHandler

4

2 3

Example: OS InitiatesFail-over to Spares

MCA Predictive Failure Notification

4

Page 14: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

14

Software Error Recovery MotivationAbility to isolate uncorrectable errors and achieve containment– Allows for OS to terminate/restart an application mapped to that address or the VMM to terminate

the guest OS– System remains active running other applications or guest OSs– Increase the system up time (RAS); important requirement for servers

These errors are detected ahead of software consumption– Provide software an opportunity to attempt to recover from an uncorrected error before the error

brings down the machine

Potential error recovery cases– Uncorrected errors detected outside of program execution have potential for error recovery

- e.g. DRAM patrol scrubber

Potential to extend architecture capability in future to cover cases where software consumes erroneous data

Page 15: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Application & System Tools

Page 16: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Microsoft Virtualization for Server Applications

Virtualization Platform

Mission Critical Applications Management Platform

Enterprise ApplicationsLine Of Business (LOB) Custom

Applications

Database Communication

Business Applications

MicrosoftServer

Applications

Collaboration

Hyper-V™

Microsoft Virtualization = Windows Server 2008 R2 Hyper-V + System Center

Page 17: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Virtual Memory & Second-Level TranslationWith Virtualization an additional level of mapping is required Second Level Address Translation (SLAT) provides the extra translation into Virtual Machine address spacesPerformance advantage over non-enabled CPUs

Physical Memory Pages

The Virtual / Process view The Physical / real view

Virtual Machine 1

Hyper Visor

Virtual Machine 1

Virtual Machine 3

Operating System

Page 18: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Second Level Address Translation

DEMO

Page 19: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

What Makes a System Mission Critical?

Page 20: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

SAN for SQL and Files

SAN

LOB AppsWindows Server App Fabric

SQL Server 2008 R2Windows Server 2008 R2

Fiber Optic channel to SAN

Scale Up Configuration

Page 21: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

SAN based for

SQL and Files

SAN

Dual Gigabit Ethernet

on PCIe bus

HP BL465

CICS COBOL appsMicro Focus Server EEWindows Server 2008

SQL Server 2008 R2

Windows Server 2008 R2HP BL465

CICS COBOL appsMicro Focus Server EEWindows Server 2008LOB Apps

Windows Server App FabricWindows Server 2008 R2

Fiber Optic channel to SAN

Scale Out Configuration

Page 22: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Dual Gigabit Ethernet on PCIe

bus

SQL Server 2008 R2

Windows Server 2008 R2

LOB AppsApp Server

Windows Server 2008 R2

Windows Server 2008 R2Hyper-V Virtualization Server

LOB AppsApp Server

Windows Server 2008 R2

LOB AppsApp Server

Windows Server 2008 R2

Fiber Optic channel to SAN

Scale Out Virtualized

SAN based SQL and files

SAN

Page 23: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Core Parking & SQL Scale Up

DEMO

Page 24: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Operational Best PracticesOperations practices based on Information Technology Infrastructure Library (ITIL) /Microsoft® Operations Framework (MOF)

Change managementIncident managementProblem management

Dedicated Service Operations Center (SOC) Focused on BPOExperts in online collaboration services

Dedicated service administration teamISO 27001 aligned operational procedures

Page 25: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Hardware Provisioning

Deployment, Patching and State Mgmt

Virtual Workload

Provisioning

Mobile Device

Management

Performance and Health Monitoring

Backup & Disaster Recovery

Page 26: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

The Microsoft Platforms are Mission Critical Today

Sunguard1024-Core Computing Grid running Windows Server

2008 and SQL Server 2008

Asset Liability management (ALM) -

Near Linear scalability

bwin30,000 Transactions per

Second at peak

1 Million bets per day

100 Terabytes of data

SiemensPLM system supports 5,000

concurrent users

Gained 50% of space through compression

Sunguard - http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000006391 bwin - http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000004138 Siemens - http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000004826

Page 27: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Mission Critical Wrap-up

Windows Server 2008 R2 and SQL Server 2008 R2 are mission criticalHardware partners provide scale-up and resilient platform Windows Server + Intel Xeon 7500 can detect and recover from hardware errorsDemocratizing Mission Critical

Page 28: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Related ContentBreakout Sessions

Deploying, Virtualizing, and Managing Linux and UNIX with Hyper-VManage Your Enterprise from a Single Seat: Windows PowerShell RemotingLiving in a Mixed Environment: Integrating Your Heterogeneous InfrastructureBuilding a Business Critical System: Technology, Architecture, and Process

Interactive SessionsNext Generation VDI with Microsoft RemoteFXLighting Up Nehalem EX with Windows Server 2008 R2

Hands-on LabsImplementing High Availability

Product Demo StationsWindows Server 2008 R2 Failover Clustering

Page 29: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Resources

www.microsoft.com/teched

Sessions On-Demand & Community Microsoft Certification & Training Resources

Resources for IT Professionals Resources for Developers

www.microsoft.com/learning

http://microsoft.com/technet http://microsoft.com/msdn

Learning

Page 30: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Complete an evaluation on CommNet and enter to win!

Page 31: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

Sign up for Tech·Ed 2011 and save $500 starting June 8 – June 31st

http://northamerica.msteched.com/registration

You can also register at the

North America 2011 kiosk located at registrationJoin us in Atlanta next year

Page 32: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

© 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to

be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Page 33: Larry Mead – CTO Platform Modernization Team – Microsoft Rob Shiveley – Data Center – Intel Scott Rosenbloom – Platform Strategy - Microsoft SESSION CODE:

JUNE 7-10, 2010 | NEW ORLEANS, LA