510

52
IBM Systems Group 9/29/2005 © 2004 IBM Corporation Greg Rodgers Peter Morjan Sept 27, 2005 MareNostrum Training

Upload: mubashir-moin

Post on 02-Sep-2015

222 views

Category:

Documents


7 download

DESCRIPTION

510

TRANSCRIPT

  • IBM Systems Group

    9/29/2005 2004 IBM Corporation

    Greg RodgersPeter Morjan

    Sept 27, 2005

    MareNostrum Training

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation2

    Agenda

    Greg Rodgers & Peter Morjan

    Greg Rodgers

    Greg Rodgers

    Greg Rodgers

    Instructor

    DIM and Image Management4:30-6

    Storage Subsystem2:30-4:00

    LUNCH1:2:30

    Network Overviewand Linux Services

    11:30-1

    Blade Cluster ArchitectureJS20 OverviewMareNostrum Layout

    9:30-11:00TuesdaySept 27

    TopicsTimeDate

    Some detail on these charts will be added during class. Final charts will be available after class.

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation3

    High-Capacity Multi-Network Linux Cluster Model

    POWER server

    POWER server

    MyrinetHi speed fabric

    Service LAN

    A multi-purpose multi-user supercomputer

    Reliable Gigabit Network

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation4

    Multiple Networks in BladeCenter Clusters: 3 Networks in BladeCenter Cluster Architecture

    service , out of band systems management reliable gigabit network for global access, net boot, image service, and GPFS high speed fabric for distributed memory apps (e.g. MPI) and optional IO

    Features: Out of band service network gives physical security. Systems management

    network isolated from users. BladeCenters controlled by SNMP commands on service network.

    Cluster can be brought up without high speed fabric High reliable Gbe network helps to diagnose and recover from complex high

    speed fabric issues Gbe Bandwidth sufficient bandwidth for root file system, allows for diskless

    image management. Independent IO traffic. Heavy file IO wont impact a concurrent MPI user 2nd gigabit interface available for expansion

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation5

    The MareNostrum Blade Cluster

    p615 server

    P615 server

    2560 portMyrinet 2000

    switch

    Service LAN

    FORCE10 Gigabit Network

    20 DS4100 storage nodes

    172 BladeCenters2406 Blades

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation6

    BladeCenter System ManagementMethodology

    Processor Blade

    ENetCPU

    ServiceProcessor

    EthernetSwitch Module

    CDROM/Floppy

    Control Panel

    Blower

    Power

    ENet

    ManagementModule

    lVPDLEDsVoltageTemperatureCPU I2C InterfaceFlash Update

    ClusterManagement ServerBladeCenter

    Chassis

    Redundant System Components not shown

    ENet

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation7

    JS20 Blades, BladeCenter and Compute Racks

    17.6GF

    246 GF

    1.48 TF

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation8

    New Technologies used in MareNostrum

    IBM Advanced semiconductor technology (CMOS10S 90nm) Hi speeds at low power

    2.2Ghz PowerPC 970FX processor Industry leading 64-bit commodity processor Record price/perf in HPC workloads

    IBM Blade Center Integration Record cluster density Improved cluster operating efficiency

    (power, space, cooling) Speed of installation

    IBM e1350 Support Provides Cluster level testing, integration,

    and fulfillment

    Hi-density Myrinet interconnect Significant reduction in switching hardware MPI performance that scales

    Hi-density gigabit switch w/48 port linecards

    Enterprise scale-out FAStT IBM Storage (TotalStorage 4100) with GPFS on 2000+ nodes Reliable and scalable global access filesystem

    Linux 2.6 Enterprise and performance features to

    exploit the POWER architecture VMX, large pages, modular boot

    Diskless node capability Improved node reliability Reduced installation and maintenance costs-> Flexibility to change node personality

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation9

    Agenda

    Greg Rodgers & Peter Morjan

    Greg Rodgers

    Greg Rodgers

    Greg Rodgers

    Instructor

    DIM and Image Management4:30-6

    Storage Subsystem2:30-4:00

    LUNCH1:2:30

    Network Overviewand Linux Services

    11:30-1

    Blade Cluster ArchitectureJS20 OverviewMareNostrum Layout

    9:30-11:00TuesdaySept 27

    TopicsTimeDate

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation10

    Anatomy of a Blade

    CPU 1

    CPU 2

    DIMM 1

    DIMM 2

    DIMM 3

    DIMM 4

    BatteryBuzzer SW3

    SW4

    to I/O Exp.

    DRIVE 1 DRIVE 2

    DaughterCard

    to Front Panel

    to M

    idplane

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation11

    JS20 Blade

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation12

    JS20 Processor Blade:

    2-way PPC970 SMPNorthbridge with Memory Controller and Hypertransport I/O BusAMD Hypertransport tunnel to PCI-XAMD southbridge2 or 4 DIMMs (up to 4-8 GB)BladeCenter Service Processor2x1Gb Ethernet on board,

    PCI-X attached (Broadcom)optional additional IO daughter card:

    2x1Gb Ethernet (Broadcom) ... or2x2Gb FibreChannel (QLogic) orMyrinetPCI-X attached

    Single-wide blade14 blades per chassis84 servers (168 processors) in a 42 U rack

    PPC 970

    U3 Northbridge

    PPC 970

    D DR

    AMD 8131 Hyper Transport PCI-X

    Tunnel

    16 bit HT

    GBIT BCM5704S

    AMD 8111 Hyper Transport

    I/O HUB 8 bit HT

    Flash

    Super I/O

    USB K/M

    HDM Connector

    HDM Connector

    IDE

    Hawk SP

    SMBus

    D DR

    D DR

    D DR

    Opt Fiber Ch.

    Or Gbit ext

    PCI-X

    USB FDD/CDROM

    MIDPLANE Baier/Lichtenau Nov1502

    VRM

    RS-485

    PCI-X

    IDE

    NV RAM

    Serial Port

    JS20 Blade Logic

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation16

    VMX vs MMX/SSE/SSE2

    VMX

    32 x 128-bit VMX registers

    No interference with FP registers

    No context or mode switching

    Max. throughput: 8 Flops / cycle

    Char Short Int Float

    MMX / SSE / SSE2

    8 x 128-bit SSE registers plus 8 x 64-bit MMX registers

    MMX registers == FP registersMMX stalls FP

    Context switching required for MMX

    Max. throughput: 2 Flops / cycle

    Char Short Int Long Float Double

    +

    -

    +

    +

    +

    +

    -

    -

    -

    -

    Much more about VMX on Friday

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation17

    Agenda

    Greg Rodgers & Peter Morjan

    Greg Rodgers

    Greg Rodgers

    Greg Rodgers

    Instructor

    DIM and Image Management4:30-6

    Storage Subsystem2:30-4:00

    LUNCH1:2:30

    Network Overviewand Linux Services

    11:30-1

    Blade Cluster ArchitectureJS20 OverviewMareNostrum Layout

    9:30-11:00TuesdaySept 27

    TopicsTimeDate

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation18

    MareNostrum Rack Summary

    34 xSeries e1350 Racks

    29 Compute Racks (RC01-RC29)- 171 BC chassis w/OPM & gb ESM- 2406 JS20+ nodes w/myrinet card

    1 Gigabit Network Rack (RN01)- 1 Force10 E600 for Gb network- 4 Cisco 3550 48-port switchs

    4 Myrinet Racks (RM01-RM04)- 10 clos256+256 myrinet switches - 2 Myrinet spine1280s

    8 pSeries 7014-T42 Racks

    1 Operations Rack (RH01)- 1 7316-TF3 display- 2 p615 mgmt nodes- 2 HMC 7315-CR2

    -3 Remote Async Nodes- 3 cisco 3550 (installed on site)- 1 BCIO (installed on site)

    7 Storage Server Racks (RS01-RS07)- 40 p615 storage servers

    - 20 FAStT100 controllers- 20 EXP100 expansion

    drawers- 560 250M SATA disks

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation19

    27 bladecenter 1350 xSeries racks(RC01-RC27)

    Box Summary per rack 6 Blade Center Chassis

    Cabling External

    6 10/100 cat5 from MM 6 Gb from ESM to E600 84 LC cables to myrinet switch

    Internal 24 OPM cables to 84 LC cables

    BladeCenter(7U)

    BladeCenter(7U)

    BladeCenter(7U)

    BladeCenter(7U)

    BladeCenter(7U)

    BladeCenter(7U)

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation21

    MareNostrum Rack Names

    Blade CentersMyriet SwitchsStorage Servers

    Operations rack and displayGigabit Switch 10/100 cisco switches

    RC07

    R

    F B

    RS06

    RS05

    RS04

    RH01

    RS03

    RS02

    RS01

    RS07

    RC06

    RC05

    RC04

    RN01

    RC03

    RC02

    RC01

    RC11

    RC10

    RM04

    RM03

    RM02

    RM01

    RC09

    RC08

    RC27

    RC26

    RC25

    RC24

    RC23

    RC22

    RC21

    RC20

    RC19

    RC18

    RC17

    RC16

    RC15

    RC14

    RC13

    RC12

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation23

    MareNostrum Logical Names

    Blade CentersMyriet SwitchsStorage Servers

    Operations rack and displayGigabit Switch 10/100 cisco switches

    s34s33s32s31s30s29

    s28s2

    7s26s25s24s23

    s22s21s20s19

    s41c3

    mm

    mc2

    hmc1

    cab

    eza

    s41ci

    sco7

    cisco5

    cisco5

    s18s1

    7s16s15s14s13

    s12s11s10s09s08s0

    7

    s06s05s04s03s02s01

    s40s39s38s3

    7s36s35

    cisco4

    cisco3

    cisco2

    cisco1

    e600

    s02c2

    mm

    s02c1

    mm

    s01c4

    mm

    s01c3

    mm

    s01c2

    mm

    s01c1

    mm

    s05c2

    mm

    s05c1

    mm

    s04c4

    mm

    s04c3

    mm

    s04c2

    mm

    s04c1

    mm

    s03c4

    mm

    s03c3

    mm

    s03c2

    mm

    S03c1

    mm

    s02c4

    mm

    s02c3

    mm

    s08c2

    mm

    s08c1

    mm

    s07c4

    mm

    s07c3

    mm

    s07c2

    mm

    s07c1

    mm

    s06c4

    mm

    s06c3

    mm

    s06c2

    mm

    S06c1

    mm

    s05c4

    mm

    s05c3

    mm

    s11c2

    mm

    s11c1

    mm

    s10c4

    mm

    s10c3

    mm

    s10c2

    mm

    s10c1

    mm

    s09c4

    mm

    s09c3

    mm

    s09c2

    mm

    S09c1

    mm

    s08c4

    mm

    s08c3

    mm

    RC0

    7 RC06

    RC05

    RC04

    RN01

    RC03

    RC02

    RC01

    RS07 RS06

    RS05

    RS04

    RH01

    R303

    RS02

    RS01

    ms1

    mc4

    mc3

    s14c2

    mm

    s14c1

    mm

    s13c4

    mm

    s13c3

    mm

    s13c2

    mm

    s13c1

    mm

    s12c4

    mm

    s12c3

    mm

    s12c2

    mm

    S12c1

    mm

    s11c4

    mm

    s11c3

    mm

    s17c2

    mm

    s17c1

    mm

    s16c4

    mm

    s16c3

    mm

    s16c2

    mm

    s16c1

    mm

    s15c4

    mm

    s15c3

    mm

    s15c2

    mm

    S15c1

    mm

    s14c4

    mm

    s14c3

    mm

    RC11

    RC10

    RM04

    RM03

    RM02

    RM01

    RC09

    RC08

    s20c2

    mm

    s20c1

    mm

    s19c4

    mm

    s19c3

    mm

    s19c2

    mm

    s19c1

    mm

    s18c4

    mm

    s18c3

    mm

    s18c2

    mm

    S18c1

    mm

    s17c4

    mm

    s17c3

    mm

    RC19

    RC18

    RC1

    7 RC16

    RC15

    RC14

    RC13

    RC12

    ms2

    mc1

    mc0

    ms9

    mc8

    mc7

    ms0

    mc6

    mc5

    s23c2

    mm

    s23c1

    mm

    s22c4

    mm

    s22c3

    mm

    s22c2

    mm

    s22c1

    mm

    s21c4

    mm

    s21c3

    mm

    s21c2

    mm

    S21c1

    mm

    s20c4

    mm

    s20c3

    mm

    s26c2

    mm

    s26c1

    mm

    s25c4

    mm

    s25c3

    mm

    s25c2

    mm

    s25c1

    mm

    s24c4

    mm

    s24c3

    mm

    s24c2

    mm

    S24c1

    mm

    s23c4

    mm

    s23c3

    mm

    s29c2

    mm

    s29c1

    mm

    s28c4

    mm

    s28c3

    mm

    s28c2

    mm

    s28c1

    mm

    s27c4

    mm

    s27c3

    mm

    s27c2

    mm

    S27c1

    mm

    s26c4

    mm

    s26c3

    mm

    s32c2

    mm

    s32c1

    mm

    s31c4

    mm

    s31c3

    mm

    s31c2

    mm

    s31c1

    mm

    s30c4

    mm

    s30c3

    mm

    s30c2

    mm

    S30c1

    mm

    s29c4

    mm

    s29c3

    mm

    RC19

    RC18

    RC1

    7 RC16

    RC15

    RC14

    RC13

    RC12

    s35c2

    mm

    s35c1

    mm

    s34c4

    mm

    s34c3

    mm

    s34c2

    mm

    s34c1

    mm

    s33c4

    mm

    s33c3

    mm

    s33c2

    mm

    S33c1

    mm

    s32c4

    mm

    s32c3

    mm

    s38c2

    mm

    s38c1

    mm

    s37c4

    mm

    s37c3

    mm

    s37c2

    mm

    s37c1

    mm

    s36c4

    mm

    s36c3

    mm

    s36c2

    mm

    S36c1

    mm

    s35c4

    mm

    s35c3

    mm

    s41c2

    mm

    s41c1

    mm

    s40c4

    mm

    s40c3

    mm

    s40c2

    mm

    s40c1

    mm

    s39c4

    mm

    s39c3

    mm

    s39c2

    mm

    S39c1

    mm

    s38c4

    mm

    s38c3

    mm

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation24

    Mare Nostrum Scaled Floor Plan v1431 x 11 tiles (60cm x 60cm) 18.6m x 6.6m = 123 sq m18.6m x 8.2m = 153 sq m (including AC)

    Blade CentersMyriet SwitchsStorage Servers

    Operations rack and displayGigabit Switch 10/100 cisco switches

    E600

    Ba

    ck D

    oor to

    loading d

    ock

    Row 1 2 3 4 5

    R

    F B

    cisco

    cisco

    L

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation26

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation27

    1 operations pSeries rack (RH01) Box summary

    1 Display 2 HMC 2 p615 3 16-port Remote Async Nodes RAN#0-2 BCIO (manually installed) 3 CISCO 3550 (manually installed)

    Cabling External

    2 Gb for p615 to E600 40 Serial lines from RAN#0-2 to p615s 8 Gb for BCIO to E600 40 cat5 from p615s to cisco 4 cat5 uplinks from ciscos in RN01

    Internal HMC RAN#0 RAN#0 to RAN#1 RAN#1 to RAN#2 2 p615s to RAN#0 KVM Display to HMC P615s cat5 to cisco BCIO MM cat5 to cisco 2 cat5 uplinks from cisco to cisco Note: One of the p615s in this operations

    rack will do diskless image support for 3 Bladecenters

    P615 (4U)P615 (4U)

    HMC (4U)7135-C02

    Display (1U)

    3 RANsSerial mux

    Backup HMC7135-C02

    RH01BCIO

    BladeCenter(7U)Final placement subject to on-site analysis.

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation29

    Agenda

    Greg Rodgers & Peter Morjan

    Greg Rodgers

    Greg Rodgers

    Greg Rodgers

    Instructor

    DIM and Image Management4:30-6

    Storage Subsystem 2:30-4:00

    LUNCH1:2:30

    Network Overview and Linux Services

    11:30-1

    Blade Cluster Architecture9:30-11:00TuesdaySept 27

    TopicTimeDate

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation30

    The MareNostrum Blade Cluster

    p615 server

    P615 server

    2560 portMyrinet 2000

    switch

    Service LAN

    FORCE10 Gigabit Network

    20 DS4100 storage nodes

    172 BladeCenters2406 Blades

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation31

    Ethernet Switches

    Layer 3 Nortel SwitchLayer 2/3 Cisco Switch

    Layer 3/7 Nortel Switch

    D-Link Switch

    BladeCenter I/O Switch Flexibility

    Pass ThruModule

    Optical Pass-thru

    Service network

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation32

    MareNostrum Networks

    Gigabit Network Myrinet Network Service Network Serial Network

    The p615s remote management network

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation33

    MareNostrum Networks

    M

    !"#

    $

    !%&

    '

    %

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation34

    1 network xSeries 1350 rack (RN01)

    FORCE10 E600(16U)

    Box summary 1 FORCE10 E600 4 Cisco 3550

    Cabling External (VERY HEAVY 390 total cables)

    162 Gb from BC in compute racks to E600 8 Gb from BCIO to E600 42 Gb from p615s to E600 163 cat5 from MM to cisco 12 cat5 from Myrinet switches to cisco 3 cisco uplink to cisco in RH01 Future option (42 fiber gige cables to e600 fiber

    card) Internal

    1 10/100 from Force10 service to Cisco

    RN01

    Free (24U)

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation36

    Interconnection of BladeCenters Used for system boot of every BladeCenter 212 internal network cables

    170 for blades 42 for file servers

    76 ports available for external links

    Gigabit Network : Force 10 E600

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation38

    MareNostrum Network Review

    p615 server

    P615 server

    2560 portMyrinet 2000

    switch

    Service LAN

    FORCE10 Gigabit Network

    20 DS4100 storage nodes

    172 BladeCenters2406 Blades

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation39

    Myrinet Switch Internals

    LED diagnostic display

    PPC Linux Diagnostic Module

    14U Aluminum chassis with handles

    Integrated quad-ported spine slots 4x64

    16x16 host port slots

    Front to rear air flow

    Hot swap redundant power supply

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation40

    Myrinet Switch Cabling- 126 host cables per side from a

    full 84-blade rack bundle and half arack bundle. Call these H84B andH42B bundles. Each switch managesthree racks.

    - 64 quad cables routed verticallyupward to spine from 4 center cards. Call this bundle a Q64B. There are 10 Q64Bs.

    - Avoid blocking air intake at bottom.Worst case blockage by 2 Q64Bsin top switch. Ensure enough slack to swap middle power supplies.

    -- LCD display will not be blocked.

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation41

    4 myrinet xSeries 1350 racks(RM01 RM04)

    RM01 RM02 RM03 RM04

    MyrinetClos256x256

    (14U)Myrinet

    Clos256x256(14U)

    MyrinetClos256x256

    (14U)

    MyrinetClos256x256

    (14U)Myrinet

    Clos256x256(14U)

    MyrinetClos256x256

    (14U)

    MyrinetClos256x256

    (14U)Myrinet

    Clos256x256(14U)

    MyrinetClos256x256

    (14U)Myrinet

    Clos256x256(14U)

    MyrinetSpine1280

    (14U)Myrinet

    Spine1280(14U)

    Box summary 10 Clos256x256

    switches 2 Spine 1280

    Cabling External

    12 10/100 cat5 2364 LC cables

    Internal 640 Quad Spine

    cables (over top)

    Complex myrinet cabling is covered with more detail in the next charts.

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation42

    Myrinet Spine Cabling- 10 Q64B bundles from center 4 cards in switches provide 640 quad cables at top of myrinet racks to redistributed to 4 Q160B bundles. Cables are 5m.

    MyrinetClos256x256

    (14U)Myrinet

    Clos256x256(14U)

    MyrinetClos256x256

    (14U)

    MyrinetClos256x256

    (14U)Myrinet

    Clos256x256(14U)

    MyrinetClos256x256

    (14U)

    MyrinetClos256x256

    (14U)Myrinet

    Clos256x256(14U)

    MyrinetClos256x256

    (14U)Myrinet

    Clos256x256(14U)

    MyrinetSpine1280

    (14U)Myrinet

    Spine1280(14U)

    Q64B Q64B Q64B Q160B Q64B Q64B Q160B Q160B Q64B Q64B Q160B Q64B Q64B Q64B

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation43

    Myrinet Cable Bundle Summary640 Quad 5m Interswitch Cables

    Q64B Q64B Q64B Q160B Q64B Q64B Q160B Q160B Q64B Q64B Q160B Q64B Q64B Q64B

    2364 Host Fiber Cables

    Note: 8 racks have 84-way bundle split to two 42-way bundles below myrinet rack.

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation44

    Myrinet 2560-port full bisection

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation51

    Local Customization of Linux Services

    All scripts for Linux services should be in /etc/init.d

    All scripts should be installed with insserv command

    All scripts should follow rules for specifying dependencies.

    See: man init.d , man insserv

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation53

    Agenda

    Greg Rodgers & Peter Morjan

    Greg Rodgers

    Greg Rodgers

    Greg Rodgers

    Instructor

    DIM and Image Management4:30-6

    Storage Subsystemand p615 management

    2:30-4:00

    LUNCH1:2:30

    Network Overview11:30-1

    Blade Cluster Architecture9:30-11:00TuesdaySept 27

    TopicTimeDate

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation54

    MareNostrum Network Review

    p615 server

    P615 server

    2560 portMyrinet 2000

    switch

    Service LAN

    FORCE10 Gigabit Network

    20 DS4100 storage nodes

    172 BladeCenters2406 Blades

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation55

    MareNostrum Storage Subsystem

    POWERStorage Server

    scsi disks

    POWERStorage Server

    scsi disks

    Fiber ChannelStorage

    Each POWER storage server manages 56 blades and half of 7TB Fiber channel storage nodefor redundancy

    Root filesystems are contained on scsi disks. Fiber channel storage is used for parallel file system.

    Repeat 20 times

    Storage servers are both image servers and GPFS storage servers

    FastT100

    controlle

    r

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation56

    140TB Storage Subsystem

    RS01 RS02 RS03 RS04 RS05 RS06 RS07

    20 * 7TB Storage Server NodesEach storage server node consists of

    2 p615 1 FAStT 100 controller with 3.5TB1 EXP100 SATA drawer with 3.5TB

    2 p615 is 8U FAStT100 is 3U EXP100 is 3UTotal Storage Node is 14U3 Nodes per rack

    p615

    FAST-T100

    EXP100

    p615

    SN03

    SN02

    SN01

    SN06

    SN05

    SN04

    SN20

    SN19

    SN09

    SN08

    SN07

    SN11

    SN10

    SN14

    SN13

    SN12

    SN17

    SN16

    SN1514U Free SN18

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation57

    6 storage pSeries rackswith 3 storage nodes each(RS01, RS02, RS03, RS05, RS06, RS07) Box summary

    6 p615 3 FAStT100 3 EXP100

    Cabling External

    12 10/100 cat5 12 Gb 12 Myrinet 6 Serial

    Internal 2 p615 to FAStT100 FAStT100 to EXP100

    P615 (4U)P615 (4U)

    FAStT100(3U)EXP100(3U)

    P615(4U)P615(4U)

    FAStT100(3U)EXP100(3U)

    P615(4U)P615(4U)

    FAStT100(3U)EXP100(3U)

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation58

    1 storage pSeries rackwith 2 storage nodes(RS04)

    Box summary 4 p615 2 FAStT100 2 EXP100

    Cabling External

    8 10/100 cat5 8 Gb 8 myrinet 4 serial

    Internal 2 p615 FC to FAStT100 FAStT100 to EXP100

    P615 (4U)

    P615 (4U)

    FAStT100(3U)

    EXP100(3U)

    P615(4U)

    P615(4U)

    FAStT100(3U)

    EXP100(3U)

    Free (14U)

    RS04

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation62

    p615 Remote Control

    The hmc can remote power and provide console to any p615.

    See HMC manual for instructions. Effective System Management using IBM Hardware Mangement Console for

    pSeries Manual has lots of stuff not related to MareNostrum regarding partitioning.

    Remember: no partitioning means one partition per system Two key commands youll need to learn

    mkvterm Chsysstate

    If you write scripts on the hmc, back them up somewhere else. An hmc reload will wipe them out

    Recommend scripting remote console command

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation63

    Serial cabling for p615 service network3 RANs needed

    only 2 are shown.

    No connection to 7040 frame required.

    Managed System is each p615 server

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation64

    p615 performance

    Optimal adapter placement depends on bus structure Built-in 10/100/1000 is an optimal IO interface Build-in 10/100 is used for service network Adapters on MareNostrum p615s

    Two Myrinet 4 meg cards 1 Emulex fiber channel adapter 1 Fiber gigabit card (not used)

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation65

    p615 performance continued

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation66

    Agenda

    Greg Rodgers & Peter Morjan

    Greg Rodgers

    Greg Rodgers

    Greg Rodgers

    Instructor

    DIM and Image Management4:30-6

    Storage Subsystem2:30-4:00

    LUNCH1:2:30

    Network Overview11:30-1

    Blade Cluster Architecture9:30-11:00TuesdaySept 27

    TopicTimeDate

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation67

    DIM DIM = Diskless Image Management

    DIM not a great name, not really diskless Prototyped on MareNostrum to operate blades as if they were diskless

    DIM is utility software copyright IBM, made available to BSC, not for redistribution.

    Other advantages: Asynchronous image management Single image maintenance Speed: No noticable performance degredation even with oversubscribed ethernet. Zero blade install time. No Linux distro modification. Efficient image management: over 2 million rpms on MareNostrum.

    efficient yet not minimalistic Local hard drive available for user /scratch

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation68

    Basic DIM Process

    Install Linux and manage image on master blade(s)

    Can use multiple blades for different images gnode (s41c3b13) mnode (s41c3b14)

    Clone blade image using dim_update_master brute force rsync of root directories

    Distribute master copy to clones read-only and read-write parts Intelligent rsync with filters Can be done with blades up or down

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation69

    DIM verses Warewolf DIM scales to thousands of nodes with 2-level hierarchy

    DIM has large shared read-only parts of the image Fast update Storage efficiency

    Allows complete distro, not minimalistic Exploit caching on image server

    DIM can update images during operation with or without a running client. Warewolf (like rocks) rebuilds the image for any change, like a new rpm or new user

    DIM uses loopback mounted filesystems on the image server to control quota Also allow several types of network data transport including NFS, NBD, and iSCSI.

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation70

    Diskless Image Management

    Extensive use of Linux 2.6 dynamic loading and linuxrc

  • IBM Systems Group

    MareNostrum Training Class | 9/29/2005 2004 IBM Corporation72

    Dim Services Required

    DHCP The /etc/dhcpd.conf is the master database

    NFS NFS server required on all DIM image servers and NFS client required

    on DIM rsync

    Required on DIM image servers ssh tftp xinetd