isc 12 bof: infiniband? problems? do you care?

34
science + computing ag IT services for sophisticated computer environments Tübingen | München | Berlin | Düsseldorf InfiniBand? Problems? Do you care? Christian Kniep / Jan Wender

Upload: sciecomp

Post on 25-Dec-2014

777 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: ISC 12 BoF: InfiniBand? Problems? Do you care?

science + computing agIT services for sophisticated computer environmentsTübingen | München | Berlin | Düsseldorf

InfiniBand? Problems? Do you care?

Christian Kniep / Jan Wender

Page 2: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Agenda

This is an interactive session!▪ Who is on the podium?▪ Living Histogram?▪ Getting some statistics

▪ Living Histogram

▪ Existing Monitoring Solutions▪ Discussion

▪ Quick and Dirty Analysis▪ Conclusions

2

Page 3: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

On the podium

3

Page 4: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

4

Founding Year

Locations

Employees ShareholderRevenue 10/11

Partners

science + computing at a glance

1989

TübingenMünchen Berlin Düsseldorf

270Bull S.A. (100%)27 Mio. Euro

Daikin Industries, JapanNICE srl, Italien Exa Corporation, USAPlatform Computing, Kanada

Page 5: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

5

Brian L. Joiner, International Statistical Review / Revue Internationale de Statistique, Vol. 43, No. 3. (Dec.,1975), pp. 339-340.

Living Histogram?

Page 6: ISC 12 BoF: InfiniBand? Problems? Do you care?

Page

BoF InfiniBand | 2012-06-19 © 2012 science + computing ag

Living Histogram

6

Size of Fabric▪ <10▪ <50▪ <500▪ >500

Page 7: ISC 12 BoF: InfiniBand? Problems? Do you care?

Switch Structure▪ Switch size

▪ singular switch (mlx4036, qlogic12300)

▪ Modular switch (mlx5600, qlogic12800)

▪ Amount▪ few▪ many

Page

BoF InfiniBand | 2012-06-19 © 2012 science + computing ag

7

Living Histogram

Page 8: ISC 12 BoF: InfiniBand? Problems? Do you care?

Page

BoF InfiniBand | 2012-06-19 © 2012 science + computing ag

8

Focus▪ Stability

➡ maintenance cost▪ High-Perfomance

➡ extremly optimized

Living Histogram

Page 9: ISC 12 BoF: InfiniBand? Problems? Do you care?

Type of Use▪ Cluster Purpose

▪ Single Purpose Cluster▪ Multi Purpose Cluster

▪ Usage▪ One Job at a time▪ Multiple Jobs

Page

BoF InfiniBand | 2012-06-19 © 2012 science + computing ag

9

Living Histogram

Page 10: ISC 12 BoF: InfiniBand? Problems? Do you care?

Kind/Amount of Problems▪ Impact

▪ minor▪ major

▪ Amount▪ few▪ many

Page

BoF InfiniBand | 2012-06-19 © 2012 science + computing ag

10

Living Histogram

Page 11: ISC 12 BoF: InfiniBand? Problems? Do you care?

Page

BoF InfiniBand | 2012-06-19 © 2012 science + computing ag

11

Problem solving▪ Iterative

➡ reseat / reboot▪ Analytic

➡ dig into the problem➡ try to wipe it out

Living Histogram

Page 12: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Monitoring Solutions

stable (but not useful to admins?)

▪ infiniband-diags▪ ibcheckerrors▪ ibdiagpath

▪ plugin to non-IB systems▪ nagios▪ collectl

▪ hardware vendor suites▪ Unified Fabric Manager (Mellanox)▪ InfiniBand Fabric Suites (QLogic)

12

unstable (individually carved)

▪ wrapper of infiniband-diags▪ INAM (Ohio-State-University)▪ QNIB▪ .....

not listed stuff▪ ...

Page 13: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Monitoring Solutions

stable (but not useful to admins?)

▪ infiniband-diags▪ ibcheckerrors▪ ibdiagpath

▪ plugin to non-IB systems▪ nagios▪ collectl

▪ hardware vendor suites▪ Unified Fabric Manager (Mellanox)▪ InfiniBand Fabric Suites (QLogic)

13

unstable (individually carved)

▪ wrapper of infiniband-diags▪ INAM (Ohio-State-University)▪ QNIB▪ .....

not listed stuff▪ ...

Page 14: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

14

Modular Switches

switchguid=0xac1(ac1)! # Spine 1Switch! 36 "S-ac1"! # "A1" enhanced port 0 lid 11 lmc 0[1]! "S-bc1"[1]! # "B1" lid 21 4xQDR[2]! "S-bc2"[1]! # "B2" lid 22 4xQDR[3]! "S-bc3"[1]! # "B3" lid 23 4xQDR

switchguid=0xac2(ac2)! # Spine 2Switch! 36 "S-ac2"! # "A2" enhanced port 0 lid 12 lmc 0[1]! "S-bc1"[2]! # "B1" lid 21 4xQDR[2]! "S-bc2"[2]! # "B2" lid 22 4xQDR[3]! "S-bc3"[2]! # "B3" lid 23 4xQDR

switchguid=0xbc1(bc1)! # Line 1Switch 36 "S-bc1"! # "B1" enhanced port 0 lid 21 lmc 0[1]! "S-ac1"[1]! # "A1" lid 11 4xQDR[2] "S-ac2"[1] # "A2" lid 12 4xQDR[3] "H-1"[1](f1) # "Host1" lid 101 4xQDR

switchguid=0xbc2(bc2)! # Line 2Switch! 36 "S-bc2"! # "B2" enhanced port 0 lid 22 lmc 0[1]! "S-ac1"[2]! # "A1" lid 11 4xQDR[2] "S-ac2"[2] # "A2" lid 12 4xQDR[3] "H-2"[1](f2) # "Host2" lid 102 4xQDR

switchguid=0xbc3(bc3)! # Line 3Switch! 36 "S-bc3"! # "B3" enhanced port 0 lid 23 lmc 0[1]! "S-ac1"[3]! # "A1" lid 11 4xQDR[2] "S-ac2"[3] # "A2" lid 12 4xQDR[3] "H-3"[1](f3) # "Host3" lid 103 4xQDR

Page 15: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

15

Modular Switches

Chassis1switchguid=0xac1(ac1)! # Spine 1Switch! 36 "S-ac1"! # "A1" enhanced port 0 lid 11 lmc 0[1]! "S-bc1"[1]! # "B1" lid 21 4xQDR[2]! "S-bc2"[1]! # "B2" lid 22 4xQDR[3]! "S-bc3"[1]! # "B3" lid 23 4xQDR

switchguid=0xac2(ac2)! # Spine 2Switch! 36 "S-ac2"! # "A2" enhanced port 0 lid 12 lmc 0[1]! "S-bc1"[2]! # "B1" lid 21 4xQDR[2]! "S-bc2"[2]! # "B2" lid 22 4xQDR[3]! "S-bc3"[2]! # "B3" lid 23 4xQDR

switchguid=0xbc1(bc1)! # Line 1Switch 36 "S-bc1"! # "B1" enhanced port 0 lid 21 lmc 0[1]! "S-ac1"[1]! # "A1" lid 11 4xQDR[2] "S-ac2"[1] # "A2" lid 12 4xQDR[3] "H-1"[1](f1) # "Host1" lid 101 4xQDR

switchguid=0xbc2(bc2)! # Line 2Switch! 36 "S-bc2"! # "B2" enhanced port 0 lid 22 lmc 0[1]! "S-ac1"[2]! # "A1" lid 11 4xQDR[2] "S-ac2"[2] # "A2" lid 12 4xQDR[3] "H-2"[1](f2) # "Host2" lid 102 4xQDR

switchguid=0xbc3(bc3)! # Line 3Switch! 36 "S-bc3"! # "B3" enhanced port 0 lid 23 lmc 0[1]! "S-ac1"[3]! # "A1" lid 11 4xQDR[2] "S-ac2"[3] # "A2" lid 12 4xQDR[3] "H-3"[1](f3) # "Host3" lid 103 4xQDR

Spine1 Spine2

Line1 Line2 Line3

Host1 Host2 Host3

Page 16: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

16

Chassis1switchguid=0xac1(ac1)! # Spine 1Switch! 36 "S-ac1"! # "A1" enhanced port 0 lid 11 lmc 0[1]! "S-bc1"[1]! # "B1" lid 21 4xQDR[2]! "S-bc2"[1]! # "B2" lid 22 4xQDR[3]! "S-bc3"[1]! # "B3" lid 23 4xQDR

switchguid=0xac2(ac2)! # Spine 2Switch! 36 "S-ac2"! # "A2" enhanced port 0 lid 12 lmc 0[1]! "S-bc1"[2]! # "B1" lid 21 4xQDR[2]! "S-bc2"[2]! # "B2" lid 22 4xQDR[3]! "S-bc3"[2]! # "B3" lid 23 4xQDR

switchguid=0xbc1(bc1)! # Line 1Switch 36 "S-bc1"! # "B1" enhanced port 0 lid 21 lmc 0[1]! "S-ac1"[1]! # "A1" lid 11 4xQDR[2] "S-ac2"[1] # "A2" lid 12 4xQDR[3] "H-1"[1](f1) # "Host1" lid 101 4xQDR

switchguid=0xbc2(bc2)! # Line 2Switch! 36 "S-bc2"! # "B2" enhanced port 0 lid 22 lmc 0[1]! "S-ac1"[2]! # "A1" lid 11 4xQDR[2] "S-ac2"[2] # "A2" lid 12 4xQDR[3] "H-2"[1](f2) # "Host2" lid 102 4xQDR

switchguid=0xbc3(bc3)! # Line 3Switch! 36 "S-bc3"! # "B3" enhanced port 0 lid 23 lmc 0[1]! "S-ac1"[3]! # "A1" lid 11 4xQDR[2] "S-ac2"[3] # "A2" lid 12 4xQDR[3] "H-3"[1](f3) # "Host3" lid 103 4xQDR

Spine1 Spine2

Line1 Line2 Line3

Host1 Host2 Host3

Chassis1

Host1 Host2 Host3

Modular Switches

Page 17: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Monitoring Solutions

stable (but not useful to admins?)

▪ infiniband-diags▪ ibcheckerrors▪ ibdiagpath

▪ plugin to non-IB systems▪ nagios▪ collectl

▪ hardware vendor suites▪ Unified Fabric Manager (Mellanox)▪ InfiniBand Fabric Suites (QLogic)

17

unstable (individually carved)

▪ wrapper of infiniband-diags▪ INAM (Ohio-State-University)▪ QNIB▪ .....

not listed stuff▪ ...

Page 18: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Monitoring Solutions

stable (but not useful to admins?)

▪ infiniband-diags▪ ibcheckerrors▪ ibdiagpath

▪ plugin to non-IB systems▪ nagios▪ collectl

▪ hardware vendor suites▪ Unified Fabric Manager (Mellanox)▪ InfiniBand Fabric Suites (QLogic)

18

unstable (individually carved)

▪ wrapper of infiniband-diags▪ INAM (Ohio-State-University)▪ QNIB▪ .....

not listed stuff▪ ...

Page 19: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Monitoring Solutions

stable (but not useful to admins?)

▪ infiniband-diags▪ ibcheckerrors▪ ibdiagpath

▪ plugin to non-IB systems▪ nagios▪ collectl

▪ hardware vendor suites▪ Unified Fabric Manager (Mellanox)▪ InfiniBand Fabric Suites (QLogic)

19

unstable (individually carved)

▪ wrapper of infiniband-diags▪ INAM (Ohio-State-University)▪ QNIB▪ .....

not listed stuff▪ ...

Page 20: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Monitoring Solutions

stable (but not useful to admins?)

▪ infiniband-diags▪ ibcheckerrors▪ ibdiagpath

▪ plugin to non-IB systems▪ nagios▪ collectl

▪ hardware vendor suites▪ Unified Fabric Manager (Mellanox)▪ InfiniBand Fabric Suites (QLogic)

20

unstable (individually carved)

▪ wrapper of infiniband-diags▪ INAM (Ohio-State-University)▪ QNIB▪ .....

not listed stuff▪ ...

Page 21: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Monitoring Solutions

stable (but not useful to admins?)

▪ infiniband-diags▪ ibcheckerrors▪ ibdiagpath

▪ plugin to non-IB systems▪ nagios▪ collectl

▪ hardware vendor suites▪ Unified Fabric Manager (Mellanox)▪ InfiniBand Fabric Suites (QLogic)

21

unstable (individually carved)

▪ wrapper of infiniband-diags▪ INAM (Ohio-State-University)▪ QNIB▪ .....

not listed stuff▪ ...

Page 22: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Discussion - Quick Analysis

Fabricsize▪ small -> easy as pie?▪ big -> crit. mass for

real analysis?Switch structure▪ what is your

routing algorithm?Focus▪ 80:20 rule?

22

Type of use▪ willing/forced to shareProblemkind / -amount▪ runs smoothly enoughProblemsolving▪ learncurve starts step

performancemaintenance

Page 23: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Discussion - Quick Analysis

Fabric size▪ small -> easy as pie?▪ big -> crit. mass for

real analysis?Switch structure▪ what is your

routing algorithm?Focus▪ 80:20 rule?

23

Type of use▪ willing/forced to shareProblem type / amount▪ runs smoothly enoughProblem solving▪ learning curve starts steep

performancemaintenance

Page 24: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Discussion - Quick Analysis

Fabric size▪ small -> easy as pie?▪ big -> crit. mass for

real analysis?Switch structure▪ what is your

routing algorithm?Focus▪ 80:20 rule?

24

Type of use▪ willing/forced to shareProblem type / amount▪ runs smoothly enoughProblem solving▪ learning curve starts steep

performancemaintenance

Page 25: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Discussion - Quick Analysis

Fabric size▪ small -> easy as pie?▪ big -> crit. mass for

real analysis?Switch structure▪ what is your

routing algorithm?Focus▪ 80:20 rule?

25 0

25

50

75

100

performancemaintenance

Type of use▪ willing/forced to shareProblem type / amount▪ runs smoothly enoughProblem solving▪ learning curve starts steep

Page 26: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Discussion - Quick Analysis

26

Type of use▪ willing/forced to shareProblem type / amount▪ runs smoothly enoughProblem solving▪ learning curve starts steep

Fabric size▪ small -> easy as pie?▪ big -> crit. mass for

real analysis?Switch structure▪ what is your

routing algorithm?Focus▪ 80:20 rule?

performancemaintenance

Page 27: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Discussion - Quick Analysis

27

Type of use▪ willing/forced to shareProblem type / amount▪ runs smoothly enoughProblem solving▪ learning curve starts steep

Fabric size▪ small -> easy as pie?▪ big -> crit. mass for

real analysis?Switch structure▪ what is your

routing algorithm?Focus▪ 80:20 rule?

performancemaintenance

Page 28: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Discussion - Quick Analysis

28

Type of use▪ willing/forced to shareProblem type / amount▪ runs smoothly enoughProblem solving▪ learning curve starts steep

Fabric size▪ small -> easy as pie?▪ big -> crit. mass for

real analysis?Switch structure▪ what is your

routing algorithm?Focus▪ 80:20 rule?

performancemaintenance

Page 29: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Discussion - Conclusions

Monitoring▪ what approach?

Do we scare you?▪ not intending to spread Fear, Uncertainty and Doubt

Our conclusions

Your conclusions

29

Page 30: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Discussion - Conclusions

Monitoring▪ what approach?

Do we scare you?▪ not intending to spread Fear, Uncertainty and Doubt

Our conclusions

Your conclusions

30

Page 31: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Discussion - Conclusions

Monitoring▪ what approach?

Do we scare you?▪ not intending to spread Fear, Uncertainty and Doubt

Our conclusions

Your conclusions

31

Page 32: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Discussion - Conclusions

Monitoring▪ what approach?

Do we scare you?▪ not intending to spread Fear, Uncertainty and Doubt

Our conclusions

Your conclusions

32

Page 33: ISC 12 BoF: InfiniBand? Problems? Do you care?

© 2012 science + computing ag

Page

BoF InfiniBand | 2012-06-19

Discussion - Conclusions

Monitoring▪ what approach?

Do we scare you?▪ not intending to spread Fear, Uncertainty and Doubt

Our conclusions

Your conclusions

33

Page 34: ISC 12 BoF: InfiniBand? Problems? Do you care?

Thank you for your attention and participation!

science + computing agwww.science-computing.de

Telefon: +49 (0)7071 9457 - 0E-Mail: [email protected]