© 2011 ibm corporation toolkit for event analysis and logging education dec 2011
TRANSCRIPT
© 2011 IBM Corporation
Toolkit for Event Analysis and Logging
Education
Dec 2011
© 2011 IBM Corporation
Contents
Overview
Locations
Commands
Alerts and Connectors
Debug
References
© 2011 IBM Corporation
Overview
© 2011 IBM Corporation
Overview
Common HPC Event Analysis Framework– Combined best aspects and lessons learned from BlueGene ELA and Federation
ELA– Addressed new p7 IH requirements
Common Event Repository– First release: CNM, Service Focal Point (HMC), PNSD, LL, GPFS (coming soon)
Analysis of Events to create Alerts– Rules based engine– Flexible alert delivery. For example, RMC and e-mail
Real-time Analysis and Historic Analysis– Real-time to be pro-active and react immediately to events– Historical allows for deeper debug on-site and off-site
Robust framework to prevent loss of alerts and events– Handles event flooding– Checkpoint/Shutdown/Restart
Open Source (pyteal.sourceforge.net)– Using ODBC– Python, C/C++, and Perl
© 2011 IBM Corporation
RAS Strategy
Correct
CentralizeDatabase
Event Adapters
AnalyzeGeneric Analysis
Custom Analysis
Rules
AlertGeneric filters, listeners
Custom
Auto-Recovery
Custom
FindFind
ResolveResolve
RefineRefine
Recommended Actions
Manual Analysis
Detect
Monitors
Observation
TEALTEAL
Get Data
DebugAnalyze Behavior
Release new rules
Fix Framework
Maintenance package escape
Shouldn’t be manual?
Data Mining
queries
Historicalanalysis
Data collection
As enabled
Grayed-out boxes are future possibilities
Query, e-mail, RMC
© 2011 IBM Corporation
TEAL Concepts
Event Log(table in xCat DB)
Alert Log(table in xCat DB)
Connector
CNM
Connector
…
Monitor
semaphore
Event
Analyzer Alert
Analyzer Alert
Filters
Alert
Listeners
Alert
teal.confteal.conf
teal.conf
Event
© 2011 IBM Corporation
P7-IH Usage
Output is to an alert database– Monitored by the administrator and operators– Various methods of monitoring will be described– Commands are used to query the database
Primary users are the administrator and operator
Runs on the EMS– Commands are issued via the EMS command line– SSRs may run commands under engineering direction
Event database may be collected to work on new analysis algorithms, or bugs
© 2011 IBM Corporation
P7-IH Implementation
CNM
SFP
GPFS
LL
PNSD
TEAL
Event Log(table in cluster DB)
Alert Log(table in cluster DB)
HMC(s)
Systems
SFP to TEAL
Analyzed
Events
Network Events
to SFP
Customer Notify
e-mail, RMC,
query
Admin, Operator
Store
Events
© 2011 IBM Corporation
Locations
Points to a specific event location
Can be physical, logical or a mixture of both
Is hierarchical in nature–Simple - one type of item per level–Complex - multiple types of items per level
Operations–Scoping–Validation–Casting (platform specific)
XML-based description–/opt/teal/data/ibm/teal/xml/percs_location.xml–Can use it to remind yourself of the location formats
© 2011 IBM Corporation
Location Code Examples
Simple–Hierarchy innate in
description
Complex–Compact ID–Optional Instance Values
Example:
<node>-<program>-<pid>
comp01-firefox-1234
comp01-vncserver-4567
Example:
FR
CG
SN
DR
HB
LL OM HF
LR LD RM
H:FR008-CG03-SN000-DR0
© 2011 IBM Corporation
P7-IH Locations
Application– A:c250mgrs20-pvt.ppd.pok.ibm.com##teal.py##28327–Expect this from PNSD and GPFS – apps in general
Job–J:z25c4s9.ppd.pok.ibm.com.1.3–Expect this from LoadLeveler
Hardware (aka logical hardware)
–H:FR008-CG03-SN000-DR0-HB1-OM27-LR22–Expect this from ISNM
pSeries (aka service/physical)
–P:U9125.F2C.0286C66–Expect this from SFP
© 2011 IBM Corporation
Commands
© 2011 IBM Corporation
teal runtime and historic modes
tllsevent list events
tlrmevent prune events from the event log
tllsalert list alerts
tlchalert change the state of alert
tlrmalert prune alerts from the alert log
tllsckpt list checkpoints
tltab (sbin) database table maintenance
TEAL EMS Command Line (/opt/teal/bin)
© 2011 IBM Corporation
Closing Alerts
1. tllsalert 2. tlchalert --id 1543 --state close
Querying Alerts
• tllsalert –q”creation_time>2010-12-30 creation_time<2011-02-01”
• tllsalert -q”event_loc=P” –f text• tllsalert -q”event_loc=H:FR007-CG03-SN016-DR0-HB0 event_scope=hub”• tllsalert –-with-assoc -f text
Removing Alerts
• tlrmalert --older-than 2011-01-01-12:00:00
Output Options: csv, json, text, “brief”
Can only remove alerts• closed• not a duplicate
Can take a long time
Managing Alerts
© 2011 IBM Corporation
Listing events• tllsevent• tllsevent -q”src_loc=H:FR007-CG03-SN016-DR0-HB0 src_scope=hub”• tllsevent –e• tllsevent –q”time_logged=2011-04”
Removing Events
• tlrmevent --older-than 2011-01-01-12:00:00
Only Events not associated with:• an alert• a checkpoint
Managing Events
© 2011 IBM Corporation
1. Close (by resolving) any active alerts (tlchalert)
2. Remove all closed alerts (tlrmalert -–older-than)
3. Remove all events not associated with an alert(tlrmevent -–older-than)
Cleaning Out the DB
© 2011 IBM Corporation
tllschkpt
CnmEventAnalyzer R 35301
PNSDEventAnalyzer R None
LLEventAnalyzer S None
SFPEventAnalyzer R None
monitor_event_queue R 35301
MAX_event_rec_id 3530
tllschkpt –f text <- shows additional data
monitor_event_queue is last recovery type and start rec_id
GEAR based analyzers contain pool checkpoint information
Checkpoints
State when analyzer last checkpointed
Last event processedby the monitor
Maximum rec_id in event log
© 2011 IBM Corporation
User can set up a query for the criteria of interest
Filters and listeners in the configuration file for historic mode or all modes are executed
Choice of committing or not committing (default) the generated alerts
To capture all alerts produced, a file or print listener that does not specify any filters should be used
Time occurred or time logged can be used for analysis
teal --historic -–query=”src_comp=CNM time_occurred>2011-02-01-10:00:00”
Historic Analysis - Reanalyzing
© 2011 IBM Corporation
rec_id (=,<,>,<=,>=) Can be a single value or a comma separated list of ids
event_id (=) Can be a single value or a comma-separated list of ids
time_occurred (=,<,>,<=,>=) A single value in the format of yyyy-mm-dd hh:mm:ss
time_logged (=,<,>,<=,>=) A single value in the format of yyyy-mm-dd hh:mm:ss
src_comp (=) Can be a single value or a comma-separated list of values
src_loc_type:src_loc (=) The location is optional otherwise all events with the same location type will be included.
src_scope (=) Level to scope all source locations to. This is only valid if the reporting location type is specified
rpt_comp (=) Can be a single value or a comma-separated list of values
rpt_loc_type:rpt_loc (=) The location is optional otherwise all events with the same location type will be included
rpt_scope (=) Level to scope all reporting locations to. This is only valid if the reporting
TEAL historic and tlls* Options
© 2011 IBM Corporation
csv – good for reading into spreadsheets, or program parsing
rec_id,event_id,time_occurred,time_logged,src_comp,src_loc,src_loc_type,rpt_comp,rpt_loc,rpt_loc_type,event_cnt,elapsed_time
91455,BD700041,2011-02-09 15:06:19,2011-02-09 15:06:19,CNM,BB03-FR007-SN000-DR0-HB0-LD00,H,CNM,"TRMD",A,,
json– good for program parsing{"src_comp": "CNM", "rpt_loc_type": "A", "event_id": "BD700041",
"src_loc_type": "H", "time_occurred": "2011-02-09 15:06:19", "rec_id": 91455, "event_cnt": null, "rpt_loc": "TRMD", "elapsed_time": null, "rpt_comp": "CNM", "time_logged": "2011-02-09 15:06:19", "src_loc": "BB03-FR007-SN000-DR0-HB0-LD00“}
Sample output – csv and json
© 2011 IBM Corporation
Alertsand
Connectors
© 2011 IBM Corporation22
CNM and TEAL EMS
ISNM/CNM
NM
TEAL
Event Alert
Monitor
Filter
Analyzer
Listener
NetworkEvents
SFP
FSPFSPFSP Init
Rules
© 2011 IBM Corporation23
Network Hardware Events
Events reported by the HFI, ISR or Optical Module:
HFI Events– HFI Down – report for completeness of network status
Link Events– Link types are HFI-to-ISR links, Llocal (intra-drawer), Lremote (intra-SN),
and D-link (inter-SN) – Port Down/Port Up– Threshold events: CRC, dropped flit, flit retry– Correctable/uncorrectable errors on port-level routing structures– Packet flow events, e.g. credit overflow, sender hang informational
Optical Module Events– Module-level events affect a single D port or two LR ports– Channel-level events affect a single D port. May affect one or two LR
ports depending on which channels are affected.– Some OM events are thresholded by LNMC
© 2011 IBM Corporation24
Frame Events
Reported directly to CNM by frame (BPA) firmware
ISNM uses these events for analysis only – BPA creates any serviceable events for the problems it detects; ie. it suppresses network events caused by frame events
Sample frame events that may affect the ISR network:– CEC power dropped due to MCM Over Temperature– CEC DCCA errors – High ambient temperature BPA
FSP
FSP
FSP
FSP
CNM
© 2011 IBM Corporation25
Example CNM Alert
>[c250mgrs52]>/opt/teal/bin/tllsalert -f text -q "alert_id=BD700025”rec_id : 9673alert_id : BD700025creation_time : 2011-08-16 15:15:11.146044severity : Eurgency : Sevent_loc : FR052-CG03-SN000-DR0-HB1-OM12-LD12event_loc_type : Hfru_loc : Nonerecommendation : There is a problem with a D-Link.Record the alert ID.Record the location in the alert message.Contact IBM Service.Log on to the Management Server.To isolate to the proper FRU, run Link Diags and perform the actions that it
recommends.If no action is recommended, because Diags cannot isolate to the proper FRU,
replace the FRUs in the order listed.reason : D-link down between frame FR052 cage CG03 (superNode SN000 drawer DR0)
hub HB1 port LD12 and frame FR052 cage CG06 (superNode SN003 drawer DR0) hub HB1 port LD15 (D Link Port Down)
src_name : CnmEventAnalyzerstate : 1raw_data : {"fru_list":"{ HFI_DDG,Isolation Procedure,,,, },{ HFI_CAB,Symbolic
Procedure,U78A9.001.20C1000-P1-T17-T6,,, },{ CBLCONT,Symbolic Procedure,U78A9.001.311B001-P1-T16-T5,,, },{ 52Y3020,FRU,U78A9.001.20C1000-P1-R2,YA193P203586,ABC123,TRMD },{ 52Y3020,FRU,U78A9.001.311B001-P1-R2,YA193P399669,ABC123,TRMD }","nbr_loc":"FR052-CG06-SN003-DR0-HB1-OM15-LD15","nbr_typ":"H","pwr_enc":"78AC-100BC50052","eed_loc":"c250mgrs52:/var/opt/isnm/cnm/log","encl_mtms":"9125-F2C/028B506"}
© 2011 IBM Corporation
raw_data : {"fru_list":"{ HFI_DDG,Isolation Procedure,,,, },{ HFI_CAB,Symbolic Procedure,U78A9.001.20C1000-P1-T17-T6,,, },{ CBLCONT,Symbolic Procedure,U78A9.001.311B001-P1-T16-T5,,, },{ 52Y3020,FRU,U78A9.001.20C1000-P1-R2,YA193P203586,ABC123,TRMD },{ 52Y3020,FRU,U78A9.001.311B001-P1-R2,YA193P399669,ABC123,TRMD }","nbr_loc":"FR052-CG06-SN003-DR0-HB1-OM15-LD15","nbr_typ":"H","pwr_enc":"78AC-100BC50052","eed_loc":"c250mgrs52:/var/opt/isnm/cnm/log","encl_mtms":"9125-F2C/028B506"}
CNM FRU list format in alerts
•Multiple FRUs with each one contained in braces•Part number, FRU type, FRU location, ECID, CCIN
Part Number
FRU type FRU location Part Serial Number
ECID CCIN
HFI_DDG Isolation Procedure
HFI_CAB Symbolic Procedure U78A9.001.20C1000-P1-T17-T6
CBLCONT Symbolic Procedure U78A9.001.311B001-P1-T16-T5
52Y3020 FRU U78A9.001.20C1000-P1-R2 YA193P203586 ABC123 TRMD
52Y3020 FRU U78A9.001.311B001-P1-R2 YA193P399669 ABC123 TRMD
© 2011 IBM Corporation27
Example CNM Compound Alert
>[c250mgrs52]>/opt/teal/bin/tllsalert -f text -q "alert_id=BDFF0060” -wrec_id : 13304alert_id : BDFF0060creation_time : 2011-08-26 19:02:53.971854severity : Wurgency : Oevent_loc : FR052-CG04-SN001-DR0event_loc_type : Hfru_loc : Nonerecommendation : A large number of HFI network links attached to a drawer are down
without an accompanying power event.Contact IBM Service and report the alert ID.If a drawer lost power, then this is a secondary effect.reason : Drawer level event occurred on frame FR052 cage CG04 (superNode SN001
drawer DR0). (Suspicious Drawer)src_name : CnmEventAnalyzerstate : 1raw_data : {"fru_list":"{ HFI_IDR,Isolation Procedure,,,, }","nbr_loc":"FR052-
CG04-SN001-DR0-HB7-OM09-LD09","nbr_typ":"H","pwr_enc":"78AC-100BC50052","eed_loc":"c250mgrs52:/var/opt/isnm/cnm/log","encl_mtms":"9125-F2C/028B5D6"}
Condition Alerts: []Condition Events:
[32873,32874,32875,32876,32877,32878,32879,32880,32881,32882,32883,32884]Duplicate Alerts: []Suppression Alerts: []Suppression Events: []
© 2011 IBM Corporation28
Example CNM Alert with suppression
>[c250mgrs52]>/opt/teal/bin/tllsalert -f text -q "alert_id=BD700022” -wrec_id : 8507alert_id : BD700022creation_time : 2011-08-11 14:39:00.244292severity : Eurgency : Sevent_loc : FR052-CG10-SN007-DR0-HB3-OM09-LD09event_loc_type : Hfru_loc : Nonerecommendation : There is a problem with a D-Link.Record the alert ID and call IBM Service.Log on to the Management Server.To isolate to the proper FRU, run Link Diags and perform the actions that it
recommends.If no action is recommended, because Diags cannot isolate to the proper FRU,
replace the FRUs in the order listed.reason : D Link Port Lane Width Change between frame FR052 cage CG10 (superNode
SN007 drawer DR0) hub HB3 port LD09 and frame FR052 cage CG09 (superNode SN006 drawer DR0) hub HB3 port LD08 (D Link Port Lane Width Change)
src_name : CnmEventAnalyzerstate : 1raw_data : {"fru_list":"{ HFI_DDG,Isolation Procedure,,,, },{ HFI_CAB,Symbolic
Procedure,U78A9.001.30CK001-P1-T14-T1,,, },{ CBLCONT,Symbolic Procedure,U78A9.001.312N005-P1-T14-T2,,, },{ 52Y3020,FRU,U78A9.001.30CK001-P1-R5,YA193P400322,ABC123,TRMD },{ 52Y3020,FRU,U78A9.001.312N005-P1-R5,YA193N035309,ABC123,TRMD }","nbr_loc":"FR052-CG09-SN006-DR0-HB3-OM08-LD08","nbr_typ":"H","pwr_enc":"78AC-100BC50052","eed_loc":"c250mgrs52:/var/opt/isnm/cnm/log","encl_mtms":"9125-F2C/028B5F6"}
Condition Alerts: []Condition Events: [26388]Duplicate Alerts: [8511]Suppression Alerts: []Suppression Events: [26389,26390]
© 2011 IBM Corporation29
Example CNM Event>[c250mgrs52]>/opt/teal/bin/tllsevent -f text -q “event_id=BD700025” -erec_id : 22877event_id : BD700025 - D Link Port Downtime_occurred : 2011-08-01 14:52:14time_logged : 2011-08-01 14:52:14.369687src_comp : CNMsrc_loc : FR052-CG07-SN004-DR0-HB0-OM14-LD14src_loc_type : Hrpt_comp : CNMrpt_loc : c250mgrs52##cnmdrpt_loc_type : Aevent_cnt : Noneelapsed_time : Noneext.eed_loc_info : c250mgrs52:/var/opt/isnm/cnm/logext.encl_mtms : 9125-F2C/028B596ext.global_counter : Noneext.isnm_raw_data : REG_BEGIN ISR_GLOBAL_COUNTER_REGISTER = 0x000005347ecda480 ISR_ID_REGISTER =
0x004800d01c000000 ISR_D14D15_FIR = 0x4000000000000000 D_PORT_14_SEND_NEIGHBOR_ID = 0x000800d01ee00000 OLL_LLD14_LINK_STATUS = 0xc1d6000100000000 REG_END
ext.local_om1 : U78A9.001.30CM002-P1-R2-R1,52Y3020,YA193P407777,ABC122,TRMDext.local_om2 :ext.local_planar : U78A9.001.30CM002-P1,74Y0601,YH10HA0BH002,ABC122,2E00ext.local_port : U78A9.001.30CM002-P1-T17-T7ext.local_torrent : U78A9.001.30CM002-P1-R2,52Y3020,YA193P407777,ABC123,TRMDext.nbr_om1 : U78A9.001.30CK001-P1-R2-R4,52Y3020,YA193P399201,ABC123,TRMDext.nbr_om2 :ext.nbr_planar : U78A9.001.30CK001-P1,74Y0601,YH10HA0BJ003,ABC123,2E00ext.nbr_port : U78A9.001.30CK001-P1-T15-T8ext.nbr_torrent : U78A9.001.30CK001-P1-R2,52Y3020,YA193P399201,ABC123,TRMDext.neighbor_loc : H: FR052-CG04-SN006-DR0-HB0-OM11-LD11ext.pwr_ctrl_mtms : 78AC-100BC50052ext.recovery_file_path : /var/opt/isnm/cnm/log
© 2011 IBM Corporation
Uses RMC and xCAT monitoring support
Retrieves batches of events from HMC
[c250mgrs14][/]> nodels hmc
c250hmc05_a
[c250mgrs14][/]> lscondresp
Displaying condition with response information:
Condition Response Node State
"AllServiceableEvents_HB" "TealLogSfpEvent_HB" "c250mgrs14" "Active"
FSP HMC TEAL
SFP Connector
© 2011 IBM Corporation
rec_id : 8490
event_id : B1812A80
time_occurred : 2011-04-20 09:57:41
time_logged : 2011-04-20 09:58:46.187401
src_comp : SFP
src_loc : U9125.F2C.P7IH165
src_loc_type : P
rpt_comp : 7042CR5/KQZAAAT
rpt_loc : c250hmc05.ppd.pok.ibm.com##AllServiceableEvents_B
rpt_loc_type : A
event_cnt : None
elapsed_time : None
ext.call_home : N
ext.description : Platform firmware (0x81) reported an error.
ext.fru_list : [['FSPSP04', 'ACT04219I Isolate procedure', '', '', '', ''], ['45D7208', 'ACT04216I FRU', 'U78A9.001.1122233-P1-R5', 'YH30HA022005', '', '2A3A'], ['FSPSP06', 'ACT04219I Isolate procedure', '', '', '', '']]
ext.prob_num : 320
ext.sfp_raw_data : {'FRURecentlyReplaced': ['No', 'No', 'No'], 'FRULogicControllingCECMachineSerialNumber': ['P7IH165', 'P7IH165', 'P7IH165'], 'HSCBiosName': 'KQZAAAT', 'CreatedTimeStamp': '04/20/2011 06:16:49', 'CECMachineModel': 'F2C', 'FDAdditionalMachine': ['9125-F2C-P7IH165'], 'EventType': 'open', 'SystemRefCode': 'B1812A80', 'CreatorID': 'E', 'FRUEnclosureMachineSerialNumber': ['P7IH165', 'P7IH165', 'P7IH165'], 'FRUEnclosureMachineTypeModel': ['9125-F2C', '9125-F2C', '9125-F2C'], 'DuplicateCount': '0', 'EventSeverity': '32', 'CECMachineType': '9125', 'SubsystemID': '129', 'FRULogicControllingCECMachineTypeModel': ['9125-F2C', '9125-F2C', '9125-F2C'], 'CalledHome': 'No', 'FRUReplacementPriority': ['80', '50', '25'], 'CECMachineSerialNumber': 'P7IH165', 'LastReportedTimeStamp': '04/20/2011 06:16:49', 'HSCBiosId': '7042CR5', 'PlatformLogID': '1346333000'}
SFP Event
© 2011 IBM Corporation
rec_id : 8040alert_id : 14020079creation_time : 2011-05-17 12:58:58.661058severity : Eurgency : Nevent_loc : U9458.100.BPCF007event_loc_type : Pfru_loc : Nonerecommendation : reason : Power/Cooling subsystem & control (0x60) reported an error.src_name : SFPEventAnalyzerstate : 1raw_data : {'FRU List': [['IQYRISC', 'ACT04219I Isolate procedure', '', '', '', ''], ['PU_BOOK', 'ACT04216I FRU', 'U78A9.001.1122233', '', '', '']], 'SFP': 'c250hmc05.ppd.pok.ibm.com', 'Problem Number': 601}
SFP Alert
© 2011 IBM Corporation
raw_data : {'FRU List': [['IQYRISC', 'ACT04219I Isolate procedure', '', '', '', ''], ['PU_BOOK', 'ACT04216I FRU', 'U78A9.001.1122233', '', '', '']], 'SFP': 'c250hmc05.ppd.pok.ibm.com', 'Problem Number': 601}
SFP FRU list format in alerts
•Multiple FRUs with each one contained in brackets•Part number, FRU type, FRU location, ECID, CCIN
Part Number
FRU type FRU location Part Serial Number
ECID CCIN
IQYRISC Isolate Procedure
PU_BOOK FRU U78A9.001.11222333
© 2011 IBM Corporation
New support for Loadleveler 5.1
DB table polling via TEAL connector daemon
Loadleveler must be configured to use the DB
teal_llTLL_Raslog teal
[root@c250mgrs20-pvt log]# service teal_ll status [ OK ]loadleveler.py (pid 17583) is running...
[c250mgrs14][/]> lssrc -s teal_llSubsystem Group PID Status teal_ll 5701830 active
Loadleveler Connector
© 2011 IBM Corporation
===================================================rec_id : 9alert_id : LL001000creation_time : 2011-05-19 13:26:34.559391severity : Eurgency : Nevent_loc : z25c4s12.ppd.pok.ibm.comevent_loc_type : Afru_loc : Nonerecommendation : Call next level of supportreason : LoadL_schedd on machine z25c4s12.ppd.pok.ibm.com is down.
src_name : LLEventAnalyzerstate : 1raw_data :
LL alert_id:
LL0010xx = Daemon Down
LL0020xx = job failures
Loadleveler Alert
© 2011 IBM Corporation
Multi-tiered configuration through service nodes using RMC and xCAT monitoring support
Uses pnsd_stat command to get statistics
May cause jitter on compute nodes so may not be enabled in all cases
xcatmn2:~ # lscondresp
Displaying condition with response information:
Condition Response Node State
"TealAnyNodePnsdStat_H" "TealLogPnsdEvent_H" "xcatmn2" “Active"
TEALCompute Svc Node
PNSD Connector
© 2011 IBM Corporation
===================================================rec_id : 12alert_id : PNSD0001creation_time : 2011-01-26 23:03:40severity : Eurgency : Nevent_loc : compute37##TealPnsdStatevent_loc_type : Afru_loc : Nonerecommendation : Call next level of supportreason : Packet retransmit threshold has been exceeded on node compute37src_name : PNSDEventAnalyzerstate : 1raw_data : 0.046
PNSD Alert
PNSD alert_id:
PNSD0001 = Retransmit threshold exceeded
© 2011 IBM Corporation
Installation
© 2011 IBM Corporation
Multi-platform– AIX – installp– Linux – RPM
Base– Pipeline– Base services
•Logging•DB access•Configuration•Locations
– Rules engine– Common filters/listeners– Command line– xCAT extensions
Component– Connector Library/Program– Rules– Alert/Event Metadata– Extension Data Format– User specific Filters/Listeners– Configuration file
TEAL
Base
ISNM
GPFS
PNSD
LL
ServiceFocal Point
….
Packaging
© 2011 IBM Corporation
Stanza-based
Used during startup (/etc/teal)
Separate files per package (teal.conf => base framework features)
Configures processing pipeline
Additional parameters for specialized function
Enabled in different modes
[alert_listener.RmcAlertListener]class = ibm.teal.listener.rmc_alert_listener.RmcAlertListenerenabled = false
[alert_listener.FileAlertListener]class = ibm.teal.listener.file_alert_listener.FileAlertListenerenabled = historicfilters = DuplicateAlertFilterformat = textfile = /var/log/teal/cluster_alert.logmode = write
Configuration Files
© 2011 IBM Corporation
Add the definition where TEAL will pick it up:– Add to base configuration file (/etc/teal/teal.conf)– Add in file to configuration directory (/etc/teal/my.conf)– For temporary use: copy conf file(s) to own directory, modify and use during
historic analysis (more often for writing out alerts)
[alert_listener.SmtpAlertListener] class = ibm.teal.listener.smtp_alert_listener.SmtpAlertListener enabled = realtime filters = DuplicateAlertFilter server=ems1234.cluster.net [email protected], [email protected] [email protected]
Adding a e-mail listener
© 2011 IBM Corporation
/
/opt/teal
/usr/lib
/etc/teal
/data
/ibm
/bin
Start up configuration(default)
Libraries
•Component rules & metadata•Location•Extended data def
Directory Structure
Code
© 2011 IBM Corporation
/var/log/teal has TEAL logs (default)
On AIX look at the console (alog –t console –o) Note the following: (These are important fields with their TEAL and SFP
equivalents)– TEAL alert_id, SFP refcode– TEAL src_loc, SFP reporting MTMS– TEAL reason, SFP problem description– FRU list in TEAL and SFP
Specific alert data or range (text format)– /opt/teal/bin/tllsalert –f text –q “[query to narrow down]”– -f json or –f csv can be more handy for greping out certain records– -d to show duplicates
Specific event data or range (text, with extended and raw data)– /opt/teal/bin/tllsevent –f text –e –r –q “[query to narrow down]”– -f json or –f csv can be more handy for greping out certain records– -x to show which alerts it is associated with
When Things Go Wrong
© 2011 IBM Corporation
When Things Go Wrong (continued)
Data dump–/opt/teal/sbin/tltab -d -p <path to dump file–Restore:
• /opt/teal/sbin/tltab -c # Drop and recreate the tables
• /opt/teal/sbin/tltab -r -p <path to returned file> # Restore the tables with the user data
See TEAL on sourceforge (pyteal.sourceforge.net)
Look at service pack for known issues, hints/tips, etc..–http://www.ibm.com/developerworks/wikis/display/hpccentral/IB
M+High+Performance+Computing+Clusters+Service+Packs
© 2011 IBM Corporation
References
TEAL Sourceforge Project - http://pyteal.sourceforge.net– Command reference– Install/Configuration Instructions– Design Overview & other goodies– Mailing List– Problem Tickets
xCAT HPC Software Installation– http://sourceforge.net/apps/mediawiki/xcat/index.php?title=IBM_HPC_Stack_in_an
_xCAT_Cluster– Loadleveler– GPFS– RSCT/RMC
Cluster Guide– https://www.ibm.com/developerworks/wikis/display/hpccentral/IBM+HPC+Clusterin
g+with+Power+775+-+Cluster+Guide
Cluster Service Pack readme– https://www.ibm.com/developerworks/wikis/display/hpccentral/IBM+High+Performa
nce+Computing+Clusters+Service+Packs