quality(health(plans(&(benefits( healthierliving financial...

33
Quality health plans & benefits Healthier living Financial wellbeing Intelligent solu;ons Alexander Norris [email protected] Video How Aetna Uses Splunk and Prelert to Improve the Consumer Directed Healthcare Experience

Upload: others

Post on 27-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Quality  health  plans  &  benefits  Healthier  living  Financial  well-­‐being  Intelligent  solu;ons  

Alexander  Norris  [email protected]  

 Video    

How  Aetna  Uses  Splunk  and  Prelert  to  Improve  the  Consumer  Directed  Healthcare  Experience  

Page 2: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   2  

Our  Values  Splunk  Prelert  Path  Cool    

Page 3: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   3  3  ©2014  Aetna  Inc.  

Our  values  drive  us  to  do  beIer…  for  People  

Page 4: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   4  

Our  Values  Splunk  Prelert  Path  Cool    

History  Founded  in  1853    $47.2  Billion  ’13  Revenue  

23.1  Million  Medical  Members    

Page 5: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   5  5  ©2014  Aetna  Inc.  

Our  healthy:  

Health  care  built  around  people  

“Our  healthy  as  a  company  is  to  find  solu5ons  that  help  people  live  healthier  

lives  and  to  help  them  manage  their  health  versus  their  health  care.”  

—  Mark  Bertolini,  CEO,  Aetna  

Page 6: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   6  

Our  Values  Splunk  Prelert  Path  Cool    

Page 7: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   7  

Our  Analy;cs  Journey  

Levels  of  analyPcs  maturity  depends  on  how  much  of  the  decision  process  is  automated,  and  how  much  is  leR  for  human  intervenPon.  

DescripPve  analyPcs  (dashboards  and  query/drill  down  tools)  -­‐  the  intelligence  is  enPrely  leR  to  the  human.  

Advanced  analyPcs  -­‐  more  of  the  intelligence  is  automated.    

Moving  from  descripPve  analyPcs  to  more  advanced  analyPcs  gets  us  closer  to  decisions  and  acPons  (i.e.  direct  business  impact).      

Page 8: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   8  

Our  key  to  open  machine  data  is  Splunk    

Splunk  

Websphere,  IIS,  IHS,  Logviewer,  BPM,  DataPower,  UDB,  MQ,  Pure,  San,  Omnibus,  Windows,  VDI,  Esx,  ICE,  Proxy,  Netscaler,  SAN,  Avaya,  Cloud,  Linux,  F5,  AIX,  IRONMAIL,  Mainframe  (CICS,  DB2,  WAS)  

ASD,  ATV,  AVA,  NAV,  Docfind  (FAST/DSE),  Lifesuite,  Workability,  IQS,  IUS,  Incedo,  QRS,  APMCAS,  AQC,  Dynamo,  NICE    

PlaSorms   Products  

AnalyPcs  Across  Users  and  Silos  CreaPng  a  Single  ‘Glass  Pane’  

Page 9: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   9  

Our  Values  Splunk  Prelert  Path  Cool    

Page 10: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   10  

•  Data  across  Splunk  made  available  in  Prelert  

•  Let  the  machine  learn  what  is  ‘normal’  •  Create  acPonable  anomalies  •  People  intersect  with  other  technologies  

as  Prelert  maps  together  the  picture  •  Dedicated  Search  Heads    

mean  workloads  outside  of  convenPonal  Splunk  usage  

Prelert  

WAS/JAVA  

I.H.S    

Compuware  Vantage  

z/OS  CICS  

COBOL  Datapower  

MQ  

DB2/UDB  

Our  Prelert  Splunk  Applica;on    

Page 11: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   11  

Source: Prelert

Our  Prelert  Splunk  Applica;on  

Page 12: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   12  

•  Retrospec;ve  ‒  Auto-­‐detect:  This  is  a  Prelert  UI  to  write  searches  against  any  log(s)  that  pre-­‐exist  in  Splunk  and  idenPfy  anomalies  

•  Real-­‐;me  ‒  Prelert  searches  are  pre-­‐defined  and  setup  to  run  at  regular  intervals  to  establish/adjust  baselines  on  metrics  

Transforms this to a handful of meaningful anomalies per day

Our  Prelert  Splunk  Applica;on  Tac;cs  

Page 13: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   13  

Our  Prelert  Splunk  Applica;on  Con;nuous  Improvement  Approach  

If  the  search  was  successful,  implement  in  real  ;me  on  a  wider  scale  for  con;nuous  benchmarking  and  

aler;ng    

If  the  search  needs  improvement  (i.e.  

addi;on/exclusion  of  new  metrics),  begin  the  

cycle  again  

 

Customer  

Plan  

Do  

Check  

Act  

Iden;fy  an  opportunity  and  plan  for  predic;ve  analy;cs  (anomaly  detec;on)    

Implement  the  search    on  a  small  scale  in    auto-­‐detect  mode  

 

Use  data  to  analyze  the  results  of  the  search  and  determine  whether  it  

iden;fied  anomalies  

Page 14: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   14  

Our  Values  Splunk  Prelert  Path  Cool    

Page 15: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   15  

Reac;ve  Path  Case  Study  

An  outage  where  source  and  impact  were  unknown  to  flagship  self    service  applicaPon    StarPng  at  4:10PM  an  response  team  was  acPvated  as  Navigator  transacPons  hung.    Self  registraPon  JVMs  crashed.  ODR  failure  followed    At  6:30PM  Navigator  UDB  instance  recycled  ulPmately  allowing  the  return    to  stability    

Page 16: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   16  

Reac;ve  Path  Focal  Point  /  Transac;on  Alert  Event  

The  first  wave  of  misbehavior  began  at  4:10PM  when  self  registra;on  failed      

JVM  Request  :  *  Method  :  N/A  SQL  :  N/A  Resource  :  Resident  Time  Trap  CondiPon  :  ApplicaPon  Offending  Content  :  /registraPon/MyAssist/ListResults.jsp  Threshold  :  120,000  milliseconds  MaxMin  :  Maximum  Number  of  Hits  :  7  Offending  Value  :  180,029  milliseconds  Severity  :  Low  

 

Page 17: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   17  

Reac;ve  Path  Failure    Log  Output  

[1/13/14  16:08:14:169  EST]  00000ae4  TCPChannel        W      TCPC0004W:  TCP  Channel  TCP_4    has  exceeded  the  maximum  number  of  open  connec;ons  100.  [1/13/14  16:09:09:971  EST]  0000002e  ThreadMonitor  W      WSVR0605W:  Thread  "WebContainer  :  2"  (00000040)  has  been  ac;ve  for  395389  milliseconds  and  may  be  hung.    There  is/are  1  thread(s)  in  total  in    the  server  that  may  be  hung.                  at  java.net.SocketInputStream.socketRead0(Na;ve  Method)                  at  java.net.SocketInputStream.read(SocketInputStream.java:129)                  at  com.ibm.db2.jcc.t4.z.b(z.java:199)                  at  com.ibm.db2.jcc.t4.z.c(z.java:289)                  at  com.ibm.db2.jcc.t4.z.c(z.java:402)                  at  com.ibm.db2.jcc.t4.z.v(z.java:1170)                  at  com.ibm.db2.jcc.t4.ab.c(ab.java:137)                  at  com.ibm.db2.jcc.t4.b.Wc(b.java:1308)                  at  com.ibm.db2.jcc.t4.b.b(b.java:1227)                  at  com.ibm.db2.jcc.t4.b.a(b.java:5983)                  at  com.ibm.db2.jcc.t4.b.c(b.java:792)                  at  com.ibm.db2.jcc.t4.b.b(b.java:735)                  at  com.ibm.db2.jcc.t4.b.a(b.java:402)                  at  com.ibm.db2.jcc.t4.b.<init>(b.java:331)                  at  com.ibm.db2.jcc.DB2PooledConnec;on.<init>(DB2PooledConnec;on.java:84)  

Log  file  analysis  in  Splunk  showed  the  threads    being  hung  in  db2  and  connec;on  limits  reached  

Page 18: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   18  

Reac;ve  Path  Failure  SQL  Deficiency  Isola;on    

These  SQL  integrity  constraint  viola;ons    communicated  resource  conten;on  

Page 19: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   19  

Reac;ve  Path  Failure  We  con;nued  to  understand  the  connec;on  interac;ons  with  this  shared  db  resource  within  Tivoli.  This  picture  is  the  DB  connec;on  escala;on  paIern  

Page 20: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   20  

Reac;ve  Path  Ethernet  Traffic  Sniff  DB  Isola;on  Compuware  DCRUM  

These  are  DB  views  from  watching  traffic  to  the  database.    It  helped  visualize  DB  interac;ons  

Page 21: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   21  

Reac;ve  Path  Conflict  Detec;on  in  Splunk  

2014-­‐01-­‐21-­‐12.54.05.315574-­‐300  E14573A538                  LEVEL:  Warning  PID          :  10158854                          TID    :  75911              PROC  :  db2sysc  0  INSTANCE:  navudbp1                          NODE  :  000                  DB      :  DSRGP000  APPHDL    :  0-­‐9855                              APPID:    IP.HERE.54686.140121173034  AUTHID    :  LRTUSER    EDUID      :  75911                                EDUNAME:  db2agent  (DSRGP000)  0  FUNCTION:  DB2  UDB,  data  management,  sqldEscalateLocks,  probe:3  

MESSAGE  :  ADM5502W    The  escala;on  of  "700"  locks  on  table  "GSRGP00D.T007000"                        to  lock  intent  "X"  was  successful.      2014-­‐01-­‐21-­‐12.56.02.803084-­‐300  I15112A544                  LEVEL:  Error  PID          :  10158854                          TID    :  84770              PROC  :  db2sysc  0  INSTANCE:  navudbp1                          NODE  :  000                  DB      :  DSRGP000  

APPHDL    :  0-­‐7410                              APPID:  IP2.HERE.44561.140121151200  AUTHID    :  LRTUSER    EDUID      :  84770                                EDUNAME:  db2agent  (DSRGP000)  0  FUNCTION:  DB2  UDB,  common  communica;on,  sqlcctcptest,  probe:11  

MESSAGE  :  Detected  client  termina;on  DATA  #1  :  Hexdump,  2  bytes  0x070000019BFF4D42  :  0036                                                                              .6  

•  DB  log  output  communicated  the  conflict  to  SME  

 •  Data  showed    contenPon  caused  by  overlap  of  online  and  batch  systems  

Page 22: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   22  

Reac;ve  Path  Tivoli  Enterprise  Portal  (TEP)     Memory  and  CPU  doubled  for  the  batch    

server  making  jobs  not  run  within  ;meframes  

Page 23: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   23  

Reac;ve  Path  Summary  

•  No  hardware  or  infrastructure  soRware  issues  were  found  on  any  of  the  Websphere,  Frameworks,  or  DB  servers  

•  The  team  determined  lock  waits  for  DB2  came  from  a  batch  system  which  caused  the  crash  of  several  online  JVMs  

•  DBA  Support  implemented  a  monitoring  change  to  capture  data  base  lock  waits  in  excess  of  20  seconds.  This  also  captures  lock  Pme  outs  and  deadlocks  

•  DBA  and  AD  are  reviewing  SQL,  adding  “WITH  UR”  to  increase  concurrency  where  needed  

     

FROM  COMMAND  LINE  TO  SPLUNK  =  FASTER  INFORMATION        

Page 24: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   24  

Predic;ve  Path  Aetna  Voice  Advantage  (AVA)  

•  Award  winning  natural  voice  recogniPon  system  •  Customers  opPng  out  to  “live”  service  reps  due  to  

delays  in  backend  systems    •  Defined  the  unknown:  Applied  Prelert  machine  

learning  to  understand  ‘normal’  events  from  many  plaporms  and  products  

This  was  impossible  before  

Page 25: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   25  

December January February March

Prelert Anomaly Recognition with AVA in the I.H.S/ODR/DataPower System Out Data

Manual Isolation and identification of interdependencies beyond single Silo •  Direct Results:

•  Coding Defect in AVA •  Datapower Firmware Deficiency (IBM

Product Request) •  Limitation of CICS job modified

•  Indirect Results: •  Created working team comprised of

effected components •  Integrated into existing specialized

alerting to enhance process

Explored auto learn dynamic correlation between silos which enhanced insight. Feeding continued research. •  Direct Results:

•  An association between CICS job failures and AVA OPT OUTS is recognized

•  Indirect Results: •  Understanding the tool. A lot of cross-discipline

teamwork.

Implemented enhanced correlated search for AVA •  Direct Results:

•  A new live search will consume a larger set of machine data enhancing anomaly detection.

•  Indirect Results: •  The anomaly detection will

initiate a deeper knowledge set, enhancing results.

Predic;ve  Path:  AVA  Con;nuous  Improvement    

Page 26: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   26  

Predic;ve  Path:  AVA  Con;nuous  Improvement  Anomaly  Detec;on  Results  

This is the aggregate dump time for CICS zone (z/OS) online transaction impacting

April 17.5 Minutes May 8 Minutes June 2.5 Minutes July 30 Seconds

System Dump Counts by Month

Page 27: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   27  

Weekly  count  of  Pmeouts  for  the  anomalous  AVA  queue  Q151P.P001.NTWKCATINQ.ONLINE.AVA.REQ    

Predic;ve  Path:  AVA  Con;nuous  Improvement  Anomaly  Detec;on  Results  

Page 28: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   28  

The  number  of  criPcal  events  resulPng  from  synthePc  Hammer  transacPons  for  AVA  has  dropped  off.    From  over  400  per  month  to  single  digits  (  3  and  8  in  June  and  July)  

Predic;ve  Path:  AVA  Con;nuous  Improvement  Anomaly  Detec;on  Results  

Page 29: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   29  

•  Improved  applicaPon  resiliency  for  applicaPons  beyond  the  targeted  systems  (AVA,  NAV,  ACAS,  OPT)    

•  Lowered  ‘stop  the  world’  dump  acPvity  in  CICS  region  from  over  17  minutes  in  April  to  30  seconds  in  July  

•  CICS  region  stability  resulted  in  less  backend  Pmeouts  in  Datapower  plaporm.      

•  Prelert  anomalies  resulted  in  people  paying  arenPon,  adding  to  engagement  legiPmacy  and  prompted  engagement  to  answer  the  quesPon  of  ‘why?’  

•  The  more  transacPons  that  are  self  serviced  the  lower  cost  to  service  

Predic;ve  Path:  Results  Summary  

Page 30: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   30  

Our  Values  Splunk  Prelert  Path  Cool    

Page 31: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   31  

Two  contests  were  held  this  year  igniPng  the  spirit  of  innovaPon.  All  employees  and  any  company  that  work  with  Aetna  were  invited  to  the  challenge  •  The  first  was  a  ‘Happy  New  Year  Hackathon’  the  challenge  was  to  create  an  

original  soluPon  to  berer  people’s  health    

•  Most  Recent  challenge  ‘Big  Datathon’  was  to  create  a  predicPon  algorithm  for  our  internaPonal  business  for  claims  costs.  Hadoop!  

 

Cool:  Hackathon  &  Datathon  

Page 32: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Aetna  Inc.   32  

Splunk’s  Hunk  is  in  our  Lab    Possibili6es  are  endless  

 Other  Stuff:  

 Top  Products  &  Plaporm  Resiliency  Effort    OOM/Coding  Defects  (Dev  Ops  feedback)    Mainframe  data  (Syncsort)  

 

Cool:  Hunk  &  Other  Stuff  

Page 33: Quality(health(plans(&(benefits( Healthierliving Financial ......caused!the!crash!of!several!online!JVMs! • DBA!Supportimplemented!amonitoring!change!to!capture!database!lock!waits!

Thank  you