Download - Automating IT Analytics to Optimize Service Delivery and Cost at Safeway - A #GartnerDC Presentation
Copyright © 2012 TeamQuest Corporation. All Rights Reserved.
TeamQuest and the TeamQuest logo are registered trademarks in the US, EU and elsewhere.All other trademarks and service marks are the property of their respective owners.
Automating IT Analytics to Optimize Service Delivery
and Cost at Safeway
David Wagner – TeamQuest AdvocateChris Lynn - Safeway Capacity Manager/Performance Analyst
December 11, 2013
Copyright © 2012 TeamQuest Corporation. All Rights Reserved.
• TeamQuest Perspectives
• Safeway Experiences
Agenda
Copyright © 2012 TeamQuest Corporation. All Rights Reserved.
• Continuously financially-optimized IT environment– Always know where and when performance problems
will affect the bottom line– Identify cost and performance inefficiencies in support
of business processes and eliminate them
• Continuously optimized customer experience– Understand when, where and why customer
experiences fail– Resolve, predict and prevent customer dissatisfaction
issues
Desired State: Continuous Optimization
Copyright © 2012 TeamQuest Corporation. All Rights Reserved.
• Significantly reduce initial CapEx, and ongoing OpEx– Make, and keep making, more money!
• Optimize resources for systems of customer engagement
• Deploy and refresh new applications faster– e.g. Retailers need to capture their share of mobile
commerce as it grows from $6 to $31B (2016)
• Respond faster to business spikes
• Prevent business impacting outages and slowdowns
Continuous IT Optimization Results
Copyright © 2012 TeamQuest Corporation. All Rights Reserved.
Underlying IT
Infrastructure
Outage Manageme
nt
Customer Operations
Distribution Automation
Asset Manageme
nt
5
Services
Applications
Big Data Collection
Business Intelligence
Aligned Business and IT Intelligence
Enterprise IT Optimization
• Correlate business & IT performance
• Insight into how business process- changes impact IT
• Understand and optimize IT costs by business unit/process and technology
• Insight into business performance across technology stack
Network
Storage
How : Aligned Business and IT
Analytics
Server/OS
Copyright © 2012 TeamQuest Corporation. All Rights Reserved.
• Federates existing data/information into purpose-designed optimization process – Technology data (e.g. server, network, storage, etc.)– Service data (catalog, metrics, tickets, etc.)– Financial data– Business data (analytics, KPIs, plans, TXNs, etc.)
• Automates IT analytics across all data sources– Flexible and adaptive to dynamic environments– Raw (commodity) data -> actionable information for IT
• Single-pane-of-glass IT Optimization
TeamQuest’s Approach: Federated IT Analytics
Copyright © 2012 TeamQuest Corporation. All Rights Reserved.
• Continuous Optimization
– Pre-purchase validation
– Re-purposing– Consolidation
• Fully automated, low cost
• Integrated with Risk and Service Management
• Changed new VMwareClusters from every 6 weeks to:
– None for 18+ months…
– Consolidated 1000’s ofVM’s (Saving $M)
Result: Automated Application Financial
Optimization
Copyright © 2012 TeamQuest Corporation. All Rights Reserved.
• Continuous IT Optimization– Peak IT Performance– Ideal Resource Capacity– Optimized Resource Costs
• Automated IT Analytics– Predictive– Federated
• Aligned IT and Business Management– Performance– Capacity– Financial
Result: IT Optimized... Future Assured
Automating IT Analytics to Optimize Service Delivery and Cost at
Safeway
Chris Lynn - SafewayDecember 11, 2013 2:30-3:00
Background Server Storage Forecasting and Optimization Application Capacity Analysis – Dashboards
to Details Business KPI Analytics Vmware High Level Analytics
Topics
Background
• Manager of Safeway Capacity and Performance Team• [email protected]• http://www.linkedin.com/pub/chris-lynn/2/65/3
09/
• Environment Supported• ~4000 servers (~1700 physical)• ~200 significant applications• Unix, Windows, Mainframe, Teradata,
Tandem, etc.• Thousands of internal IT Customers, and
millions of shoppers
Server Storage Forecasting and Optimization
• Optimizing Availability (reducing incidents)• Optimizing Enterprise Capacity• Reducing Risk• Automated replacing Manual• Embedded expertise
Storage Capacity Incident Avoidance:
Old (on server) Manual Method$ dfFilesystem size used avail capacity Mounted on/dev/vx/dsk/rootvol 3.9G 2.5G 1.4G 65% //dev/vx/dsk/var 1.9G 832M 1.0G 45% /varswap 9.5G 16K 9.5G 1% /var/runswap 1.0G 2.6M 1021M 1% /tmp/dev/vx/dsk/patrol 1.9G 1.5G 227M 88% /appl/patrol/dev/vx/dsk/home 486M 347M 91M 80% /export/home/dev/vx/dsk/openv 1.4G 583M 717M 45% /usr/openv/dev/vx/dsk/performdg/usrlocal 1.9G 542M 1.3G 29% /usr/local/dev/vx/dsk/performdg/oracle 3.9G 1.1G 2.8G 28% /appl/oracle/dev/vx/dsk/performdg/apache 128M 27M 95M 22% /appl/apache/dev/vx/dsk/rootdg/opswarelv 241M 234K 216M 1% /var/opt/opsware/dev/vx/dsk/performdg/b1home 12G 432M 11G 4% /appl/perform/best1home/dev/vx/dsk/performdg/spool_apache 256M 145M 104M 59% /appl/spool/apache/dev/vx/dsk/performdg/manage 180G 122G 54G 70% /appl/perform/manager/dev/vx/dsk/performdg/workspace 90G 59G 29G 68% /appl/perform/workspace/dev/vx/dsk/performdg/collect 480G 442G 37G 93% /appl/perform/collect
Automated Storage Forecasting: File System
Exceptions
• Weekly automated prioritized scan• 4500 servers• 45000 filesystems
• Focused on meaningful exceptions
• A proactive shift from find to fix• Was – 50 minutes looking
for potential problems, 10 minutes to fix
• Now- 5 minutes looking for potential problems, 55 minutes fixing them
• Impossible to do manually
New Automated (global exception) File System Forecast Analytic Details
• Complex multi-level thresholds1. Is file system utilization above 90% AND growing by >0.2% for the interval?2. Is file system utilization above 75% AND growing by >2% for the interval?3. Is the file system utilization above 15% AND growing by >15% for the interval?4. Is /appl/patrol above 90% AND growing for the interval?
• Individual exclusions and special cases• Physical and virtual in same report, but can be treated uniquely.• Sorted by date/time most likely to fill up• Show all candidates for a single server together (sorted by
highest one), minimize the time for operations to respond• Includes historical trend compared to just a point in time (e.g.
df)• Forecast utilization trend into the future (multiple statistical
options)• more than 24 hours of data to avoid temp FS• must have recent data to avoid shutdown servers• final measured number not below threshold• if final number >99.5% catches the very full fs that might not
be growing.
Executive Capacity Dashboards
Highly Stressed
Stressed
Well Used
Under Used
Capacity Risk Indicators
Application Capacity Dashboards
FastF
orward
-PR
Corpco
mm-PR
DataStg B2B
Crystal
-PRAM
Autonomy-P
R
EXE-PR
WLS
J-PR
WLV
J-PR
peoplesoft-PR
JOE-PR
Workb
rain-PR
Corema-PR
0%10%20%30%40%50%60%70%80%90%
100%
Capacity Risk Indicators
Highly Stressed Stressed Well Used Under Used
Application Capacity Analysis
• Automated Application Triage
• All relevant metrics• Embedded expertise• Enterprise perspective of
true capacity
AIX--2
52
Solaris
--240
Linux
--218
1
Windo
ws--1
445
ESX H
ost--
252
0%
100%
Capacity Risk Can-didates
(OS--#of systems)Under UsedWell UsedStressedHighly Stressed
Integration With Business Metrics
System/Platform capacity data:• Physical servers• Virtual servers• Tandem capacity systems• Teradata capacity
systems• Datacenter facilities
Business perspective:• Business transaction
volumes• Resource utilization
Vmware Aggregate Capacity
Aggregate shows the worst individual status
Vmware Aggregate Capacity
Aggregate shows the worst individual metric status
Lessons Learned/ Value Gained
• Reduced service risk• More proactive less reactive• Established a baseline to optimize capacity, and
a mechanism to measure the progress• Business and IT alignment• Performance and capacity to the business• Management and technical personnel
• Launch slowly in phases to not overwhelm the groups
• People really do care about formatting and color choice, not just content