self healing storage for performance issues ramya ......•collect performance data from global...

14
2019 Storage Developer Conference. © HPE. All Rights Reserved. 1 Self Healing Storage for Performance Issues Ramya Krishnamurthy HPE

Upload: others

Post on 30-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

2019 Storage Developer Conference. © HPE. All Rights Reserved. 1

Self Healing Storage for Performance Issues

Ramya KrishnamurthyHPE

2019 Storage Developer Conference. © HPE. All Rights Reserved. 2

ABSTRACTWhile architecting and designing the storage sub-system for the calculated performance and multiple 9’s should always be done, what matters most is how fast the storage services can be restored to the accepted level. Today majority of the troubleshooting is reactive to bring back the services to the desired level. Restoration is heavily dependent on how quickly an administrator can diagnose, pin point and remediate the issue. Storage service issues can be broadly categorized into availability and performance issues. The proposed presentation aims at self-healing or automated healing of storage sub-system for performance issues. The proposal intends to read the workloads running on storage systems in different environments and chart the read and write performance graphs over a prolonged period of time. The proposal further intends to draw correlation between performance tunables and performance graphs w.r.t the workloads. Restoration of performance intends to happen either with the recommendation of exact tunables or in an automated way.

2019 Storage Developer Conference. © HPE. All Rights Reserved. 3

Agenda A day in the life of storage administrator Current approach of resolving performance Proposed approach of automated remediation Example with HPE Storage

2019 Storage Developer Conference. © HPE. All Rights Reserved. 4

A day in the life of storage administrator

End user calls up for slow application Storage admin spends hours/days in top-down troubleshooting

Relies on kbase, cross teams and tribal knowledge

Relies on monitoring tools to troubleshoot their environment.

Unfortunately, this has meant staff spending dozens of hours mining log files and interpreting graphs, all in an effort to gain some insight into the cause of a disruption so it could be

resolved.

2019 Storage Developer Conference. © HPE. All Rights Reserved. 5

Current Approach

Troubleshooting is similar to a treasure hunt! Pin-point that it’s a storage issue Determine the congestion point Perform manual calculations to assign suitable cache, back-end IOPS,

etc. Apply the performance tunables Monitor to see if there is an improvement

2019 Storage Developer Conference. © HPE. All Rights Reserved. 6

Proposed approach- High level

Enable data collection in the similar systems deployed globally Collect performance measurements from these systems Perform correlation on the collected data Continuously learn from the globally installed base and improve Build a recommendation engine (using ML and AI) to prevent & address the

performance issues

2019 Storage Developer Conference. © HPE. All Rights Reserved. 7

Proposed solution approach

What is the role of machine learning?

Humans tend to learn from the past and from others experience. Then they use the experience to do better in future under repeat circumstances!

Use machine learning to•Collect performance data from global systems, learn from it•Automatically tune the system to deliver greater performance

Use Predictive modeling to understand all the operating, environmental, and telemetry parameters within each system in the infrastructure stack.Pair domain experts with AI to enable machine-learning algorithms to identify causation from historical events to predict the most complex and damaging problems

2019 Storage Developer Conference. © HPE. All Rights Reserved. 8

Example of how we can improve small random read performance

There are four basic workloads that a storage array must be able to service:1. Sequential writes2. Random writes3. Sequential reads4. Random readsThe first three workloads listed previously (sequential writes, random writes, and sequential reads) all benefit from cache on a storage array controller because the algorithms designed into the array software leverage the array cache to service these workloads improving IO for these workloads and hence reducing the IO latency experienced by a host.The fourth workload mentioned previously, random reads, generally does not benefit from array cache and hence that is why these IOs are termed “random.” Because the data being read is random in nature, the array algorithms cannot anticipate the data that will be requested and have that data prestaged into read cache before a host requests it. As a result, random reads result in a lot of what are termed “read miss” or“cache miss” events where the requested data is not in read cache and must be retrieved from the back end of the array. solution is to use SSDs as a Level-2 read cache to hold small block random read data and improve overall random read performanceHPE 3PAR implementation that uses flash (SSD) storage as a Level-2 read cache on an HPE 3PAR StoreServ array is called Adaptive Flash Cache. The flash cache effectively extends the system cache without adding more physical memory. Creating more cache space from the SSDs allows the 3PAR StoreServ Storage to deliver commonly accessed data at greater speed.The space for the flash cache on the SSDs is automatically reserved by the system, there is no need to specify which SSDs to use. This feature does not require a separate license

2019 Storage Developer Conference. © HPE. All Rights Reserved. 9

Enable Flash Cache

2019 Storage Developer Conference. © HPE. All Rights Reserved. 10

Check stats of Flash Cache

2019 Storage Developer Conference. © HPE. All Rights Reserved. 11

Example of AI using HPE Infosight with HPE storage

HPE Infosight offers predictive analytics that extend across the infrastructure lifecycle—from planning to expanding.

It right sizes new infrastructure by anticipating performance and resources needed based on different applications seen in our installed base.

Once arrays are deployed, The predictive analytics transforms the product and constantly looks for leading indicators of problems and automatically resolves them before customers even realized there was an issue.

Accurately predicts future capacity, performance, and bandwidth needs based on historical use and autoregressive and simulations

2019 Storage Developer Conference. © HPE. All Rights Reserved. 12

Proposal to improve small random read performance using AI

HPE Infosight will examine the Flashcache (FMP) and Cache Memory Page-CMP (DRAM) statistics by node or by virtual volume of a given HPE 3par. HPE infosight will also examine the Flashcache actvities

HPE Infosight will compare the flash cache size with that of global install base for similar models and system configurations and tune it accordingly

Monitor the read performance for given HPE 3Par system If the performance decreases, Roll back the tuning Continue the cycle as workloads will change continuously

2019 Storage Developer Conference. © HPE. All Rights Reserved. 13

References

References https://www.hpe.com/us/en/resources/storage/ai-autonomous-infosight.html?parentPage=/us/en/solutions/infosight

https://h20195.www2.hpe.com/v2/getpdf.aspx/4AA5-5397ENW.pdf

2019 Storage Developer Conference. © HPE. All Rights Reserved. 14

Thank You