eva architecture introduction

© 2004 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice

EVA Architecture Overview

Rodger DanielsEVA Replication Architect

Apr 8, 2023 2

Disclosure information

The following information is “HP Confidential” and is intended only for a limited audience within HP who fulfill a “need to know” requirement. The information contained is to be handled accordingly with HP’s policy for handling this classification of information.

http://legalweb.corp.hp.com/legal/files/labels.asp

This information may NOT be shared outside HP.

04/08/23 00:02 3

On LineHP Confidential NSSNSS

• EVA– Excellent Random R/W performance– Excellent cache read hit number– Fault tolerant, scaleable virtualization mapping scheme (Garbage collection free)– Mirrored write cache– Volatile read cache– Metadata in volatile memory (Policy Memory)

• Backend disks provide non volatile metadata store– Replication features

• Snapclones, Snapshots (fully allocated, space efficient)• CA disaster tolerant remote replication

– RAID0, RAID1, RAID5– Active mirroring between controllers through FC Mirror Port(s)– GL - On the fly XOR– XL – Inline parity calculation

EVA Features

04/08/23 00:02 4


• NSC (Network Storage Controller)– Refers to a controller from a VCS perspective

• VCS– Virtual Controller Software (firmware)– Becomes XCS for XL family

• Physical Store– Unused Drive (In the process of becoming a useable part of the system, but needs to become

incorporated into an RSS)

• Volume– A used disk drive, can accept customer data at this point

• Storage Cell– EVA Controllers, Shelves and Disks that have been initialized by the firmware. Can be logically

constructed into Disk Groups (LDADs), Logical Disks, Virtual Disks and then used for customer data

• Disk Group (LDAD – Logical Disk Address Domain)– A group of disks that function as a separate storage pool. A virtual disk is contained within a

single disk group and can not span disk groups. A disk group is made up of one or more redundant store sets. User data for a virtual disk is striped across the entire disk group.

Architectural Discussion Objects

04/08/23 00:02 5


• Quorum– A set of disks that contains copies of the SCS data base

• Logical Disk– Logical representation of a virtual disk. At the CS component level the

representation of a virtual disk

• Virtual Disk– A virtual representation of a logical disk, for external use by a host

• Presented Unit– The presentation of a virtual disk, ie. its mounted and useable by a host

• RSS (Redundant Store Set)– A subset of disks within a disk group that represents a smaller fault domain

then the disk group..

Architectural Discussion Objects

04/08/23 00:02 6


2C12D: One disk group containing all 64 drives

Eight RSSs:RSS 1RSS 2RSS 3RSS 4RSS 5RSS 6RSS 7RSS 8

RSS Example

04/08/23 00:02 7


Host Port

CacheManager

RaidServices

FCServices

DRM Core

DRM Log

DRM FC

HP Tachyons

Device Tachyons Mirror Tachyon

DRM Copy

EMUENVIRONMENTAL

MONITOR UNIT

SCMIServices

2

HTB

HTB XD

HTB

EETB

XD XD

XD

SEST,ERQ,IMQ

FED

MFCD

FED

SEST,ERQ,IMQSEST,IMQ ERQ

TDCB

DTD

TDCB

ALLOC DEALLOC

EXECRTOS

CNODE

CODE HIGHWAY

FaultManager

EIRP,TEIRP

EIP

OCPOPERATIONAL

CONTROL PANEL

ALLCOMPONENTS

SCSSTORAGE

CELLSTATE

TDSD, ELSD, MFCD,FED

CONFIG/STATE

CONFIG/STATE

SCSCB

2

CONFIGSTATE

CONFIGSTATE

3CONFIGSTATE

COMMANDSTATE

4 4

11

XD 3CS

CONTAINERSERVICES

XD

XD

5

3

5

CONFIG

STATE

ALLOC

DEALLOC

6

6 CSIO

CSLDREADY

CSIO

04/08/23 00:02 8


• Host Port– Front end FC services, decodes and sequences instructions, controller responses to host, assigns work to code

highway, passes commands to SCMI, supports SCSI interface, handles AAA logic (V4)

• (SCMI) Storage Cell Management Interface– Architected interface to allow external management agents (Command View/Bridge) to manage the EVA

• (SCS) Storage Cell State– Inoperative/Operative unit handling, SCMI requests to add/remove objects from the system, return info about

objects, unit presentation, pullover, failover, meltdowns, meltdown recovery, ILF disk management, system database, RSS management, add/remove devices, cell mastership, error reporting

• Cache Manager– Read/Write cache management, full stripe writes, assigns work to RAID services, RAID5 write recovery/parity

recovery

• (DRM) Data Replication Manager – Continuous Access– Remote disaster tolerant replication

• (Container Services)– Virtualization (Map management), local replication (snapclone/snapshot), sparing, leveling

• RAID Services– Services supporting RAID0, RAID1 and RAID5

• FCS– Backend FCS, Mirroring and DRM FCS support, Disk Drive handling

• FM (Fault Manager)– Manage event logs, termination codes, etc.

Architecture Component Overview

04/08/23 00:02 9


• Host Port

• (SCMI) Storage Cell Management Interface

• (SCS) Storage Cell State

• Cache Manager

• Container Services

• Data Replication Manager (DRM) – Continuous Access

• FM – Fault Manager

Architecture Component Overview

Apr 8, 2023 10

Host Port

04/08/23 00:02 11


• EVA is made up of a controller pair– 2 host ports per controller module

• One controller is the master and the other slave– Actions affecting storage cell structures and database are restricted to the master controller– Example is VDisk (LUN) creation

• EVA GL (VCS3.XXX and earlier) is an Asymmetrical Virtual RAID controller– Asymmetrical LUN access

• Unit is ready read/write on one controller while it is not ready on the other controller• Simultaneous access to LUN only supported via ports on same controller

– One queue for LUN, ordering based on command arrival

• Host Ports only support Fabric connnection– 1Gb, 2Gb switches supported– Highest available link speed is auto negotiated

Host Port and EVA GL Operation

04/08/23 00:02 12


• EVA Controller Pair– Defined as a single node

• Assigned a SCSI-3 WWID

– Two control units each containing two host ports

• Each host port defined by unique port WWID

– Node and Port Identifiers are 64 bit IEEE registered numbers, with a portion assigned by a company ID and the rest by a HP specific method to ensure uniqueness of the identifiers.

EVA GL and Host Port IDs

04/08/23 00:02 13


• Handles front end FC services

• Decodes and sequences instructions

• Controller responses to Host

• Assigns work to Code Highway

• Passes along SCMI commands to SCMI module

• Supports SCSI interface

Host Port Module

04/08/23 00:02 14


Host Port

CacheManager

RaidServices

FCServices

DRM Core

DRM Log

DRM FC

HP Tachyons


DRM Copy

EMUENVIRONMENTAL

MONITOR UNIT

SCMIServices

2

HTB

HTB XD

HTB

EETB

XD XD

XD

SEST,ERQ,IMQ

FED

MFCD

FED


TDCB

DTD

TDCB

ALLOC DEALLOC

EXECRTOS

CNODE

CODE HIGHWAY

FaultManager

EIRP,TEIRP

EIP

OCPOPERATIONAL

CONTROL PANEL

ALLCOMPONENTS

SCSSTORAGE

CELLSTATE


CONFIG/STATE

CONFIG/STATE

SCSCB

2

CONFIGSTATE

CONFIGSTATE

3CONFIGSTATE

COMMANDSTATE

4 4

11

XD 3CS

CONTAINERSERVICES

XD

XD

5

3

5

CONFIG

STATE

ALLOC

DEALLOC

6

6 CSIO

CSLDREADY

CSIO

Apr 8, 2023 15

SCMI

04/08/23 00:02 16


• Architected interface used by external management agents (Command View/Bridge) to communicate with the EVA

• Communication via SCSI Send Receive Diagnostics– All SCMI commands made through LUN0

– Commands come in via SCMI command packet

– Response via SCMI response packet

– Original design limited response to a single attribute

• In order to reduce message traffic super SCMI commands developed which return a lot on information via a single response

SCMIStorage Cell Management

Interface

04/08/23 00:02 17


• External management agent uses SCMIApi or RealSCMI to communicate with the EVA

• SCMI Server processes the command inside of VCS

• SEND DIAGNOSTIC command - use page code 90 (vendor specific). Contains SCMI command packet, and command buffers(2). 64KB max buffer size.

• RECEIVE DIGNOSTIC command - returns the result in SCMI response packet and response buffers(2).

• Host Port layer handles matching of the send/receive pair and rejecting illegal combination.

• Built in security mechanism by establishing password (encrypted password is transmitted).

• The agent (client) must “log-in” using the correct password to be able to send SCMI commands for execution


Interface

04/08/23 00:02 18


• Limitations– The system processes one send/receive diagnostic at a time

– This means when the system is synchronously executing a command via send receive diagnostic, until that command completes the next management command is held up

– When a management command is held up the management agent loses manageability of the array for that time

– Asynchronous background delete example

• Designing commands that take along time to execute

• See SCMI Spec section 6.7, 5.2.5, 4.57.1


Interface

Apr 8, 2023 19

State (SCS)

04/08/23 00:02 20


Host Port

CacheManager

RaidServices

FCServices

DRM Core

DRM Log

DRM FC

HP Tachyons


DRM Copy

EMUENVIRONMENTAL

MONITOR UNIT

SCMIServices

2

HTB

HTB XD

HTB

EETB

XD XD

XD

SEST,ERQ,IMQ

FED

MFCD

FED


TDCB

DTD

TDCB

ALLOC DEALLOC

EXECRTOS

CNODE

CODE HIGHWAY

FaultManager

EIRP,TEIRP

EIP

OCPOPERATIONAL

CONTROL PANEL

ALLCOMPONENTS

SCSSTORAGE

CELLSTATE


CONFIG/STATE

CONFIG/STATE

SCSCB

2

CONFIGSTATE

CONFIGSTATE

3CONFIGSTATE

COMMANDSTATE

4 4

11

XD 3CS

CONTAINERSERVICES

XD

XD

5

3

5

CONFIG

STATE

ALLOC

DEALLOC

6

6 CSIO

CSLDREADY

CSIO

04/08/23 00:02 21


• Storage Cell State (SCS – State)– Inoperative/Operative unit handling– SCMI requests to add/remove objects from the system– Return info about objects– Unit presentation– Pullover– Failover– Meltdowns– Meltdown recovery– ILF disk management– State database (Object Store Management)– RSS management– add/remove devices– Cell mastership– Error reporting

SCS Functionality

04/08/23 00:02 22


• Cell State Manager (CSM)– Makes all State decisions, controls state of EVA– Active only on the master controller– Manages Quorum Disks– Owns SCS data base– SCMI command processing– Cell realization– Unit failover

• Cell Volume Manager (CVM)– Volume transitions– RSS membership– Meltdown level

• Cell State Agent (CSA)– Manipulates volatile data structures on behalf of CSM

• Device Discovery

SCS Components

04/08/23 00:02 23


• Quorum Disks– RSS0 is a special RSS that tracks the quorum disks

• It is the only RSS that has disks from multiple disk groups• It is the only RSS that has disks that are all members of other RSSs

– At least 5 disks mirrored, max 16, 1 per disk group, 1 per shelf• Master owns, slave cannot access quorum drives• Read one, write all – nway write• User notified when all quorum disks are lost• Special quorum disks called golden quorum, used in single controller

configuration• Kept in synch using an incarnation number

– In event of crash check all incarnation numbers– SCS data base resides on quorum disks– SCS data base keeps information about the current storage cell configuration

• Storage Cell, Disk Groups, VDisks, DR Groups– Journals for Metadata Updates (Can be a performance issue)

SCS Components

04/08/23 00:02 24


• RSS Membership– A disk is not available for storage if it is not a member of an RSS– When new drives are added to the system they must be added to

existing RSSs or new RSSs must be created– When drives are removed from the system it may require that RSSs

are merged

• RSS Size– RSSs are 6 to 12 drives– When an RSS drops below 6 drives it will merge with another RSS to

create a larger RSS– When an RSS grows beyond 11 drives it will be split to create 2

RSSs– A merge can force a split– Optimal size targeted by the system is 8 drives

RSS Management

04/08/23 00:02 25


• RSS Goals– Size is important

• Optimal size targeted by system is 8• Must be greater then 5 and less then 12• When an RSS goes to 5 or less it is merged with another RSS is

another RSS is available• When an RSS grows to 12 or greater it is split into two smaller

RSSs of size 6 or greater– Every member has a mirror partner

• Talk about VA R1 geometry vs EVA R1 geometry– Mirror partners should be on different shelves– RSS Members should be on different shelves– Mirror partners same size– RSS members same size

RSS Management

04/08/23 00:02 26


• Adding a Single Drive to an LDAD– Add a single disk then add to RSS with smallest odd membership

• If more than 1 to choose from then select based on shelf numbers and disk sizes

• Adding Multiple Drives to an LDAD– Try to mate all unpaired disks

– Try to make it so everyone has a partner on a different shelf

– If more than 5 disks try to create as many new RSSs of size 8 and a new smaller RSS with what’s left

• Things Not Guaranteed– Mirror partners will be on a different shelf

– All RSS members will be on a different shelf

– Don’t tear apart good RSSs to make RSSs with drives on different shelves

– Don’t make 4 6 member RSSs into 3 8 member RSSs

RSS Management

Apr 8, 2023 27

Cache and Battery

04/08/23 00:02 28


Cache and Battery State

• Cache Policy:– The battery capacity (i.e., write cache holdup time) is a major input for

determining what is called the Cache Policy

– Cache Policy determines whether or not a unit is presented to hosts, which controller it is presented through, and whether it operates in write-back or write-through mode

04/08/23 00:02 29


Battery Holdup and Cache Policy

04/08/23 00:02 30


The Storage Cell and Cache Policy

Storagecell Slave Battery System Bad

Storagecell Slave Battery System Low

Storagecell Slave Battery System Good

Storagecell Master Battery System Bad

No unit presentation except SACD

All units writethrough on Storagecell Slave

All units writeback on Storagecell Slave

Storagecell Master Battery System Low

All units writethrough on Storagecell Master

All units writethrough on both Storagecell Master and Slave

All units writeback on Storagecell Slave

Storagecell Master Battery System Good

All units writeback on Storagecell Master

All units writeback on Storagecell Master

All units writeback on both Storagecell Master and Slave

Adapted from “VCS Battery Manager Overview” by Bryan Walder (Aug 29, 02).

• When one controller’s battery system is no longer good, units move to the other controller, if its battery state is better

04/08/23 00:02 31


Battery Holdup Times

• GL–Two batteries–Low holdup time96 hours

• XL Lite (4000, 6000)–One battery–Low Holdup Time in Write Through is about 96 hours–Normal Holdup Time in Write Back mode is up to 242 hours

• XL (8000)–Two batteries–Low Holdup Time in Write Through is about 96 hours–Normal Holdup Time in Write Back mode is up to 244 hours

04/08/23 00:02 32


Cache Management for Dummies

• Terminology:– Dirty Data

• Write cache data that has not been flushed to disk

– Write-back caching

• Committing data when it reaches write cache and is mirrored on the other controller to reduce write latencies

– Write-through caching

• Disabling write cache and forcing a write to successfully write to disk before returning successful status

– Atomic Write

• Guarantee that for any write up to 128K that does not cross a 128K boundary that a read of the data will either return all old data or all new data

04/08/23 00:02 33


Cache Management for Dummies

• Terminology:– Fail-over

• Process of failing over a controllers write cache to the other controller

– Crash-over

• The process of reconstructing local cache data structures following a controller power cycle

– Volatile Memory

• Non battery backed memory – assumed to not survive a power cycle

– Non-volatile Memory

• Battery backed memory – assumed to survive a power cycle

– SACD (Storage Array Control Device)

04/08/23 00:02 34


Cache Benefits

• Benefits of Caching:– The cache acts as a holding point between front and back end

operations for a given piece of data

– Reduced host port command latency (disk v. electronic speed):

• Read hits to already cached data

• Write-back for absorbing bursty write data at electronic speed—can achieve electronic speed for absorbing new host writes as long as the cache doesn’t fill up, and over time, the average host write data rate is less than the rate at which the media can absorb the data.

04/08/23 00:02 35


Cache Buffers

• Cache Buffers:–Block = 512 bytes–GL Buffer = 2048 bytes (populated with 1 to 4 blocks of user

data)–XL Buffer = 8192 bytes (populated with 1 to 16 blocks of user

data)–Cache Page = 128 kilo bytes

04/08/23 00:02 36


Cache Layout GL and XL (4000, 6000)

A Write Primary256MB Non-volatile

B Write Mirror256MB Non-volatile

A Read512MB Volatile

B Write Primary256MB Non-volatile

A Write Mirror256MB Non-volatile

B Read512MB Volatile

Cache-A Cache-B

04/08/23 00:02 37


XL (8000)

A Write Primary512MB Non-volatile

B Write Mirror512MB Non-volatile

A Read1024MB Volatile

B Write Primary512MB Non-volatile

A Write Mirror512MB Non-volatile

B Read1024MB Volatile

Cache-A Cache-B

04/08/23 00:02 38


Host Port

CacheManager

RaidServices

FCServices

DRM Core

DRM Log

DRM FC

HP Tachyons


DRM Copy

EMUENVIRONMENTAL

MONITOR UNIT

SCMIServices

2

HTB

HTB XD

HTB

EETB

XD XD

XD

SEST,ERQ,IMQ

FED

MFCD

FED


TDCB

DTD

TDCB

ALLOC DEALLOC

EXECRTOS

CNODE

CODE HIGHWAY

FaultManager

EIRP,TEIRP

EIP

OCPOPERATIONAL

CONTROL PANEL

ALLCOMPONENTS

SCSSTORAGE

CELLSTATE


CONFIG/STATE

CONFIG/STATE

SCSCB

2

CONFIGSTATE

CONFIGSTATE

3CONFIGSTATE

COMMANDSTATE

4 4

11

XD 3CS

CONTAINERSERVICES

XD

XD

5

3

5

CONFIG

STATE

ALLOC

DEALLOC

6

6 CSIO

CSLDREADY

CSIO

04/08/23 00:02 39


Cache Manager Operations

• Host Port Reads/Writes (HP Interface)

• Mirroring write data to other controller

• Cooperation with DRM for order preservation

• Full stripe write aggregation for RAID5 to avoid RMW penalty

• R5 parity recovery

• World Peace

Apr 8, 2023 41

Active-Active Controller Support on EVA

EVA 3000, 5000 VCS 4.XXXEVA 4000, 6000, 8000

Apr 8, 2023 42

Active-Active Controller Support

− Active-active multi-pathing− Vdisk failover− Controller failover

Apr 8, 2023 43

Active-Active Multi-Pathing

What is active-active multi-pathing?

− On the EVA 3000/5000 a Vdisk is preferred to a controller and it can only be accessed by that preferred controller• To read or write the vdisk from the other controller it must

be moved to that other controller

Apr 8, 2023 44


What is active-active multi-pathing?

− On the EVA 4000/6000/8000 a Vdisk is “mastered” by one controller in the controller pair but it can be read from and written to via the “slave” controller in the controller pair

− This ability to access the Vdisk through either controller allows for active-active load balancing, path failover, and the support of native failover software on the servers

Apr 8, 2023 45


Vdisk access via the master controller

− All read and write requests are sent to the master controller

− The only data that moves across the mirror port between controllers is write data being mirrored to the slave controller’s mirror write cache

Apr 8, 2023 46


Read cachePrimary

write cache

Mirrorwrite cache

Read cachePrimary

write cache

Mirrorwrite cache

Host write

Master controllerVirtualDisk

Vdisk access via Master controller

Host read

Server

Transfers across the mirror ports

Slave controller

Apr 8, 2023 47


Vdisk access via the slave controller

− All read and write requests are sent to the master controller via the controller mirror ports

− Both read data and write data moved between the controllers via the controller mirror ports

Apr 8, 2023 48


Vdisk reads via the slave controller

− Read and write requests are received by the slave controller

− All requests are sent to the master controller− Reads are fulfilled from the read cache on the master

controller via the mirror port between the controller− A performance penalty is paid for read requests on

the slave controller

Apr 8, 2023 49


Vdisk writes via the slave controller

− Writes are fulfilled by first putting the data in the mirror half of the write cache on the slave controller and then sending the data to the master controller via the mirror port where it goes into the primary write cache

− Vdisks being replicated by Continuous Access can be written via the slave controller

− Minimal performance penalty for write requests to the slave controller

Apr 8, 2023 50


Read cachePrimary

write cache

Mirrorwrite cache

Read cachePrimary

write cache

Mirrorwrite cache


Vdisk read via non-mastering controller

Host read request

Server


Slave controller

Apr 8, 2023 51


Read cachePrimary

write cache

Mirrorwrite cache

Read cachePrimary

write cache

Mirrorwrite cache

Host write


Vdisk write via non-mastering controller

Server


Slave controller

Apr 8, 2023 52

Vdisk Failover

Vdisk failover on XL

− Vdisk failover results in the slave controller becoming the master controller for a vdisk(s)

− Can occur in one of two manners• Implicit failover – EVA decides to change the Vdisk master• Explicit failover – Administrator or host based software decides to

change the Vdisk master− Causes

• HBA failure• SAN failure• Controller failure• Administrative decision

Apr 8, 2023 53

Vdisk Failover

Implicit Vdisk failover

− Implicit transition of a vdisk between controllers is initiated by the EVA and is based on which controller the majority of read IO requests are being received

− Measurements are taken on an hourly basis− Implicit failover occurs if >= 2/3 of the reads

occur on the slave controller

Apr 8, 2023 54

Vdisk Failover

Implicit Vdisk failover

− Based on reads because reading through the slave controller incurs a fairly large performance penalty

− Almost no performance penalty when writing through slave controller so writes are ignored

− Considered giving the administrator control of the measurement window but decided in the end not to provide this access

Apr 8, 2023 55

Vdisk Failover

Explicit Vdisk failover

− Explicit transition of a vdisk between controllers is performed either by the storage administrator or host path failover software• True64• OVMS

− Not allowed if the controller is in write-through mode during a fully allocated snapshot or snapclone creation

Apr 8, 2023 56

Vdisk Failover

Vdisk failover

− Can failover about 1TB per second regardless of the number of Vdisks being failed over

− Failover is done by group (DR Group for CA or Vdisk and snaps for other Vdisk) one vdisk at a time

Apr 8, 2023 57

Vdisk Failover

Vdisk failover

− When a Vdisk failover occurs, the Vdisk is first put into write-through mode and dirty cache entries for the Vdisk are flushed

− Metadata from the mastering controller’s policy memory is then written to the disk group reserve metadata area (a hidden Vraid 1 disk owned by the controllers) • Metadata changes are also written to a journal that reside on

the quorum disk but it is faster to use the disk group metadata area than the journal

− The metadata is then read from the reserved metadata area by the new mastering controller

− New controller takes over the Vdisk

Apr 8, 2023 58

Vdisk Failover

Read cache

Primary write cache

Mirrorwrite cache

Read cache

Primary write cache

Mirrorwrite cache


Slave controllerVirtualDisk Master controller

Slave controller

Hidden Vraid 1

Metadata

Policy memory

Policy memory

Cache write-through mode - dirty cache entries for Vdisk are flushed

Apr 8, 2023 59

Controller Failure

Controller failure

− When a controller failure occurs all Vdisks mastered on the controller are failed over

− Can failover about 1TB per second regardless of the number of Vdisks being failed over

− Failover is done by group (DR Group for CA or Vdisk and snaps for other Vdisk) one vdisk at a time

Apr 8, 2023 60

Controller Failure

Controller failure

− Metadata changes in the controller’s policy memory do not get written to hidden metadata area or the quorum drive because the controller has failed

− The new master controller reads the metadata from the hidden metadata area and it reads metadata journal entries from the quorum drive

− The metadata journal entries from the quorum drive are applied to the metadata from the hidden metadata area to recover any metadata changes that were in-process at the time of the controller failure

Apr 8, 2023 61

Controller Failover

Read cache

Primary write cache

Mirrorwrite cache

Read cache

Primary write cache

Mirrorwrite cache


Slave controllerVirtualDisk Master controller

Hidden VRaid 1

Metadata

Controller Failure

Policy memory

Policy memory

Quorumdrive

MetadataJournal

eva architecture introduction

Documents

3cnode xd

ocpoperational

alloc dealloc

drm fctdcb

drm logalloc

write primary

write primary

dirty cache