a high-availability software update method for distributed

12
A High-Availability Software Update Method for Distributed Storage Systems Dai Kobayashi, 1 Akitsugu Watanabe, 1 Toshihiro Uehara, 2 and Haruo Yokota 1,3 1 Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, Tokyo, 152-8552 Japan 2 NHK Science and Technical Research Laboratories, Tokyo, 157-8510 Japan 3 Global Scientific Information and Computing Center, Tokyo Institute of Technology, Tokyo, 152-8552 Japan SUMMARY In this paper, we propose a nonstop system upgrade method without significant performance degradation for data management software. To reduce disk accesses and network traffic, we construct logical nodes inside a physical node and migrate data between the symbiotic logical nodes. This logical migration is assisted by storage management functions that hide data location and migration from users. We also show the effectiveness of our method using experi- mental results on the system based on the Autonomous Disks we have proposed as a highly available storage sys- tem technology. © 2006 Wiley Periodicals, Inc. Syst Comp Jpn, 37(10): 35–46, 2006; Published online in Wiley Inter- Science (www.interscience.wiley.com). DOI 10.1002/ scj.20503 Key words: network storage; high availability; ceaseless update; storage management. 1. Introduction Recently, although the complexity of software to manage data on storage systems has increased because systems have become larger, the software also requires continued high availability. Large-scale storage systems are constructed from a great number of magnetic hard disk drives (HDDs), or nodes, with consideration for both costs and performance. In such systems, the software to manage data provides several functions: storage virtualization to provide a unified storage image, data redundancy control to protect data under disk failure, and so on. It is crucial that software is updated with no service downtime. The software may require updating after distri- bution for many reasons, such as bug fixes and scope creep. However, to avoid great losses of productivity, service opportunities, and so on, systems cannot be stopped. For the following discussion, we introduce two mod- els of relations between the location of the storage manage- ment and the data store: a separated management model that stores data and manages it in different nodes, and a united management model that has one type of node that © 2006 Wiley Periodicals, Inc. Systems and Computers in Japan, Vol. 37, No. 10, 2006 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J88-D-I, No. 3, March 2005, pp. 684–697 Contract grant sponsor: This research was carried out in part with assis- tance from the Japan Science and Technology Agency CREST; the Storage Research Consortium (SRC); a Grant in Aid for Scientific Research on Priority Areas (16016232) from the Ministry of Education, Culture, Sports, Science and Technology; and also the 21st Century COE Program “Framework for Systemization and Application of Large-Scale Knowl- edge Resources.” 35

Upload: others

Post on 30-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A High-Availability Software Update Method for Distributed

A High-Availability Software Update Method for DistributedStorage Systems

Dai Kobayashi,1 Akitsugu Watanabe,1 Toshihiro Uehara,2 and Haruo Yokota1,3

1Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, Tokyo, 152-8552 Japan

2NHK Science and Technical Research Laboratories, Tokyo, 157-8510 Japan

3Global Scientific Information and Computing Center, Tokyo Institute of Technology, Tokyo, 152-8552 Japan

SUMMARY

In this paper, we propose a nonstop system upgrademethod without significant performance degradation fordata management software. To reduce disk accesses andnetwork traffic, we construct logical nodes inside a physicalnode and migrate data between the symbiotic logical nodes.This logical migration is assisted by storage managementfunctions that hide data location and migration from users.We also show the effectiveness of our method using experi-mental results on the system based on the AutonomousDisks we have proposed as a highly available storage sys-tem technology. © 2006 Wiley Periodicals, Inc. Syst CompJpn, 37(10): 35–46, 2006; Published online in Wiley Inter-Science (www.interscience.wiley.com). DOI 10.1002/scj.20503

Key words: network storage; high availability;ceaseless update; storage management.

1. Introduction

Recently, although the complexity of software tomanage data on storage systems has increased becausesystems have become larger, the software also requirescontinued high availability.

Large-scale storage systems are constructed from agreat number of magnetic hard disk drives (HDDs), ornodes, with consideration for both costs and performance.In such systems, the software to manage data providesseveral functions: storage virtualization to provide a unifiedstorage image, data redundancy control to protect dataunder disk failure, and so on.

It is crucial that software is updated with no servicedowntime. The software may require updating after distri-bution for many reasons, such as bug fixes and scope creep.However, to avoid great losses of productivity, serviceopportunities, and so on, systems cannot be stopped.

For the following discussion, we introduce two mod-els of relations between the location of the storage manage-ment and the data store: a separated management modelthat stores data and manages it in different nodes, and aunited management model that has one type of node that

© 2006 Wiley Periodicals, Inc.

Systems and Computers in Japan, Vol. 37, No. 10, 2006Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J88-D-I, No. 3, March 2005, pp. 684–697

Contract grant sponsor: This research was carried out in part with assis-tance from the Japan Science and Technology Agency CREST; the StorageResearch Consortium (SRC); a Grant in Aid for Scientific Research onPriority Areas (16016232) from the Ministry of Education, Culture,Sports, Science and Technology; and also the 21st Century COE Program“Framework for Systemization and Application of Large-Scale Knowl-edge Resources.”

35

Page 2: A High-Availability Software Update Method for Distributed

both stores and manages data. With the latter model, con-sidering how to provide data continually is important whenupdating the software. The details are discussed in Section2.

Typical high-availability clustered systems updatetheir management software without stopping their servicesby delegating all functions of a target node to spare nodes,taking the target node off-line and updating its software.This method is also used for failure recovery. However, thismethod is unacceptable for storage systems using the unitedmanagement model because of performance degradation.When using a cold spare node for a target node, becausedelegating services requires transferring all data in thetarget node through the network, the method degrades thesystem performance. Using hot spare nodes for all nodesmakes the system larger and increases its costs. It alsodegrades system performance because of data consistencycontrol between nodes.

A method using failure recovery is also employed. Ittakes the target node off-line without any preparation, andthe system’s failure recovery functions disguise the action[2, 3]. This method can be applied to all kinds of updates ifnewer versions can communicate with older versions. How-ever, this method also degrades system performance be-cause of the HDD accesses and network transfers causedby recovering data redundancy. The method’s executiontime is proportional to the amount of stored data.

In this paper, we propose a novel method to updatedata management software on storage systems employingthe united management model without stopping services.The method uses logical nodes, several of which can existin one physical node. All data in a logical node with oldersoftware can migrate to another logical node with newersoftware without disk accesses or network transfers, whenthey exist in the same physical node. The old node can thenbe released from managing data and its version updatedrapidly, with no interruption to services.

We demonstrate the efficiency of our method withexperimental results on Autonomous Disks [1], a simula-tion environment for a kind of parallel storage system wehave proposed in an earlier paper. The results show that theperformance degradation of our method is 5% of usual,while that of existing methods is 49 to 55% of usual.

The paper is organized as follows. Section 2 discussesthe current technology of parallel storage systems andintroduces the two management models. Section 3 explainsthe concept of logical node and gives details of our methodto update data management software, and Section 4 pro-vides experimental results and discussion. Section 5 dis-cusses the applicability of our method and a separatedmanagement model that is beyond the scope of this paper.Related work is introduced in Section 6. We conclude ourpaper with some suggestions for future research directions.

2. Parallel Storage Systems and DataManagement Software

In this section, we discuss current parallel storagesystems, and briefly describe some issues in updating datamanagement software.

2.1. Parallel storage systems

Current large-scale storage systems are constructedfrom a number of data storage devices combined withnetworks. The systems provide services such as read andwrite accesses for users. In addition, these systems provideadvanced data management functions for reducing manage-ment labor cost and increasing usability: storage virtualiza-tion, which provides a unified storage image; transparentdata migration, which can relocate data for some reasonwithout interrupting service; load balancing functions,which remove hot spots; data redundancy control, so as notto lose data under disk failure; and so on.

2.1.1. Two data management models

For the following discussion, we introduce two mod-els of the relationship between the locations of storagemanagement and the data store: a separated managementmodel (Fig. 1), which has different nodes to store andmanage data; and a united management model (Fig. 2), inwhich one kind of node both stores and manages data.

Examples of systems employing the united manage-ment model are systems constructed from multiple NAS(Network Attached Storage) devices, that is, parallel NASsystems. Each data storage node in a parallel NAS systemhas its own data management software. The software onparallel NAS systems like X-NAS [4] and AutonomousDisks [1] controls part of the data management function,and the system also provides overall data management

Fig. 1. A separated management model dividing datamanagement and storage.

36

Page 3: A High-Availability Software Update Method for Distributed

functions with cooperation between nodes. Storage systemsconstructed from clustered PCs such as GFS [3] and OracleRAC have the same feature.

In this paper, we focus on the united managementmodel. The separated management model is discussed inSection 5.

2.1.2. Storage virtualization

Current parallel storage systems employ storage vir-tualization. It provides a unified storage image, hidingcomplex structure such as data location and data storageformat from the users. In other words, users can access datawithout regard to where the data is stored through thevirtualization layer.

There are some implementations that realize storagevirtualization. Some parallel NAS systems put one storagevirtualization control unit between users and systems [5].Others utilize distributed indexes by cooperation betweennodes [1, 4].

2.1.3. Transparent data migration

Enhancing storage virtualization makes it possible tohide data migration. A system can move data from one nodeto another without long-term blocking of accesses to thedata. Actual systems apply transparent data migration fordata relocation in HSM (Hierarchical Storage Manage-ment) [6] and for removing hot spots [1]. It is also possibleto detach some nodes from a system without long-termblocking of access with transparent data migration.

2.1.4. Data redundancy control

Transparent data migration and replica data can makesystems highly reliable. When a node fails, the systemrewrites data location information dynamically to avoidfailed accesses. Although some storage systems use erasurecodes like RAID 3 to 5 for node failures, large-scale storage

systems prefer replicas because they offer higher availabil-ity and scalability. Such systems therefore treat a RAIDgroup as a node.

2.2. Issues for updating data managementsoftware

Data management software on parallel storage sys-tems has ever-increasing complexity, because it must in-clude the various functions discussed above. Complexsoftware typically has a number of bugs and users call forenhancements (“scope creep”). Therefore, the need to up-date the software after distribution is increasing. Users alsorequire that the software should be updated without stop-ping any services.

The basic service is read and write accesses for datastored in the system. Therefore, if software is to be updatedwithout stopping services, the process must always allowall the data stored in the system to be read, and must notblock new data being written in the system. In this paper,we define these requirements as providing all accesses witha given acceptable latency.

To simplify the problem, we assume that a node thatis updating its software cannot provide data access services.To provide access to the data in the updating node, the datamust be temporarily (or permanently) moved to other nodesthat can provide access to users.

The problem at this time is performance degradationrelated to data transfer. Systems employing the united man-agement model (Fig. 2) are able to move data duringsoftware updates using the failure recovery function torelease a node from data management, as discussed above.That process must copy data from one node to another toensure data redundancy. However, the copy process causesperformance degradation because of its disk accesses andnetwork transfers. The process also has long-term process-ing implications, because all data must be transferred atleast once if all nodes are to be updated.

The following discussion focuses on how to avoidperformance degradation when data management nodesalso store data.

3. Proposal

In this section, we propose a novel method to updatedata management software without stopping the services ofthe storage system. First, we introduce the concept oflogical node. Then we explain the details of our method.

Fig. 2. A united management model binding datamanagement with storage.

37

Page 4: A High-Availability Software Update Method for Distributed

3.1. Logical nodes

Parallel storage systems are constructed from storagenodes. Each storage node has data and some part of the datamanagement function. Stored data is managed as data seg-ments based on fixed size or other semantics. Data locationinformation is managed by the storage virtualization or-ganization.

Logical nodes have the same functions and interfacesas the physical nodes. They can store data and behave aspart of the data management function. There are multiplelogical nodes in a physical node. To clients and otherphysical nodes, logical and physical nodes present the sameview. Therefore, only those logical nodes that are in thesame physical node can distinguish between logical andphysical nodes.

Figure 3 illustrates an example with two logicalnodes in one physical node. Each logical node has its ownunique network address, local data location information,metadata for the stored data, management tables for distrib-uted concurrency control, and processes to manage them.The node also has exclusive permission for the part of thestored data that it manages. A logical node can delegatepermission to other logical nodes on the same physicalnode, using a data migration protocol that includes inter-change of management information and metadata, but doesnot include data movement. We call it logical data migra-tion.

3.1.1. Restrictions on construction of logicalnodes

Some systems have restrictions on data and replicaplacement policy that affect the location of logical nodes.

One such restriction concerns replicas. Replicas ofdata for failure recovery must be located in different physi-cal nodes because most node failures are caused by thephysical failure of a device. Therefore, data on a logicalnode must not be located at two logical nodes that are in thesame physical node. For example, if the system employsnode groups in which all nodes in a group are replicas, thenodes in a group must be located on different physicalnodes.

Another restriction is caused by simplification oroptimization. Simplifications of the structure of distributedindices or optimizations of data locations that consider diskhead positions limit data elements to migrating to only afew specific nodes. An example is range partitioning, thatis, vertical partitioning, in relational databases. Each nodestores data that falls in a range of keys. Therefore, data canbe migrated only to a node that stores the appropriate range.

Because our method uses logical data migration toavoid data transfers, it must be able to migrate all data on alogical node to another logical node in the same physicalnode. In such cases, some restrictions must be relaxed toapply our method. We discuss this point further in Section3.2.5.

3.1.2. Logical data migration

Data migration between logical nodes on the samephysical node, which we call logical data migration, isachieved by delegating permissions for the managed datasegments and hiding the delegation process with storagevirtualization. In other words, logical data migrationchanges the owner node of the data segments. To make thechange, logical nodes exchange data location informationsuch as LBA (Logical Block Addressing). It is similar inaction to mv behavior within a partition of a filesystem. Thetransparent data migration function, discussed above, hidesthis process from users.

Because logical data migration requires few diskaccesses, the cost of transparent data migration such asmetadata transfer and rewriting data location informationis the dominant disk access cost. A large part of the networktransfer cost is caused by the interchange of small amountsof data such as data location information, because thelogical data migration does not require transfers of datasegments. In addition, because there is no data duplication,little disk space is consumed by logical data migration.Storage is only required for the logical node program code.

Accesses to logically migrating data are processedwithout interruptions to services. Transparent data migra-tion achieves this with the following protocols. First, anaccess request arrives at the storage virtualization layer. Letus assume that at that time the target data are being logicallymigrated. Then the access request is kept at the storagevirtualization layer because part of the data location infor-Fig. 3. The concept of logical nodes.

38

Page 5: A High-Availability Software Update Method for Distributed

mation associated with the data is locked by the transparentmigration function. When the migration has finished andthe information is unlocked, the access request is forwardedto the new logical node controlling the data. All later accessrequests are forwarded to the new logical node. The dura-tion of the wait for access is short enough not to count as aservice interruption, because logical migration involves fewdisk accesses and network transfers.

In a similar way, logical migration processes areblocked by read or write access requests queued at the timethe migration is requested, and do not disturb those ac-cesses.

3.1.3. Other communications between nodes

There are various communications between nodes incurrent storage systems. Examples are system configura-tion updates and replica relocation notification. Logicalnodes are defined to behave as physical nodes from theviewpoint of other physical nodes to utilize storage man-agement functions. Therefore, the communications are alsoprocessed normally as if there was a physical node. Hence,the cost of applying our method to an existing architecturecan be kept low.

3.2. Software update method with logicalnodes

We propose a method to avoid data transfers when thedata management software is updated, using logical datamigration. We first show the strategy and detail steps. Thenwe discuss the features of our method and its cost. Finally,some examples of applying the method to restricted sys-tems are given.

3.2.1. Strategy

As discussed in Section 2.2, to provide accesses todata in an updating node, the data must be moved to othernodes that can communicate with users, but data transfercauses performance degradation.

Our method therefore uses logical data migration torelease a node from data management.

3.2.2. Steps

We assume that newer version software can inter-change data segments and information associated with datamanagement with older version software. The method con-structed from the following steps results in a physical nodewith a newer logical node.

1. Make a logical node with newer version softwareinto a physical node that has a logical node with the olderversion (Fig. 4).

2. Send a command to the system to insert the newlogical node. The system configuration information is up-dated for the new logical node.

3. Migrate all data segments managed with the oldlogical node to the new logical node logically.

4. Detach the old logical node from the system.

To update all nodes in the system, all logical nodesrepeat the steps. Because the method does not reduce dataredundancy, the steps can run simultaneously in multiplenodes.

3.2.3. Features

Our method can update software with little perform-ance degradation because it involves neither disk accessesfor data segments nor their transfer on the network.

It can also upgrade without stopping services becausetransparent data migration on storage virtualization hidesthe logical data migration from users as discussed above. Itdoes not reduce data redundancy because the detachedlogical node has already been released from data manage-ment. Hence, regeneration of a replica, which causes severeperformance degradation, does not occur after running themethod.

The method can update software on many nodessimultaneously, which is impossible in existing methodsthat require detaching physical nodes from the system,because the system must have most of its nodes availableto maintain system performance. Simultaneous updatesreduce the duration consumed by updating.

It is impossible to apply our method to software thatdoes not allow independent logical nodes on a physicalnode. Examples of such software are operating systemkernels and device drivers. Scope creep and bug fixes donot usually involve updating of such codes.

Fig. 4. Detaching an old version node by a logicalmove between logical nodes.

39

Page 6: A High-Availability Software Update Method for Distributed

3.2.4. Data transfer cost

In this section, we estimate the costs of our methodand compare them to those of existing methods. We use theterm cost as the factor of performance degradation. Theresults show that our method has consistently lower coststhan the methods that update software by detaching physi-cal nodes. If fewer data segments are stored, the differenceincreases.

The cost of detaching a physical node Cphy is the sumof the costs of updating the system configuration informa-tion Cmdfy and moving data Cmove:

Cphy = Cmdfy + Cmove

Cmove is proportional to the product of the number of storeddata segments n and the sum of the cost to transfermetadata for each segment Cmmeta, the cost to transfer eachdata segment Cmdata, and the cost to update data locationinformation for each segment Cmmap:

Cmove = n × (Cmmeta + Cmdata + Cmmap)

When n is large enough, Cmdfy becomes negligible. In actualsystems, metadata are much smaller than data, Cmmeta <<Cmdata. Therefore, Cmmap and Cmdata are dominant inCphy.

The cost of our method with two logical nodes in aphysical node Clg is as follows. Assuming we use the samestorage virtualization and transparent data migration imple-mentations as the existing method, the cost to transfermetadata with logical node Cmmeta

g and the cost to updatedata location information Cmmap are dominant in Clg:

Clg = Cmdfy + n × (Cmmap + Cmmetag )

Cmmeta nearly equals Cmmetag because while Cmmeta has high

degradation and short duration as a result of reading andwriting in different physical nodes, Cmmeta

g has low degra-dation and long duration because it remains in the samephysical node.

Consequently, using the same storage size results inthe following:

• When the data segment size is large, that is, n issmall and Cmdata is large, Clg << Cphy;

• When the data segment size is small, that is, n islarge and Cmdata is small, Clg → Cphy;

• Cphy is always larger than Clg because Cphy mustinvolve Cmdata.

Note that Cmmap is a result of the loss of data locationinformation for concurrency control. The performance deg-radation of Cmdata is caused by consuming system re-sources such as disk accesses and network transfers, and theduration of Cmdata is proportional to the amount of storeddata. The duration of Cmmap + Cmmeta is proportional to n× metadatasize.

3.2.5. Strategy for restricted data locations

As already discussed, the restrictions on data loca-tions in some systems must be relaxed to apply our method.

An example is systems employing both chained de-clustering [7] and range partitioning in their replica place-ment policy (Fig. 5). To apply our method without loss ofreliability and usability, replica location restrictions mustbe modified.

Chained declustering is rotational mirroring for rep-lica placement to avoid loss of data. Figure 5 illustrates anexample. Range partitioning is a strategy to divide data intodata chunks based on ranges of keys associated with thedata and to store each data chunk in several nodes. It isusually used in parallel relational databases. In the strategy,a datum in a chunk can only be moved to chunks that areneighbors of the original chunk. With both strategies, rep-licas are best placed in neighboring nodes for rapid recov-ery. If a logical node is inserted as a neighbor node of anexisting logical node, data on both logical nodes are lostwhen the physical node has a failure.

In this case, to apply our method, the replica place-ment policy must be modified. An example of modifiedplacement is shown in Fig. 6. We define two chains. Oneincludes only odd nodes and the other includes only evennodes. Logical nodes are inserted not only at the targetnode, but also at neighbor nodes. This modification main-tains the reliability of the system, keeping data redundancybetween the replica node and the updating node. It also

Fig. 5. Distributed storage systems using a rangepartitioning strategy and chained declustering replication.

Fig. 6. A small modification of replication strategy toapply our methods.

40

Page 7: A High-Availability Software Update Method for Distributed

maintains system performance, keeping replica relocationin the same physical nodes.

Thus, a minor modification to data placement policymakes the system able to employ our method.

4. Experiment

To evaluate the performance advantage of ourmethod, we present the result of experiments on ourAutonomous Disks [1] simulation system. We use perform-ance metrics such as system throughput, response time, andduration, and compare them to the existing method, whichinvolves detaching physical nodes.

In this section we describe Autonomous Disks, theexperimental setup, results of the experiment, and adddiscussion.

4.1. Autonomous disks

Autonomous Disks is a parallel storage system wehave proposed for high availability, scalability, and reliabil-ity. It employs the united management model. The systemis constructed from disk nodes combined with a commoditynetwork. Using controllers and cache memories on the disknodes, it implements advanced data management functionssuch as load balancing and transparent data migration.There is no centralized control point.

The system based on Autonomous Disks employs adistributed B-tree as the distributed data index, to realizestorage virtualization, and also employs chained decluster-ing as the replica placement policy.

The system realizes several other advanced data man-agement functions based on ECA rules and transactionprocessing, such as data relocation, load balancing, andfailure recovery.

4.2. Environment

The experimental system was implemented with Javaon a Linux cluster system. Our method was implementedas follows. The logical nodes are realized as UNIX proc-esses. The network address for each logical node is allo-cated using IP-aliasing. The data storage space shared byall logical nodes on a physical node is a filesystem partitionformatted with the ext3 filesystem. Data segments are im-plemented as files.

Specifications of each node are given in Table 1. Thenodes are connected to a large switch that has sufficientcapacity. The following experiments use a system with sixphysical nodes.

We chose an aB+-Tree [8] as the distributed index forstorage virtualization. It combines a list-structured global

index that is duplicated on all nodes with local B+-Treeindices to indicate locations of data segments. The modifi-cation notifications of global nodes are sent synchronouslyto all nodes. The effects of the index structure on perform-ance are discussed in Section 4.4. The metadata are storedin 4-Kbyte blocks in leaf pages of a B-tree.

Data segment size is fixed for simplicity. The totalamount of data was fixed at 24 Gbytes, so each node hasabout 4 Gbytes of data and 4 Gbytes of replica data ofanother node. The experiments use two data sets: 1536 dataitems of size 16 Mbytes, and 3072 data items of size 8Mbytes.

Accesses for data, that is, workload, were generatedby six PCs with the same specifications as the data storagenodes. Each PC has six threads to generate read accesses toall stored data evenly, creating a situation with a well-bal-anced workload.

The experiments compare throughput, response time,and execution duration for the following four settings:

• Normal operation with no system update. Theseexperiments used only 16-Mbyte segments.

• While detaching a physical node.• While using logical data migration to release a

logical node from data management.• While using logical data migration to release a

logical node from data management on each of thesix physical nodes. These experiments used only16-Mbyte segments.

4.3. Results

4.3.1. Additional overhead of logical node

The first experiment observes the performance over-head required to construct logical nodes in physical nodes.Figure 7 shows the 95% confidence intervals of results withfive replications of experiments. While the usual systemindicates about 51.7 Mbytes per second as throughput, thesystem with two logical nodes in a physical system indi-

Table 1. Specifications of PCs used as storage nodesand clients

41

Page 8: A High-Availability Software Update Method for Distributed

cates about 51.4 Mbytes per second. This shows that theoverhead is small enough to ignore.

4.3.2. Throughput and execution duration

The next experiments observed the degradation ofsystem throughput and execution duration. Figure 8 shows95% confidence intervals of results with five replications.

Compared to normal system operation, detaching aphysical node decreases the average throughput by 49% for16-Mbyte data segments and 55% for 8-Mbyte data seg-ments. The smaller data segments result in larger through-put degradation because the number of data migrations perunit size increases. In contrast, logical data migration de-creases the average throughput by 5% for both 16- and8-Mbyte data segments.

The duration of execution to migrate all stored datain a node to another node is shown in Fig. 9 with 95%confidence intervals. Detaching a physical node requires1160 seconds for 16-Mbyte data segments and 1075 sec-onds for 8-Mbyte data segments. In contrast, logical data

migration requires 61 seconds for 16-Mbyte data segmentsand 91 seconds for 8-Mbyte data segments.

4.3.3. Response time

Response time is defined as the duration between theissue of a read request and receipt of the whole data seg-ment.

Figure 10 shows the relative cumulative frequency ofobserved response times with 16-Mbyte data segments. Theexperiment with 8-Mbyte data segments has similar trends.A point (x, y) on the graph indicates that y% of total accessesare processed within x seconds.

The figure shows that the fraction of accesses con-suming more than 30 seconds in the experiment with physi-cal detaching is twice as high as that in the experiment withlogical data migration. Although the fraction of accessesconsuming from 10 to 30 seconds for the experiment with

Fig. 7. The average throughput when two logical nodesrun in a physical node.

Fig. 8. The average throughput while executing theprocess that detaches a node physically or logically.

Fig. 9. Mean execution times to detach a node with 4Gbytes of data.

Fig. 10. Relative cumulative frequency of responsetime.

42

Page 9: A High-Availability Software Update Method for Distributed

logical data migration increases, its maximum responsetime is about the same as that of the normal system. Thisshows that the system maintains its availability when ourmethod is used.

The performance degradation with the physical de-taching is caused by disk accesses and network congestion,which increase the response times for requests from users.The performance degradation with logical data migrationis mostly caused by the updating of data location informa-tion on the distributed index. However, the resources re-quired to update the distributed index are much less thanthose for requests, so the response time does not increasemuch.

4.3.4. Behavior under multiple migration

In this experiment, we performed logical data migra-tions on all six nodes. The duration is 90.1 ± 9.7 seconds.This is only half as much again as the duration of logicaldata migration for one node. Figure 11 shows the results.Migrating one node generates four updates of the globalindex per second while migrating all six nodes generates 17updates per second.

The throughput in this experiment was 49.3 ± 2.4Mbytes per second, which is nearly equal to that in thenormal case, 49.4 ± 4.6 Mbytes per second. The simulationsystem uses range partitioning and a B-tree-based globalindex. The logical data migration updates only that part ofthe index range associated with the target logical node, sothe process requires an exclusive lock only on that part. Thisresults in fewer conflicts of global index updates, thuscausing less performance degradation. When systems use astorage virtualization solution that has more update con-flicts, performance metrics such as throughput and execu-tion duration will decrease more than the results of thisexperiment.

4.4. Discussion of experimental results

4.4.1. Performance comparison

In this section, we discuss the factors affecting per-formance degradation. If we define performance degrada-tion as the product of the decrease in throughput and theduration of the decrease in throughput, our method has lessthan one percent of the performance degradation caused bythe existing method that detaches physical nodes. As dis-cussed in Section 3.2.4, the cause of performance degrada-tion is the cost to update data location information Cmmap,the cost to transfer metadata Cmmeta, and the cost to transferdata segments Cmdata. Our approach to delegating datamanagement permission cannot avoid Cmmap and Cmmeta.

The difference between the two methods is Cmdata. Itis a disk access cost in these experiments because theexperimental results, which indicate 50 Mbytes per secondof throughput, show that the network bandwidth has sparecapacity. Our observations indicate that the CPU also hasspare capacity.

The experiments with detaching a physical node re-quire similar durations regardless of the data segment sizeand the number of data segments for the same amount ofdata. Therefore, the total amount of data is dominant inthose experiments. In contrast, with logical data migration,the experiments using 3072 8-Mbyte data segments con-sumed half as much time again as those using 1536 16-Mbyte data segments. This shows that the results of theexperiments depend on the number of data segments. It isclear that our method results in less performance degrada-tion when migrating multiple data segments.

The cause of performance degradation in logical datamigration is Cmmap and Cmmeta. In detail, Cmmap affectsboth the increase in duration and the decreasing throughput,while Cmmeta affects only the latter. The experiments in thispaper implement a B+-tree as the distributed data index forstorage virtualization. The global index of the structure,which is a list of key ranges, is duplicated in all nodes.Updating all the global indices involved in migrating a datasegment is the main factor that increases the duration, andmore nodes result in longer durations. Employing a moreefficient structure as the distributed index solves this prob-lem. An example of such a structure is the Fat-Btree [9],which requires update notifications to be propagated to onlya few nodes rather than all of them.

The decrease in throughput with logical data migra-tion is caused by waiting for mechanical movement in HDDsuch as head positioning and rotational latency, and bycongestion of concurrency control, in which users mustwait for the data migration process to release its exclusivelock on data location information. In the experiments, thelatter is the main cause of the decrease in throughput. Thisis shown by the experiment in Section 4.3.4, in which the

Fig. 11. Execution time for logical migration at sixnodes concurrently.

43

Page 10: A High-Availability Software Update Method for Distributed

multiple node updates resulted in similar throughputs tothose with one node update, even though the former in-volves six times as many metadata accesses as the latter.The effect of concurrency control congestion depends onthe ratio of the number of accesses per unit time to theamount of data migration per unit time.

Consequently, our method with logical data migra-tion performs better for both throughput and duration thanthe method that detaches physical nodes, and meets ourexpectations. In addition, our method improves with largedata segments. The 8- and 16-Mbyte data segments used inour experiments are larger than those used in many filesys-tems, but smaller than the chunks in the Google FS [3]. Asystem that stores many small data segments should mi-grate multiple data segments collectively for better per-formance.

4.4.2. Execution duration

In the experiments with 20 Gbytes of data, the migra-tion to empty a node required from a few to 20 minutes. Inactual use, it is conceivable that each node could store morethan 100 Gbytes and a system could have thousands ofnodes. The durations in such a system would be muchlonger than those in our experiments.

We can estimate the duration with larger nodes. As-suming we store 100 Gbytes of data in each node, which is20 times more than in our experiment, to update all nodeswith physical detaching would require 40 hours (= 20minutes × 20 × 6 nodes). In contrast, with the same assump-tions, we estimate that our method would consume 2 hours(= 1 minute × 20 × 6 nodes). In addition, our method canupdate multiple nodes simultaneously, so would consumeonly 30 minutes (= 1.5 minutes × 20).

Consequently, our method is much faster than theexisting method.

5. Discussion of Proposed Method

5.1. Applicability

In this section, we discuss the applicability of ourmethod.

As discussed in Section 3.2.2, there will be a periodwhen both the newer and older versions of the softwarebeing upgraded are present in a system. Therefore, it isrequired that (i) the newer version is compatible with theolder version. This feature has two aspects: (i-a) the updatesdo not modify the communication design and (i-b) the newversion is designed to communicate with the old version.Condition (i) will normally be met, because the existingmethod with physical detaching also requires it.

As discussed in Section 3.2.3, it is also required that(ii) there can be multiple data management subjects in aphysical node. This condition makes it possible to apply ourmethod to the software processing low-level resources. Thesoftware can avoid the more complex methods introducedin related work such as Kerninst [11].

Consequently, existing systems meet conditions (i)and (ii), so it is possible to apply our method.

5.2. Software update issues with the separatedmanagement model

We have focused on systems employing the unitedmanagement model. In this section, we discuss the alterna-tive model of separated management.

Examples of systems employing the separated man-agement model are systems constructed from servers andstorage nodes combined in a Storage Area Network (SAN).In an SAN system, data storage nodes such as HDDs andRAID groups are separate from data management nodes.

Updating software on such systems is readilyachieved by detaching the data management node physi-cally using a failure recovery mechanism. Permission tomanage data on the data management node considered tohave failed is delegated to other normal data managementnodes. This delegation does not require the transfer of largeamounts of data. Instead a small amount of managementinformation must be transferred, because the data manage-ment nodes in such systems do not have data segments.Therefore, there is not the same performance problem aswith the united management model.

If the data management nodes have replicas of thedata for performance gain, there are the same problems aswith the other model.

6. Related Work

There are various approaches to updating software onhigh-availability systems without stopping their services.

Ajmani proposed using a spare node with sensitivescheduling and simultaneous running of different versionsof the software [10]. The method allows systems to havenodes running with different versions of software, as doesour method. However, it does not solve the performanceproblem caused by data transfers.

There are other approaches that patch modificationsto software in memory dynamically, such as Kerninst [11].This approach can theoretically be applied to all kinds ofupdates, and has no performance degradation. Thus, actualavailability is high in operating systems that can implementit. A problem of the approach is the complexity of makingpatches that can apply to dynamic memory images. Thisproblem restricts the applicability of the approach. In con-

44

Page 11: A High-Availability Software Update Method for Distributed

trast, our method requires no complex patches and exploitsexisting functions such as transparent data migration andstorage virtualization. However, software that cannot useour method, such as operating systems, must use methodslike Kerninst or other methods with high performancedegradation.

7. Conclusions

In this paper, we have proposed a novel method toupdate data management software on parallel storage sys-tems without stopping their services, using logical datamigration between logical nodes. We have presented theconcept of logical nodes and logical data migration, whichdelegate permission to manage data rapidly by transferringmetadata and storage management information to otherlogical nodes in the same physical node, and hide themigration from users with storage virtualization. We havealso illustrated how to release an old version node frommanaging data by logical data migration. The proposedmethod can update software with lower performance deg-radation than existing methods, and without loss of systemreliability, availability, or usability. We have also shown anexample of applying our method to a system employingchained declustering, which has restrictions on its dataplacement policy.

The experiments with our proposed AutonomousDisks system, an advanced parallel storage system, haveshown the effectiveness of our method compared to existingmethods involving physical node detaching. Our methodhas less performance degradation than the existing method,and has shorter execution times because it can run simulta-neously in all physical nodes. We have also shown thatsystems that store numbers of small data segments shouldmigrate multiple data segments collectively for better per-formance.

It is important to develop methods to update softwareinvolving low-level processing such as operating systemsand firmware on high-availability systems. Although thispaper focuses only on read and write accesses as a storageservice, it is also important to apply our methods to moreintelligent storage systems. Some studies [12] have shownthat it is efficient to integrate high-level functions such aspart of a DBMS mechanism into storage systems.

Acknowledgments. This research was carried outin part with assistance from the Japan Science and Technol-ogy Agency CREST; the Storage Research Consortium(SRC); a Grant in Aid for Scientific Research on PriorityAreas (16016232) from the Ministry of Education, Culture,Sports, Science and Technology; and also the 21st Century

COE Program “Framework for Systemization and Applica-tion of Large-Scale Knowledge Resources.”

REFERENCES

1. Yokota H. Autonomous Disks for advanced databaseapplications. Proc International Symposium on Da-tabase Applications in Non-Traditional Environ-ments (DANTE’99), p 441–448.

2. Ito D, Yokota H. Cluster reconfiguration of autono-mous disks in disk failures or reinforcement. Proc12th IEICE Data Engineering Workshop 2001(DEWS2001), 2B-4.

3. Ghemawat S, Gobioff H, Leung ST. The Google filesystem. 19th ACM Symposium on Operating Sys-tems Principles, 2003.

4. Yasuda Y, Kawamoto S, Ebata A, Okitsu J, HiguchiT, Hamanaka N. Scalability of X-NAS: A clusteredNAS system. IPSJ Transactions on Advanced Com-puting Systems, 2003.

5. Katsurashima W, Yamakawa S, Torii T, Ishikawa J,Kikuchi Y, Yamaguti K, Fujii K, Nakashima T. NASswitch: A novel CIFS server virtualization. IEEESymposium on Mass Storage Systems, p 82–86,2003.

6. Hulen H, Graf O, Fitzgerald K, Watson RW. Storagearea networks and high performance storage system.10th NASA Goddard Conference on Mass StorageSystems and Technologies, UCRL-JC-146951, 2002.

7. Hsiao HI, DeWitt DJ. Chained declustering: A newavailability strategy for multiprocessor database ma-chines. Proc Sixth International Conference on DataEngineering, 1999.

8. Lee ML, Kitsuregawa M, Ooi BC, Tan KL, Modal A.Towards self-tuning data placement in parallel data-base systems. Proc ACM SIGMOD, p 225–236,2000.

9. Yokota H, Kanemasa Y, Miyazaki J. Fat-Btree: Anupdate-conscious parallel directory structure. Proc15th International Conf on Data Engineering, p 448–457, 1999.

10. Ajmani S. Automatic software upgrades for distrib-uted systems. Ph.D. thesis proposal, MassachusettsInstitute of Technology, 2003.

11. Tamches A, Miller BP. Fine-grained dynamic instru-mentation of commodity operating system kernels.3rd Symposium on Operating Systems Design andImplementation (OSDI), 1999.

12. Riedel E, Gibson GA, Faloutsos C. Active storage forlarge-scale data mining and multimedia applications.Proc 1998 Very Large Data Bases Conference(VLDB).

45

Page 12: A High-Availability Software Update Method for Distributed

AUTHORS (from left to right)

Dai Kobayashi graduated from the Department of Computer Science at Tokyo Institute of Technology in 2003, completedhis master’s course in the Graduate School of Information Science and Engineering in 2005, and is now a Ph.D. student. He isengaged in research on data engineering and storage systems.

Akitsugu Watanabe graduated from the Department of Computer Science at Tokyo Institute of Technology in 2000,completed his master’s course in the Graduate School of Information Science and Engineering in 2002, and is now a Ph.D.student. He is engaged in research on data engineering.

Toshihiro Uehara (member) graduated from the Department of Electrical Engineering at Keio University in 1979,completed his master’s course in 1981, and joined the Japan Broadcasting Corporation (NHK). He moved to NHK Science andTechnical Research Laboratories in 1984, and is now a senior research engineer there. He is engaged in research and developmenton digital VTR, server-type broadcasting, and autonomic storage systems. He received the Ichimura Academic AchievementAward in 1991 and an award from the Japan Institute of Invention and Innovation in 1994.

Haruo Yokota (member) received his B.E. degree from the Department of Physical Electronics at Tokyo Institute ofTechnology in 1980, completed his master’s course in computer science in 1982, and joined Fujitsu Ltd. He moved to theInstitute of New Generation Computer Technology (ICOT) for the Japanese 5th Generation Computer Project. He returned toFujitsu Laboratories Ltd. in 1986. He became an associate professor of information science at the Japan Advanced Institute ofScience and Technology in 1992. He moved to the Graduate School of Information Science and Engineering at Tokyo Instituteof Technology in 1998, and has been a professor at the Global Scientific Information and Computing Center and Departmentof Computer Science since 2001. He holds a D.Eng. degree. His research interests include the general research area of dataengineering, information storage systems, and dependable computing. He is a member of the board of trustees of the DatabaseSociety of Japan, and a member of the board of the Japan Chapter of ACM SIGMOD. He was chairman of the board of IEICE’sTechnical Group of Data Engineering from 2003 to 2005. He is a member of the Information Processing Society of Japan, theJapan Society for Artificial Intelligence, the Institute of Electrical and Electronics Engineers, and the Association for ComputingMachinery.

46