unit 2

Unit-III

Cluster ComputingIntroductionA computer cluster consists of a set of loosely connected computers that work together so thatin many respects they can be viewed as a single system.The components of a cluster are usually connected to each other through fast local area networks,each node (computer used as a server) running its own instance of an operating system.Computer clusters emerged as a result of convergence of a number of computing trendsincluding the availability of low cost microprocessors, high speed networks, and software forhigh performance distributed computing.Clusters are usually deployed to improve performance and availability over that of a singlecomputer, while typically being much more cost-effective than single computers of comparablespeed or availability.Computer clusters have a wide range of applicability and deployment, ranging from smallbusiness clusters with a handful of nodes to some of the fastest supercomputers in the world.

Basic conceptsThe desire to get more computing horsepower and better reliability by orchestrating a

number of low cost commercial off-the-shelf computers has given rise to a variety ofarchitectures and configurations.

The computer clustering approach usually (but not always) connects a number of readilyavailable computing nodes (e.g. personal computers used as servers) via a fast local areanetwork. The activities of the computing nodes are orchestrated by "clustering middleware", asoftware layer that sits atop the nodes and allows the users to treat the cluster as by and large onecohesive computing unit, e.g. via a single system image concept.

Computer clustering relies on a centralized management approach which makes the nodesavailable as orchestrated shared servers. It is distinct from other approaches such as peer to peeror grid computing which also use many nodes, but with a far more distributed nature.

A computer cluster may be a simple two-node system which just connects two personalcomputers, or may be a very fast supercomputer. A basic approach to building a cluster is that ofa Beowulf cluster which may be built with a few personal computers to produce a cost-effectivealternative to traditional high performance computing. An early project that showed the viabilityof the concept was the 133 nodes Stone Soupercomputer.[4] The developers used Linux, theParallel Virtual Machine toolkit and the Message Passing Interface library to achieve highperformance at a relatively low cost.

Although a cluster may consist of just a few personal computers connected by a simplenetwork, the cluster architecture may also be used to achieve very high levels of performance.The TOP500 organization's semiannual list of the 500 fastest supercomputers often includesmany clusters, e.g. the world's fastest machine in 2011 was the K computer which has adistributed memory, cluster architecture.[6][7]

Attributes of clustersComputer clusters may be configured for different purposes ranging from general purpose

business needs such as web-service support, to computation-intensive scientific calculations. Ineither case, the cluster may use a high-availability approach. Note that the attributes describedbelow are not exclusive and a "compute cluster" may also use a high-availability approach, etc.

A load balancing cluster with two servers and 4 user stations

"Load-balancing" clusters are configurations in which cluster-nodes share computationalworkload to provide better overall performance. For example, a web server cluster may assigndifferent queries to different nodes, so the overall response time will be optimized. However,approaches to load-balancing may significantly differ among applications, e.g. a high-performance cluster used for scientific computations would balance load with differentalgorithms from a web-server cluster which may just use a simple round-robin method byassigning each new request to a different node.

"Computer clusters" are used for computation-intensive purposes, rather than handling IO-oriented operations such as web service or databases.For instance, a computer cluster mightsupport computational simulations of weather or vehicle crashes. Very tightly coupled computerclusters are designed for work that may approach "supercomputing".

"High-availability clusters" (also known as failover clusters, or HA clusters) improve theavailability of the cluster approach. They operate by having redundant nodes, which are thenused to provide service when system components fail. HA cluster implementations attempt to useredundancy of cluster components to eliminate single points of failure. There are commercialimplementations of High-Availability clusters for many operating systems. The Linux-HAproject is one commonly used free software HA package for the Linux operating system.

Design and configurationOne of the issues in designing a cluster is how tightly coupled the individual nodes may be. Forinstance, a single computer job may require frequent communication among nodes: this impliesthat the cluster shares a dedicated network, is densely located, and probably has homogenousnodes. The other extreme is where a computer job uses one or few nodes, and needs little or nointer-node communication, approaching grid computing.

For Instance in a Beowulf system, the application programs never see the computationalnodes (also called slave computers) but only interact with the "Master" which is a specificcomputer handling the scheduling and management of the slaves. In a typical implementation theMaster has two network interfaces, one that communicates with the private Beowulf network forthe slaves, the other for the general purpose network of the organization. The slave computerstypically have their own version of the same operating system, and local memory and disk space.However, the private slave network may also have a large and shared file server that storesglobal persistent data, accessed by the slaves as needed.

By contrast, the special purpose 144 node DEGIMA cluster is tuned to running astrophysicalN-body simulations using the Multiple-Walk parallel treecode, rather than general purposescientific computations.

Due to the increasing computing power of each generation of game consoles, a novel use hasemerged where they are repurposed into High-performance computing (HPC) clusters. Someexamples of game console clusters are Sony PlayStation clusters and Microsoft Xbox clusters.Another example of consumer game product is the Nvidia Tesla Personal Supercomputerworkstation, which uses multiple graphics accelerator processor chips.

Computer clusters have historically run on separate physical computers with the sameoperating system. With the advent of virtualization, the cluster nodes may run on separatephysical computers with different operating systems which are painted above with a virtual layerto look similar. The cluster may also be virtualized on various configurations as maintenancetakes place. An example implementation is Xen as the virtualization manager with Linux-HA.[11]

Cluster Network Computing ArchitectureThe topic of clustering draws a fair amount of interest from the computer networking

community, but the concept is a generic one. To use a non-technical analogy: saying you areinterested in clustering is like saying you are interested in food. Does your interest lie incooking? In eating? Perhaps you are a farmer interested primarily in growing food, or arestaurant critic, or a nutritionist. As with food, a good explanation of clustering depends on yoursituation.

The computing world uses the term "clustering" in at least two distinct ways. For one,"clustering" or "cluster analysis" refers to algorithmic approaches of determining similarityamong objects. This kind of clustering might be very appealing if you like math... but this is notthe sort of clustering we mean in the networking sense.

What Is Cluster Computing?In a nutshell, network clustering connects otherwise independent computers to work together

in some coordinated fashion. Because clustering is a term used broadly, the hardwareconfiguration of clusters varies substantially depending on the networking technologies chosenand the purpose (the so-called "computational mission") of the system. Clustering hardwarecomes in three basic flavors: so-called "shared disk," "mirrored disk," and "shared nothing"configurations.

Shared Disk ClustersOne approach to clustering utilizes central I/O devices accessible to all computers ("nodes")within the cluster. We call these systems shared-disk clusters as the I/O involved is typically diskstorage for normal files and/or databases. Shared-disk cluster technologies include OracleParallel Server (OPS)and IBM's HACMP.

Shared-disk clusters rely on a common I/O bus for disk access but do not require sharedmemory. Because all nodes may concurrently write to or cache data from the central disks, asynchronization mechanism must be used to preserve coherence of the system. An independentpiece of cluster software called the "distributed lock manager" assumes this role.Shared-disk clusters support higher levels of system availability: if one node fails, other nodesneed not be affected. However, higher availability comes at a cost of somewhat reduced

performance in these systems because of overhead in using a lock manager and the potentialbottlenecks of shared hardware generally. Shared-disk clusters make up for this shortcomingwith relatively good scaling properties: OPS and HACMP support eight-node systems,

Shared Nothing ClustersA second approach to clustering is dubbed shared-nothing because it does not involve concurrentdisk accesses from multiple nodes. (In other words, these clusters do not require a distributedlock manager.) Shared-nothing cluster solutions include Microsoft Cluster Server (MSCS).MSCS is an atypical example of a shared nothing cluster in several ways. MSCS clusters use ashared SCSI connection between the nodes, that naturally leads some people to believe this is ashared-disk solution. But only one server (the one that owns the quorum resource) needs thedisks at any given time, so no concurrent data access occurs. MSCS clusters also typicallyinclude only two nodes, whereas shared nothing clusters in general can scale to hundreds ofnodes.

Mirrored Disk ClustersMirrored-disk cluster solutions include Legato's Vinca. Mirroring involves replicating allapplication data from primary storage to a secondary backup (perhaps at a remote location) foravailability purposes. Replication occurs while the primary system is active, although themirrored backup system -- as in the case of Vinca -- typically does not perform any work outsideof its role as a passive standby. If a failure occurs in the primary system, a failover processtransfers control to the secondary system. Failover can take some time, and applications can losestate information when they are reset, but mirroring enables a fairly fast recovery schemerequiring little operator intervention. Mirrored-disk clusters typically include just two nodes.

Data sharing and communicationData sharing

As the early computer clusters were appearing during the 1970s, so were supercomputers. Oneof the elements that distinguished the two classes at that time was that the early supercomputersrelied on shared memory. To date clusters do not typically use physically shared memory, whilemany supercomputer architectures have also abandoned it.

However, the use of a clustered file system is essential in modern computer clusters.Examples include the IBM General Parallel File System, Microsoft's Cluster Shared Volumes orthe Oracle Cluster File System.

Message passing and communicationTwo widely used approaches for communication between cluster nodes are MPI, the Message

Passing Interface and PVM, the Parallel Virtual Machine.PVM was developed at the Oak Ridge National Laboratory around 1989 before MPI was

available. PVM must be directly installed on every cluster node and provides a set of softwarelibraries that paint the node as a "parallel virtual machine". PVM provides a run-timeenvironment for message-passing, task and resource management, and fault notification. PVMcan be used by user programs written in C, C++, or Fortran, etc.

MPI emerged in the early 1990s out of discussions between 40 organizations. The initial effortwas supported by ARPA and National Science Foundation. Rather than starting anew, the designof MPI drew on various features available in commercial systems of the time. The MPIspecifications then gave rise to specific implementations. MPI implementations typically useTCP/IP and socket connections. MPI is now a widely available communications model thatenables parallel programs to be written in languages such as C, Fortran, Python, etc. Thus, unlikePVM which provides a concrete implementation, MPI is a specification which has beenimplemented in systems such as MPICH and Open MPI.

Cluster managementOne of the challenges in the use of a computer cluster is the cost of administrating it which can attimes be as high as the cost of administrating N independent machines, if the cluster has Nnodes. In some cases this provides an advantage to shared memory architectures with loweradministration costs.This has also made virtual machines popular, due to the ease ofadministration.Task scheduling

When a large multi-user cluster needs to access very large amounts of data, task schedulingbecomes a challenge. The MapReduce approach was suggested by Google in 2004 and otheralgorithms such as Hadoop have been implemented.

However, given that in a complex application environment the performance of each jobdepends on the characteristics of the underlying cluster, mapping tasks onto CPU cores and GPUdevices provides significant challenges.[17] This is an area of ongoing research and algorithmsthat combine and extend MapReduce and Hadoop have been proposed and studied.[17]

Node failure managementWhen a node in a cluster fails, strategies such as "fencing" may be employed to keep the rest ofthe system operational. Fencing is the process of isolating a node or protecting shared resourceswhen a node appears to be malfunctioning. There are two classes of fencing methods; onedisables a node itself, and the other disallows access to resources such as shared disks.The STONITH method stands for "Shoot The Other Node In The Head", meaning that thesuspected node is disabled or powered off. For instance, power fencing uses a power controller toturn off an inoperable node.The resources fencing approach disallows access to resources without powering off the node.This may include persistent reservation fencing via the SCSI3, fibre Channel fencing to disablethe fibre channel port or global network block device (GNBD) fencing to disable access to theGNBD server.

Software development and administrationParallel programming

Load balancing clusters such as web servers use cluster architectures to support a largenumber of users and typically each user request is routed to a specific node, achieving taskparallelism without multi-node cooperation, given that the main goal of the system is providingrapid user access to shared data. However, "computer clusters" which perform complexcomputations for a small number of users need to take advantage of the parallel processingcapabilities of the cluster and partition "the same computation" among several nodes.[20]

Automatic parallelization of programs continues to remain a technical challenge, but parallelprogramming models can be used to effectuate a higher degree of parallelism via thesimultaneous execution of separate portions of a program on different processors.[20][21]

Debugging and monitoringThe development and debugging of parallel programs on a cluster requires parallel language

primitives as well as suitable tools such as those discussed by the High Performance DebuggingForum (HPDF) which resulted in the HPD specifications.[13][22] Tools such as TotalView werethen developed to debug parallel implementations on computer clusters which use MPI or PVMfor message passing.The Berkeley NOW (Network of Workstations) system gathers cluster data and stores them in adatabase, while a system such as PARMON, developed in India, allows for the visualobservation and management of large clusters.[13]

Application checkpointing can be used to restore a given state of the system when a node failsduring a long multi-node computation.[23] This is essential in large clusters, given that as thenumber of nodes increases, so does the likelihood of node failure under heavy computationalloads. Checkpointing can restore the system to a stable state so that processing can resumewithout having to recompute results.[23]

Price performanceClustering can provide significant performance benefits versus price. The System Xsupercomputer at Virginia Tech, the 28th most powerful supercomputer on Earth as of June2006,is a 12.25 TFlops computer cluster of 1100 Apple XServe G5 2.3 GHz dual-processormachines (4 GB RAM, 80 GB SATA HD) running Mac OS X and using InfiniBandinterconnect. The cluster initially consisted of Power Mac G5s; the rack-mountable XServes aredenser than desktop Macs, reducing the aggregate size of the cluster. The total cost of theprevious Power Mac system was $5.2 million, a tenth of the cost of slower mainframe computersupercomputers. (The Power Mac G5s were sold off.)

A Cluster Computer and its ArchitectureA cluster is a type of parallel or distributed processing system, which consists of a collection ofinterconnected stand-alone computers working together as a single,integrated computingresource.A computer node can be a single or multiprocessor system (PCs, workstations, or SMPs) withmemory, I/O facilities, and an operating system. A cluster generally refers to two or morecomputers (nodes) connected together. The nodes can exist in a single cabinet or be physicallyseparated and connected via a LAN. An inter-connected (LAN-based) cluster of computers canappear as a single system to users and applications. Such a system can provide a cost-effectiveway to gain featuresand benefits (fast and reliable services) that have historically been foundonly on more expensive proprietary shared memory systems. The typical architecture of a clusteris shown below

The following are some prominent components of cluster computers: Multiple High Performance Computers (PCs, Workstations, or SMPs) State-of-the-art Operating Systems (Layered or Micro-kernel based) High Performance Networks/Switches (such as Gigabit Ethernet and Myrinet) Network Interface Cards (NICs) Fast Communication Protocols and Services (such as Active and Fast Messages) Cluster Middleware (Single System Image (SSI) and System Availability Infrastructure)

o Hardware (such as Digital (DEC) Memory Channel, hardware DSM, and SMPtechniques)

o Operating System Kernel or Gluing Layer (such as Solaris MC and GLU-nix)o Applications and Subsystems

Applications (such as system management tools and electronic forms) Runtime Systems (such as software DSM and parallel file system) Resource Management and Scheduling software (such as LSF (Load

Sharing Facility) and CODINE (COmputing in DIstributed NetworkedEnvironments))

Parallel Programming Environments and Tools (such as compilers, PVM (ParallelVirtual Machine), and MPI (Message Passing Interface))

Applicationso Sequentialo Parallel or Distributed

The network interface hardware acts as a communication processor and is responsible fortransmitting and receiving packets of data between cluster nodes via a network/switch.Communication software offers a means of fast and reliable data communication among clusternodes and to the outside world. Often, clusters with a special network/switch like Myrinet usecommunication protocols such as active messages for fast communication among its nodes. Theypotentially bypass the operating system and thus remove the critical communication overheadsproviding direct user-level access to the network interface.The cluster nodes can work collectively, as an integrated computing resource, or they can operateas individual computers. The cluster middleware is responsible for offering an illusion of a

unified system image (single system image) and availability out of a collection on independentbut interconnected computers.Programming environments can o_er portable, e_cient, and easy-to-use tools for development ofapplications. They include message passing libraries, debuggers, and profilers. It should not beforgotten that clusters could be used for the execution of sequential or parallel applications.

Clusters ClassificationsClusters offer the following features at a relatively low cost: High Performance Expandability and Scalability High Throughput High Availability

Cluster technology permits organizations to boost their processing power using standardtechnology (commodity hardware and software components) that can be acquired/purchased at arelatively low cost. This provides expandability- an affordable upgrade path that lets rganizationsincrease their computing power-while preserving their existing investment and without incurringa lot of extra expenses.The performance of applications also improves with the support ofscalable software environment. Another benefit of clustering is a failover capability that allowsa backup computer to take over the tasks of a failed computer located in its cluster.

Clusters are classified into many categories based on various factors as indicated below.1. Application Target - Computational science or mission-critical applications. High Performance (HP) Clusters High Availability (HA) Clusters

2. Node Ownership - Owned by an individual or dedicated as a cluster node. Dedicated Clusters Nondedicated Clusters

The distinction between these two cases is based on the ownership of the nodes in a cluster. Inthe case of dedicated clusters, a particular individual does not own a workstation; the resourcesare shared so that parallel computing can be performed across the entire cluster. The alternativenondedicated case is where individuals own workstations and applications are executed bystealing idle CPU cycles. The motivation for this scenario is based on the fact that mostworkstation CPU cycles are unused, even during peak hours. Parallel computing on adynamically changing set of nondedicated workstations is called adaptive parallel computing.In nondedicated clusters, a tension exists between the workstation owners and remote users whoneed the workstations to run their application. The former expects fast interactive response fromtheir workstation, while the latter is only concerned with fast application turnaround by utilizingany spare CPU cycles. This emphasis on sharing the processing resources erodes the concept ofnode ownership and introduces the need for complexities such as process migration and loadbalancing strategies. Such strategies allow clusters to deliver adequate interactive performance aswell as to provide shared resources to demanding sequential and parallel applications.3. Node Hardware - PC, Workstation, or SMP. Clusters of PCs (CoPs) or Piles of PCs (PoPs) Clusters of Workstations (COWs) Clusters of SMPs (CLUMPs)

4. Node Operating System - Linux, NT, Solaris, AIX, etc. Linux Clusters (e.g., Beowulf) Solaris Clusters (e.g., Berkeley NOW) NT Clusters (e.g., HPVM) AIX Clusters (e.g., IBM SP2) Digital VMS Clusters HP-UX clusters. Microsoft Wolfpack clusters.

5. Node Configuration - Node architecture and type of OS it is loaded with. Homogeneous Clusters: All nodes will have similar architectures and run the same OSs. Heterogeneous Clusters: All nodes will have di_erent architectures and run di_erent OSs.

6. Levels of Clustering - Based on location of nodes and their count. Group Clusters(#nodes: 2-99): Nodes are connected by SANs (System Area Networks)

like Myrinet and they are either stacked into a frame or exist within a center. Departmental Clusters (#nodes: 10s to 100s) Organizational Clusters (#nodes: many 100s) National Metacomputers (WAN/Internet-based): (#nodes: many

departmental/organizational systems or clusters) International Metacomputers (Internet-based): (#nodes: 1000s to many millions)

Individual clusters may be interconnected to form a larger system (clusters of clusters) and, infact, the Internet itself can be used as a computing cluster. The use of wide-area networks ofcomputer resources for high performance computing has led to the emergence of a new _eldcalled Metacomputing.

Commodity Components for ClustersThe improvements in workstation and network performance, as well as the availability ofstandardized programming APIs, are paving the way for the widespread usage of cluster-basedparallel systems. In this section, we discuss some of the hardware and software componentscommonly used to build clusters and nodes.

ProcessorsOver the past two decades, phenomenal progress has taken place in microprocessor

architecture (for example RISC, CISC, VLIW, and Vector) and this is making the single-chipCPUs almost as powerful as processors used in supercomputers. Most recently researchers havebeen trying to integrate processor and memory or network interface into a single chip. TheBerkeley Intelligent RAM (IRAM) project is exploring the entire spectrum of issues involved indesigning general purpose computer systems that integrate a processor and DRAM onto a singlechip- from circuits, VLSI design, and architectures to compilers and operating systems. Digital,with its Alpha 21364 processor, is trying to integrate processing, memory controller, andnetwork interface into a single chip.

Memory and CacheOriginally, the memory present within a PC was 640 KBytes, usually `hardwired' onto themotherboard. Typically, a PC today is delivered with between 32 and 64 MBytes installed inslots with each slot holding a Standard Industry Memory Module (SIMM); the potential capacityof a PC is now many hundreds of MBytes. Computer systems can use various types of memory

and they include Extended Data Out (EDO) and fast page. EDO allows the next access to beginwhile the previous data is still being read, and fast page allows multiple adjacent accesses to bemade more efficiently. The amount of memory needed for the cluster is likely to be determinedby the cluster target applications. Programs that are parallelized should be distributedsuch that the memory, as well as the processing, is distributed between processors for scalability.Thus, it is not necessary to have a RAM that can hold the entire problem in memory on eachsystem, but it should be enough to avoid the occurrence of too much swapping of memory blocks(page-misses) to disk, since disk access has a large impact on performance.Access to DRAM is extremely slow compared to the speed of the processor, taking up to ordersof magnitude more time than a CPU clock cycle. Caches are used to keep recently used blocks ofmemory for very fast access if the CPU references a word from that block again. However, thevery fast memory used for cache is expensive and cache control circuitry becomes more complexas the size of the cache grows. Because of these limitations, the total size of a cache is usually inthe range of 8KB to 2MB. Within Pentium-based machines it is not uncommon to have a 64-bitwide memory bus as well as a chip set that supports 2 MBytes of external cache. Theseimprovements were necessary to exploit the full power of the Pentium and to make the memoryarchitecture very similar to that of UNIX workstations.

Disk and I/OImprovements in disk access time have not kept pace with microprocessor performance, whichhas been improving by 50 percent or more per year. Although magnetic media densities haveincreased, reducing disk transfer times by approximately 60 to 80 percent per year, overallimprovement in disk access times, which rely upon advances in mechanical systems, has beenless than 10 percent per year. Grand challenge applications often need to process large amountsof data and data sets. Amdahl's law implies that the speed-up obtained from faster processors islimited by the slowest system component; therefore, it is necessary to improve I/O performancesuch that it balances with CPU performance. One way of improving I/O performance is to carryout I/O operations in parallel, which is supported by parallel file systems based on hardware orsoftware RAID. Since hardware RAIDs can be expensive, software RAIDs can be constructed byusing disks associated with each workstation in the cluster.

System BusThe initial PC bus (AT, or now known as ISA bus) used was clocked at 5 MHz and was 8 bitswide. When _rst introduced, its abilities were well matched to the rest of the system. PCs aremodular systems and until fairly recently only the processor and memory were located on themotherboard, other components were typically found on daughter cards connected via a systembus. The performance of PCs has increased by orders of magnitude since the ISA bus was firstused, and it has consequently become a bottleneck, which has limited the machine throughput.The ISA bu was extended to be 16 bits wide and was clocked in excess of 13 MHz.This,however, is still not su_cient to meet the demands of the latest CPUs, disk interfaces, and otherperipherals. A group of PC manufacturers introduced the VESA local bus, a 32-bit bus thatmatched the system's clock speed. The VESA bus has largely been superseded by the Intel-created PCI bus, which allows 133 Mbytes/s transfers and is used inside Pentium-based PCs. PCIhas also been adopted for use in non-Intel based platforms such as the Digital AlphaServer range.This has further blurred the distinction between PCs and workstations, as the I/O subsystem of aworkstation may be built from commodity interface and interconnect cards.

Cluster ApplicationsClusters have been employed as an execution platform for a range of application classes, rangingfrom supercomputing and mission-critical ones, through to ecommerce, and database-based ones.Clusters are being used as execution environments for Grand Challenge Applications (GCAs)[57] such as weather modeling, automobile crash simulations, life sciences, computational fluiddynamics, nuclear simulations, image processing, electromagnetics, data mining, aerodynamicsand astrophysics. These applications are generally considered intractable without the use of state-of-the-art parallel supercomputers. The scale of their resource requirements, such as processingtime, memory, and communication needs distinguishes GCAs from other applications. Forexample, the execution of scientific applications used in predicting life-threatening situationssuch as earthquakes or hurricanes requires enormous computational power and storage resources.In the past, these applications would be run on vector or parallel supercomputers costing millionsof dollars in order to calculate predictions well in advance of the actual events. Such applicationscan be migrated to run on commodity off-the-shelf-based clusters and deliver comparableperformance at a much lower cost.In fact, in many situation expensive parallel supercomputers have been replaced by low-costcommodity Linux clusters in order to reduce maintenance costs and increase overallcomputational resources. Clusters are increasingly being used for running commercialapplications. In a business environment, for example in a bank, many of its activities areautomated. However, a problem will arise if the server that is handling customer transactionsfails. The bank’s activities could come to halt and customers would not be able to deposit orwithdraw money from their account. Such situations can cause a great deal of inconvenience andresult in loss of business and confidence in a bank. This is where clusters can be useful. A bankcould continue to operate even after the failure of a server by automatically isolating failedcomponents and migrating activities to alternative resources as a means of offering anuninterrupted service.With the increasing popularity of the Web, computer system availability is becoming critical,especially for e-commerce applications. Clusters are used to host many new Internet servicesites. For example, free email sites like Hotmail , and search sites like Hotbot use clusters.Cluster-based systems can be used to execute many Internet applications:• Web servers• Search engines• Email• Security• Proxy• Database serversIn the commercial arena these servers can be consolidated to create what is known as anenterprise server. The servers can be optimized, tuned, and managed for increased efficiency andresponsiveness depending on the workload through various loadbalancing techniques. A largenumber of low-end machines (PCs) can be clustered along with storage and applications forscalability, high availability, and performance.

The Linux Virtual Server [66] is a cluster of servers, connected by a fast network. It provides aviable platform for building scalable, cost-effective and a more reliable Internet service than atightly coupled multi-processor system since failed components can be easily isolated and thesystem can continue to operate without any disruption. The Linux Virtual Server directs clients’

network connection requests to the different servers according to a scheduling algorithm andmakes the parallel services of the cluster appear as a single virtual service with a single IPaddress. Prototypes of the Linux Virtual Server have already been used to build many sites thatcope with heavy loads, Client applications interact with the cluster as if it were a single server.The clients are not affected by the interaction with the cluster and do not need modification. Theapplications performance and scalability is achieved by adding one or more nodes to the cluster,by automatically detecting node or daemon failures and by reconfiguring the systemappropriately to achieve high availability.Clusters have proved themselves to be effective for a variety of data mining applications. Thedata mining process involves both compute and data intensive operations. Clusters provide twofundamental roles:• Data-clusters that provide the storage and data management services for the data sets beingmined.• Compute-clusters that provide the computational resources required by the data filtering,preparation and mining tasks.The Terabyte Challenge [69] is an open, distributed testbed for experimental work related tomanaging, mining and modelling large, massive and distributed data sets.The Terabyte Challenge is sponsored by the National Scalable Cluster Project (NSCP) [70], theNational Center for Data Mining (NCDM) [71], and the Data Mining Group (DMG) [72]. TheTerabyte Challenge’s testbed is organized into workgroup clusters connected with a mixture oftraditional and high performance networks. They define a meta-cluster as a 100-node work-group of clusters connected via TCP/IP, and a supercluster as a cluster connected via a highperformance network such as Myrinet. The main applications of the Terabyte Challenge include:• A high energy physics data mining system called EventStore;• A distributed health care data mining system called MedStore;• A web documents data mining system called Alexa Crawler [73];• Other applications such as distributed BLAST search, textural data mining and economic datamining.

An underlying technology of the NSCP is a distributed data mining system called Papyrus.Papyrus has a layered infrastructure for high performance, wide area data mining and predictivemodelling. Papyrus is built over a data-warehousing layer, which can move data over bothcommodity and proprietary networks. Papyrus is specifically designed to support various clusterconfigurations, it is the first distributed data mining system to be designed with the flexibility ofmoving data, moving predictive, or moving the results of local computations.

unit 2

Documents