Download - Lecture 12 Scalable Computing
![Page 1: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/1.jpg)
Lecture 12
Scalable Computing
Graduate Computer Architecture
Fall 2005
Shih-Hao Hung
Dept. of Computer Science and Information Engineering
National Taiwan University
![Page 2: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/2.jpg)
Scalable Internet Services
• Lessions from Giant-Scale Serviceshttp://www.computer.org/internet/ic2001/w4046abs.htm
– Access anywhere, anytime.– Availability via multiple devices.– Groupware support.– Lower overall cost.– Simplified service updates.
![Page 3: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/3.jpg)
Giant-Scale Services: Components
![Page 4: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/4.jpg)
Network Interface
• A simple network connecting two machines
• Message
![Page 5: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/5.jpg)
Network Bandwidth vs Message Size
![Page 6: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/6.jpg)
Switch: Conencting More than 2 Machines
![Page 7: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/7.jpg)
Switch
![Page 8: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/8.jpg)
Network Topologies
• Relative performance for 64 nodes
![Page 9: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/9.jpg)
Packets
![Page 10: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/10.jpg)
Load Management
• Balancing loads (load balancer)– Round-robin DNS– Layer-4 (Transport layer, e.g. TCP) switches– Layer-7 (Application layer) switches
![Page 11: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/11.jpg)
The 7 OSI (Open System Interconnection) Layers
![Page 12: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/12.jpg)
The 7 OSI (Open System Interconnection) Layers
• Application (Layer 7) This layer supports application and end-user processes. Communication partners are identified, quality of service is identified, user authentication and privacy are considered, and any constraints on data syntax are identified. Everything at this layer is application-specific. file transfers, e-mail, and other network software services. Telnet and FTP.
• Presentation (Layer 6) This layer provides independence from differences in data representation (e.g., encryption) by translating from application to network format, and vice versa. The presentation layer works to transform data into the form that the application layer can accept. This layer formats and encrypts data to be sent across a network, providing freedom from compatibility problems. It is sometimes called the syntax layer.
• Session (Layer 5) This layer establishes, manages and terminates connections between applications. The session layer sets up, coordinates, and terminates conversations, exchanges, and dialogues between the applications at each end. It deals with session and connection coordination.
![Page 13: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/13.jpg)
The 7 OSI (Open System Interconnection) Layers• Transport (Layer 4) This layer provides transparent transfer of data between
end systems, or hosts, and is responsible for end-to-end error recovery and flow control. It ensures complete data transfer. TCP/IP.
• Network (Layer 3) This layer provides switching and routing technologies, creating logical paths, known as virtual circuits, for transmitting data from node to node. Routing and forwarding are functions of this layer, as well as addressing, internetworking, error handling, congestion control and packet sequencing.
• Data Link (Layer 2) At this layer, data packets are encoded and decoded into bits. It furnishes transmission protocol knowledge and management and handles errors in the physical layer, flow control and frame synchronization. The data link layer is divided into two sublayers: The Media Access Control (MAC) layer and the Logical Link Control (LLC) layer. The MAC sublayer controls how a computer on the network gains access to the data and permission to transmit it. The LLC layer controls frame synchronization, flow control and error checking.
• Physical (Layer 1) This layer conveys the bit stream - electrical impulse, light or radio signal -- through the network at the electrical and mechanical level. It provides the hardware means of sending and receiving data on a carrier, including defining cables, cards and physical aspects. Fast Ethernet, RS232, and ATM are protocols with physical layer components.
![Page 14: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/14.jpg)
OSI
![Page 15: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/15.jpg)
Simple Web Farm
![Page 16: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/16.jpg)
Search Engine Cluster
![Page 17: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/17.jpg)
High Availability• High availability is a major driving requirement
behind giant-scale system design.– Uptime: typically measured in nines, and traditional
infrastructure systems such as the phone system aim for four or five nines (“four nines” implies 0.9999 uptime, or less than 60 seconds of downtime per week).
– Meantime-between-failure (MTBF)– Mean-time-to-repair (MTTR)– uptime = (MTBF – MTTR)/MTBF– yield = queries completed/queries offered– harvest = data available/complete data– DQ Principle:
Data per query × queries per second →constant– Graceful Degradation
![Page 18: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/18.jpg)
Clusters in Giant-Scale Services
– Scalability– Cost/performance– Independent components
![Page 19: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/19.jpg)
Cluster Example
![Page 20: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/20.jpg)
Lesson Learned• Get the basics right. Start with a professional data center and layer-7
switches, and use symmetry to simplify analysis and management.• Decide on your availability metrics. Everyone should agree on the goals and
how to measure them daily. Remember that harvest and yield are more useful than just uptime.
• Focus on MTTR at least as much as MTBF. Repair time is easier to affect for an evolving system and has just as much impact.
• Understand load redirection during faults. Data replication is insufficient for preserving uptime under faults; you also need excess DQ.
• Graceful degradation is a critical part of a high-availability strategy. Intelligent admission control and dynamic database reduction are the key tools for implementing the strategy.
• Use DQ analysis on all upgrades. Evaluate all proposed upgrades ahead of time, and do capacity planning.
• Automate upgrades as much as possible. Develop a mostly automatic upgrade method, such as rolling upgrades. Using a staging area will reduce downtime, but be sure to have a fast, simple way to revert to the old version.
![Page 21: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/21.jpg)
Deep Scientific ComputingKramer et. al., IBM J. R&D March 2004
• High-performance computing (HPC)– Resolution of a simulation– Complexity of an analysis– Computational power– Data storage
• New paradigms of computing– Grid computing– Network
![Page 22: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/22.jpg)
Themes (1/2)• Deep science applications must now integrate simulation with
data analysis. In many ways this integration is inhibited by limitations in storing, transferring, and manipulating the data required.
• Very large, scalable, high-performance archives, combining both disk and tape storage, are required to support this deep science. These systems must respond to large amounts of data—both many files and some very large files.
• High-performance shared file systems are critical to large systems. The approach here separates the project into three levels—storage systems, interconnect fabric, and global file systems. All three levels must perform well, as well as scale, in order to provide applications with the performance they need.
• New network protocols are necessary as the data flows are beginning to exceed the capability of yesterdays protocols. A number of elements can be tuned and improved in the interim, but long-term growth requires major adjustments.
![Page 23: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/23.jpg)
Themes (2/2)• Data management methods are key to being able to organize
and find the relevant information in an acceptable time.• Security approaches are needed that allow openness and ser
vice while providing protection for systems. The security methods must understand not just the application levels but also the underlying functions of storage and transfer systems.
• Monitoring and control capabilities are necessary to keep pace with the system improvements. This is key, as the application developers for deep computing must be able to drill through virtualization layers in order to understand how to achieve the needed performance.
![Page 24: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/24.jpg)
Simulation: Time and Space
![Page 25: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/25.jpg)
More Space
![Page 26: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/26.jpg)
NERSC System
![Page 27: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/27.jpg)
High-Performance Storage System (HPSS)
![Page 28: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/28.jpg)
Networking for HPC Systems
• End-to-end network performance is a product of– Application behavior– Machine capabilities– Network path– Network protocol– Competing traffic
• Difficult to ascertain the limiting factor without monitoring/diagnostic capabilities– End host issues– Routers and gateways– Deep Security
![Page 29: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/29.jpg)
End Host Issues
• Throughput limit– Time to copy data from user memory to kernel
across memory bus (2 memory cycles)– Time to copy from kernel to NIC (1 I/O cycle)– If limited by memory BW:
![Page 30: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/30.jpg)
Memory & I/O Bandwidth
• Memory BW– DDR: 650-2500 MB/s
• I/O BW– 32-bit/33Mhz PCI: 132 MB/s– 64-bit/33Mhz PCI: 264 MB/s– 64-bit/66Mhz PCI: 528 MB/s– 64-bit/133Mhz PCI-X: 1056 MB/s– PCI-E x1: ~1 Gbit/s– PCI-E x16: ~16 Gbit/s
![Page 31: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/31.jpg)
Network Bandwidth• VT600, 32-bit/33mhz PCI, DDR400, AMD2700+, 850 MB/s me
mory BW– 485 Mbit/s
• 64-bit/133Mhz PCI-X, 1100-2500 MB/s memory BW– Limited to 5000 Mbit/s– Also limited by DMA overhead– Only reach half of 10Gb NIC
• I/O architecture– On-chip NIC?
• OS architecture– Reduce number of memory copy? Zero copy?– TCP/IP overhead– TCP/IP offload– Maximum Transfer Unit (MTU)
![Page 32: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/32.jpg)
Conclusion
• High performance storage and network• End host performance• Data management• Security• Monitoring and control
![Page 33: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/33.jpg)
Petaflop Computing
![Page 34: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/34.jpg)
Science-driven System Architecture
• Leadership Computing Systems– Processor performance– Interconnect performance– Software: scalability & optimized lib
• Blue Planet– Redesigned Power5-based HPC system
• Single core node• High-memory bandwidth per processor
– ViVA (Virtual Vector Architecture) allows the eight processors in a node to be treated as a single processor with peak performance of 60+ Gigaflop/s.
![Page 35: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/35.jpg)
ViVA-2: Application Accelerator
• Accelerates particular application-specific or domain-specific features.– Irregular access patterns– High load/store issue rates– Low cache line utilization
• ISA enhancement– Inst to support prefetch irregular data access– Inst to upport sparse, non-cache-resident loads– More registers for SW pipelining– Inst to initiate many dense/indexed/sparce loads
• Proper compiler support will be a critical component
![Page 36: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/36.jpg)
Leadership Computing Applications
• Major computational advances– Nanoscience– Combustion– Fusion– Climate– Life Sciences– Astrophysics
• Teamwork– Project team– Facilities– Computational scientist
![Page 37: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/37.jpg)
Supercomputers 1993-2000
• Clusters vs MPPs
![Page 38: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/38.jpg)
Clusters
• Cost-performance
![Page 39: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/39.jpg)
Total Cost of Ownership (TCO)
![Page 40: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/40.jpg)
• Built with lots of PC’s• 80 PC’s in one rack
![Page 41: Lecture 12 Scalable Computing](https://reader036.vdocuments.us/reader036/viewer/2022081516/5681399b550346895da13510/html5/thumbnails/41.jpg)
Google• Performance
– Latency: <0.5s– Bandwidth, scaled with # of users
• Cost– Cost of PC keeps shrinking– Switches, Rack, etc.– Power
• Reliability– Software failure– Hardware failure (1/10 of SW failure)
• DRAM (1/3)• Disks (2/3)
– Switch failure– Power outage– Network outage