cis 415 lecture 3 - big data platform elements - part 1
DESCRIPTION
cis lectureTRANSCRIPT
Big Data Platform Elements - Part 1CIS 415 Lecture 3
Hina Arora
Announcements• We have a Grader!
oAnirudh Dhawan ([email protected])oOffice Hours: Thur 10am-12pm; BA Suite 318
• Show of hands – did you complete last week’s required readings?o Contents of Lecture-1 Deck and any supplemental notes you took in classo Review: “Vocabulary” Section in Lecture-1 Decko Review: List of common data applications hereo Watch: The Beauty of Data Visualization
Big Data Platform Elements
Virtualization
Cloud Computing
Parallel Programming
Map Reduce
Big Data
Platforms
What will we cover today?• Virtualization• Cloud Computing
Virtualization
What is Virtualization?• “Virtualization means that Applications can use a resource
without any concern for where it resides, what the technical interface is, how it has been implemented, which platform it uses, and how much of it is available” ~Rick F. Van der Lans in Data Virtualization for Business Intelligence Systems
• We’ll look at a few different types of virtualization:o Server Virtualization – can be HW-level or OS-level Virtualizationo Storage Virtualizationo Network Virtualizationo Desktop Virtualizationo Application Virtualization
Server Virtualization: HW-level virtualization• Ability to run multiple Virtual Machines (VMs or guests) on a single Physical
Machine (host).• Each Virtual Machine emulates the underlying physical hardware and has an
Operating System (OS). • Guest VMs are mostly completely isolated from each other.• Each guest VM can run a different OS.• Hypervisors (or Virtual Machine Monitors or VMMs) are used to create and run
VMs. There are two types of hypervisors:o Type-1, Native or Bare-metal Hypervisors:
‒ Run directly on the host's hardware. ‒ Example: Hyper-V Hypervisor.
o Type-2 or Hosted Hypervisors: ‒ Run on the host’s OS. ‒ Example: VMware Player, VirtualBox
• Server Virtualization provides improved utilization, and scalabilityReference: https://en.wikipedia.org/wiki/Hypervisor
Server
Hypervisor Type-1
Guest OS
Bins/Libs
App
Guest OS
Bins/Libs
App
VM
Server
Host OS
Hypervisor Type-2
Guest OS
Bins/Libs
App
Guest OS
Bins/Libs
App
VM
Server Virtualization: OS-level virtualization• Ability to run multiple isolated Containers (user-space instances or guests)
on a single Physical Machine (host).• Containers do not emulate the underlying HW and don’t have their own OS
(they share the host OS). This lighter footprint allows hosts to support a higher density of guest Containers (as against guest VMs). But on the flip side raises Security concerns.
• Containers can also share binaries and libraries with other Containers.• Each Container typically runs a single Application.• Example: Docker
Reference: https://en.wikipedia.org/wiki/Operating-system-level_virtualization
ServerHost OS
Bins/Libs
App
Bins/Libs
AppContainer
App
Dock
er
Review: Storage Definitions• Block
o A sequence of bytes.o Storage systems typically provide access to blocks.o The OS typically abstracts other logical views like files and records.
• Stripingo Sequential blocks of data are stored on different physical storage devices in (typically) round-robin fashion.o Example: Disk1 <A, C, E>; Disk2 <B, D, F>o Striping is useful when requests for data are faster than a single storage device can deliver. Striping data across multiple storage
devices allows for concurrent access to data thereby improving performance.
• Mirroringo Replication of data onto separate disks in real time. o Example: Disk1 <A, B, C>; Disk2 <A, B, C>o Improves data redundancy and reliability.
• Parityo When data on a crashed disk can be reconstructed using data on other disks (using the XOR operation)o Example: Disk1 <A:11010011>; Disk2 <B:10011001>; Disk3 <PAB: 01001010>
Essentially, PAB = A XOR B, so is any one disk crashes, you can reconstruct using XOR operation between other twoo Improves data redundancy
• File System:o Controls how data is managed, stored and retrieved. o Without a file system, we would just have a large blob of data with no way to identify different connected pieces of information. o File systems are organized around groups of data called files, and groups of files called directories or folders.o Distributed files systems are files systems that are spread across multiple servers.
Reference: Wikipedia
Storage Virtualization• Data is abstracted into what appears to be a single storage unit, while the physical
storage actually spans multiple heterogeneous devices and often locations• Storage Virtualization provides location independence, improved utilization,
performance, reliability and availability• Example: RAID (redundant array of independent/inexpensive disks)
Popular RAID Types
Striping(provides excellent performance)
Mirroring(provides excellent redundancy)
Parity(provides good redundancy)
Minimum Number of Disks
Example(Disk – Blocks)
Comments
RAID 0 Yes No No 2 Disk 1 -- A, C, EDisk 2 -- B, D, F
Excellent Performance. No Redundancy. Do not use for critical applications.
RAID 1 No Yes No 2 Disk 1 -- A, B, CDisk 2 -- A, B, C
Good Performance. Excellent Redundancy.
RAID 5 Yes No Yes(Distributed Parity)
3 Disk 1 – A, C, PEFDisk 2 – B, PCD, EDisk 3 – PAB, D, F
Good Performance.Good Redundancy.Most cost effective.Fast Reads; Slow Writes.
RAID 10 Yes Yes No 4 Disk 1 -- A, C, EDisk 2 -- A, C, EDisk 3 -- B, D, FDisk 4 -- B, D, F
Excellent Performance.Excellent Redundancy.Great for mission critical applications.Not as cost-effective as RAID 5.
Reference: https://en.wikipedia.org/wiki/RAID
Review: Network Definitions• Local Area Network (LAN):
o A computer network with interconnected devices within a limited geographical area such as a house or building.• Wide Area Network (WAN):
o A computer network that spans large geographical areas• IP Address
o Address of a device participating in a networko IPv4: 32 bits | IPv6: 128 bitso Example: 11000000.10101000.00000101.10000010 (192.168.5.130)o Higher order bits determine network (indicated by subnet mask), and lower order bits determine host (device)
• Subnetting: o Dividing a network into smaller partso This affects the total number of hosts that can be addressed
• Switch: o Connects devices together on a computer network
• Routero Carry traffic from one network/subnet to the othero Routers maintain routing tables to determine whether traffic is meant for this LAN, a connected LAN or a different
network.o Example: the home router connects home computers to the internet (these are similar networks since they both share
TCP/IP protocol)• Gateway
o Typically connects two or more (dissimilar) computer networksReference: Wikipedia Image Source: http://netprivateer.com/lanwan.html
Network Virtualization• Creation of logical, virtual networks that are decoupled from the (limitations of) underlying
physical hardware.
• Example: VLAN, VPNo Virtual Local Area Network (VLAN)
‒ Allows for grouping of hosts within a virtual LAN regardless of geographical location
‒ Provides scalability, flexibility, simplified administration, and securityo Virtual Private Network (VPN)
‒ Securely extends a private network over a public network such as the internet‒ Users can remotely communicate with the private network as though they were
directly connected to it with the same functionality, security and administrativepolicies
‒ Provides flexibility, simplified administration, and security
Image Source: link Image Source: https://en.wikipedia.org/wiki/Virtual_private_network
(Remote) Desktop Virtualization• Enables access to applications on a remote OS using a virtual desktop. • The remote OS carries the application and data, and only the display, keyboard, and
mouse information are communicated with the local client device.• Users (on the local client devices) must establish a session and be connected with
the remote server to access the application. • Makes installation, upgrades and management of applications easier for IT.• Two kinds: RDS, VDI
• Remote Desktop Services (RDS) aka Terminal Serviceso Provides remote desktop to multiple users on a Host OSo Provides users session-based isolation (session virtualization) - users share Host OSo Users have no admin privileges on the host OSo Can support higher user density
• Virtual Desktop Infrastructure (VDI)o Provides remote desktop to multiple users on Guest OSso Provides users VM-based isolation - each user gets a dedicated Guest OSo Users have admin privileges on the Guest OSo Support lower users density
Application Virtualization• Application Virtualization separates the Application from the OS, so Applications can
be more easily deployed and delivered.• The application is packaged and streamed from the server down the network to the
client and, instead of being installed on the client device, is executed on the local device in a virtual bubble that is completely isolated from the client OS.
• Applications are streamed intelligently. o Only required parts are streamed as and when they are used. o Once the application has been streamed, it is cached on the client device so it doesn’t have
to be streamed every time a user uses it on the client. This also means the application can be used even when the client is not connected to the server.
o When an application upgrade is available, the server copy is upgraded, and the upgrades are streamed down to the clients the next time the application is used on the client.
• Makes installation, upgrades and management of applications easier for IT.• Examples: VMware ThinApp, Citrix XenApp and Microsoft App-V
Reference: http://blogs.msdn.com/b/ianm/archive/2010/06/11/microsoft-virtual-desktop-101-making-sense-of-vdi-rds-app-v-med-v-and-desktop-virtualisation.aspx
Cloud Computing
Have you used Applications Hosted on the Cloud?
• You typically sign up for service (free with ads, free trial, or subscription)• You connect to the internet for access• You don’t need to “install” application software, and “version upgrades”
are pushed seamlessly• You expect reliable, on-demand, self-service of the application• You expect ability to instantaneously upgrade (eg more storage, no ads,
etc)• You rely on the service provider for infrastructure (eg: you don’t set up mail
server)• You rely on the service provider for security and privacy• You rely on the service provider for backup and recovery
*Note: a lot of these services come with clients apps – we are not considering that scenario here.
What are some characteristics these applications have in common*?
What is Cloud Computing?• “Cloud computing is a model for enabling convenient, on-demand network
access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.
• Key enabling technologies include: (1) fast wide-area networks, (2) powerful, inexpensive server computers, and (3) high-performance virtualization for commodity hardware.”
Source: http://www.nist.gov/itl/cloud/
http://www.intel.com/content/www/us/en/cloud-computing/cloud-101-video.html
There are 3 basic deployment models in cloud computing:
• Private Cloudo Two kinds of private clouds:
‒ On-Prem Private Cloud: On-Prem Data Center + Network Virtualization + Cloud Orchestration Software ‒ Externally Hosted Private Cloud (also called Virtual Private Cloud): Logically isolated, user-defined, and
user-controlled portion of a 3rd party hosted cloud (like AWS or Microsoft). o Provides high degree of Controlo Good for highly-sensitive data and applications
• Public Cloudo Third-Party Provides Cloud Services (3 different service models - IaaS, PaaS, or SaaS)o Typically pay-as-you-go model (you pay for what you use)o Service Provider held to agreed upon availability, reliability, privacy and security standardso Provides high degree of Scalabilityo Example: Amazon AWS, Microsoft Azure, Google Cloud
• Hybrid Cloudo Combination of Private and Public Cloudo Allows you to pick desired level of Control vs Scalability
Deployment Models
• Private: User controls everything from the networking to the applications. Example: user’s on-premise datacenter.
• IaaS: User controls the application down to the underlying OS, and the Cloud Provider manages the virtualization layer and the hardware. Example: getting a virtual server in the cloud.
• PaaS: User controls application and data, and the Cloud Provider provisions the underlying supporting infrastructure, typically including operating system, programming-language execution environment, database, and web servers. This allows developers to focus on application development instead of worrying about underlying hardware and software layers.
• SaaS: User gains access to application software and databases. Cloud providers install and operate application software, and manage the infrastructure and platforms that run the applications. Example: O365 in the cloud.
Image Source: http://cloudcomputing.sys-con.com/node/2932264
There are 4 basic service models in cloud computing, based on what parts of the stack the User controls vs what the Cloud Provider manages.
Reference: https://en.wikipedia.org/wiki/Cloud_computing
Service Models
* Note: “Managed by Microsoft” is just an example – it’s essentially cloud provider of your choice…
• On-demand self-service: A consumer can provision computing capabilities, as needed automatically without requiring human interaction with each service provider.
• Device and location independence: Users can access service using a web browser regardless of location or device used (e.g., PC, mobile phone).
• Resource pooling: Computing resources are pooled to serve multiple consumers, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand.
• Scalability and elasticity: Dynamic on-demand provisioning of resources on a fine-grained, self-service basis in near real-time without users having to engineer for peak loads.
• Measured service:Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized service.
Key Characteristics
Reference: http://www.nist.gov/itl/cloud/ and https://en.wikipedia.org/wiki/Cloud_computing
• Advantageso Scalability and elasticity by design (dynamic on-demand provisioning of resources)o Convenience by design (device and location independence)o Continuous Availability by design (on-demand self-service)o Improved Reliability due to use of multiple redundant siteso Faster Deployment since infrastructure set up is quick, and software integration is easiero Cost Reduction due to savings on sunk cost of infrastructure, licenses, and maintenance
• Riskso Limited Control over infrastructure, software, and datao Security and Privacy of data is at the mercy of the Service Providero Dependency on the Provider can lead to vendor lock-in and migration challengeso Downtime of service can occur due to Service Provider outage or network access issues
Advantages and Risks
Reference: https://en.wikipedia.org/wiki/Cloud_computing
What did we learn today?• Four key elements make up big data platforms:
o Virtualization, Cloud Computing, Parallel Programming and Map Reduce.
• “Virtualization means that Applications can use a resource without any concern for where it resides, what the technical interface is, how it has been implemented, which platform it uses, and how much of it is available.”o Virtualization can occur at different levels of the stack: Server, Storage, Network, Desktop and
Application.
• “Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”o Three Deployment Models: Private, Public, Hybrid.o Four Service Models: Private, IaaS, PaaS, SaaS.o There are Advantages and Risks involved in Cloud Computing that one must be aware.
Required Readings for this Lecture• Contents of this Deck
o Note: Anything I’ve linked to as “Source”, “Reference”, or “Optional Reading” in the deck is not required reading.
• Supplemental notes you take during class
• Homework - spend a 5-10 minutes on each of these Sites: Amazon AWS, Microsoft Azure, Google Cloud
o Do you now see a number of familiar terms on these sites?
o What deployment models do they cover?
o What service models do they cover?
o Note how they all have very similar competing offers (including free trials to improve adoption).