big data
DESCRIPTION
BIG DATA. HADOOP. Background. The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing - PowerPoint PPT PresentationTRANSCRIPT
Background
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing
• The volume of data being made publicly available increases every year. success in the future will be dictated to a large extent by their ability to extract value from other organizations’ data.
• Variety of Data, Velocity of the Data , Volume of the data –V3
• Data Storage & Analysis– The storage capacity of the hard drives increased, but access speeds
have not kept up significantly– Now 1 Terabyte data is norm for disks and speed is around 100
MB/s,so it takes more than two and half hours to read all the data from the disk. So there is a long time to read zetta bytes of data
– So alternative solution—To read from multiple disks
Background
• Data Storage & Analysis – Problems in reading from and writing to multiple disks– Multiple hardware pieces are prone for failure-So data loss probability is high– Solution for Data loss-Replication– RAID works with replication only– Data Analyis need to combine the data from various elements & challenges– Need a solution as reliable shared storage and analysis system
• Hello ! Hadoop• NUTCH project by Doug Cutting• Google GFS & Map Reduce distributed data storage and processing• Yahoo Development Project• Doug Cutting Apache Hadoop Open source frame work• Hadoop-Made up Name
Hadoop Vs Other Systems
HADO
OP
• Best fit for Adhoc Analysis
• Written once and read many times
• Variety of Data
• Peta bytes of data
• Batch analysis
• Dynamic Schema
• Data Locality
• Data flow is implicit
• Shared Nothing Architecture
• Scaling out approach with commodity hardware
• Key/value pair
RDB
MS
• Good for low latency data
• Organized data/Structured data
• Gigabytes of data
• Interactive and Batch
• Static Schema
• Scaling is expensive
• Tables Structure
HPC ,GRID
&VOLUN
TEER COMPUTI
NG
• Distribution of work across the cluster
• Data intensive applications, Network Bandwidth
• Compute nodes idle
• MPI(message passing interface) flexibility but complexity for data flow
• SETI@home,
• Volunteer computing
• Volunteers are donating CPU cycles, not bandwidth
• Volunteer computing, untrusted computers, no data locality
HADOOP ARCHITECTURE
HDFS
MR HADOOP
Top of Existing File SystemStreaming Data Access
patternsVery large files
Commodity HardwareHigh Through put rather than
low latency
Lot of small filesLow latency Data access
Multiple Writes,
1) MAP2) REDUCE
3) CODE for MR JOB4) Automatic parallelization
5) Fault ToleranceJava,Python etc
House keeping in built