apache hadoop

Download Apache hadoop

If you can't read please download the document

Upload: sheetal-sharma

Post on 06-Aug-2015

21 views

Category:

Technology


3 download

TRANSCRIPT

  1. 1. Apache Hadoop Sheetal Sharma Intern At IBM Innovation Centre
  2. 2. Why Data? Get insights to offer a better product More data usually beats better algorithms Get of insights to make better decisions Avoid guesstimates
  3. 3. What Is Challenging? Store data reliably Analyze data quickly Cost-effective way Use expressible and high-level language
  4. 4. Fundamental Ideas A big system of machines, not a big machine Failures will happen Move computation to data, not data to computation Write complex code only once, but right
  5. 5. Apache Hadoop An open-source Java software Storing and processing of very large data sets A clusters of commodity machines A simple programming model
  6. 6. Apache Hadoop Two main components: HDFS - a distributed file system MapReduce a distributed processing layer
  7. 7. HDFS The Purpose Of HDFS Store large datasets in a distributed, scalable and fault- tolerant way High throughput Very large files Streaming reads and writes (no edits)
  8. 8. HDFS Mis-Usage Do NOT use, if you have Low-latency requests Random reads and writes Lots of small files Then better to consider RDBMs,
  9. 9. Splitting Files And Replicating Blocks Split a very large file into smaller (but still large) blocks Store them redundantly on a set of machines
  10. 10. Spiting Files Into Blocks The default block size is 64MB Minimize the overhead of a disk seek operation (less than 1%) A file is just sliced into chunks after each 64MB (or so)
  11. 11. Replicating Blocks The default replication factor is 3 It can be changed per a file or a directory It can be
  12. 12. Master And Slaves The Master node keeps and manages all metadata information The Slave nodes store blocks of data and serve them to the client Master node (called NameNode) Slave nodes (called DataNodes
  13. 13. Classical* HDFS Cluster *no NameNode HA, no HDFS Replication Manages metadata Does some house-keeping operations for NameNode Stores and retrieves blocks of data
  14. 14. HDFS NameNode Performs all the metadata- related operations Keeps information in RAM (for fast look up) The file system tree Metadata for all files/directories (e.g. ownership, permissions) Names and locations of blocks
  15. 15. HDFS DataNode Stores and retrieves blocks of data Data is stored as regular files on a local filesystem (e.g. ext4) e.g. blk_-992391354910561645 (+ checksums in a separate file) A block itself does not know which file it belongs to! Sends a heartbeat message to the NN to say that it is still alive Sends a block report to the NN periodically
  16. 16. HDFS Secondary NameNode NOT a failover NameNode Periodically merges a prior snapshot (fsimage) and editlog(s) (edits) Fetches current fsimage and edits files from the NameNode Applies edits to fsimage to create the up-to-date fsimage Then sends the up-to-date fsimage back to the NameNode
  17. 17. Reading A File From HDFS Block data is never sent through the NameNode The NameNode redirects a client to an appropriate DataNode The NameNode chooses a DataNode that is as close as possible Lots of data comes from DataNodes to a client Blocks locations $ hadoop fs -cat /toplist/2013-05-15/poland.txt
  18. 18. HDFS And Local File System Runs on the top of a native file system (e.g. ext3, ext4, xfs) HDFS is simply a Java application that uses a native
  19. 19. HDFS Data Integrity HDFS detects corrupted blocks When writing Client computes the checksums for each block Client sends checksums to a DN together with data When reading Client verifies the
  20. 20. HDFS NameNode Scalability Stats based on Yahoo! Clusters An average file 1.5 blocks (block size = 128 MB) An average file 600 bytes in RAM (1 file and 2 blocks objects) 100M files 60 GB of metadata
  21. 21. HDFS NameNode Performance Read/write operations throughput limited by one machine ~120K read ops/sec ~6K write ops/sec MapReduce tasks are also HDFS clients Internal load increases as the cluster grows
  22. 22. HDFS Main Limitations Single NameNode Keeps all metadata in RAM Performs all metadata operations Becomes a single
  23. 23. MapReduce MapReduce Model Programming model inspired by functional programming map() and reduce() functions processing pairs Useful for processing
  24. 24. Map And Reduce Functions Map and Reduce
  25. 25. Map And Reduce Functions - Counting Word
  26. 26. MapReduce Job Input data is divided into splits and converted into pairs Invokes map() function multiple times Keys are sorted, values not (but could be) Invokes reduce() Function multiple times
  27. 27. MapReduce Example: ArtistCount Artist, Song, Timestamp, User Key is the offset of the line from the beginning of the line We could specify which artist goes to which reducer (HashParitioner is default one)
  28. 28. MapReduce Example: ArtistCount map(Integer key, EndSong value, Context context): context.write(value.artist, 1) reduce(String key, Iterator values, Context context): int count = 0 for each v in values: count += v context.write(key, count) Pseudo-code in non-existing language ;)
  29. 29. MapReduce Combiner Make sure that the Combiner combines fast and enough (otherwise it adds overhead only)
  30. 30. MapReduce Implementation Batch processing system Automatic parallelization and distribution of computation Fault-tolerance Deals with all messy details related to distributed processing Relatively easy to use for programmers
  31. 31. JobTracker Reponsibilities Manages the computational resources Available TaskTrackers, map and reduce slots Schedules all user jobs Schedules all
  32. 32. TaskTracker Reponsibilities Runs map and reduce tasks Reports to JobTracker Heartbeats saying that it is still alive Number of free map and reduce slots Task progress,
  33. 33. Apache Hadoop Cluster It can consists of 1, 5, 100 and 4000 nodes
  34. 34. Thank You!