Big data with Hadoop - Introduction
Post on 16-Apr-2017
Big DataHadoopTomy Rhymond | Sr. Consultant | HMB Inc. | email@example.com | 614.432.9492
Torture the data, and it will confess to anything. -Ronald Coase, Economics, Nobel Prize Laureate"The goal is to turn data into information, and information into insight." Carly Fiorina"Data are becoming the new raw material of business." Craig Mundie, Senior Advisor to the CEO at Microsoft.In God we trust. All others must bring data. W. Edwards Deming, statistician, professor, author, lecturer, and consultant.
AgendaDataBig DataHadoopMicrosoft Azure HDInsightHadoop Use CasesDemoConfigure Hadoop Cluster / Azure StorageC# MapReduce Load and Analyze with Hive Use Pig Script for Analyze dataExcel Power Query
Huston, we have a Data problem.IDC estimate put the size of the digital universe at 40 zettabytes (ZB) by 2020, which is 50-fold growth from the beginning of 2010.By 2020, emerging markets will supplant the developed world as the main producer of the worlds data.This flood of data is coming from many source.The New York Stock Exchange generates about 1 terabytes of trade data per dayFacebook hosts approximately one petabyte of storageThe Hadron Collider produce about 15 petabytes of data per yearInternet Archives stores around 2 petabytes of data and growing at a rate of 20 terabytes per month.Mobile devices and Social Network attribute to the exponential growth of the data.
1 ZB = 1 billion TB 1 PB = 1000 TB.
Grocery chains know what you are buying very weekRestaurants know what you eatCable companies know what you watchSearch engines know what you browseRetailers know what you like to wear, what gadgets you are interested in.Social network know what is in your mind.
You are no longer a User123 in some database, you now have a profile. This gives a 360 degree view
85% Unstructured, 15% StructuredThe data as we know is structured.Structured data refers to information with a high degree of organization, such as inclusion in a relational database is seamless and readily searchable.Not all data we collect conform to a specific, pre-defined data model. It tends to be the human-generated and people-oriented content that does not fit neatly into database tables85 percent of business-relevant information originates in unstructured form, primarily text.Lack of structure make compilation a time and energy-consuming task.These data are so large and complex that it becomes difficult to process using on-hand management tools or traditional data processing applications.These type of data is being generated by everything around us at all times. Every digital process and social media exchange produces it. Systems, sensors and mobile devices transmit it.
What are some examples of Unstructured Data?E-MailsReportsExcel FilesWord DocumentsPDF DocumentsImages (e.g., .jpg, or .gif)Media (e.g., mp3, .wma, or .wmv)Text FilesPowerPoint PresentationsSocial Media Internet Forums4
Relational Data SQL DataUn-Structured Data Twitter FeedSemi-Structured Data Json
Un-Structured Data Amazon Review
So What is Big Data?Big Data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured.Capturing and managing lot of information; Working with many new types of data.Exploiting these masses of information and new data types of applications and extract meaningful value from big dataThe process of applying serious computing to seriously massive and often highly complex sets of information.Big data is arriving from multiple sources at an alarming velocity, volume and variety. More data lead to more accurate analyses. More accurate analysis may lead to more confident decision making.
The 4 Vs of Big DataVolume: We currently see the exponential growth in the data storage as the data is now more than text data. There are videos, music and large images on our social media channels. It is very common to have Terabytes and Petabytes of the storage system for enterprises.
Velocity: Velocity describes the frequency at which data is generated, captured and shared. Recent developments mean that not only consumers but also businesses generate more data in much shorter cycles.
Variety: Todays data no longer fits into neat, easy to consume structures. New types include content, geo-spatial, hardware data points, location based, log data, machine data, metrics, mobile, physical data points, process, RFID etc.
Veracity: This refers to the uncertainty of the data available. Veracity isnt just about data quality, its about data understandability. Veracity has an impact on the confidence data.
Volume presents the most immediate challenge to conventional IT structures. It calls for scalable storage, and a distributed approach to querying. Many companies already have large amounts of archived data, perhaps in the form of logs, but not the capacity to process it.
Velocity the increasing rate at which data flows into an organization has followed a similar pattern to that of volume. The Internet and mobile era means that the way we deliver and consume products and services is increasingly instrumented, generating a data flow back to the provider.
Variety Rarely does data present itself in a form perfectly ordered and ready for processing. A common theme in big data systems is that the source data is diverse, and doesnt fall into neat relational structures. It could be text from social networks, image data, a raw feed directly from a sensor source. None of these things come ready for integration into an application.
Veracity (Uncertainty): The lack of certainty. A state of having limited knowledge where it is impossible to exactly describe the existing state, a future outcome, or more than one possible outcome.7
Big Data vs Traditional DataTraditionalBig DataData SizeGigabytesPetabytesAccessInteractive and BatchBatchUpdatesRead and Write many timesWrite once, read many timesStructureStatic SchemaDynamic SchemaIntegrityHighLowScalingNonlinearLinear
Data StorageStorage capacity of the hard drives have increased massively over the yearsOn the other hand, the access speeds of the drives have not kept up.Drive from 1990 could store 1370 MB of data and had a speed of 4.4 MB/scan read all the data in about 5 mins.Today One Terabyte drives are the norm, but the transfer rate is around 100 MB/sTake more than two and half hours to read all the dataWriting is even slowerThe obvious ways to reduce time is to read from multiple disks at onceHave 100 disks each holding one hundredth of data. Working in parallel, we could read all the data in under 2 minutes.Move Computing to Data rather than bring data to computing.
Why big data should matter to youThe real issue is not that you are acquiring large amounts of data. It's what you do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable cost reductionstime reductions new product development and optimized offeringssmarter business decision making. By combining big data and high-powered analytics, it is possible to:Determine root causes of failures, issues and defects in near-real time, potentially saving billions of dollars annually.Send tailored recommendations to mobile devices while customers are in the right area to take advantage of offers.Quickly identify customers who matter the most.Generate retail coupons at the point of sale based on the customer's current and past purchases.
Optimize routes for many thousands of package delivery vehicles while they are on the road.Analyze millions of SKUs to determine prices that maximize profit and clear inventory.Recalculate entire risk portfolios in minutes.Use clickstream analysis and data mining to detect fraudulent behavior.10
Ok I Got BigData, Now what?The huge influx of data raises many challenges.Process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful informationTo analyze and extract meaningful value from these massive amounts of data, we need optimal processing power.We need parallel processing and therefore requires many pieces of hardwareWhen we use many pieces of hardware, the chances that one will fail is fairly high.Common way to avoiding data loss is through replicationRedundant copies of data are keptData analysis tasks need to combine dataThe Data from one disk may need to combine with data from 99 other disks
Challenges of Big DataInformation GrowthOver 80% of the data in the enterprise consists of unstructured data, growing much faster pace than traditional dataProcessing PowerThe approach to use single, expensive, powerful computer to crunch information doesnt scale for Big DataPhysical StorageCapturing and managing all this information can consume enormous resourcesData IssuesLack of data mobility, proprietary formats and interoperability obstacle can make working with Big Data complicatedCostsExtract, transform and load (ETL) processes for Big Data can be expensive and time consuming
ApacheHadoopis an open source software project that enables the distributed processing of large data sets across clusters of commodity servers.It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework.Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine and the Hadoop Distributed File System (HDFS).
History of HadoopHadoop is not an acronym; its a made-up name. Named after stuffed an yellow elephant of Doug Cuttings (Project Creator) son.2002-2004 : Nutch Project - web-scale, open source, crawler-based search engine.2003-2004: Google released GFS (Google File System) & MapReduce 2005-2006: Added GFS (Google File System) & MapReduce impl to Nutch2006-2008: Yahoo hired Doug Cutting and his team. They spun out storage and processing parts of Nutch to form Hadoop.2009 : Achieved Sort 500 GB in 59 Seconds (on 1400 nodes) and 100 TB in 173 Minutes (on 3400 nodes)
Hadoop ModulesHadoop Common: The common utilities that support the other Hadoop modules.Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.Hadoop YARN: A framework for job scheduling and cluster resource management.Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.Other Related Modules; Cassandra - scalable multi-master database with no single points of failure.HBase - A scalable, distributed database that supports structured data storage for large tables.Pig - A high-level data-flow language and execution framework for parallel computation.Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying.Zookeeper - A high-performance coordination service for distributed applications.
Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner.
Apache Hadoop YARN (short, in self-deprecating fashion, for Yet Another Resource Negotiator) is a cluster management technology. It is one of the key features in second-generation Hadoop.15
HDFS Hadoop Distributed File System
The heart of Hadoop is the HDFS.The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.HDFS is designed on the following assumptions and goals:Hardware failure is norm rather than exception.HDFS is designed more for batch processing rather than interactive use by users.Application that run on HDFS have large data sets. A typical file in HDFS is in gigabytes to terabytes in size.HDFS application uses a write-once-read-many access model. A file once created, written and closed need not be changed.A computation requested by an application is much more efficient if it is executed near the data it operated on. On other words, Moving computation is cheaper than moving data.Easily portable from one platform to another.
HDFS has many similarities with existing distributed file systems.However, the differences from other distributed file systems are significant. HDFS is highlyfault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides highthroughput access to application data and is suitable for applications that have large data sets.
** Hardware FailureHardware failure is the norm rather than the exception. An HDFS instance may consist ofhundreds or thousands of server machines, each storing part of the file systems data. Thefact that there are a huge number of components and that each component has a non-trivialprobability of failure means that some component of HDFS is always non-functional.Therefore, detection of faults and quick, automatic recovery from them is a core architecturalgoal of HDFS.
** Streaming Data AccessApplications that run on HDFS need streaming access to their data sets. They are not generalpurpose applications that typically run on general purpose file systems. HDFS is designedmore for batch processing rather than interactive use by users. The emphasis is on highthroughput of data access rather than low latency of data access. POSIX imposes many hardrequirements that are not needed for applications that are targeted for HDFS. POSIXsemantics in a few key areas has been traded to increase data throughput rates.
** Large Data SetsApplications that run on HDFS have large data sets. A typical file in HDFS is gigabytes toterabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregatedata bandwidth and scale to hundreds of nodes in a single cluster. It should support tens ofmillions of files in a single instance.
** Simple Coherency ModelHDFS applications need a write-once-read-many access model for files. A file once created,written, and closed need not be changed. This assumption simplifies data coherency issuesand enables high throughput data access. A Map/Reduce application or a web crawlerapplication fits perfectly with this model. There is a plan to support appending-writes to filesin the future.
** Moving Computation is Cheaper than Moving DataA computation requested by an application is much more efficient if it is executed near thedata it operates on. This is especially true when the size of the data set is huge. Thisminimizes network congestion and increases the overall throughput of the system. Theassumption is that it is often better to migrate the computation closer to where the data islocated rather than moving the data to where the application is running. HDFS providesinterfaces for applications to move themselves closer to where the data is located.
**Portability Across Heterogeneous Hardware and Software PlatformsHDFS has been designed to be easily portable from one platform to another. This facilitateswidespread adoption of HDFS as a platform of choice for a large set of applications.16
NameNode: NameNode is the node which stores the filesystem metadata i.e. which file maps to what block locations and which blocks are stored on which datanode.Secondary NameNode:NameNode is the single point of failure.DataNode: The data node is where the actual data resides.All datanodes send a heartbeat message to the namenode every 3 seconds to say that they are alive. The data nodes can talk to each other to rebalance data, move and copy data around and keep the replication high.Job Tracker/Task Tracker:The primary function of the job tracker is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking its progress, fault tolerance etc.)The task tracker has a simple function of following the orders of the job tracker and updating the job tracker with its progress status periodically.
The namenode maintains two in-memory tables, one which maps the blocks to datanodes (one block maps to 3 datanodes for a replication value of 3) and a datanode to block number mapping. Whenever a datanode reports a disk corruption of a particular block, the first table gets updated and whenever a datanode is detected to be dead (because of a node/network failure) both the tables get updated.
The data node is where the actual data resides. Some interesting traits of the same are as follows: All datanodes send a heartbeat message to the namenode every 3 seconds to say that they are alive. If the namenode does not receive a heartbeat from a particular data node for 10 minutes, then it considers that data node to be dead/out of service and initiates replication of blocks which were hosted on that data node to be hosted on some other data node. The data nodes can talk to each other to rebalance data, move and copy data around and keep the replication high. When the datanode stores a block of information, it maintains a checksum for it as well. The data nodes update the namenode with the block information periodically and before updating verify the checksums. If the checksum is incorrect for a particular block i.e. there is a disk level corruption for that block, it skips that block while reporting the block information to the namenode. In this way, namenode is aware of the disk level corruption on that datanode and takes steps accordingly.
HDFS - InputSplitInputFormatSplit the input blocks and files into logical chunks of type InputSplit, each of which is assigned to a map task for processing.RecordReaderA RecordReader uses the data within the boundaries created by the input split to generate key/value pairs.
Suppose a data set is composed on a single 300Mb file, spanned over 3 different blocks (blocks of 128Mb), and suppose that able to get 1 InputSplit for each block. Lets imagine now 3 different scenarios.
The first Reader will start reading bytes from Block B1, position 0. The first two EOL will be met at respectively 50Mb and 100Mb. 2 lines (L1 & L2) will be read and sent as key / value pairs to Mapper 1 instance. Then, starting from byte 100Mb, we will reach end of our Split (128Mb) before having found the third EOL. This incomplete line will be completed by reading the bytes in Block B2 until position 150Mb. First part of Line L3 will be read locally from Block B1, second part will be read remotely from Block B2 (by the mean of FSDataInputStream), and a complete record will be finally sent as key / value to Mapper 1. The second Reader starts on Block B2, at position 128Mb. Because 128Mb is not the start of a file, there are strong chance our pointer is located somewhere in an existing record that has been already processed by previous Reader. We need to skip this record by jumping out to the next available EOL, found at position 150Mb. Actual start of RecordReader 2 will be at 150Mb instead of 128Mb.We can wonder what happens in case a block starts exactly on a EOL. By jumping out until the next available record (through readLine method), we might miss 1 record. Before jumping to next EOL, we actually need to decrement initial start value to start 1. Being located at least 1 offset before EOL, we ensure no record is skipped !18
MapReduceHadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster.The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform.The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Hadoop DistributionsMicrosoft Azure HDInsightIBM InfoSphere BigInsightsHortonworksAmazon Elastic MapReduceCloudera CDH
Hadoop Meets TheMainframeBMCControl-M for Hadoop is an extension of BMCs larger Control-M product suite that was born in 1987 as an automated mainframe job scheduler.CompuwareAPM is an application performance management suite that also spans the arc of data enterprise data center computing distributed commodity servers. SyncsortSyncsort offers Hadoop Connectivity to move data between Hadoop and other platforms including the mainframe, Hadoop Sort Acceleration, and Hadoop ETL for cross-platform data integration.InformaticaHParser can run its data transformation services as a distributed application on Hadoops MapReduce engine.
HDInsight makes Apache Hadoop available as a service in the cloud. Process, analyze, and gain new insights from big data using the power of Apache HadoopDrive decisions by analyzing unstructured data with Azure HDInsight, a big data solution powered by Apache Hadoop. Build and run Hadoop clusters in minutes. Analyze results with Power Pivot and Power View in Excel.Choose your language, including Java and .NET. Query and transform data through Hive.
HDInsight also provides a cost efficient approach to the managing and storing of data using Azure Blob storage.
Scale elastically on demand - HDInsight is a Hadoop distribution powered by the cloud. This means HDInsight was architected to handle any amount of data, scaling from terabytes to petabytes on demand.
Crunch all data structured,semi-structured, unstructured - Since it's 100% Apache Hadoop, HDInsight can process unstructured or semi-structured data from web clickstreams, social media, server logs, devices and sensors, and more. This allows you to analyze new sets of data which uncovers new business possibilities to drive your organization forward.
Develop in your favorite language - HDInsight has powerful programming extensions for languages including, C#, Java, .NET, and more.
No hardware to acquire or maintain - With HDInsight, you can deploy Hadoop in the cloud without buying new hardware or other up-front costs. Theres also no time-consuming installation or set up. Azure does it for you. You can launch your first cluster in minutes.
Use Excel to visualize your Hadoop data - Because it's integrated with Excel, HDInsight lets you visualize and analyze your Hadoop data in compelling new ways in a tool familiar to your business users. From Excel, users can select Azure HDInsight as a data source.
Connect on-premises Hadoop clusters with the cloud - HDInsight is also integrated with Hortonworks Data Platform, so you can move Hadoop data from an on-site datacenter to the Azure cloud for backup, dev/test, and cloud bursting scenarios. Using the Microsoft Analytics Platform System, you can even query your on-premises and cloud-based Hadoop clusters at the same time.
Includes NoSQL transactional capabilities - HDInsight will also include Apache HBase, a columnar NoSQL database that run on top of the Hadoop Distributed File System (HDFS). This allows you to do large transactional processing (OLTP) of nonrelational data enabling use cases like having interactive websites or sensor data write to Azure Blob storage.
Sqoop is tool that transfers bulk data between Hadoop and relational databases such a SQL, or other structured data stores, as efficiently as possible. Use Sqoop to import data from external structured data stores into the HDFS or related systems like Hive. Sqoop can also extract data from Hadoop and export the extracted data to external relational databases, enterprise data warehouses, or any other structured data store type.
Familiar Business Intelligence (BI) tools such as Excel, PowerPivot, SQL Server Analysis Services and Reporting Services retrieves, analyzes, and reports data integrated with Windows Azure HDInsight using ODBC drivers. The Hive ODBC driver and Hive Add-in for Excel are available for download on the HDInsight dashboard.
Hive is a distributed data warehouse managing data stored in an HDFS. It is the Hadoop query engine. Hive is for analysts with strong SQL skills providing an SQL-like interface and a relational data model. Hive uses a language called HiveQL; a dialect of SQL. Hive, like Pig, is an abstraction on top of MapReduce and when run, Hive translates queries into a series of MapReduce jobs. Scenarios for Hive are closer in concept to those for RDBMS, and so are appropriate for use with more structured data. For unstructured data, Pig is better choice. Windows Azure HDInsight includes an ODBC driver for Hive, which provides direct real-time querying from business intelligence tools such as Excel into Hadoop.
HDInsightThe combination of Azure Storage and HDInsight provides an ultimate framework for running MapReduce jobs.Creating an HDInsight cluster is quick and easy: log in to Azure, select the number of nodes, name the cluster, and set permissions. The cluster is available on demand, and once a job is completed, the cluster can be deleted but the data remains in Azure Storage.Use Powershell to submit MapReduce Jobs Use C# to create MapReduce ProgramsSupport Pig Latin, Avro, Sqoop and more.
Creating an HDInsight cluster is quick and easy: log in to Azure, select the number of nodes, name the cluster, and set permissions. The cluster is available on demand, and once a job is completed, the cluster can be deleted but the data remains in Azure Storage. Having the data securely stored in the cloud before, after, and during processing gives HDInsight an edge compared with other types of Hadoop deployments. Storing the data this way is particularly useful in cases where the Hadoop cluster does not need to stay up for long periods of time. It is worth noting that some other usage patterns, such as the data exploration pattern (also known as the data lake pattern), require the Hadoop cluster and the data to be persisted at all times. In these usage patterns, users analyze the data directly on Hadoop, and for these cases, other Hadoop solutions, such as the Microsoft Analytics Platform System or Hortonworks Data Platform for Windows, are more suitable.
Azure Storage is massively scalable, so you can store and process hundreds of terabytes of data to support the big data scenarios required by scientific, financial analysis, and media applications. Or you can store the small amounts of data required for a small business website. 25
Use casesA 360 degree view of the customerBusiness want to know to utilize social media postings to improve revenue.Utilities: Predict power consumptionMarketing: Sentiment analysisCustomer service: Call monitoringRetail and marketing: Mobile data and location-based targetingInternet of Things (IoT)Big Data Service Refinery
Utilities: Predict power consumptionUtility companies have rolled out smart meters to measure the consumption of water, gas, and electricity at regular intervals of one hour or less. These smart meters generate huge volumes of interval data that needs to be analyzed. Utilities also run big, expensive, and complicated systems to generate power. Each grid includes sophisticated sensors that monitor voltage, current, frequency, and other important operating characteristics. To gain operating efficiency, the company must monitor the data delivered by the sensor. A big data solution can analyze power generation (supply) and power consumption (demand) data using smart meters.
Marketing: Sentiment analysisMarketing departments use Twitter feeds to conduct sentiment analysis to determine what users are saying about the company and its products or services, especially after a new product or release is launched. Customer sentiment must be integrated with customer profile data to derive meaningful results. Customer feedback may vary according to customer demographics.
Customer service: Call monitoringIT departments are turning to big data solutions to analyze application logs to gain insight that can improve system performance. Log files from various application vendors are in different formats; they must be standardized before IT departments can use them.
Retail and marketing: Mobile data and location-based targetingRetailers can target customers with specific promotions and coupons based location data. Solutions are typically designed to detect a user's location upon entry to a store or through GPS. Location data combined with customer preference data from social networks enable retailers to target online and in-store marketing campaigns based on buying history. Notifications are delivered through mobile applications, SMS, and email.
DemoConfigure HDInsight ClusterCreate Mapper and Reducer Program using Visual Studio C# Upload Data to Blob Storage using Azure Storage ExplorerRun Hadoop Job Export output to Power Query for Excel Hive Example with HDInsightPig Script with HDInsight
Resources for HDInsight for Windows Azure
Microsoft: HDInsightWelcome to Hadoop on Windows Azure - the welcome page for the Developer Preview for the Apache Hadoop-based Services for Windows Azure.Apache Hadoop-based Services for Windows Azure How To Guide - Hadoop on Windows Azure documentation.Big Data and Windows Azure - Big Data scenarios that explore what you can build with Windows Azure.
Microsoft: Windows and SQL DatabaseWindows Azure home page - scenarios, free trial sign up, development tools and documentation that you need get started building applications.MSDN SQL- MSDN documentation for SQL DatabaseManagement Portal for SQL Database - a lightweight and easy-to-use database management tool for managing SQL Database in the cloud.Adventure Works for SQL Database - Download page for SQL Database sample database.
Microsoft: Business IntelligenceMicrosoft BI PowerPivot- a powerful data mashup and data exploration tool.SQL Server 2012 Analysis Services - build comprehensive, enterprise-scale analytic solutions that deliver actionable insights.SQL Server 2012 Reporting - a comprehensive, highly scalable solution that enables real-time decision making across the enterprise.
Apache Hadoop:Apache Hadoop - software library providing a framework that allows for the distributed processing of large data sets across clusters of computers.HDFS - Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.Map Reduce - a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.
Hortonworks:Sandbox - Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadooptutorials.
About MeTomy RhymondSr. Consultant, HMB, Inc.