hadoop and big data analytics - sysfore

Sysfore Technologies

#117-120, First Floor, 4th Block, 80 Feet Road, Koramangala, Bangalore 560034

HADOOP AND BIG DATA ANALYTICS

Hadoop and Big Data Analytics

Gartner defines Big Data as “high volume, velocity and variety information

assets that demand cost-effective, innovative forms of information processing

for enhanced insight and decision making”.

Big data is data that, by virtue of its velocity, volume, or variety (the three Vs),

cannot be easily stored or analysed with traditional methods.

The term covers each and every piece of data your organization has stored till

now. It includes all the data stored both on-premises or in the cloud. It could

be papers, digital, structured and non-structured data within your company.

There is a deluge of structured and unstructured data that is generated every

second. This is known as Big Data, which can be analysed to help customers

turn that data into insights. AWS provides a broad platform of managed

services, infrastructure and tools to tackle your next Big Data project. It

enables you to collect, store, process, analyse and visualize Big Data on the

cloud. It provides all the hardware, software, infrastructure to maintain and

scale, so that you focus on building your application.

Some of the common Big Data Customer scenarios include Web & E-Tailing,

Telecommunications, Government, Healthcare & Life Science, Bank & Financial

Services and Retail, where Big Data is continuously generated.

How is Big Data consumed by Businesses: Businesses can gain a lot of insight

into how their product is being consumed, by analysing the Big Data

generated. Big Data analytics is an area of rapidly growing diversity. Analysing

large data sets requires significant compute capacity that can vary in size based

on the amount of input data and the analysis required. This characteristic of

big data workloads is ideally suited to the pay-as-you-go cloud computing

model, where applications can easily scale up and down based on demand.

Using Big Data analytics will give you a clear picture about how your data is

being generated and consumed by the customers. It can be used for predictive

marketing and plans to increase your business. It provides:

Early key indicators, gives insights into business trends resulting in

business fortunes.

Analytics results in business advantage.

Get more precise analysis and results with more data.

Limitations of using the traditional analytics methods: The advancements in

technologies has resulted in huge volume of data being generated every

second. Storing, processing, analysing and getting quality results is time

consuming, costly and ineffective in the current scenario.

Only a limited amount of high fidelity raw data is available for analysis.

Storage is limited by the high volume of raw data that is continuously

generated.

Moving data for computation doesn’t scale accordingly.

Data is archived regularly to conserve space. This limits the amount of

data that is available for the analytical tools.

The perception that traditional data warehousing processes are too slow

and limited in scalability.

The ability to converge data from multiple data sources, both structured

and unstructured.

The realization that time to information is critical to extract value from

data sources that include mobile devices, RFID, the web and a growing

list of automated sensory technologies.

As requirements change you can easily resize your environment (horizontally

or vertically) on AWS to meet your Amazon Web Services.

In addition, there are at least four major developmental segments that

underline the diversity to be found within Big Data analytics. These segments

are MapReduce, scalable database, real-time stream processing and Big Data

appliance.

Using Hadoop for Big Data Analytics: There is a big difference between Big

Data and Hadoop. The former is an asset, often a complex and ambiguous one,

while the latter is a program that accomplishes a set of goals and objectives for

dealing with that asset.

Hadoop is an open-source software framework for storing data and running

applications on clusters of commodity hardware. It provides massive storage

for any kind of data, enormous processing power and the ability to handle

virtually limitless concurrent tasks or jobs.

Hadoop is a framework, which allows processing of large data sets. It

completes the tasks in minutes, while the same done using the RDMS would

take hours.

Hadoop has 2 main components:

HDFS – Hadoop Distributed File System (for Storage)

MapReduce (for Processing)

Hadoop Distributed File System works: The Hadoop Distributed File System

(HDFS) is the primary storage system used by Hadoop applications. It consists

of HDFS clusters, which each contain one or more data nodes. Incoming data is

split into segments and distributed across data nodes to support parallel

processing. Each segment is then replicated on multiple data nodes to enable

processing to continue in the event of a node failure.

While HDFS protects against some types of failure, it is not entirely fault

tolerant. A single NameNode located on a single server is required. If this

server fails, the entire file system shuts down. A secondary NameNode

periodically backs up the primary. The backup data is used to restart the

primary but cannot be used to maintain operation.

HDFS is typically used in a Hadoop installation, yet other distributed file

systems are also supported. The Amazon S3 file system can be used but does

not maintain information on the location of data segments, reducing the ability

of Hadoop to survive server or rack failures. Other file systems such as open

source CloudStore and the MapR file system can be used to do maintain

location information.

Distributed processing is handled by MapReduce: The idea behind

MapReduce is that Hadoop can first map a large data set, and then perform a

reduction on that content for specific results. A reduce function can be thought

of as a kind of filter for raw data. The HDFS system then acts to distribute data

across a network or migrate it as necessary.

The MapReduce feature consists of one JobTracker and multiple TaskTrackers.

Client applications submit jobs to the JobTracker, which assigns each job to a

TaskTracker node. When HDFS or another location-aware file system is in use,

JobTracker takes advantage of knowing the location of each data segment. It

attempts to assign processing to the same node on which the required data

has been placed.

Apache Hadoop users typically build their own parallelized computing clusters

from commodity servers, each with dedicated storage in the form of a small

disk array or solid-state drive (SSD) for performance. These are commonly

referred to as “shared-nothing” architectures.

Big Data is getting Big and more important: As more and more data are

collected, the analysis of these data requires scalable, flexible, and high-

performing tools to provide analysis and insight in a timely fashion. Big Data

analytics is a growing field, with the need to parse large data sets from

multiple sources, and to produce information in real-time or near-real-time

gaining importance. IT organizations are exploring various analytics

technologies to parse web-based data sources and extract value from the

social networking boom.

hadoop and big data analytics - sysfore

Technology