combining hadoop with big data analytics data analytics page 2 of 11 sponsored by combining hadoop...

11

Click here to load reader

Upload: vuduong

Post on 30-Mar-2018

216 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Combining Hadoop with Big Data Analytics Data Analytics Page 2 of 11 Sponsored by Combining Hadoop with Contents Hadoop for Big Data Puts Architects on Journey of Discovery ‘Big

Combining Hadoop with Big Data Analytics

Page 2: Combining Hadoop with Big Data Analytics Data Analytics Page 2 of 11 Sponsored by Combining Hadoop with Contents Hadoop for Big Data Puts Architects on Journey of Discovery ‘Big

Page 2 of 11 Sponsored by

Combining Hadoop with Big Data Analytics

Contents

Hadoop for Big Data Puts Architects on Journey of Discovery

‘Big Data’ Analytics Programs Require Tech Savvy, Business Know-How

This E-Guide explores the growing popularity of leveraging Hadoop, an open source software framework, to tackle the challenges of big data analytics, as well as some of the key business and technical barriers to adoption. Gain expert advice on the skill sets and roles needed for successful big data analytics programs, and discover how organizations can design and assemble the right team.

Hadoop for Big Data Puts Architects on Journey of Discovery By: Jessica Twentyman, Contributor

―Problems don‘t care how you solve them,‖ said James Kobielus, an analyst

with Forrester Research, in a recent blog on the topic of ‗big data‘. ―The only

thing that matters is that you do indeed solve them, using any tools or

approaches at your disposal.‖

In recent years, that imperative -- to solve the problems posed by big data --

has led data architects at many organizations on a journey of discovery. Put

simply, the traditional database and business intelligence tools that they

routinely use to slice and dice corporate data are just not up to the task of

handling big data.

To put the challenge into perspective, think back a decade: few enterprise

data warehouses grew as large as a terabyte. By 2009, Forrester analysts

were reporting that over two-thirds of Enterprise Data Warehouses (EDWs)

deployed were in the 1 to 10-terabyte range. By 2015, they claim, a majority

of EDWs in large organizations will be 100 TB or larger, ―with petabyte-scale

EDWs becoming well entrenched in sectors such as telecoms, financial

services and web commerce.‖

What‘s needed to analyze these big data stores are special new tools and

approaches that deliver ―extremely scalable analytics,‖ said Kobielus. By

―extremely scalable,‖ he means analytics that can deal with data that stands

Page 3: Combining Hadoop with Big Data Analytics Data Analytics Page 2 of 11 Sponsored by Combining Hadoop with Contents Hadoop for Big Data Puts Architects on Journey of Discovery ‘Big

Page 3 of 11 Sponsored by

Combining Hadoop with Big Data Analytics

Contents

Hadoop for Big Data Puts Architects on Journey of Discovery

‘Big Data’ Analytics Programs Require Tech Savvy, Business Know-How

out in four key ways: for its volume (from hundreds of terabytes to petabytes

and beyond); its velocity (up to and including real-time, sub-second delivery);

its variety (diverse structured, unstructured and semi-structured formats); and

its volatility (wherein scores to hundreds of new data sources come and go

from new applications, services, social networks and so on).

Hadoop for Big Data on the Rise

One of the principal approaches to emerge in recent years has been Apache

Hadoop, an open source software framework that supports data-intensive

distributed analytics involving thousands of nodes and petabytes of data.

The underlying technology was originally invented at search engine giant

Google by in-house developers looking for a way to usefully index textual

data and other ―rich‖ information and present it back to users in meaningful

ways. They called this technology MapReduce – and today‘s Hadoop is an

open-source version, used by data architects for analytics that are deep and

computationally intensive.

In a recent survey of enterprise Hadoop users conducted by Ventana

Research on behalf of Hadoop specialist Karmasphere, 94% of Hadoop

users reported that they can now perform analyses on large volumes of data

that weren‘t possible before; 88% said that they analyze data in greater

detail; and 82% can now retain more of their data.

It‘s still a niche technology, but Hadoop‘s profile received a serious boost

over that past year, thanks in part to start-up companies such as Cloudera

and MapR that offer commercially licensed and supported distributions of

Hadoop. Its growing popularity is also the result of serious interest shown by

EDW vendors like EMC, IBM and Teradata. EMC bought Hadoop specialist

Greenplum in June 2010; Teradata announced its acquisition of Aster Data in

March 2011; and IBM announced its own Hadoop offering, Infosphere, in

May 2011.

What stood out about Greenplum for EMC was its x86-based, scale-out

MPP, shared-nothing design, said Mark Sears, architect at the new

Page 4: Combining Hadoop with Big Data Analytics Data Analytics Page 2 of 11 Sponsored by Combining Hadoop with Contents Hadoop for Big Data Puts Architects on Journey of Discovery ‘Big

Page 4 of 11 Sponsored by

Combining Hadoop with Big Data Analytics

Contents

Hadoop for Big Data Puts Architects on Journey of Discovery

‘Big Data’ Analytics Programs Require Tech Savvy, Business Know-How

organization EMC Greenplum. ―Not everyone requires a set-up like that, but

more and more customers do,‖ he said.

―In many cases,‖ he continued ―these are companies that are already adept

at analysis, at mining customer data for buying trends, for example. In the

past, however, they may have had to query a random sample of data relating

to 500,000 customers from an overall pool of data of 5 million customers.

That opens up the risk that they might miss something important. With a big

data approach, it‘s perfectly possible and relatively cost-effective to query

data relating to all 5 million.‖

Hadoop Ill Understood by the Business

While there‘s a real buzz around Hadoop among technologists, it‘s still not

widely understood in business circles. In the simplest terms, Hadoop is

engineered to run across a large number of low-cost commodity servers and

distribute data volumes across that cluster, keeping track of where individual

pieces of data reside using the Hadoop Distributed File System (HDFS).

Analytic workloads are performed across the cluster, in a massively parallel

processing (MPP) model, using tools such as Pig and Hive. Results,

meanwhile, are delivered as a unified whole.

Earlier this year, Gartner analyst Marcus Collins mapped out some of the

ways he‘s seeing Hadoop used today: by financial services companies, to

discover fraud patterns in years of credit-card transaction records; by mobile

telecoms providers, to identify customer churn patterns; by academic

researchers, to identify celestial objects from telescope imagery.

―The cost-performance curve of commodity servers and storage is putting

seemingly intractable complex analysis on extremely large volumes of data

within the budget of an increasingly large number of enterprises,‖ he

concluded. ―This technology promises to be a significant competitive

advantage to early adopter organizations.‖

Early adopters include daily deal site Groupon, which uses the Cloudera

distribution of Hadoop to analyse transaction data generated by over 70

million registered users worldwide, and NYSE Euronext, which operates the

Page 5: Combining Hadoop with Big Data Analytics Data Analytics Page 2 of 11 Sponsored by Combining Hadoop with Contents Hadoop for Big Data Puts Architects on Journey of Discovery ‘Big

Page 5 of 11 Sponsored by

Combining Hadoop with Big Data Analytics

Contents

Hadoop for Big Data Puts Architects on Journey of Discovery

‘Big Data’ Analytics Programs Require Tech Savvy, Business Know-How

New York stock exchange as well as other equities and derivatives markets

across Europe and the US. NYSE Euronext uses EMC Greenplum to

manage transaction data that is growing, on average, by as much as 2TB

each day. US bookseller Barnes & Noble, meanwhile, is using Aster Data

nCluster (now owned by Teradata) to better understand customer

preferences and buying patterns across three sales channels: its retail

outlets, its online store and e-reader downloads.

Skills Deficit in Big Data Analytics

While Hadoop‘s cost advantage might give it widespread appeal, the skills

challenge it presents is more daunting, according to Collins. ―Big data

analytics is an emerging field, and insufficiently skilled people exist within

many organizations or the [wider] job market,‖ he said.

Skills-related issues that stand in the way of Hadoop adoption include the

technical nature of the MapReduce framework, which requires users to

develop Java code, and the lack of institutional knowledge on how to

architect the Hadoop infrastructure. Another impediment to adoption is the

lack of support for analysis tools used to examine data residing within the

Hadoop architecture.

That, however, is starting to change. For a start, established BI tools vendors

are adding support for Apache Hadoop: one example is Pentaho, which

introduced support in May 2010 and, subsequently, specific support for the

EMC Greenplum distribution.

Another sign of Hadoop‘s creep towards the mainstream is support from data

integration suppliers, such as Informatica. One of the most common uses of

Hadoop to date has been as an engine for the ‗transformation‘ stage of ETL

(extraction, transformation, loading) processes: the MapReduce technology

lends itself well to the task of preparing data for loading and further analysis

in both convential RDBMS-based data warehouses or those based on

HDFS/Hive. In response, Informatica announced that integration is underway

between EMC Greenplum and its own Data Integration Platform in May

2011.

Page 6: Combining Hadoop with Big Data Analytics Data Analytics Page 2 of 11 Sponsored by Combining Hadoop with Contents Hadoop for Big Data Puts Architects on Journey of Discovery ‘Big

Page 6 of 11 Sponsored by

Combining Hadoop with Big Data Analytics

Contents

Hadoop for Big Data Puts Architects on Journey of Discovery

‘Big Data’ Analytics Programs Require Tech Savvy, Business Know-How

Additionally, there‘s the emergence of Hadoop appliances, as evidenced by

EMC‘s Greenplum HD (a bundle that combines MapR‘s version of Hadoop,

the Greenplum database and a standard x86 based server), announced in

May, and more recently (August 2011), the Dell/Cloudera Solution (which

combines Dell PowerEdge C2100 servers and PowerConnect switches with

Cloudera‘s Hadoop distribution and its Cloudera Enterprise management

tools).

Finally, Hadoop is proving wellsuited to deployment in cloud environments, a

fact which gives IT teams the chance to experiment with pilot projects on

cloud infrastructures. Amazon, for example, offers Amazon Elastic

MapReduce as a hosted service running on its Amazon Elastic Compute

Cloud (EC2) and Amazon Simple Storage Service (S3). The Apache

distribution of Hadoop, moreover, now comes with a set of tools designed to

make it easier to deploy and run on EC2.

―Running Hadoop on EC2 is particularly appropriate where the data already

resides within Amazon S3,‖ said Collins. If the data does not reside on

Amazon S3, he said, then it must be transferred: the Amazon Web Services

(AWS) pricing model includes charges for network costs and Amazon Elastic

MapReduce pricing is in addition to normal Amazon EC3 and S3 pricing.

―Given the volume of data required to run big data analytics, the cost of

transferring data and running the analysis should be carefully considered,‖ he

cautioned.

The bottom line is that ―Hadoop is the future of the EDW and its footprint in

companies‘ core EDW architectures is likely to keep growing throughout this

decade,‖ said Kobielus.

‘Big Data’ Analytics Programs Require Tech Savvy, Business Know-How By: Rick Sherman, Contributor

Articles and case studies written about ―big data‖ analytics programs

proclaim the successes of organizations, typically with a focus on the

Page 7: Combining Hadoop with Big Data Analytics Data Analytics Page 2 of 11 Sponsored by Combining Hadoop with Contents Hadoop for Big Data Puts Architects on Journey of Discovery ‘Big

Page 7 of 11 Sponsored by

Combining Hadoop with Big Data Analytics

Contents

Hadoop for Big Data Puts Architects on Journey of Discovery

‘Big Data’ Analytics Programs Require Tech Savvy, Business Know-How

technologies used to build the systems. Product information is useful, of

course, but it‘s also critical for enterprises embarking on these projects to be

sure they have people with the right skills and experience in place to

successfully leverage big data analytics tools.

Knowledge of new or emerging big data technologies, such as Hadoop, is

often required, especially if a project involves unstructured or semi-structured

data. But even in such cases, Hadoop know-how is only a small portion of

the overall staffing requirements; skills are also needed in areas such as

business intelligence (BI), data integration and data warehousing. Business

and industry expertise is another must-have for a successful big data

analytics program, and many organizations also require advanced analytics

skills, such as predictive analytics and data mining capabilities.

Data scientist is a new job title that is getting some industry buzz. But it‘s

really a combination of skills from the areas listed above, including predictive

modeling, both statistical and business analysis, and data integration

development. I‘ve seen some job postings for data scientists that require a

statistical degree as well as an industry-specific one. Having both degrees

and matching professional experience is great, but what are the chances of

finding job candidates who do? Pretty slim!

Taking a more realistic look at what to plan for, I describe below the key skill

sets and roles that should be part of big data analytics deployments. Instead

of listing specific job titles, I‘ve kept it general; organizations can bundle

together the required roles and responsibilities in various ways based on the

capabilities and experience levels of their existing workforce and new

employees that they hire.

Business knowledge. In addition to technical skills, companies engaging in

big data analytics projects need to involve people with extensive business

and industry expertise. That must include knowledge of a company‘s own

business strategy and operations as well as its competitors, current industry

conditions and emerging trends, customer demographics and both macro-

and microeconomic factors.

Page 8: Combining Hadoop with Big Data Analytics Data Analytics Page 2 of 11 Sponsored by Combining Hadoop with Contents Hadoop for Big Data Puts Architects on Journey of Discovery ‘Big

Page 8 of 11 Sponsored by

Combining Hadoop with Big Data Analytics

Contents

Hadoop for Big Data Puts Architects on Journey of Discovery

‘Big Data’ Analytics Programs Require Tech Savvy, Business Know-How

Much of the business value derived from big data analytics comes not from

textbook key performance indicators but rather from insights and metrics

gleaned as part of the analytics process. That process is part science (i.e.,

statistics) and part art, with users doing what-if analysis to gain actionable

information about an organization‘s business operations. Such findings are

only possible with the participation of business managers and workers who

have firsthand knowledge of business strategies and issues.

Business analysts. This group also has an important role to play in helping

organizations to understand the business ramifications of big data. In

addition to doing analytics themselves, business analysts can be tasked with

things such as gathering business and data requirements for a project,

helping to design dashboards, reports and data visualizations for presenting

analytical findings to business users and assisting in measuring the business

value of the analytics program.

BI developers. These people work with the business analysts to build the

required dashboards, reports and visualizations for business users. In

addition, depending on internal needs and the BI tools that an organization is

using, they can enable self-service BI capabilities by preparing the data and

the required BI templates for use by business executives and workers.

Predictive model builders. In general, predictive models for analyzing big

data need to be custom-built. Deep statistical skills are essential to creating

good models; too often, companies underestimate the required skill level and

hire people who have used statistical tools, including models developed by

others, but don‘t have knowledge of the mathematical principles underlying

predictive models.

Predictive modelers also need some business knowledge and an

understanding of how to gather and integrate data into models, although in

both cases they can leverage other people‘s more extensive expertise in

creating and refining models. Another key skill for modelers to have is the

ability to assess how comprehensive, clean and current the available data is

before building models. There often are data gaps, and a modeler has to be

able to close them.

Page 9: Combining Hadoop with Big Data Analytics Data Analytics Page 2 of 11 Sponsored by Combining Hadoop with Contents Hadoop for Big Data Puts Architects on Journey of Discovery ‘Big

Page 9 of 11 Sponsored by

Combining Hadoop with Big Data Analytics

Contents

Hadoop for Big Data Puts Architects on Journey of Discovery

‘Big Data’ Analytics Programs Require Tech Savvy, Business Know-How

Don‘t be lulled into thinking that tools alone spell big data success. The real

key to success is ensuring that your organization has the skills it needs.

Data architects. A big data analytics project needs someone to design the

data architecture and guide its development. Typically, data architects will

need to incorporate various data structures into an architecture, along with

processes for capturing and storing data and making it available for analysis.

This role involves traditional IT skills such as data modeling, data profiling,

data quality, data integration and data governance.

Data integration developers. Data integration is important enough to

require its own developers. These folks design and develop integration

processes to handle the full spectrum of data structures in a big data

environment. In doing so, they ideally ensure that integration is done not in

silos but as part of a comprehensive data architecture.

It‘s also best to use packaged data integration tools that support multiple

forms of data, including structured, unstructured and semi-structured

sources. Avoid the temptation to develop custom code to handle extract,

transform and load operations on pools of big data; hand coding can

increase a project‘s overall costs -- not to mention the likelihood of an

organization being saddled with undocumented data integration processes

that can‘t scale up as data volumes continue to grow.

Technology architects. This role involves designing the underlying IT

infrastructure that will support a big data analytics initiative. In addition to

understanding the principles of designing traditional data warehousing and BI

systems, technology architects need to have expertise in the new

technologies that often are used in big data projects. And if an organization

plans to implement open source tools, such as Hadoop, the architects need

to understand how to configure, design, develop and deploy such systems.

It can be exciting to tackle a big data analytics project and to acquire new

technologies to support the analytics process. But don‘t be lulled into thinking

that tools alone spell big data success. The real key to success is ensuring

Page 10: Combining Hadoop with Big Data Analytics Data Analytics Page 2 of 11 Sponsored by Combining Hadoop with Contents Hadoop for Big Data Puts Architects on Journey of Discovery ‘Big

Page 10 of 11 Sponsored by

Combining Hadoop with Big Data Analytics

Contents

Hadoop for Big Data Puts Architects on Journey of Discovery

‘Big Data’ Analytics Programs Require Tech Savvy, Business Know-How

that your organization has the skills it needs to effectively manage large and

varied data sets as well as the analytics that will be used to derive business

value from the available data.

About the Author:

Rick Sherman is the founder of Athena IT Solutions, which provides

consulting, training and vendor services on business intelligence, data

integration and data warehousing. Sherman has written more than 100

articles and spoken at dozens of events and webinars; he also is an adjunct

faculty member at Northeastern University‘s Graduate School of Engineering.

He blogs at The Data Doghouse and can be reached at rsherman@athena-

solutions.com.

Page 11: Combining Hadoop with Big Data Analytics Data Analytics Page 2 of 11 Sponsored by Combining Hadoop with Contents Hadoop for Big Data Puts Architects on Journey of Discovery ‘Big

Page 11 of 11 Sponsored by

Combining Hadoop with Big Data Analytics

Contents

Hadoop for Big Data Puts Architects on Journey of Discovery

‘Big Data’ Analytics Programs Require Tech Savvy, Business Know-How

Free resources for technology professionals TechTarget publishes targeted technology media that address your need for

information and resources for researching products, developing strategy and

making cost-effective purchase decisions. Our network of technology-specific

Web sites gives you access to industry experts, independent content and

analysis and the Web‘s largest library of vendor-provided white papers,

webcasts, podcasts, videos, virtual trade shows, research reports and more

—drawing on the rich R&D resources of technology providers to address

market trends, challenges and solutions. Our live events and virtual seminars

give you access to vendor neutral, expert commentary and advice on the

issues and challenges you face daily. Our social community IT Knowledge

Exchange allows you to share real world information in real time with peers

and experts.

What makes TechTarget unique? TechTarget is squarely focused on the enterprise IT space. Our team of

editors and network of industry experts provide the richest, most relevant

content to IT professionals and management. We leverage the immediacy of

the Web, the networking and face-to-face opportunities of events and virtual

events, and the ability to interact with peers—all to create compelling and

actionable information for enterprise IT professionals across all industries

and markets.

Related TechTarget Websites