Combining Hadoop with Big Data Data Analytics Page 2 of 11 Sponsored by Combining Hadoop with Contents Hadoop for Big Data Puts Architects on Journey of Discovery ‘Big Data’ Analytics

Download Combining Hadoop with Big Data   Data Analytics Page 2 of 11 Sponsored by Combining Hadoop with Contents Hadoop for Big Data Puts Architects on Journey of Discovery ‘Big Data’ Analytics

Post on 30-Mar-2018

216 views

Category:

Documents

4 download

TRANSCRIPT

  • Combining Hadoop with Big Data Analytics

  • Page 2 of 11 Sponsored by

    Combining Hadoop with Big Data Analytics

    Contents

    Hadoop for Big Data Puts Architects on Journey of Discovery

    Big Data Analytics Programs Require Tech Savvy, Business Know-How

    This E-Guide explores the growing popularity of leveraging Hadoop, an open source software framework, to tackle the challenges of big data analytics, as well as some of the key business and technical barriers to adoption. Gain expert advice on the skill sets and roles needed for successful big data analytics programs, and discover how organizations can design and assemble the right team.

    Hadoop for Big Data Puts Architects on Journey of Discovery By: Jessica Twentyman, Contributor

    Problems dont care how you solve them, said James Kobielus, an analyst

    with Forrester Research, in a recent blog on the topic of big data. The only

    thing that matters is that you do indeed solve them, using any tools or

    approaches at your disposal.

    In recent years, that imperative -- to solve the problems posed by big data --

    has led data architects at many organizations on a journey of discovery. Put

    simply, the traditional database and business intelligence tools that they

    routinely use to slice and dice corporate data are just not up to the task of

    handling big data.

    To put the challenge into perspective, think back a decade: few enterprise

    data warehouses grew as large as a terabyte. By 2009, Forrester analysts

    were reporting that over two-thirds of Enterprise Data Warehouses (EDWs)

    deployed were in the 1 to 10-terabyte range. By 2015, they claim, a majority

    of EDWs in large organizations will be 100 TB or larger, with petabyte-scale

    EDWs becoming well entrenched in sectors such as telecoms, financial

    services and web commerce.

    Whats needed to analyze these big data stores are special new tools and

    approaches that deliver extremely scalable analytics, said Kobielus. By

    extremely scalable, he means analytics that can deal with data that stands

  • Page 3 of 11 Sponsored by

    Combining Hadoop with Big Data Analytics

    Contents

    Hadoop for Big Data Puts Architects on Journey of Discovery

    Big Data Analytics Programs Require Tech Savvy, Business Know-How

    out in four key ways: for its volume (from hundreds of terabytes to petabytes

    and beyond); its velocity (up to and including real-time, sub-second delivery);

    its variety (diverse structured, unstructured and semi-structured formats); and

    its volatility (wherein scores to hundreds of new data sources come and go

    from new applications, services, social networks and so on).

    Hadoop for Big Data on the Rise

    One of the principal approaches to emerge in recent years has been Apache

    Hadoop, an open source software framework that supports data-intensive

    distributed analytics involving thousands of nodes and petabytes of data.

    The underlying technology was originally invented at search engine giant

    Google by in-house developers looking for a way to usefully index textual

    data and other rich information and present it back to users in meaningful

    ways. They called this technology MapReduce and todays Hadoop is an

    open-source version, used by data architects for analytics that are deep and

    computationally intensive.

    In a recent survey of enterprise Hadoop users conducted by Ventana

    Research on behalf of Hadoop specialist Karmasphere, 94% of Hadoop

    users reported that they can now perform analyses on large volumes of data

    that werent possible before; 88% said that they analyze data in greater

    detail; and 82% can now retain more of their data.

    Its still a niche technology, but Hadoops profile received a serious boost

    over that past year, thanks in part to start-up companies such as Cloudera

    and MapR that offer commercially licensed and supported distributions of

    Hadoop. Its growing popularity is also the result of serious interest shown by

    EDW vendors like EMC, IBM and Teradata. EMC bought Hadoop specialist

    Greenplum in June 2010; Teradata announced its acquisition of Aster Data in

    March 2011; and IBM announced its own Hadoop offering, Infosphere, in

    May 2011.

    What stood out about Greenplum for EMC was its x86-based, scale-out

    MPP, shared-nothing design, said Mark Sears, architect at the new

  • Page 4 of 11 Sponsored by

    Combining Hadoop with Big Data Analytics

    Contents

    Hadoop for Big Data Puts Architects on Journey of Discovery

    Big Data Analytics Programs Require Tech Savvy, Business Know-How

    organization EMC Greenplum. Not everyone requires a set-up like that, but

    more and more customers do, he said.

    In many cases, he continued these are companies that are already adept

    at analysis, at mining customer data for buying trends, for example. In the

    past, however, they may have had to query a random sample of data relating

    to 500,000 customers from an overall pool of data of 5 million customers.

    That opens up the risk that they might miss something important. With a big

    data approach, its perfectly possible and relatively cost-effective to query

    data relating to all 5 million.

    Hadoop Ill Understood by the Business

    While theres a real buzz around Hadoop among technologists, its still not

    widely understood in business circles. In the simplest terms, Hadoop is

    engineered to run across a large number of low-cost commodity servers and

    distribute data volumes across that cluster, keeping track of where individual

    pieces of data reside using the Hadoop Distributed File System (HDFS).

    Analytic workloads are performed across the cluster, in a massively parallel

    processing (MPP) model, using tools such as Pig and Hive. Results,

    meanwhile, are delivered as a unified whole.

    Earlier this year, Gartner analyst Marcus Collins mapped out some of the

    ways hes seeing Hadoop used today: by financial services companies, to

    discover fraud patterns in years of credit-card transaction records; by mobile

    telecoms providers, to identify customer churn patterns; by academic

    researchers, to identify celestial objects from telescope imagery.

    The cost-performance curve of commodity servers and storage is putting

    seemingly intractable complex analysis on extremely large volumes of data

    within the budget of an increasingly large number of enterprises, he

    concluded. This technology promises to be a significant competitive

    advantage to early adopter organizations.

    Early adopters include daily deal site Groupon, which uses the Cloudera

    distribution of Hadoop to analyse transaction data generated by over 70

    million registered users worldwide, and NYSE Euronext, which operates the

  • Page 5 of 11 Sponsored by

    Combining Hadoop with Big Data Analytics

    Contents

    Hadoop for Big Data Puts Architects on Journey of Discovery

    Big Data Analytics Programs Require Tech Savvy, Business Know-How

    New York stock exchange as well as other equities and derivatives markets

    across Europe and the US. NYSE Euronext uses EMC Greenplum to

    manage transaction data that is growing, on average, by as much as 2TB

    each day. US bookseller Barnes & Noble, meanwhile, is using Aster Data

    nCluster (now owned by Teradata) to better understand customer

    preferences and buying patterns across three sales channels: its retail

    outlets, its online store and e-reader downloads.

    Skills Deficit in Big Data Analytics

    While Hadoops cost advantage might give it widespread appeal, the skills

    challenge it presents is more daunting, according to Collins. Big data

    analytics is an emerging field, and insufficiently skilled people exist within

    many organizations or the [wider] job market, he said.

    Skills-related issues that stand in the way of Hadoop adoption include the

    technical nature of the MapReduce framework, which requires users to

    develop Java code, and the lack of institutional knowledge on how to

    architect the Hadoop infrastructure. Another impediment to adoption is the

    lack of support for analysis tools used to examine data residing within the

    Hadoop architecture.

    That, however, is starting to change. For a start, established BI tools vendors

    are adding support for Apache Hadoop: one example is Pentaho, which

    introduced support in May 2010 and, subsequently, specific support for the

    EMC Greenplum distribution.

    Another sign of Hadoops creep towards the mainstream is support from data

    integration suppliers, such as Informatica. One of the most common uses of

    Hadoop to date has been as an engine for the transformation stage of ETL

    (extraction, transformation, loading) processes: the MapReduce technology

    lends itself well to the task of preparing data for loading and further analysis

    in both convential RDBMS-based data warehouses or those based on

    HDFS/Hive. In response, Informatica announced that integration is underway

    between EMC Greenplum and its own Data Integration Platform in May

    2011.

  • Page 6 of 11 Sponsored by

    Combining Hadoop with Big Data Analytics

    Contents

    Hadoop for Big Data Puts Architects on Journey of Discovery

    Big Data Analytics Programs Require Tech Savvy, Business Know-How

    Additionally, theres the emergence of Hadoop appliances, as evidenced by

    EMCs Greenplum HD (a bundle that combines MapRs version of Hadoop,

    the Greenplum database and a standard x86 based server), announced in

    May, and more recently (August 2011), the Dell/Cloudera Solution (which

    combines Dell PowerEdge C2100 servers and PowerConnect switches with

    Clouderas Hadoop distribution and its Cloudera Enterprise management

    tools).

    Finally, Hadoop is proving wellsuited to deployment in cloud environments, a

    fact which gives IT teams the chance to experiment with pilot projects on

    cloud infrastructures. Amazon, for example, offers Amazon Elastic

    MapReduce as a hosted service running on its Amazon Elastic Compute

    Cloud (EC2) and Amazon Simple Storage Service (S3). The Apache

    distribution of Hadoop, moreover, now comes with a set of tools designed to

    make it easier to deploy and run on EC2.

    Running Hadoop on EC2 is particularly appropriate where the data already

    resides within Amazon S3, said Collins. If the data does not reside on

    Amazon S3, he said, then it must be transferred: the Amazon Web Services

    (AWS) pricing model includes charges for network costs and Amazon Elastic

    MapReduce pricing is in addition to normal Amazon EC3 and S3 pricing.

    Given the volume of data required to run big data analytics, the cost of

    transferring data and running the analysis should be carefully considered, he

    cautioned.

    The bottom line is that Hadoop is the future of the EDW and its footprint in

    companies core EDW architectures is likely to keep growing throughout this

    decade, said Kobielus.

    Big Data Analytics Programs Require Tech Savvy, Business Know-How By: Rick Sherman, Contributor

    Articles and case studies written about big data analytics programs

    proclaim the successes of organizations, typically with a focus on the

  • Page 7 of 11 Sponsored by

    Combining Hadoop with Big Data Analytics

    Contents

    Hadoop for Big Data Puts Architects on Journey of Discovery

    Big Data Analytics Programs Require Tech Savvy, Business Know-How

    technologies used to build the systems. Product information is useful, of

    course, but its also critical for enterprises embarking on these projects to be

    sure they have people with the right skills and experience in place to

    successfully leverage big data analytics tools.

    Knowledge of new or emerging big data technologies, such as Hadoop, is

    often required, especially if a project involves unstructured or semi-structured

    data. But even in such cases, Hadoop know-how is only a small portion of

    the overall staffing requirements; skills are also needed in areas such as

    business intelligence (BI), data integration and data warehousing. Business

    and industry expertise is another must-have for a successful big data

    analytics program, and many organizations also require advanced analytics

    skills, such as predictive analytics and data mining capabilities.

    Data scientist is a new job title that is getting some industry buzz. But its

    really a combination of skills from the areas listed above, including predictive

    modeling, both statistical and business analysis, and data integration

    development. Ive seen some job postings for data scientists that require a

    statistical degree as well as an industry-specific one. Having both degrees

    and matching professional experience is great, but what are the chances of

    finding job candidates who do? Pretty slim!

    Taking a more realistic look at what to plan for, I describe below the key skill

    sets and roles that should be part of big data analytics deployments. Instead

    of listing specific job titles, Ive kept it general; organizations can bundle

    together the required roles and responsibilities in various ways based on the

    capabilities and experience levels of their existing workforce and new

    employees that they hire.

    Business knowledge. In addition to technical skills, companies engaging in

    big data analytics projects need to involve people with extensive business

    and industry expertise. That must include knowledge of a companys own

    business strategy and operations as well as its competitors, current industry

    conditions and emerging trends, customer demographics and both macro-

    and microeconomic factors.

  • Page 8 of 11 Sponsored by

    Combining Hadoop with Big Data Analytics

    Contents

    Hadoop for Big Data Puts Architects on Journey of Discovery

    Big Data Analytics Programs Require Tech Savvy, Business Know-How

    Much of the business value derived from big data analytics comes not from

    textbook key performance indicators but rather from insights and metrics

    gleaned as part of the analytics process. That process is part science (i.e.,

    statistics) and part art, with users doing what-if analysis to gain actionable

    information about an organizations business operations. Such findings are

    only possible with the participation of business managers and workers who

    have firsthand knowledge of business strategies and issues.

    Business analysts. This group also has an important role to play in helping

    organizations to understand the business ramifications of big data. In

    addition to doing analytics themselves, business analysts can be tasked with

    things such as gathering business and data requirements for a project,

    helping to design dashboards, reports and data visualizations for presenting

    analytical findings to business users and assisting in measuring the business

    value of the analytics program.

    BI developers. These people work with the business analysts to build the

    required dashboards, reports and visualizations for business users. In

    addition, depending on internal needs and the BI tools that an organization is

    using, they can enable self-service BI capabilities by preparing the data and

    the required BI templates for use by business executives and workers.

    Predictive model builders. In general, predictive models for analyzing big

    data need to be custom-built. Deep statistical skills are essential to creating

    good models; too often, companies underestimate the required skill level and

    hire people who have used statistical tools, including models developed by

    others, but dont have knowledge of the mathematical principles underlying

    predictive models.

    Predictive modelers also need some business knowledge and an

    understanding of how to gather and integrate data into models, although in

    both cases they can leverage other peoples more extensive expertise in

    creating and refining models. Another key skill for modelers to have is the

    ability to assess how comprehensive, clean and current the available data is

    before building models. There often are data gaps, and a modeler has to be

    able to close them.

  • Page 9 of 11 Sponsored by

    Combining Hadoop with Big Data Analytics

    Contents

    Hadoop for Big Data Puts Architects on Journey of Discovery

    Big Data Analytics Programs Require Tech Savvy, Business Know-How

    Dont be lulled into thinking that tools alone spell big data success. The real

    key to success is ensuring that your organization has the skills it needs.

    Data architects. A big data analytics project needs someone to design the

    data architecture and guide its development. Typically, data architects will

    need to incorporate various data structures into an architecture, along with

    processes for capturing and storing data and making it available for analysis.

    This role involves traditional IT skills such as data modeling, data profiling,

    data quality, data integration and data governance.

    Data integration developers. Data integration is important enough to

    require its own developers. These folks design and develop integration

    processes to handle the full spectrum of data structures in a big data

    environment. In doing so, they ideally ensure that integration is done not in

    silos but as part of a comprehensive data architecture.

    Its also best to use packaged data integration tools that support multiple

    forms of data, including structured, unstructured and semi-structured

    sources. Avoid the temptation to develop custom code to handle extract,

    transform and load operations on pools of big data; hand coding can

    increase a projects overall costs -- not to mention the likelihood of an

    organization being saddled with undocumented data integration processes

    that cant scale up as data volumes continue to grow.

    Technology architects. This role involves designing the underlying IT

    infrastructure that will support a big data analytics initiative. In addition to

    understanding the principles of designing traditional data warehousing and BI

    systems, technology architects need to have expertise in the new

    technologies that often are used in big data projects. And if an organization

    plans to implement open source tools, such as Hadoop, the architects need

    to understand how to configure, design, develop and deploy such systems.

    It can be exciting to tackle a big data analytics project and to acquire new

    technologies to support the analytics process. But dont be lulled into thinking

    that tools alone spell big data success. The real key to success is ensuring

  • Page 10 of 11 Sponsored by

    Combining Hadoop with Big Data Analytics

    Contents

    Hadoop for Big Data Puts Architects on Journey of Discovery

    Big Data Analytics Programs Require Tech Savvy, Business Know-How

    that your organization has the skills it needs to effectively manage large and

    varied data sets as well as the analytics that will be used to derive business

    value from the available data.

    About the Author:

    Rick Sherman is the founder of Athena IT Solutions, which provides

    consulting, training and vendor services on business intelligence, data

    integration and data warehousing. Sherman has written more than 100

    articles and spoken at dozens of events and webinars; he also is an adjunct

    faculty member at Northeastern Universitys Graduate School of Engineering.

    He blogs at The Data Doghouse and can be reached at rsherman@athena-

    solutions.com.

  • Page 11 of 11 Sponsored by

    Combining Hadoop with Big Data Analytics

    Contents

    Hadoop for Big Data Puts Architects on Journey of Discovery

    Big Data Analytics Programs Require Tech Savvy, Business Know-How

    Free resources for technology professionals TechTarget publishes targeted technology media that address your need for

    information and resources for researching products, developing strategy and

    making cost-effective purchase decisions. Our network of technology-specific

    Web sites gives you access to industry experts, independent content and

    analysis and the Webs largest library of vendor-provided white papers,

    webcasts, podcasts, videos, virtual trade shows, research reports and more

    drawing on the rich R&D resources of technology providers to address

    market trends, challenges and solutions. Our live events and virtual seminars

    give you access to vendor neutral, expert commentary and advice on the

    issues and challenges you face daily. Our social community IT Knowledge

    Exchange allows you to share real world information in real time with peers

    and experts.

    What makes TechTarget unique? TechTarget is squarely focused on the enterprise IT space. Our team of

    editors and network of industry experts provide the richest, most relevant

    content to IT professionals and management. We leverage the immediacy of

    the Web, the networking and face-to-face opportunities of events and virtual

    events, and the ability to interact with peersall to create compelling and

    actionable information for enterprise IT professionals across all industries

    and markets.

    Related TechTarget Websites