five reasons your data analytics strategy must include hadoop

4
Five Reasons Your Data Analytics Strategy Must Include Hadoop 1 Five Reasons Your Data Analytics Strategy Must Include Hadoop Overcoming Current Limitations Source: The Data Warehousing Institute Introduction Inspired by the Google File System and MapReduce, Hadoop has emerged as an open-source software framework that supports data intensive distributed applications. Early adopters such as Facebook, Twitter and Yahoo are successfully using Hadoop to tackle their big data analytic challenges. However, many other organizations struggle to get their Hadoop projects off the ground since the framework technology lack’s end user tools and have a steep learning curve. Datameer offers the first big data analytics solution that brings the power of Hadoop analytics to end-users. This paper explains the compelling advantages of Hadoop-based analytics, Hadoop’s challenges for end-users and how the Datameer Analytics Solution overcomes these challenges. 0 25 50 75 100 expect more than 10 TB of data current solution can’t scale cost of scaling is too expensive looking for a more cost-effective platform “Hadoop promises to become a ubiquitous framework for large scale business intelligence, but right now it is difficult for many developers to use. I see tremendous value in Datameer’s approach - making Hadoop accessible to more users who need scalable analytic power for their organization’s big data requirements.” Shawn Rogers, VP of Research Business Intelligence Enterprise Management Associates 34 37 33 57 1. Cost Effective Scale from Terabytes to Petabytes of Data According to the May, 2011 McKinsey report “Big data: The next frontier for competition, innovation and productivity”, there is a 40% growth in global generated per year while global IT spending will only increase 5% per year. McKinsey also estimates that there is a 60% potential increase in retailers’ operating margins with big data and 1.5 million data-savvy managers are needed to take full advantage of big data in the US alone. BIG DATA FACTS Enterprise data will grow 650% in the next five years - 80% will be unstructured – Gartner Enterprise data doubles every three years – Forrester Unstructured data grows at 61.7% CAGR – IDC Structured data grows at 21.8% CAGR – IDC By 2013, the amount of traffic flowing over the Internet annually will reach 667 exabytes – The Economist Enterprise Data Warehouses

Upload: others

Post on 11-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Five Reasons Your Data Analytics Strategy Must Include Hadoop

Five Reasons Your Data Analytics Strategy Must Include Hadoop 1

Five Reasons Your Data AnalyticsStrategy Must Include HadoopOvercoming Current Limitations

Source: The Data Warehousing Institute

IntroductionInspired by the Google File System and MapReduce, Hadoop has emerged as an open-source software framework that supports data intensive distributed applications. Early adopters such as Facebook, Twitter and Yahoo are successfully using Hadoop to tackle their big data analytic challenges. However, many other organizations struggle to get their Hadoop projects off the ground since the framework technology lack’s end user tools and have a steep learning curve.

Datameer offers the first big data analytics solution that brings the power of Hadoop analytics to end-users. This paper explains the compelling advantages of Hadoop-based analytics, Hadoop’s challenges for end-users and how the Datameer Analytics Solution overcomes these challenges.

0 25 50 75 100

expect more than 10 TB of data

current solution can’t scale

cost of scaling is too expensive

looking for a more cost-effective platform

“Hadoop promises to become a ubiquitous framework for large scale business intelligence, but right now it is difficult for many developers to use. I see tremendous value in Datameer’s approach - making Hadoop accessible to more users who need scalable analytic power for their organization’s big data requirements.”Shawn Rogers, VP of ResearchBusiness Intelligence Enterprise Management Associates

34

37

33

57

1. Cost Effective Scale from Terabytes to Petabytes of DataAccording to the May, 2011 McKinsey report “Big data: The next frontier for competition, innovation and productivity”, there is a 40% growth in global generated per year while global IT spending will only increase 5% per year. McKinsey also estimates that there is a 60% potential increase in retailers’ operating margins with big data and 1.5 million data-savvy managers are needed to take full advantage of big data in the US alone.

BIG DATA FACTSEnterprise data will grow 650% in the next five years - 80% will be unstructured – Gartner

Enterprise data doubles every three years – Forrester

Unstructured data grows at 61.7% CAGR – IDC

Structured data grows at 21.8% CAGR – IDC

By 2013, the amount of traffic flowing over the Internet annually will reach 667 exabytes – The Economist

Enterprise Data Warehouses

Page 2: Five Reasons Your Data Analytics Strategy Must Include Hadoop

Five Reasons Your Data Analytics Strategy Must Include Hadoop 2

In the 2009 survey from The Data Warehousing Institute, 34% of all organizations expect their data warehouses to grow to more than ten terabytes of data. However, 37% believe that their current solution can’t scale and 33% believe the cost of scaling is too expensive. As a result, 57% of the organizations are looking for a more cost-effective platform.

Hadoop, designed for the cost effective storage and processing of large volumes of data, is proven to scale to 4000 servers and petabytes of data. Hadoop scales on clusters of commodity hardware of choice and provides an affordable alternative to expensive database servers for the storage of large data volumes.

But, implementation costs can be high for Hadoop because this open source framework is complex and offers few tools for data loading, analytics and reporting. DAS provides functionality across all business intelligence areas including data loading and integration, data analytics and data visualization. DAS dramatically decreases IT resource requirements, implementation time and ongoing administrative costs.

2. Analysis of Highly Granular Data, Not Just SummariesGranular data is not aggregated. Typically, it represents a single event such as a customer purchasing a product at a given time of data for a specific price. Summary data, on the other hand, might aggregate this data to total product sales by all customers for the day, the week, the month, the quarter or the year.

Increasingly, organizations are demanding access to more granular data for applications such as data mining, predictive analytics, operational business intelligence, and historical transaction-level reporting. Yet, due to scale restrictions, organizations must either limit the time period for which they can store granular data or settle for some degree of summarization.

The Hadoop Distributed File System (HDFS) is designed for the storage of massive amounts of granular data. It provides high data bandwidth and scales to hundreds of nodes in a single cluster. In addition, it supports tens of millions of files in a single instance.

Unfortunately, even though HDFS offers unparalleled storage scalability, business users can still quickly become overwhelmed by the sheer size and scale of the data stored. Therefore, Hadoop analytics users need the pre-built functionality of an analytics solution such as DAS so they can concentrate on doing data integration, analysis and visualization rather than spend their time dealing with low-level infrastructure such as how to get data into Hadoop. Patent pending data sampling techniques in DAS allow users to work interactively with representative subsets of the big data as they design their analytics. Once the user has designed the analytics, they can then be run against the full data set.

“Attributor’s selection of Datameer was driven by our need to quickly provide analytics to our clients. Datameer’s ease-of-use, seamless integration with Cloudera’s CDH, HBase and MySQL and ability to correlate structured and unstructured data on day one has already saved us both time and money in running thousands of analytics jobs for our users.”

Matt RobinsonPresident and COOAttributor

Page 3: Five Reasons Your Data Analytics Strategy Must Include Hadoop

Five Reasons Your Data Analytics Strategy Must Include Hadoop 3

3. Inclusion of Structured and Unstructured DataIn the context of data warehousing, unstructured data refers to information that either does not have a data model or has one that is not easily usable by data warehouse applications. Common examples include Word documents, video and audio files, call detail records, clickstream data, log files, email and social media data.

According to Gartner, enterprise data will grow 650% in the next five years — 80% will be unstructured. It is clear that enterprise analytics must address both structured and unstructured data as well as correlations between the two. However, traditional data warehouse technologies are optimized solely for structured data stored in relational database tables.

One of the distinct advantages of Hadoop is that it is designed for the storage and processing of both structured and unstructured data because it does not impose a data model on information. DAS extends this flexibility to its data loading, data analytics and data visualization reporting tools so that every aspect of the end-to-end solution accommodates unstructured content just as well as structured data.

4. Responding to Constantly Changing Data RequirementsRigorous data warehouse design and integration requirements make it nearly impossible to accommodate requests for new data in a timely manner. Confronted with more and more users who want immediate access to an ever-growing amount of data from both inside and outside the organization, IT simply doesn’t have the resources and budget to keep up with demand.

Hadoop stores raw data and, therefore, does not require complex data and schema mappings that are time consuming to set up and difficult to change. Performance optimization is achieved through elastic scalability rather than hard-coded schemas.

Despite less complex design and modeling than relational approaches, Hadoop, lacks data loading and integration tools. DAS provides complete data loading functionality including pre-built connectors for structured and unstructured data. With the DAS import functionality, users automatically load and store raw data in Hadoop and do translation on the fly.

5. Fast Integration of Multiple Data SourcesEnterprises are facing a proliferation of data and data sources. Business users need access to data regardless of its location, whether if comes from channels such as relational databases, call center logs and Web logs or new social media sites such as Twitter — data outside a company’s systems and infrastructure.

Just as hard-coded data and schema mappings impede the agility to respond to constantly changing data requirements in a timely manner, these requirements also make bringing in new data sources very challenging. Built on top of Hadoop’s raw data storage, DAS automates the import of data from multiple data sources and exports data to data stores such as data warehouses as well as business intelligence tools. This makes it easier for users to extend their analytics and to share their data.

Page 4: Five Reasons Your Data Analytics Strategy Must Include Hadoop

©2013 Datameer, Inc. All rights reserved. Datameer is a trademark of Datameer, Inc. Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. Other names may be trademarks of their respective owners.

Datameer, Inc. 2040 Pioneer CourtSan Mateo, CA 94403

@datameer

linkedin.com/company/datameer

T 650 286 9100F 650 286 9103www.datameer.com

ConclusionBusiness users work under tight deadlines to provide answers to pressing business questions. Yet, in today’s dynamic business environment data and analytic requirements are ever changing and users never seem to have access to what they need when they need it. Even when data is available it can be weeks or months old or at the wrong level of detail for the needs of the analysis.

The overstretched IT department simply can’t keep up with user demand and the backlog of requests just keeps getting worse as data and their locations and formats from inside the enterprise and from new social media sites grows. While business users wait, companies miss out on opportunities that could have positively impacted their bottom line.

Don’t let your company get behind in accessing and using information assets for strategic advantage. Consider the Datameer Analytics Solution on Hadoop to remove the technical barriers to:

• Cost effectively scaling for big data requirements• Storing and analyzing granular data• Inclusion of structured and unstructured data• Agility in responding to dynamic business requirements• Rapid data source integration

Ready to learn more? Simply contact Datameer at 650-286-9100 or email us as [email protected]. A number of resources are also available at www.datameer.com including informative weekly webcasts and a free Trial Edition.

About Datameer Datameer offers the first data analytics solution built on Apache Hadoop that helps business users access, analyze and use massive amounts of data. Founded by Hadoop veterans in 2009, the company’s breakthrough product, Datameer Analytics Solution (DAS), provides unparalleled access to data with minimal IT resources. DAS can scale to 4,000 servers and petabytes of data, while also delivering low TCO. Datameer is based in San Mateo, California. For more information, please visit us at www.datameer.com.