accelerating big data roi with hadoop

Business needMetaScale’s clients need much

more cost-effective ways to derive

intelligence from its enormous data

stores while radically reducing costs.

Solution

One client achieved its ROI in just

three months with MetaScale’s big

data appliances based on Dell servers

to manage several petabytes of data.

Benefits• Reduces costs per millions of

instructions per second (MIPS) up

to 160 times

• Cuts batch processing time from

over 20 hours to less than an hour

• Improves decision support by

providing better data faster

• Boosts data integrity throughout

large enterprises

• Offers rapid and cost-effective

scalability

Solutions at a glance• Big Data

Accelerating big data ROI with Hadoop MetaScale provides Hadoop big data solutions, training and support — partnering with Dell — to help clients speed processing, improve decision support and realize major cost reductions

Customer profile

Company MetaScale

Industry Information

Technology

Country United States

Website www.metascale.com

“Addressing exhausted enterprise data capacity can cost up to $800,000 per terabyte of data. But with Hadoop’s extreme scalability, adding terabytes can cost as little as $5,000 using MetaScale’s big data appliances based on Dell PowerEdge servers.” Ankur Gupta, General Manager, MetaScale

http://www.dell.com

http://www.dell.com/learn/us/en/04/business-intelligence-big-data

http://www.metascale.com

http://www.metascale.com

2

“ETL time savings can vastly improve decision support by providing enterprise management with better data faster. And our big data appliances using Dell’s PowerEdge servers provide the horsepower for that.” Ankur Gupta, General Manager, MetaScale

In addition, its complexity exceeds the capacity and capability of traditional data management models like relational databases, which are falling far short in their ability to keep up with the world’s data explosion. High costs, scalability and data governance are all issues. Compounding those issues is the fact that data volumes may be growing, but IT budgets are not.

Mitigating the growing costs of data overload One company at the forefront of helping large enterprises accelerate their development of scalable, highly cost-effective big data solutions is MetaScale, a global consulting firm with headquarters in Chicago, Illinois. As general manager of MetaScale, Ankur Gupta knows what’s driving big data. “With all the interconnectivity across all the different gadgets that are now part of our world, our lives are becoming more and more digital,” he says. “Thanks to that phenomenon, we’re collecting more data than ever before, to the point of it becoming a problem.”

It’s precisely this “problem” that is escalating the pains that companies face in dealing with big data. “One of many areas of concern,” Gupta says, “is increasing data complexity. It can delay data production schedules, while sending costs soaring. With more data, you have more copies of data, which leads to more complexity and time in dealing with it, along with higher costs of hardware, software and management.”

He adds that enterprise data users can replicate the same data sets, generating hundreds if not thousands of unnecessary duplications over time, all stored within expensive data warehouse platforms.

Millions of daily shoppers generate massive data challengesMetaScale’s first client was one of North America’s largest retail groups, with many thousands of stores and tens of billions of dollars in annual sales. Its daily transactional and operational data volume is several terabytes, generated by millions of shoppers as well as many supply chains. In all, its current data volume exceeds 3 petabytes.

Gupta says that this client realized its data management, analysis and reporting capabilities were quickly falling behind the pace of growth of its data. As just one example, to help build customer relationships through special offers and discounts, the client wanted to contact its opt-in customers via email each week.

“Unfortunately, its legacy system was so bogged down with latency that the company could only run the report

Large enterprises the world over are swamped with data.

Transactional data. Operational data. As the Internet of Things

grows, so grows data — exponentially. It’s called big data. And

while it lacks a specific, industry-standard definition, experts

generally agree it comprises massive amounts — terabytes, if

not petabytes — of both structured and unstructured data.

Products & Services

Hardware

Dell PowerEdge servers

Software

ApacheTM Hadoop®

http://www.dell.com/us/business/p/servers

http://www.dell.com/learn/us/en/555/solutions/hadoop-big-data-solution

3

once every eight weeks,” Gupta explains. “The problems were threefold: getting data into its relational database; the complexity of its ETL [Extract-Transform-Load] processes; and latency in getting data out of the system.”

In another example, the client company’s pricing business unit needed daily summary reports using data from multiple platforms to measure the effectiveness of its pricing in stores. It was generating these reports from its data warehouse and predictive analytics tools, both hosted on mainframes. But the mainframes’ batch processing required for these reports was taking 10 to 15 hours, so the company could manage only weekly data warehouse loads. As a result, the pricing business unit wasn’t getting the decision support needed to fine-tune in-store pricing.

Harnessing cost-effective distributed processing with Apache Hadoop solutionsMetaScale’s approach to solve big data problems for its client was to deploy an ApacheTM Hadoop® solution — an open-source software framework, powered by MetaScale big data appliances based on Dell PowerEdge servers.

MetaScale worked closely with Dell to develop its bundled Hadoop solutions to meet its client’s growing needs for performance, scalability and support, and Gupta recognizes Dell as a key partner. “Powerful yet highly cost-effective hardware, like our big data appliances, which incorporate Dell PowerEdge servers, is what makes the Hadoop cluster work,” he says.

MetaScale’s big data appliances arrange the server nodes into logical clusters for handling large-scale data sets cost-effectively. Gupta says its client’s Hadoop solution has a cluster with more than 500 server nodes. And that doesn’t count its backup cluster.

A step-wise approach: a proven methodologyMetaScale approaches new client projects with a detailed methodology that emphasizes a measured, step-wise approach to Hadoop’s deployment across the enterprise. “A common misconception is that Hadoop is a turn-key solution that will solve all of a company’s data issues,” Gupta says. “In fact, Hadoop is much more intricate than that. You can’t just download it and expect it to work. Companies need to take a step back, identify major use cases within their business using their own data sets, validate those cases, and then develop a proof of concept before proceeding with deployment.”

Another misconception is that because Apache Hadoop is open-source software, it’s not enterprise-ready. Gupta points out that the platform’s broader ecosystem of commercial developers has spawned a wide range of powerful tools and supporting technologies, making it an extremely capable enterprise solution.

Extreme benefits including 160 times cost reductions and time savingsIn this particular client’s Hadoop implementation, MetaScale helped its client achieve an ROI in just three months after becoming operational with a cluster that, at the time, harnessed 50 nodes of Dell servers to manage its growing data.

Gupta explains that by following MetaScale’s detailed project methodology, its client built a Hadoop cluster, initiated production in six months and achieved operational stability in 12 months. “Taking MetaScale’s measured, phased approach is the way to minimize risk and ensure success,” he says. “Once Hadoop is deployed — and done so correctly — paybacks are rapid.”

Cost savings are one source of those paybacks, according to Gupta. He points out that until Apache Hadoop debuted,

4

enterprise IT’s only option for exhausted data warehouse and processing capacity was to deploy ever-more computing power (in the form of millions of instructions per second (MIPS)) and storage.

“With processing power plus additional storage, software and support, addressing exhausted enterprise data capacity can cost up to $800,000 per terabyte of enterprise data,” he says. “But with Hadoop’s extreme scalability, adding terabytes can cost less than $5,000 per terabyte using MetaScale’s big data appliances based on Dell PowerEdge servers.”

“Time savings are the other source of ROI for Hadoop implementations,” Gupta says. MetaScale typically sees batch processing times cut from 20 hours or more, depending on the size of a data run, to less than an hour. That’s because a Hadoop Distributed File System (HDFS) can dispatch the processing to the data nodes simultaneously.

For example, in a 100-node cluster, up to 1,200 processors can be working at the same time. In addition, Hadoop can compress ETL time frames. “ETL time savings can vastly improve decision support by providing enterprise management with better data faster,” he says. “And our big data appliances using Dell’s PowerEdge servers provide the horsepower for that.”

Clean data is valuable data: integrity with rapid, cost-effective scalabilityA MetaScale Hadoop implementation can also help improve data integrity throughout large enterprises via better, enterprise-wide data governance that

prevents duplication of data sets or proliferation of different versions as different workgroups and individuals perform ETL on data.

Gupta cites one large firm that found 300 people in accounting were using 450 operational Microsoft Access databases, many with ODBC connections to other data sources, so users could extract data and modify data to meet their needs. “This lack of governance contributes to costly data movements and to data disagreements as users modify their data, often moving the results into Excel spreadsheets, which feed into portals that produce differing results from the data,” he says.

Once a MetaScale client has implemented an Apache Hadoop data infrastructure, Gupta says, it’s ready for a big data future, no matter how much its data grows or what mix of structured or unstructured data it has.

He explains that the key is to keep the data in its rawest, most basic form, using ETL as needed but leaving the source data intact. “Despite its deployment complexities,” he says, “Hadoop’s simplified data processing model supports extreme scalability, so adding capacity with MetaScale’s big data appliances, based on commodity servers like Dell PowerEdge models, can be tremendously cost-effective.”

Dell, the Dell logo and PowerEdge are trademarks of Dell Inc. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell disclaims any proprietary interest in the marks and names of others. Availability and terms of Dell Software, Solutions and Services vary by region. This case study is for informational purposes only. Dell makes no warranties—express or implied—in this case study. Reference Number: 10013003 © April 2015, Dell Inc. All Rights Reserved

View all Dell case studies at Dell.com/CustomerStories

“Hadoop’s simplified data processing model supports extreme scalability, so adding capacity with MetaScale’s big data appliances, based on commodity servers like Dell’s PowerEdge models, can be tremendously cost-effective.” Ankur Gupta, General Manager, MetaScale

http://www.dell.com/customerstories