accelerating big data roi with hadoop

Download Accelerating Big Data ROI  With Hadoop

Post on 18-Aug-2015



Data & Analytics

5 download

Embed Size (px)


  1. 1. Business need MetaScales clients need much more cost-effective ways to derive intelligence from its enormous data stores while radically reducing costs. Solution One client achieved its ROI in just three months with MetaScales big data appliances based on Dell servers to manage several petabytes of data. Benefits Reduces costs per millions of instructions per second (MIPS) up to 160 times Cuts batch processing time from over 20 hours to less than an hour Improves decision support by providing better data faster Boosts data integrity throughout large enterprises Offers rapid and cost-effective scalability Solutions at a glance Big Data Accelerating big data ROI with Hadoop MetaScale provides Hadoop big data solutions, training and support partnering with Dell to help clients speed processing, improve decision support and realize major cost reductions Customer profile Company MetaScale Industry Information Technology Country United States Website Addressing exhausted enterprise data capacity can cost up to $800,000 per terabyte of data. But with Hadoops extreme scalability, adding terabytes can cost as little as $5,000 using MetaScales big data appliances based on Dell PowerEdge servers. Ankur Gupta, General Manager, MetaScale
  2. 2. 2 ETL time savings can vastly improve decision support by providing enterprise management with better data faster. And our big data appliances using Dells PowerEdge servers provide the horsepower for that. Ankur Gupta, General Manager, MetaScale In addition, its complexity exceeds the capacity and capability of traditional data management models like relational databases, which are falling far short in their ability to keep up with the worlds data explosion. High costs, scalability and data governance are all issues. Compounding those issues is the fact that data volumes may be growing, but IT budgets are not. Mitigating the growing costs of data overload One company at the forefront of helping large enterprises accelerate their development of scalable, highly cost-effective big data solutions is MetaScale, a global consulting firm with headquarters in Chicago, Illinois. As general manager of MetaScale, Ankur Gupta knows whats driving big data. With all the interconnectivity across all the different gadgets that are now part of our world, our lives are becoming more and more digital, he says. Thanks to that phenomenon, were collecting more data than ever before, to the point of it becoming a problem. Its precisely this problem that is escalating the pains that companies face in dealing with big data. One of many areas of concern, Gupta says, is increasing data complexity. It can delay data production schedules, while sending costs soaring. With more data, you have more copies of data, which leads to more complexity and time in dealing with it, along with higher costs of hardware, software and management. He adds that enterprise data users can replicate the same data sets, generating hundreds if not thousands of unnecessary duplications over time, all stored within expensive data warehouse platforms. Millions of daily shoppers generate massive data challenges MetaScales first client was one of North Americas largest retail groups, with many thousands of stores and tens of billions of dollars in annual sales. Its daily transactional and operational data volume is several terabytes, generated by millions of shoppers as well as many supply chains. In all, its current data volume exceeds 3 petabytes. Gupta says that this client realized its data management, analysis and reporting capabilities were quickly falling behind the pace of growth of its data. As just one example, to help build customer relationships through special offers and discounts, the client wanted to contact its opt-in customers via email each week. Unfortunately, its legacy system was so bogged down with latency that the company could only run the report Large enterprises the world over are swamped with data. Transactional data. Operational data. As the Internet of Things grows, so grows data exponentially. Its called big data. And while it lacks a specific, industry-standard definition, experts generally agree it comprises massive amounts terabytes, if not petabytes of both structured and unstructured data. ProductsServices Hardware Dell PowerEdge servers Software ApacheTM Hadoop
  3. 3. 3 once every eight weeks, Gupta explains. The problems were threefold: getting data into its relational database; the complexity of its ETL [Extract- Transform-Load] processes; and latency in getting data out of the system. In another example, the client companys pricing business unit needed daily summary reports using data from multiple platforms to measure the effectiveness of its pricing in stores. It was generating these reports from its data warehouse and predictive analytics tools, both hosted on mainframes. But the mainframes batch processing required for these reports was taking 10 to 15 hours, so the company could manage only weekly data warehouse loads. As a result, the pricing business unit wasnt getting the decision support needed to fine-tune in-store pricing. Harnessing cost-effective distributed processing with Apache Hadoop solutions MetaScales approach to solve big data problems for its client was to deploy an ApacheTM Hadoop solution an open-source software framework, powered by MetaScale big data appliances based on Dell PowerEdge servers. MetaScale worked closely with Dell to develop its bundled Hadoop solutions to meet its clients growing needs for performance, scalability and support, and Gupta recognizes Dell as a key partner. Powerful yet highly cost- effective hardware, like our big data appliances, which incorporate Dell PowerEdge servers, is what makes the Hadoop cluster work, he says. MetaScales big data appliances arrange the server nodes into logical clusters for handling large-scale data sets cost-effectively. Gupta says its clients Hadoop solution has a cluster with more than 500 server nodes. And that doesnt count its backup cluster. A step-wise approach: a proven methodology MetaScale approaches new client projects with a detailed methodology that emphasizes a measured, step-wise approach to Hadoops deployment across the enterprise. A common misconception is that Hadoop is a turn-key solution that will solve all of a companys data issues, Gupta says. In fact, Hadoop is much more intricate than that. You cant just download it and expect it to work. Companies need to take a step back, identify major use cases within their business using their own data sets, validate those cases, and then develop a proof of concept before proceeding with deployment. Another misconception is that because Apache Hadoop is open-source software, its not enterprise-ready. Gupta points out that the platforms broader ecosystem of commercial developers has spawned a wide range of powerful tools and supporting technologies, making it an extremely capable enterprise solution. Extreme benefits including 160 times cost reductions and time savings In this particular clients Hadoop implementation, MetaScale helped its client achieve an ROI in just three months after becoming operational with a cluster that, at the time, harnessed 50 nodes of Dell servers to manage its growing data. Gupta explains that by following MetaScales detailed project methodology, its client built a Hadoop cluster, initiated production in six months and achieved operational stability in 12 months. Taking MetaScales measured, phased approach is the way to minimize risk and ensure success, he says. Once Hadoop is deployed and done so correctly paybacks are rapid. Cost savings are one source of those paybacks, according to Gupta. He points out that until Apache Hadoop debuted,
  4. 4. 4 enterprise ITs only option for exhausted data warehouse and processing capacity was to deploy ever-more computing power (in the form of millions of instructions per second (MIPS)) and storage. With processing power plus additional storage, software and support, addressing exhausted enterprise data capacity can cost up to $800,000 per terabyte of enterprise data, he says. But with Hadoops extreme scalability, adding terabytes can cost less than $5,000 per terabyte using MetaScales big data appliances based on Dell PowerEdge servers. Time savings are the other source of ROI for Hadoop implementations, Gupta says. MetaScale typically sees batch processing times cut from 20 hours or more, depending on the size of a data run, to less than an hour. Thats because a Hadoop Distributed File System (HDFS) can dispatch the processing to the data nodes simultaneously. For example, in a 100-node cluster, up to 1,200 processors can be working at the same time. In addition, Hadoop can compress ETL time frames. ETL time savings can vastly improve decision support by providing enterprise management with better data faster, he says. And our big data appliances using Dells PowerEdge servers provide the horsepower for that. Clean data is valuable data: integrity with rapid, cost-effective scalability A MetaScale Hadoop implementation can also help improve data integrity throughout large enterprises via better, enterprise-wide data governance that prevents duplication of data sets or proliferation of different versions as different workgroups and individuals perform ETL on data. Gupta cites one large firm that found 300 people in accounting were using 450 operational Microsoft Access databases, many with ODBC connections to other data sources, so users could extract data and modify data to meet their needs. This lack of governance contributes to costly data movements and to data disagreements as users modify their data, often moving the results into Excel spreadsheets, which feed into portals that produce differing results from the data, he says. Once a MetaScale client has implemented an Apache Hadoop data infrastructure, Gupta sa