mistakes to avoid when implementing big data project - ebook...implementing hadoop for the first...

Mistakes To Avoid When Implementing A Big Data Project

It’s no secret that Big Data comes with inherent challenges. We identified seven common mistakes made by executives and IT teams as they go through the planning and implementation process. We’ve addressed the mistakes in two parts:

1) tactical (for developers and engineers), and 2) strategic (for architects and executives).

FEBRUARY 20, 2017

https://www.hitachivantara.com

Table of ContentsIntroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Common Tactical MistakesMISTAKE #1: Migrate everything before devising a plan . . . . . . . . . . . 4

MISTAKE #2: Assume the same skillsets for managing a traditional relational database are transferable to Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

MISTAKE #3: Treating a data lake on Hadoop like a regular database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

MISTAKE #4: I can figure out security later . . . . . . . . . . . . . . . . . . . . . . . 7

Common Strategic MistakesMISTAKE #1: The HiPPO knows best . No strategic inquiry necessary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

MISTAKE #2: Bridge the skills gap with traditional ETL processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

MISTAKE #3: I can have a small budget and get enterprise-level value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

IntroductionIt’s no secret that Hadoop comes with inherent challenges. Business needs, specialized skills, data integration and budget are just a few things that factor into planning and implementation. Yet, despite all the due diligence, a large percentage of Hadoop implementations fail. We’d like to turn that around.

With the goal of helping organizations achieve business value from Hadoop, we sat down with members of Hitachi Vantara’s consulting services and enterprise support teams to discuss their experiences with helping organizations to develop, design and implement complex big data, business analytics or embedded analytics initiatives.

We identified seven common mistakes made by executives and IT teams as they go through the planning and implementation process. We’ve addressed the mistakes in two parts: 1) tactical (for developers and engineers), and 2) strategic (for architects and executives).

3

Common Tactical MistakesMISTAKE #1 MIGRATE EVERYTHING BEFORE DEVISING A PLAN

Let’s say you’ve determined that your current architecture is not equipped to process big data effectively, management is open to adopting Hadoop, and you’re excited to get started. Wonderful!

But don’t just dive in without a plan. Migrating everything without a clear strategy will only create long-term issues resulting in expensive ongoing maintenance. With first-time Hadoop implementations, you can expect a lot of error messages and a steep learning curve. Dysfunction, unfortunately, is a natural byproduct of the Hadoop ecosystem...unless you have expert guidance.

Successful implementation starts by identifying a business use case. Consider every phase of the process – from data ingestion to data transformation to analytics consumption, and even beyond to other applications and systems where analytics must be embedded. It also means clearly determining how Hadoop and big data will create value for your business. Taking an end-to-end, holistic view of the data pipeline, prior to implementation, will help promote project success and enhanced IT collaboration with the business.

Use Hadoop in the Analytic Data Pipeline for more on creating a holistic approach.

OUR ADVICE:Maximize your learning in the least amount of time by taking a holistic approach and starting with smaller test cases. Like craft beer, good things come in small batches.

C H E C K O U T O U R E - B O O K

4

https://www.hitachivantara.com/en-us/pdf/ebook/leveraging-hadoop-in-analytic-data-pipeline-ebook.pdf

MISTAKE #2ASSUME THE SAME SKILLSETS FOR MANAGING A TRADITIONAL RELATIONAL DATABASE ARE TRANSFERABLE TO HADOOP

A common mistake made by people who are implementing Hadoop for the first time is believing you can do things with Hadoop the same way you do with relational databases. But Hadoop is a whole new world in which you must do things differently.

Hadoop is a distributed file system, not a traditional relational database (RDMS). Hadoop allows IT teams to effectively store, process and distribute structured and unstructured data in sizes and types that relational databases can’t typically handle. It offers massive scale in processing power and storage by using multiple nodes of commodity hardware to crunch data in parallel. Because Hadoop doesn’t function in the same way as a relational database, you cannot expect to simply migrate all your data and manage it in the same way, nor can you expect skillsets to be easily transferable between the two.

On the other hand, if your current team happens to lack certain skills or familiarity with Hadoop, it doesn’t mean that hiring a new group of people is inevitable. Every situation is different, and there are several options to consider. For example, training existing employees in addition to augmenting staff might be a good option. Filling skills gaps with point solutions may suffice in some instances, but for growing organizations looking to scale, leveraging an end-to-end data platform that is accessible to a broad base of relevant users may be the best solution in the long run.

OUR ADVICE:While Hadoop does present IT orga-nizations with skills and integration challenges, it’s important to look for a solution with the right combination of people, agility and power to make you successful.

5

MISTAKE #3TREATING A DATA LAKE ON HADOOP LIKE A REGULAR DATABASE

A major misconception is that you can treat a data lake on Hadoop just like a regular database. While Hadoop is powerful, it’s not structured the same way as an Oracle, HP Vertica, or a Teradata database, for example. Similarly, it was not designed to store anything you’d normally put on Dropbox or Google Drive. A good guideline for this scenario is: if it can fit on your desktop or laptop, it probably doesn’t belong on Hadoop.

In a data lake, the data is all there. But since the data has not been partitioned or optimized, you only have the pieces of a data lake – it’s not ready out of the box. Many think that the data lake is clear and easy to find, but in reality, it can be murky. You realize it’s not what you thought you were building.

As your organization scales data onboarding from just a few sources going into Hadoop to hundreds or more, IT time and resources can be monopolized,

creating hundreds of hard-coded data movement procedures – and the process is often highly manual and error-prone. A properly developed data lake will:

■■ Reduce IT time and cost spent building and maintaining repetitive big data ingestion jobs, allowing valuable staff to dedicate time to more strategic projects.

■■ Minimize the risk of manual errors by decreasing dependence on hard-coded data ingestion.

■■ Automate business processes for efficiency and speed, while maintaining data governance.

■■ Enable more sophisticated analysis by business users with new and emerging data sources.

OUR ADVICE:Take the proper steps at the start to understand how to best ingest data to get a working data lake.

Learn how Pentaho platform’s Metadata Injection helps organizations accelerate productivity and reduce risk in complex data onboarding projects by dynamically scaling out from one template to hundreds of actual transformations.

DOWNLOAD THE DATASHEET

6

https://www.hitachivantara.com/en-us/pdf/datasheet/metadata-injection-datasheet.pdf

MISTAKE #4I CAN FIGURE OUT SECURITY LATER

For most enterprises, protecting sensitive data is top of mind, especially after recent headlines about high profile data breaches. And if you’re considering using any sort of big data solution in your enterprise, keep in mind that you’ll be processing data that’s sensitive to your business, your customers and your partners. You know security is important in the long run, but is it important to consider it before you deploy? Absolutely!

You should never, ever, expose the credit card and bank account information, social security numbers, proprietary corporate information and personally identifiable information of your clients, customers and employees. Protection starts with planning ahead, not after deployment.

OUR ADVICE:Address each of the following security solutions before you deploy a big data project:

■■ Authentication: Control who can access clusters and what they can do with the data.

■■ Authorization: Control what actions users can take once they’re in a cluster.

■■ Audit and tracking: Track and log all actions by each user as a matter of record.

■■ Compliant data protection: Utilize industry-standard data encryption methods in compliance with applicable regulations.

■■ Automation: Prepare, blend, report and send alerts based on a variety of data in Hadoop.

■■ Predictive analytics: Integrate predictive analytics for near real-time behavioral analytics.

■■ Best practices: Blending data from applications, networks and servers as well as mobile, cloud and IoT data.

Stay ahead of the curve. Watch this cybersecurity video to learn more.

W A T C H T H I S V I D E O

7

https://www.hitachivantara.com/en-us/solutions/data-analytics/cybersecurity-analytics.html?source=pentaho-redirect

Common Strategic MistakesMISTAKE #1THE HiPPO KNOWS BEST. NO STRATEGIC INQUIRY NECESSARY

HiPPO is an acronym for the “highest paid person’s opinion” or the “highest paid person in the office.” The idea is that HiPPOs are so self-assured that they tend to dismiss any data or the input of lower-paid employees that disagree with the correctness of their intuitions. Trusting one’s gut rather than data may work occasionally, but Hadoop is complex and requires strategic inquiry to fully understand the nuances of when, where, and why to use it.

To start, it’s important to understand what you’re trying to achieve from a business perspective, who will benefit from this investment, and how the spend will be justified. In fact, most big data projects fail because the business value is not being achieved.

The true business value of Hadoop is determined by the nature of your data problem. Do you have a current or future need for big data? A desire for

self-service data preparation? Or a need to embed analytics into your portals or applications? Are you spending most your time preparing data, as opposed to visualizing it?

Once a data problem has been established, the next step is to determine whether or not your current architecture will help you achieve your big data goals. If exposure to open source or unsupported code is a concern, it may be time to explore commercial options with the support and security you need. The same can be said if you plan to embed software within your company for company-wide analytics that enable users to get what they need without having to learn and switch to another application or ask IT for help all the time.

OUR ADVICE:As Teddy Roosevelt once said, “The best executive is the one who has sense enough to pick good men to do what he wants done, and self-restraint enough to keep from meddling with them while they do it.” You hired talented people for a reason. Listen to them. Once a business need for big data has been established, determine who will benefit from the investment, how it will impact your infrastructure, and how the spend will be justified. Also, try to avoid science projects – they tend to become technical exercises with limited business value.

8

MISTAKE #2BRIDGE THE SKILLS GAP WITH TRADITIONAL ETL PROCESSES

Assessing the depth and potential challenges associated with the dreaded “skills gap” is a common stumbling block for many organizations tackling the extract, transform, load (ETL) challenges associated with big data. Implementing Hadoop requires highly specific expertise, but there just aren’t enough IT pros with Hadoop skills to go around. On the other hand, some programmers proficient in Java, Python and HiveQL, for example, may lack the experience to optimize performance on relational databases. This scenario may be problematic when Hadoop and MapReduce are used for large-scale traditional data management workloads such as ETL.

Some emerging point solutions are designed to address the skills gap, but they tend to support experienced users rather than elevate the skills of those who need it most. If you’re dealing with smaller data sets, you may consider hiring people who’ve had the proper training on either end, or work with experts to train and guide your staff through implementation. But if you’re dealing with extremely large amounts of data – hundreds of terabytes of data, for instance – then you’ll likely need an enterprise-class ETL tool as part of a comprehensive business analytics platform, like Pentaho. Pentaho took a general-purpose ETL framework and made it so that it runs natively in the context of a Hadoop grid or cluster from prominent distributors such as Cloudera, MapR, Hortonworks, Amazon and others.

OUR ADVICE:Technology only gets you so far. People, experience and best practices are the most important drivers for project success with Hadoop. When considering an expert or a team of experts as permanent hires or consultants, you’ll want to consider their experience with “traditional” as well as big data integration, the size and complexity of the projects they’ve worked on, the companies they’ve worked with, and the number of successful implementations they’ve done. When dealing with very large volumes of data, it may be time to evaluate a comprehensive business analytics platform that’s designed to operationalize and simplify Hadoop implementations.

9

MISTAKE #3I CAN HAVE A SMALL BUDGET AND GET ENTERPRISE-LEVEL VALUE

The low-cost scalability of Hadoop is one reason why organizations decide to use it. But many organizations fail to factor in data replication, compression (storage space), skilled resources and overall management of big data integration of your existing ecosystem.

Remember, Hadoop was built to process a wide variety of enormous data files that continue to grow quickly. And once data is ingested, it gets replicated! For example, if you have 3TB you want to bring in, that will immediately require 9TB of storage space because Hadoop has built-in replication (which is part of the parallel processing that makes Hadoop so powerful.)

So, it’s absolutely essential to do proper sizing up front. This includes having the skills on hand to leverage SQL and BI against data in Hadoop and to compress data at the most granular levels. While you can compress data, it’s important to note that data compression affects performance. The compression of data also needs to be balanced with performance expectations for reading and writing data. Also, storing the data may cost 3x more than what you’ve initially planned.

Big data does offer big business advantages, but unexpected costs and complexity will present challenges if you don’t properly plan and prepare. You must figure out how you’re going to balance data growth rates with the cost of scale prior to implementation.

OUR ADVICE:Understand how storage, resources, growth rates and management of big data will factor into your existing ecosystem before you implement.

10

HITACHI is a registered trademark of Hitachi, Ltd. VSP is a trademark or registered trademark of Hitachi Vantara Corporation. IBM, FICON, GDPS, HyperSwap, zHyperWrite and FlashCopy are trademarks or registered trademarks of International Business Machines Corporation. Microsoft, Azure and Windows are trademarks or registered trademarks of Microsoft Corporation. All other trademarks, service marks and company names are properties of their respective owners.

SP-xxx-x M345 July 2018

Corporate Headquarters2845 Lafayette StreetSanta Clara, CA 95050-2639 USAHitachiVantara.com | community.HitachiVantara.com

Contact InformationUSA: 1-800-446-0744GLOBAL: 1-858-547-4526HitachiVantara.com/contact

Hitachi Vantara

What’s NextGet serious about implementing Hadoop and don’t forgot to checkout these key resources:

■■ Our e-Book, Hadoop and the Analytic Data Pipeline, has more on creating a holistic approach . Click here .

■■ Our website to explore Hitachi Vantara’s Pentaho Data Integration and Pentaho Business Analytics . Click here .

Enable users to ingest, blend, cleanse and prepare diverse data from any source. With visual tools to eliminate coding and complexity, Hitachi Vantara puts the best quality data at the fingertips of IT and the business. Get started now.

Hitachi Vantara at a Glance Your data is the key to new revenue, better customer experiences and lower costs. With technology and expertise, Hitachi Vantara drives data to meaningful outcomes.


http://community.hitachivantara.com


https://www.facebook.com/HitachiVantara

https://www.linkedin.com/company/11257500/

https://twitter.com/HitachiVantara

https://www.youtube.com/c/HitachiVantara

https://www.hitachivantara.com/en-us/pdf/ebook/leveraging-hadoop-in-analytic-data-pipeline-ebook.pdf

https://www.hitachivantara.com/go/pentaho.html

https://www.hitachivantara.com/en-us/products/big-data-integration-analytics/pentaho-data-integration.html

mistakes to avoid when implementing big data project - ebook...implementing hadoop for the first...

Documents