transforming hadoop into your best bi platform€¦ · ‘crunchers’ of big data – have...

16
Transforming Hadoop into your best BI platform An expert how-to guide Bigger questions, faster answers.

Upload: others

Post on 09-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

Transforming Hadoop into your best BI platformAn expert how-to guide

Bigger questions,faster answers.

Page 2: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

Marriage counsellors needed: Hadoop and BI

In summary...

3

14

6

9

Exploring the nature of the problem

Exploring the solution

Marriage counsellors needed: Hadoop and BI

Keeping touch with reality

Keeping it interactive

Keeping a lid on it

3

4

4

5

6

8

8

9

10

11

12

Moving data

Slow analysis

Lack of support

There’s no escaping it: SQL rocks

Five core supporting reasons for SQL-on-Hadoop

The best platform for analyzing billions of transactions

But what if...

2

Transforming Hadoop into your best BI platform

Contents

Page 3: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

3

Transforming Hadoop into your best BI platform

Marriage counsellors needed: Hadoop and BI

Section 1

What we’ll cover

The promise of Hadoop for managing big data environmentsThe limitations to emerge that have impacted this reputationThe problem of querying data in Hadoop

But where to start? Well, logically it makes sense to begin by taking a look at Hadoop itself, to ask the question: “why isn’t it currently the best BI platform you have?”

Hadoop is of course synonymous with big data. Indeed, the growing need inside organizations to process and analyze structured/unstructured data –

This paper comes with a big, bold claim...

...and one we’re prepared to stand by.

alongside the associated challenges of data quality, storage, and security – was instrumental in Hadoop’s development as a data processing platform.

Hence the massive hype surrounding the technology, and why organizations including Facebook – famous ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure.

Yes Hadoop, and yes for making it the very foundation for delivering the best business intelligence platform your organization has ever experienced.

Page 4: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

4

Transforming Hadoop into your best BI platform

Keeping touch with reality

Keeping it interactive

Why? Well, to put it simply, limitations have emerged. Not, it must be said, when handling terabytes and petabytes of data through batch processing and ETL (Hadoop’s sweet spot). Rather it’s a challenge of enabling your business users to query the data in

Moving away from it was therefore seen as a good thing: positioning Hadoop as a virtual playground full of possibilities for engineers to reinvent a business’ relationship with its data resources.

All good in theory, yet in practice the absence of SQL has led to blockages in the system. Data Scientist sized blockages. That’s because it took a seasoned Data Scientist to be able to extract value from Hadoop.

the manner they’ve become used to with commercial, off-the-shelf analytics products.

Such a situation places the spotlight on the speed of the interactive SQL. What does this mean?

Business users were (and still are) excluded from most activities, and reliant on the experts to construct aggregated data sets in Hadoop – before moving them into standard databases for exploitation. Data sets that are by their very nature limited in terms of size and scope – and rapidly ageing from their moment of birth. Such challenges have caused businesses to reassess Hadoop.

Marriage counsellors needed: Hadoop and BI Section 1

Despite Hadoop being considered essential to the spread of big data, adoption across enterprises has not necessarily resulted in the promises surrounding it.

Hadoop was first conceived as a NoSQL platform. Despite still being used by over 50% of professional developers, SQL was seen as an overly complex, ‘old world’ standard for querying data.

Page 5: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

5

Transforming Hadoop into your best BI platform

What does this mean for the world of BI? The challenges here fall into three principle challenges:

1. An inability to perform BI tasks directly on Hadoop is causing organisations to aggregate individual data sets and push them out to business users

2. Running data analysis directly in Hadoop is too slow, and made worse with high-concurrent workloads

3. The SQL products on the market that are dedicated to solving these problems are themselves too slow to be effective

Equally, these problems aren’t occurring in isolation to wider business trends. Rather they’re happening against a backdrop of digital transformation agendas, and the need for real-time intelligence to be made available at the ‘sharp end’ – and placed at the fingertips of those users able to respond to the resulting insights. In other words, we’ve reached the age of pervasive BI, where organizations can have hundreds if not thousands of users simultaneously demanding access to the same data resources at the same time.

Marriage counsellors needed: Hadoop and BI Section 1

Keeping a lid on it

“Great for serving as a cheap storage repository, and for processing ETL workloads”, is a consensus view that’s now matched in part by more negative considerations: “the complexity for running in-memory, parallel workloads limits its ability to support interactive, user-facing apps”.

Page 6: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

6

Transforming Hadoop into your best BI platform

Exploring the nature of the problem

Section 2

What we’ll cover

The challenge of ensuring everyone works with the same data setThe problem of speed – and high-concurrent workloadsThe ineffective nature of most SQL-on-Hadoop engines

Let’s look at these problems in more detail...

...starting with issue one.

Page 7: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

7

Transforming Hadoop into your best BI platform

Hadoop has over time developed into a very useful platform for data storage and batch processing. That said, when it comes to providing interactive ad hoc data analysis we hit a problem – the data has to be moved onto another platform. This is partly due to its batch processing skill set (for which it was designed), that can lead to long response times for such impromptu analysis: and partly due to the complexities involved in taking serial SQL operations and ‘parallelizing’ them from scratch.

The tendency is therefore to rely on data aggregation, and the creation (by aforementioned data scientists) of data subsets that are loaded directly into existing BI tool sets. This in turn creates a range of performance and governance issues with users working on their own personal subsets, thereby reducing the potential for creating a ‘single version of the truth’. In addition, these data sets are by their very nature limited in terms of the actual data contained within them, and based on assumptions made by their creators on the type and spread of the data needed to complete a specific task.

The tendency is therefore to rely on data aggregation, and the creation (by aforementioned data scientists) of data subsets that are loaded directly into existing BI tool sets.

Exploring the nature of the problem Section 2

As a result, with Hadoop approaching (on the Gartner hype cycle) the ‘trough of disillusionment’, the experience for many users is that it’s easy to put data into the platform, but inefficient and cumbersome when taking insights out. The result is constant (and unmanageable) data movement, a lack of ROI and platform utilization, and the loss of the more subtle insights that come from interrogating one central data ‘lake’.

Problem one

An inability to perform BI tasks directly on data in Hadoop is causing organizations to aggregate individual data sets and push them out to business users

Page 8: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

In a way this problem speaks of a lack of depth in the maturity and capabilities on offer from SQL-on-Hadoop products and services. Tools that have typically failed to address the challenges of query performance (even under just a single user’s workload); as well as concurrency, where the need to offer multiple users access to the same data sources at the same time, leads to errors and snail-like interactivity.

Yet the problem also points the finger at the fact that many of the currently available SQL query engines are disk-based, and significantly slower in response than in-memory engines. This point is particularly pertinent to the speed equation, as the ability to process massive numbers of parallel queries (over billions of rows of data) will inevitably be compromised by hard disks. Even the latest SSDs on the market are way behind the performance on offer from RAM.

8

Transforming Hadoop into your best BI platform

Exploring the nature of the problem Section 2

Problem two

Running data analysis directly in Hadoop is too slow, and made worse with high-concurrency workloads.

Problem three

The SQL products on the market that are dedicated to solving these problems are themselves too slow to be effective

Part of what gives Hadoop the wow factor is its ability to store vast sets of structured and unstructured data. Yet extracting meaningful value from these resources in a fast and effective manner, and working in real-time with advanced analytics – remains out of reach for many organizations.

But is it fair to lay the blame solely at the feet of Hadoop? Or should we instead charge the tools being deployed on top of it for failing to make the most of the platform? Tools elevated beyond their ability to deliver due to the hype surrounding Hadoop, and

the expectation for a solution to the big data ‘crisis’ – alongside poor interfaces that add to the overall complexity.

Talk of failure to date only helps illustrate that many current adopters have implemented Hadoop without properly understanding it – and then failed to bring together the right tools, data, and expertise to get it working properly. The platform may stand accused of being a poor BI platform, but it can point to a variety of extenuating circumstances!

Page 9: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

9

Transforming Hadoop into your best BI platform

Exploring the solution

Section 3

What we’ll cover

The importance of SQL as THE data query toolHow ultra-fast, high concurrency analysis is possible on HadoopThe ability to say ‘no’ to any more data aggregation

There’s no escaping it: SQL rocks

What the above tells us, as far as a solution is concerned at least, is that SQL needs to be involved – and involved heavily

It is after all the most widely used query language in the world. People are familiar with it, and it remains the language that popular data visualization tools use to ask questions of a relational DBMS.

The challenge now is to extend this into non-relational data stores, to use SQL as a common language that helps inspire simplified data access to assets located in multiple stores – without having to switch between different APIs to make it all work.

Page 10: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

10

Transforming Hadoop into your best BI platform

Exploring the solution Section 3

Step forward SQL-on-Hadoop.

That’s right: SQL on the ultimate noSQL platform – SQL query engines for Hadoop in big data systems that can transform your experience of BI platforms. Tools that return IT to a state of ease and familiarity when it comes to programming analytics apps and integration tasks. Tools that enable developers to make use of their SQL skills and capabilities within

1. SQL is the language of data query, is proven to work in big data environments (eBay uses it to process 50 petabytes each day1), and is used by all modern data visualization tools for accessing data 2. SQL is the preferred language of the data management community, and sits naturally with their existing tool sets

extensive Hadoop data lakes, and overcome the ‘weak’ relational functions within the platform.

Capabilities that once and for all help Hadoop break free from the confines of the data scientists’ laboratories, and to enter the BI mainstream.

Five core supporting reasons for SQL-on-Hadoop:

3. SQL offers immediate returns – most businesses are familiar with it, and make use of it on a daily basis

4. Fast and efficient ad hoc exploration of Hadoop data enabled by SQL is a top priority and essential for justifying long-term investment in the platform

5. Self-service analytics is increasingly seen as business-critical, and without SQL tools this will be limited, thereby limiting the range of users able to extract value from Hadoop data

1Citation

Page 11: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

11

Transforming Hadoop into your best BI platform

Exploring the solution Section 3

Their aim is to identify spending patterns across a data set that covers 12 billion transactions. To do this, the company had two options:

1. Create and maintain 10,000 near real-time data extracts for individual clients – which operationally would most likely prove to be impossible

2. Use a query engine for Hadoop capable of handling hundreds of concurrent complex SQL queries over the entire data set – and return the results in near real-time.

Unsurprisingly they went with option 2:

The best platform for analyzing billions of transactions

The world’s largest payment card issuer has 10,000 active Tableau users analyzing data held in a nine petabyte Hadoop cluster in near real-time.

Page 12: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

12

Transforming Hadoop into your best BI platform

Exploring the solution Section 3

But what if…

So let’s talk what-ifs; let’s talk ultra-fast, high-concurrency SQL for Hadoop and data warehousing; and let’s talk about scale-out, in-memory software that enables modern BI and visualization tools to maintain their performance – even when the data volume is large and the user count high.

In other words, let’s talk Kognitio.

As a software layer sitting between the data stored in Hadoop and your business users/BI tools, Kognitio is:

Free to use, with no limits on scalability, capability, or duration of use Proven and mature in terms of query optimization and functionality Highly performant, particularly with concurrent SQL workloads Can be used both on-premise and in the cloud Deployed and runs as a YARN application on your Hadoop cluster

How does this relate to your day job? Well, every reader’s operational concerns are obviously unique, but by way of exploring generic benefits, let’s return to our three core problems:

Now we can provide the answer: with Kognitio you don’t have to take data out of Hadoop. Simple. Instead, you access the data directly from the platform, with the speed and performance required for interactive BI. Better still; by removing the need to constantly create and share data subsets, you’ll have

Problem one: an inability to perform BI tasks directly on Hadoop is causing organizations to aggregate individual data sets and push them out to business users.

a super-charged BI experience that’s complete – and far better than anything you’re currently using. What’s more, in this situation you really will have a single version of the truth, and from it the opportunity to see new ways for delivering ROI, and for rationalizing your entire BI infrastructure.

Page 13: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

13

Transforming Hadoop into your best BI platform

Exploring the solution Section 3

Kognitio has spent the last 25 years working out the best ways to parallelize complex SQL functionality. Due to this heritage, we can point to a history of empowering tools like Tableau and MicroStrategy, and delivering breath-taking SQL performance with concurrent workloads. To deliver the fastest

SQL is a large complex standard, and difficult to implement from scratch – but this is exactly what most SQL on Hadoop technologies are trying to do. To make matters worse, the challenge of developing SQL on Hadoop is compounded by the platform being a scale-out parallel platform – meaning all SQL operations themselves need to be efficiently

Problem two: running data analysis directly in Hadoop is too slow, and made worse with high-concurrent workloads.

Problem three: the SQL products on the market that are dedicated to solving these problems are themselves too slow to be effective.

SQL on Hadoop today, we’ve migrated our engine to the platform, and can help you to run thousands of complex queries per second to serve answers to thousands of concurrent users throughout your business.

executed in parallel. This is a very hard process to manage, and as a result, newer SQL solutions deliver poor performance when it comes to ad hoc analysis using modern, interactive visualization tools. Kognitio on the other hand started out working on complex parallel implementations, it’s what we do and we’ve got rather good at it!

The best platform for SQL on Hadoop

Want to know what your options are for getting SQL to work on Hadoop? To help, we’ve been doing some tests, and using the TPC-DS query set have measured the performance of Hive LLAP, Impala, SparkSQL and Presto – and compared their results to Kognitio.

What’s more, we haven’t just tested if the platforms can run the queries, but also whether they work under load and work with a mixed-concurrent workload – the latter to show whether they can handle interactive querying on Hadoop from BI tools.

Page 14: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

14

Transforming Hadoop into your best BI platform

SparkSQL

Impala

Hive LLAP

Presto

Kognitio

Kognitio

Kognitio

Kognitio

Removed

Removed

Removed

Removed

Impala242

1855

No supportLong-runningMinor changesOut of the box

Num

ber o

f que

ries

0

99

SparkSQL

2772

Hive LLAP5

1975

Presto

52173

Kognitio

2376

Does the platform run the SQL?

SparkSQL compared with Kognitio 8.1.50

Impala compared with Kognitio 8.1.50

Hive LLAP compared with Kognitio 8.2.0

Presto compared with Kognitio 8.2.0

Single stream at 1TB

Single stream at 1TB

Single stream at 1TB

Single stream at 1TB

The graphs above are a summary of the full benchmark results which can be found on the Kognitio website

10 concurrent streams at 1TB

10 concurrent streams at 1TB

10 concurrent streams at 1TB

1

6

11

12

15

Kognitio faster in 93 queries

Kognitio faster in 80 queries

Kognitio faster in 88 queries

Kognitio faster in 83 queries

Kognitio faster in 99 queries

Kognitio faster in 98 queries

Kognitio faster in 92 queries 7

7

1

Exploring the solution Section 3

Page 15: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

15

Transforming Hadoop into your best BI platform

In summary…

So, can Hadoop be turned into the best BI platform within your business? Certainly it’s a worthwhile goal to pursue given the amount of data flowing across most companies today, and the resulting need to make distributed computing platforms and scale-out strategies ‘work’

Yet the excessive complexities that come with running SQL queries on a parallel platform, and accommodating high-concurrent workloads has for the majority proven to be an insurmountable barrier to realizing the benefits of Hadoop.

All of which brings us to the key takeaway from this paper: such complexity can be overcome, nullified, even simplified. Indeed, with Kognitio it already has been, and we can list a large number of organizations

using our SQL engine to process data at ultra-fast speeds to gain the insights needed to drive their businesses forward. Just as importantly, we’re helping marry this speed with flexibility: to say no to aggregated data sets that only ever show part of the overall story; to help make the data accessible wherever it’s needed to support the ‘BI everywhere’ agenda; and to run analyses that are only limited by the user’s imagination – and not the available data.

This is Kognitio on Hadoop – a combination that’s helping redefine the art of the possible when bringing your BI capabilities to life in a big data environment.

For more information contact:

Kognitio Ltd, 10 Bracknell Beeches, Old Bracknell Lane, Bracknell RG12 7BW, United Kingdom

[email protected] +44 (0) 1344 300770 kognitio.com

Page 16: Transforming Hadoop into your best BI platform€¦ · ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure. Yes Hadoop, and yes for making

16

Transforming Hadoop into your best BI platform

Bigger questions,faster answers.

Kognitio Ltd, 10 Bracknell Beeches, Old Bracknell Lane, Bracknell, RG12 7BW, United Kingdom+44 (0) 1344 300770 | [email protected]