etl: still relevant in the age of hadoop

14
ETL: STILL RELEVANT IN THE AGE OF HADOOP Yaniv Mor, CEO and co-founder Xplenty

Upload: xplenty

Post on 27-Jan-2015

113 views

Category:

Data & Analytics


0 download

DESCRIPTION

Buzz about Big Data has been at fever pitch for over a year now. We hear a lot about how the insights we glean will propel businesses, about emerging technologies, and companies merging. But how often do we hear about the guts behind Big Data, what makes it actually work? Maybe I’m wrong, but from what I read, not often enough. So to buck that trend, let’s dive into one of the main building blocks of traditional data warehousing, ETL, and see how it fits in with current Big Data architecture.

TRANSCRIPT

Page 1: Etl: Still Relevant in the Age of Hadoop

ETL: STILL RELEVANT IN THE AGE OF HADOOPYaniv Mor, CEO and co-founder Xplenty

Page 2: Etl: Still Relevant in the Age of Hadoop

THE NEW KIDSome people think that Hadoop means the end of ETL (extract, transform, load). Not surprising. Often the birth of a new technology spells death for the old one.

While ETL stems from data warehousing methodologies created in the 1960s, Hadoop, born in 2005, has only caught fire over the last few years.

Page 3: Etl: Still Relevant in the Age of Hadoop

OLD HABITSThe data architecture world may be tired of using the same process for over 40 years, and Hadoop does revolutionize Big Data. Nevertheless, it doesn’t look like ETL is going anywhere.

Page 4: Etl: Still Relevant in the Age of Hadoop

HIGH HOPESHadoop comes with great promise:

▪ Stores large amounts of data on fairly inexpensive distributed systems

▪ Structured, semi-structured, and unstructured data, it can all be stored on the same platform

▪ High scalability and practically infinite storage space

Page 5: Etl: Still Relevant in the Age of Hadoop

THE END OF ETL?!?A company’s raw data can be dumped straight into Hadoop from several sources and later analyzed without having to change it even one bit.

Page 6: Etl: Still Relevant in the Age of Hadoop

H(ADOOP)OUDINI?But ETL didn’t disappear! In fact, it turns out that Hadoop is commonly used for three main cases:

▪ Cheap storage▪ ETL▪ Data Exploration

Cheap storage makes complete sense - Hadoop is a great alternative to specialized servers and RAID technology since you can always add machines with more space to the Hadoop cluster. Still, using Hadoop for ETL? How come? Isn’t it supposed to replace that old horse?

Page 7: Etl: Still Relevant in the Age of Hadoop

NOT FOR EVERYONE…Technology is only a means to an end. Who is using it and what are their needs? Data scientists, for instance, might not need any ETL. They could really gain from access to huge amounts of raw data via Hadoop and take their time to find insights.

Page 8: Etl: Still Relevant in the Age of Hadoop

…BUT GOOD FOR MOSTHowever, the majority of workers in an organization need something else. They need to do analysis and reporting, and quickly with their existing BI and reporting tools.

No matter the source from which it comes, this data has to be clean, well formed, and use the common business terminology across the organization.

This means the majority still need ETL.

Page 9: Etl: Still Relevant in the Age of Hadoop

PROGRESSThe good news: Hadoop helps the ETL process. There are good reasons why ETL tools like Informatica, Talend, and Pentaho integrate with Hadoop, and handling huge volumes of data is one of them.

While traditional tools have a limit on the file sizes that they can process, Hadoop can handle petabytes - just ask Facebook, which stores 100 PB of data on Hadoop (and that was over a year ago).

Page 10: Etl: Still Relevant in the Age of Hadoop

$PEEDBecause Hadoop is a distributed system, it processes data in parallel on the cluster’s machines, thus offloading heavy ETL tasks to Hadoop makes them run faster. Scalability and price are other well known advantages.

If ETL usually needs expensive machines that scale vertically, Hadoop scales horizontally by adding off-the-shelf servers to the cluster.

Page 11: Etl: Still Relevant in the Age of Hadoop

IT’S ALIVE!!!The fact of the matter is that ETL is far from dead, and with the help of good friend Hadoop, it is alive and kicking. We see it too.

Our Data Integration-as-a-Service is used by our clients mainly as an ETL tool. They use Xplenty to extract data from cloud sources, process and transform the data on Hadoop, and load it back into the cloud or their data warehouse.

Page 12: Etl: Still Relevant in the Age of Hadoop

NOT GOING ANYWHERE SOONHadoop does provide a new option to skip ETL and process raw data in any format directly and in one location. Although this helps some professionals, the rest of us still need to process the data before we can use it, so we can’t get rid of ETL that easily.

Page 13: Etl: Still Relevant in the Age of Hadoop

MULTI-FACETEDLuckily, Hadoop is proving to be an enterprise-class solution not just for Big Data in general, but also to take a load off that good old ETL and give its electric wheelchair a friendly boost.