building a big data data warehouse - goto 2013
DESCRIPTION
Let's face it: data warehousing in the traditional sense has been tedious, lacking agility, and slow. Often designed and built around a static set of up front questions, they are basically a big database tailored to the application of populating dashboards. The fact that the volume of data that we need to deal with recently exploded by several order of magnitude isn't improving the situation. It's no surprise that we see a new class of data warehousing setups emerge, using big data technologies en NoSQL stores. A nice side effect is that these solutions are usually not only the domain of the BI crowd, but can also be developer friendly and allow development of more data driven apps. In this talk I will present about experiences using Hadoop and other tools from the Hadoop ecosystem, such as Hive, Pig and bare MapReduce, to handle data that grows tens of GBs per day. We create a system where data is captured, stored and made available to different users and use cases, ranging from end users that write SQL queries to software developers that access the underlying data to create data driven products. I will cover topics like ETL, querying, development and deployment and reporting; using a fully open source stack, of course.TRANSCRIPT
![Page 1: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/1.jpg)
GoDataDrivenPROUDLY PART OF THE XEBIA GROUP
Building a Big Data DWH
Friso van VollenhovenCTO
Data Warehousing on Hadoop
![Page 2: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/2.jpg)
-- Wikipedia
“In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis.”
![Page 3: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/3.jpg)
![Page 4: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/4.jpg)
![Page 5: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/5.jpg)
![Page 6: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/6.jpg)
ETL
![Page 7: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/7.jpg)
![Page 8: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/8.jpg)
How to:• Add a column to the facts table?• Change the granularity of dates from day
to hour?• Add a dimension based on some
aggregation of facts?
![Page 9: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/9.jpg)
Schema’s are designed with questions in mind.
Changing it requires to redo the ETL.
![Page 10: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/10.jpg)
Schema’s are designed with questions in mind.
Changing it requires to redo the ETL.
Push things to the facts level.
Keep all source data available all times.
![Page 11: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/11.jpg)
![Page 12: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/12.jpg)
And now?• MPP databases?• Faster / better / more SAN?• (RAC?)
![Page 13: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/13.jpg)
![Page 14: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/14.jpg)
distributed storage
distributed processing
metadata + query engine
![Page 15: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/15.jpg)
![Page 16: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/16.jpg)
EXTRACTTRANSFORM
LOAD
![Page 17: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/17.jpg)
![Page 18: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/18.jpg)
![Page 19: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/19.jpg)
![Page 20: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/20.jpg)
![Page 21: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/21.jpg)
![Page 22: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/22.jpg)
• No JVM startup overhead for Hadoop API usage• Relatively concise syntax (Python)• Mix Python standard library with any Java libs
![Page 23: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/23.jpg)
![Page 24: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/24.jpg)
• Flexible scheduling with dependencies• Saves output• E-mails on errors• Scales to multiple nodes• REST API• Status monitor• Integrates with version control
![Page 25: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/25.jpg)
![Page 26: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/26.jpg)
Deployment
git push jenkins master
![Page 27: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/27.jpg)
•Scheduling•Simple deployment of ETL code•Scalable•Developer friendly
![Page 28: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/28.jpg)
![Page 29: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/29.jpg)
![Page 30: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/30.jpg)
![Page 31: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/31.jpg)
![Page 32: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/32.jpg)
![Page 33: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/33.jpg)
'februari-22 2013'
![Page 34: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/34.jpg)
![Page 35: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/35.jpg)
A: Yes, sometimes as often as 1 in every 10K calls. Or about once a week at 3K files / day.
![Page 36: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/36.jpg)
![Page 37: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/37.jpg)
![Page 38: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/38.jpg)
þ
![Page 39: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/39.jpg)
þ
![Page 40: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/40.jpg)
TSV ==thorn separated values?
![Page 41: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/41.jpg)
þ == 0xFE
![Page 42: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/42.jpg)
or -2, in HiveCREATE TABLE browsers ( browser_id STRING, browser STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';
![Page 43: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/43.jpg)
![Page 44: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/44.jpg)
![Page 45: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/45.jpg)
![Page 46: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/46.jpg)
![Page 47: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/47.jpg)
![Page 48: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/48.jpg)
![Page 49: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/49.jpg)
• The format will change• Faulty deliveries will occur• Your parser will break• Records will be mistakingly produced (over-logging)• Other people test in production too (and you get the
data from it)• Etc., etc.
![Page 50: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/50.jpg)
•Simple deployment of ETL code•Scheduling•Scalable• Independent jobs•Fixable data store• Incremental where possible•Metrics
![Page 51: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/51.jpg)
Independent jobs
source (external)
staging (HDFS)
hive-staging (HDFS)
Hive
HDFS upload + move in place
MapReduce + HDFS move
Hive map external table + SELECT INTO
![Page 52: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/52.jpg)
Out of order jobs
• At any point, you don’t really know what ‘made it’ to Hive•Will happen anyway, because some days the data
delivery is going to be three hours late• Or you get half in the morning and the other half
later in the day• It really depends on what you do with the data• This is where metrics + fixable data store help...
![Page 53: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/53.jpg)
Fixable data store
• Using Hive partitions• Jobs that move data from staging create partitions•When new data / insight about the data arrives,
drop the partition and re-insert• Be careful to reset any metrics in this case• Basically: instead of trying to make everything
transactional, repair afterwards• Use metrics to determine whether data is fit for
purpose
![Page 54: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/54.jpg)
Metrics
![Page 55: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/55.jpg)
Metrics service
• Job ran, so may units processed, took so much time• e.g. 10GB imported, took 1 hr• e.g. 60M records transformed, took 10 minutes• Dropped partition• Inserted X records into partition
![Page 56: Building a Big data data warehouse - goto 2013](https://reader035.vdocuments.us/reader035/viewer/2022081403/554c2802b4c9050d1b8b49d3/html5/thumbnails/56.jpg)
GoDataDriven
We’re hiring / Questions? / Thank you!
Friso van VollenhovenCTO