exploiting mysql 5.1 for advanced business intelligence applications matt casters: chief architect,...

Exploiting MySQL 5.1For Advanced Business Intelligence Applications

Matt Casters: Chief Architect, Data Integration and Kettle Project Founder

MySQL User Conference, Tuesday April 24, 2007

Agenda

Great news

Pentaho Data Integration introduction

Pentaho Data Integration MySQL support

Table partitioning

Database partitioning

Clustering

Metadata to the rescue

Pentaho company overview & platform demo

Questions & closing

Great news

The San Francisco Giants beat the Arizona Diamondbacks last

Saturday

Barry Zito pitched a great game

The hot-dogs where great

Pentaho news

Mondrian 2.3. was released a few weeks agoDetailed cache control (clearing)

Sun Microsystems & Pentaho announced that Pentaho Reporting

will be used to drive Business Intelligence requirements in Open

Office 2.3: to be released in the second half of this year.

Tomorrow we’ll do another one.

Pentaho Data Integration introduction

From source systems …

to the data warehouse …

to reports …

to analyses …

to dashboard reports …

to better information

Data Transformation and Integration Examples

Data filteringIs not null, greater than, less than, includes

Field manipulationTrimming, padding, upper and lowercase conversion

Data calculations+ - X / , average, absolute value, arctangent, natural logarithm

Date manipulationFirst day of month, Last day of month, add months, week of year, day of year

Data type conversionString to number, number to string, date to number

Merging fields & splitting fields

Looking up dateLook up in a database, in a text file, an excel sheet, …

Pentaho Data Integration (Kettle) Components

SpoonConnect to data sourcesDefine transformation rules and design target schema(s)Graphical job execution workflow engine for defining multi-stage and conditional transformation jobs

PanCommand-line execution of single, pre-defined transformation jobs

KitchenScheduler for multi-stage jobs

CarteRemote execution “slave” server with embedded web server

Pentaho BI PlatformIntegrated scheduling of transformations or jobsAbility to call real-time transformations and use output in reports and dashboards

Data Transformation and Integration Examples

Demo time : load 1M rows into a database table

Case Study: Pentaho Data Integration

Organization: Flemish Government Traffic Centre

Use case: Monitoring the state of the road network

Application requirement: Integrate minute-by-minute data from

570 highway locations for analysis (1550 lanes)

Technical challenges: Large volume of data, more than 2.5

billion rows

Business Usage: Users can now compare traffic speeds based on

weather conditions, time of day, date, season

Best practices:Clearly understand business user requirements firstThere are often multiple ways to solve data integration problems, so consider the long-term need when choosing the right way

Summary and Resources

Pentaho and MySQL can address help you manage your data infrastructureExtraction, Transformation and Loading for Data Warehousing and Data Migration

kettle.pentaho.orgKettle project homepage

kettle.javaforge.comKettle community website: forum, source, documentation, tech tips, samples, …

www.pentaho.org/download/All Pentaho modules, pre-configured with sample dataDeveloper forums, documentationVentana Research Open Source BI Survey

www.mysql.comWhite paper - http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.htmlKettle Webinar - http://www.mysql.com/news-and-events/on-demand-webinars/pentaho-2006-09-19.php Roland Bouman blog on Pentaho Data Integration and MySQL

http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html

Pentaho Data Integration MySQL support

Reading: MySQL Result Streaming (cursor emulation) support

Writing: MySQL dialects for data types

Job entry: Bulk Loader of text files for MySQL

Job entry: Bulk writer to a text file for MySQL

Table partitioning

ChallengesMore data is being gathered all the timeData is coming from more sources than everFaster access to stored information is becoming more importantMore people require concurrent access to the data

The old solutionSplit up the data into different tables

SALES_2003SALES_2004… SALES_2006

Hardcode reports to use one or the other tableLots of database management issuesHigh report maintenance costsNothing is automatic, everything is manual

Table partitioning

The new solution: MySQL 5.1. partitioned tablesAutomatic split up of data into smaller chunksSQL engine that can automatically pull results back together

Table partitioning

Sales table

Year 2003 Partition

Year 2004 Partition

Year 2005 Partition

Year 2006 Partition

Sales table

Table partitioning

AdvantagesReduces query time by reducing the amount of data to “plough” through.Increases performance by

“Pruning” the list of partitions to search for automatically. This is done using the MySQL query optimizer based on the query that’s being issued.Massive reduction in I/OSmaller partitioned indexes leading to faster index tree traversalAllowing parallel access to the different partitions

Table partitioning

How do I create a partitioned table?

CREATE TABLE sales

(

sales_date TIMESTAMP,

ordernr INT,

customernr INT,

productnr INT,

discount_pct TINYINT

)

PARTITION BY RANGE( YEAR(sales_date) )

(

PARTITION P2000 VALUES LESS THAN (2001),

PARTITION P2001 VALUES LESS THAN (2002)

)

Table partitioning

How do I add another partition to the table?

ALTER TABLE sales

ADD PARTITION

(

PARTITION P2002 VALUES LESS THAN (2003)

)

Table partitioning

How do I drop a partition from the table?

ALTER TABLE sales

DROP PARTITION P2002

;

Table partitioning

Retrieving partition informationmysql> desc information_schema.partitions;+-------------------------------+--------------+------+-----+---------+-------+| Field | Type | Null | Key | Default | Extra |+-------------------------------+--------------+------+-----+---------+-------+| TABLE_CATALOG | varchar(512) | YES | | NULL | || TABLE_SCHEMA | varchar(64) | NO | | | || TABLE_NAME | varchar(64) | NO | | | || PARTITION_NAME | varchar(64) | YES | | NULL | || SUBPARTITION_NAME | varchar(64) | YES | | NULL | || PARTITION_ORDINAL_POSITION | bigint(21) | YES | | NULL | || SUBPARTITION_ORDINAL_POSITION | bigint(21) | YES | | NULL | || PARTITION_METHOD | varchar(12) | YES | | NULL | || SUBPARTITION_METHOD | varchar(12) | YES | | NULL | || PARTITION_EXPRESSION | longtext | YES | | NULL | || SUBPARTITION_EXPRESSION | longtext | YES | | NULL | || PARTITION_DESCRIPTION | longtext | YES | | NULL | || TABLE_ROWS | bigint(21) | NO | | 0 | || AVG_ROW_LENGTH | bigint(21) | NO | | 0 | || DATA_LENGTH | bigint(21) | NO | | 0 | || MAX_DATA_LENGTH | bigint(21) | YES | | NULL | || INDEX_LENGTH | bigint(21) | NO | | 0 | || DATA_FREE | bigint(21) | NO | | 0 | || CREATE_TIME | datetime | YES | | NULL | || UPDATE_TIME | datetime | YES | | NULL | || CHECK_TIME | datetime | YES | | NULL | || CHECKSUM | bigint(21) | YES | | NULL | || PARTITION_COMMENT | varchar(80) | NO | | | || NODEGROUP | varchar(12) | NO | | | || TABLESPACE_NAME | varchar(64) | YES | | NULL | |+-------------------------------+--------------+------+-----+---------+-------+

Table partitioning

Retrieving partition informationSELECT partition_name FROM information_schema.partitionsWHERE table_name = ’sales’;+----------------+| partition_name |+----------------+| P2000 || P2001 || P2002 || P2003 || P2004 || P2005 || P2006 || P2007 || P2008 || P2009 || P2010 || P2011 |+----------------+

Table partitioning

Use-case: Flemish Traffic CenterMedium sized data set: 514M rowsPartitioned by week PARTITION BY RANGE ( YEAR(mtime)*100 + WEEK(mtime) )

80GB of data in hundreds of partitions1 row of data: speed, number of vehicles (5 classes), etc.

Per minutePer lane of trafficFor 1550 detectors 815M rows of data per year

A test query aggregating speed & counts per road position per minuteGave back 642.319 rows9 seconds for MySQL to get the result29 seconds to pass the data to the client over JDBC

Table partitioning

Should we do table partitioning? Yes!

However…Even though you can do sub-partitioning, you need to partition the data in one certain way, by range, by list, …The entry point for optimization is always going to be that partitioning methodPruning is always going to be based on that partitioning methodQueries across the whole data set don’t always get benefit (neither a penalty)

More detailed information:Partitioning in MySQL 5.1 and OnwardsMikael Ronström, Senior Software Architect, MySQL AB Tomorrow 11:50am – 12:35pm, Ballroom A

How can we improve upon this?


Challenge:Dealing with even greater data setsThe need to present results even quicker

The old solutionLarge boxesHuge amounts of RAMVery fast disk systems (Fibre attached, SAN, …)Multi-CPU monsters

$$$ Very expensive and complicated systems $$$

The new solution: 2 or more MySQL serversBasic, off the shelf, hardwareStandard softwareSimple configurationsPentaho Data Integration & MySQL 5.1


Year 2003 Partition

Year 2004 Partition

Year 2005 Partition

Year 2006 Partition

Sales table 2003

2004

2005

2006

Sales

2003

2004

2005

2006

Sales

2003

2004

2005

2006

Sales

2003

2004

2005

2006

Sales

DB1

DB2

DB3

DB4


Split the data over several databases

Pick a splitting mechanism in advance

Divide the load over the databases

Make the databases smaller

Increase the query performance

Increase the load performance


How do you set it up?First create a partitioned / clustered database connection

Then create a Partitioning Schema


Demo time:Creating partitions partitionedLoading data partitionedReading back data partitionedReading back data partitioned and ordered

Table partitioning

Use-case: Flemish Traffic CenterMedium sized data set: 514M rowsSame partitioning as beforeSame table layout, engine and configurationPartitioned over 3 databases on detector (per lane)

A test query aggregating speed & counts per road position per minuteGave back 642.319 rows3 seconds for MySQL to get the result10 seconds to pass the data to the client over JDBC Demonstrating almost linear scalability!!

Table partitioning

Should we do database partitioning? Yes!

However…The scalability is limited to the raw CPU power that is available on a single system.This poses a scalability problem

when dealing with larger numbers of partitionsReading back large volumes of data

Enters Pentaho Data Integration: Clustering

Pentaho Data Integration : Clustering

2003

2004

2005

2006

Sales

DBx

Server X


The challenge is to eliminate CPU bottlenecks

Pentaho Data Integration “Carte” comes to the rescue

Can be started up on any system turning it into a “Slave Server”

The combination of database partitioning and PDI Clustering:Will split up the number of database partitions among the slave serversWill split the total CPU load over the slave serversWill split the total I/O load over the database partitions (MySQL instances)


Demo-timeStart up 2 slave serversRun a step across the 2 serversMonitor

Pentaho Metadata to the rescue

The problem:Reporting becomes harder on a database partitioned systemIn stead of 1 database you read from a bunch of them

This is where our new project Pentaho Metadata Comes in

Due in a few months

Auto-generates SQL now

Will auto-generate PDI transformations in the near future

Will auto-deliver the data to your report in the platform

Will make it very easy for your users to create drag and drop

reports

Pentaho Introduction

World’s most popular enterprise open source BI

Suite2 million lifetime downloads, averaging 100K / monthFounded in 2004: Pioneer in professional open source BI

Key ProjectsJFreeReport ReportingKettle Data IntegrationMondrian OLAPPentaho BI PlatformWeka Data Mining

Management and BoardProven BI veterans from Business Objects, Cognos, Hyperion, SAS, OracleOpen source leaders - Larry Augustin, New Enterprise Associates, Index Ventures

MySQL Gold Partner


Strategic

Operational

Sales Marketing Inventory FinancialProduction

Scorecards

Analysis

Aggregates

Reports

Departmental


Demo and overview

Questions and Closing

?

Birds of a Feather

MySQL Data Warehousing and BI

Tonight at 7:30

Camino Real

With:Chief Geek James Dixon: Senior architect & CTO, PentahoJulian Hyde, OLAP Architect and Mondrian Project Founder, PentahoBrian Miezejewski, Principal Consultant, MySQLRoland Bouman, Certification Developer, MySQLMySELF™A few bottles of beer

exploiting mysql 5.1 for advanced business intelligence applications matt casters: chief architect,...

Documents