exploiting mysql 5.1 for advanced business intelligence applications matt casters: chief architect,...
TRANSCRIPT
Exploiting MySQL 5.1For Advanced Business Intelligence Applications
Matt Casters: Chief Architect, Data Integration and Kettle Project Founder
MySQL User Conference, Tuesday April 24, 2007
Agenda
Great news
Pentaho Data Integration introduction
Pentaho Data Integration MySQL support
Table partitioning
Database partitioning
Clustering
Metadata to the rescue
Pentaho company overview & platform demo
Questions & closing
Great news
The San Francisco Giants beat the Arizona Diamondbacks last
Saturday
Barry Zito pitched a great game
The hot-dogs where great
Pentaho news
Mondrian 2.3. was released a few weeks agoDetailed cache control (clearing)
Sun Microsystems & Pentaho announced that Pentaho Reporting
will be used to drive Business Intelligence requirements in Open
Office 2.3: to be released in the second half of this year.
Tomorrow we’ll do another one.
Pentaho Data Integration introduction
From source systems …
to the data warehouse …
to reports …
to analyses …
to dashboard reports …
to better information
Data Transformation and Integration Examples
Data filteringIs not null, greater than, less than, includes
Field manipulationTrimming, padding, upper and lowercase conversion
Data calculations+ - X / , average, absolute value, arctangent, natural logarithm
Date manipulationFirst day of month, Last day of month, add months, week of year, day of year
Data type conversionString to number, number to string, date to number
Merging fields & splitting fields
Looking up dateLook up in a database, in a text file, an excel sheet, …
Pentaho Data Integration (Kettle) Components
SpoonConnect to data sourcesDefine transformation rules and design target schema(s)Graphical job execution workflow engine for defining multi-stage and conditional transformation jobs
PanCommand-line execution of single, pre-defined transformation jobs
KitchenScheduler for multi-stage jobs
CarteRemote execution “slave” server with embedded web server
Pentaho BI PlatformIntegrated scheduling of transformations or jobsAbility to call real-time transformations and use output in reports and dashboards
Data Transformation and Integration Examples
Demo time : load 1M rows into a database table
Case Study: Pentaho Data Integration
Organization: Flemish Government Traffic Centre
Use case: Monitoring the state of the road network
Application requirement: Integrate minute-by-minute data from
570 highway locations for analysis (1550 lanes)
Technical challenges: Large volume of data, more than 2.5
billion rows
Business Usage: Users can now compare traffic speeds based on
weather conditions, time of day, date, season
Best practices:Clearly understand business user requirements firstThere are often multiple ways to solve data integration problems, so consider the long-term need when choosing the right way
Summary and Resources
Pentaho and MySQL can address help you manage your data infrastructureExtraction, Transformation and Loading for Data Warehousing and Data Migration
kettle.pentaho.orgKettle project homepage
kettle.javaforge.comKettle community website: forum, source, documentation, tech tips, samples, …
www.pentaho.org/download/All Pentaho modules, pre-configured with sample dataDeveloper forums, documentationVentana Research Open Source BI Survey
www.mysql.comWhite paper - http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.htmlKettle Webinar - http://www.mysql.com/news-and-events/on-demand-webinars/pentaho-2006-09-19.php Roland Bouman blog on Pentaho Data Integration and MySQL
http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html
Pentaho Data Integration MySQL support
Reading: MySQL Result Streaming (cursor emulation) support
Writing: MySQL dialects for data types
Job entry: Bulk Loader of text files for MySQL
Job entry: Bulk writer to a text file for MySQL
Table partitioning
ChallengesMore data is being gathered all the timeData is coming from more sources than everFaster access to stored information is becoming more importantMore people require concurrent access to the data
The old solutionSplit up the data into different tables
SALES_2003SALES_2004… SALES_2006
Hardcode reports to use one or the other tableLots of database management issuesHigh report maintenance costsNothing is automatic, everything is manual
Table partitioning
The new solution: MySQL 5.1. partitioned tablesAutomatic split up of data into smaller chunksSQL engine that can automatically pull results back together
Table partitioning
Sales table
Year 2003 Partition
Year 2004 Partition
Year 2005 Partition
Year 2006 Partition
Sales table
Table partitioning
AdvantagesReduces query time by reducing the amount of data to “plough” through.Increases performance by
“Pruning” the list of partitions to search for automatically. This is done using the MySQL query optimizer based on the query that’s being issued.Massive reduction in I/OSmaller partitioned indexes leading to faster index tree traversalAllowing parallel access to the different partitions
Table partitioning
How do I create a partitioned table?
CREATE TABLE sales
(
sales_date TIMESTAMP,
ordernr INT,
customernr INT,
productnr INT,
discount_pct TINYINT
)
PARTITION BY RANGE( YEAR(sales_date) )
(
PARTITION P2000 VALUES LESS THAN (2001),
PARTITION P2001 VALUES LESS THAN (2002)
)
Table partitioning
How do I add another partition to the table?
ALTER TABLE sales
ADD PARTITION
(
PARTITION P2002 VALUES LESS THAN (2003)
)
Table partitioning
How do I drop a partition from the table?
ALTER TABLE sales
DROP PARTITION P2002
;
Table partitioning
Retrieving partition informationmysql> desc information_schema.partitions;+-------------------------------+--------------+------+-----+---------+-------+| Field | Type | Null | Key | Default | Extra |+-------------------------------+--------------+------+-----+---------+-------+| TABLE_CATALOG | varchar(512) | YES | | NULL | || TABLE_SCHEMA | varchar(64) | NO | | | || TABLE_NAME | varchar(64) | NO | | | || PARTITION_NAME | varchar(64) | YES | | NULL | || SUBPARTITION_NAME | varchar(64) | YES | | NULL | || PARTITION_ORDINAL_POSITION | bigint(21) | YES | | NULL | || SUBPARTITION_ORDINAL_POSITION | bigint(21) | YES | | NULL | || PARTITION_METHOD | varchar(12) | YES | | NULL | || SUBPARTITION_METHOD | varchar(12) | YES | | NULL | || PARTITION_EXPRESSION | longtext | YES | | NULL | || SUBPARTITION_EXPRESSION | longtext | YES | | NULL | || PARTITION_DESCRIPTION | longtext | YES | | NULL | || TABLE_ROWS | bigint(21) | NO | | 0 | || AVG_ROW_LENGTH | bigint(21) | NO | | 0 | || DATA_LENGTH | bigint(21) | NO | | 0 | || MAX_DATA_LENGTH | bigint(21) | YES | | NULL | || INDEX_LENGTH | bigint(21) | NO | | 0 | || DATA_FREE | bigint(21) | NO | | 0 | || CREATE_TIME | datetime | YES | | NULL | || UPDATE_TIME | datetime | YES | | NULL | || CHECK_TIME | datetime | YES | | NULL | || CHECKSUM | bigint(21) | YES | | NULL | || PARTITION_COMMENT | varchar(80) | NO | | | || NODEGROUP | varchar(12) | NO | | | || TABLESPACE_NAME | varchar(64) | YES | | NULL | |+-------------------------------+--------------+------+-----+---------+-------+
Table partitioning
Retrieving partition informationSELECT partition_name FROM information_schema.partitionsWHERE table_name = ’sales’;+----------------+| partition_name |+----------------+| P2000 || P2001 || P2002 || P2003 || P2004 || P2005 || P2006 || P2007 || P2008 || P2009 || P2010 || P2011 |+----------------+
Table partitioning
Use-case: Flemish Traffic CenterMedium sized data set: 514M rowsPartitioned by week PARTITION BY RANGE ( YEAR(mtime)*100 + WEEK(mtime) )
80GB of data in hundreds of partitions1 row of data: speed, number of vehicles (5 classes), etc.
Per minutePer lane of trafficFor 1550 detectors 815M rows of data per year
A test query aggregating speed & counts per road position per minuteGave back 642.319 rows9 seconds for MySQL to get the result29 seconds to pass the data to the client over JDBC
Table partitioning
Should we do table partitioning? Yes!
However…Even though you can do sub-partitioning, you need to partition the data in one certain way, by range, by list, …The entry point for optimization is always going to be that partitioning methodPruning is always going to be based on that partitioning methodQueries across the whole data set don’t always get benefit (neither a penalty)
More detailed information:Partitioning in MySQL 5.1 and OnwardsMikael Ronström, Senior Software Architect, MySQL AB Tomorrow 11:50am – 12:35pm, Ballroom A
How can we improve upon this?
Database partitioning
Challenge:Dealing with even greater data setsThe need to present results even quicker
The old solutionLarge boxesHuge amounts of RAMVery fast disk systems (Fibre attached, SAN, …)Multi-CPU monsters
$$$ Very expensive and complicated systems $$$
The new solution: 2 or more MySQL serversBasic, off the shelf, hardwareStandard softwareSimple configurationsPentaho Data Integration & MySQL 5.1
Database partitioning
Year 2003 Partition
Year 2004 Partition
Year 2005 Partition
Year 2006 Partition
Sales table 2003
2004
2005
2006
Sales
2003
2004
2005
2006
Sales
2003
2004
2005
2006
Sales
2003
2004
2005
2006
Sales
DB1
DB2
DB3
DB4
Database partitioning
Split the data over several databases
Pick a splitting mechanism in advance
Divide the load over the databases
Make the databases smaller
Increase the query performance
Increase the load performance
Database partitioning
How do you set it up?First create a partitioned / clustered database connection
Then create a Partitioning Schema
Database partitioning
Demo time:Creating partitions partitionedLoading data partitionedReading back data partitionedReading back data partitioned and ordered
Table partitioning
Use-case: Flemish Traffic CenterMedium sized data set: 514M rowsSame partitioning as beforeSame table layout, engine and configurationPartitioned over 3 databases on detector (per lane)
A test query aggregating speed & counts per road position per minuteGave back 642.319 rows3 seconds for MySQL to get the result10 seconds to pass the data to the client over JDBC Demonstrating almost linear scalability!!
Table partitioning
Should we do database partitioning? Yes!
However…The scalability is limited to the raw CPU power that is available on a single system.This poses a scalability problem
when dealing with larger numbers of partitionsReading back large volumes of data
Enters Pentaho Data Integration: Clustering
Pentaho Data Integration : Clustering
2003
2004
2005
2006
Sales
DBx
Server X
Pentaho Data Integration : Clustering
The challenge is to eliminate CPU bottlenecks
Pentaho Data Integration “Carte” comes to the rescue
Can be started up on any system turning it into a “Slave Server”
The combination of database partitioning and PDI Clustering:Will split up the number of database partitions among the slave serversWill split the total CPU load over the slave serversWill split the total I/O load over the database partitions (MySQL instances)
Pentaho Data Integration : Clustering
Demo-timeStart up 2 slave serversRun a step across the 2 serversMonitor
Pentaho Metadata to the rescue
The problem:Reporting becomes harder on a database partitioned systemIn stead of 1 database you read from a bunch of them
This is where our new project Pentaho Metadata Comes in
Due in a few months
Auto-generates SQL now
Will auto-generate PDI transformations in the near future
Will auto-deliver the data to your report in the platform
Will make it very easy for your users to create drag and drop
reports
Pentaho Introduction
World’s most popular enterprise open source BI
Suite2 million lifetime downloads, averaging 100K / monthFounded in 2004: Pioneer in professional open source BI
Key ProjectsJFreeReport ReportingKettle Data IntegrationMondrian OLAPPentaho BI PlatformWeka Data Mining
Management and BoardProven BI veterans from Business Objects, Cognos, Hyperion, SAS, OracleOpen source leaders - Larry Augustin, New Enterprise Associates, Index Ventures
MySQL Gold Partner
Pentaho Introduction
Strategic
Operational
Sales Marketing Inventory FinancialProduction
Scorecards
Analysis
Aggregates
Reports
Departmental
Pentaho Introduction
Demo and overview
Questions and Closing
?
Birds of a Feather
MySQL Data Warehousing and BI
Tonight at 7:30
Camino Real
With:Chief Geek James Dixon: Senior architect & CTO, PentahoJulian Hyde, OLAP Architect and Mondrian Project Founder, PentahoBrian Miezejewski, Principal Consultant, MySQLRoland Bouman, Certification Developer, MySQLMySELF™A few bottles of beer