add real-time streaming sql to your mysql skill set julian hyde - chief architect steve herskovitz...
TRANSCRIPT
Add Real-time Streaming SQL to Your MySQL Skill Set
Julian Hyde - Chief Architect
Steve Herskovitz – Director of Professional Services
The Data Crunch
» Data volumes rising fast
» Human-originated data (e.g. e-commerce purchases) rising fast
» Machine-generated data (e.g. e-commerce events and network
packets) rising even faster
» Every business needs answers with lower latency
» Every significant problem is distributed:
» Geographically distributed organizations
» Multiple boxes for scale
» Exploit multiple cores
Data management is hard
» If you make a mistake, the system won’t be fast enough
» Can’t afford to lose data
» New technologies are very difficult to use
» MapReduce
» NoSQL
» Multi-threaded programming in Java, C++, Erlang, Scala, …
» Collaborate, interoperate, evolve
Today’s Computing Model
4
Database$$$$
TransactionProcessingApplication
InfrastructureApplication
Log file
Real-TimeApplication
BusinessEvents
BusinessTransactions
Batch LoadProcess
Polling
Stream Computing Model
5
Database$$
TransactionProcessingApplication
InfrastructureApplication
Real-TimeApplication
StreamProcesso
r
Business
Events
BusinessTransactions
Real-TimeAnswers
Traditional BI Application
Case study: Mozilla
Demo: Mozilla downloads
SQL – life in the old dinosaur yet
» Widely spoken
» Rich
» Orthogonal
» Declarative
» Tune your system without
changing your logical schema
» Apps don’t interfere with each
other
» Adaptive
» Route around failure
» Exploit available resources
» Make tradeoffs to meet QoS goals
Streaming SQL: example #1
Tweets about this conference:
» SELECT STREAM ROWTIME, author, text
FROM Tweets
WHERE text LIKE ‘%#MySQL%'
Demo: studio & simple query
Streaming SQL basics
» Streams:
» CREATE STREAM Tweets (
author VARCHAR(20),
text VARCHAR(140));
» Relational operators have streaming counterparts:
» Project (SELECT)
» Filter (WHERE)
» Union
» Join
» Aggregation (GROUP BY)
» Windowed aggregation (e.g. SUM(x) OVER window)
» Sort (ORDER BY)
Streaming SQL: example #2
» Each minute, return the number of clicks on each web
page:
» SELECT STREAM ROWTIME, uri, COUNT(*)
FROM PageRequests
GROUP BY FLOOR(ROWTIME TO MINUTE), uri
Streaming SQL: Time
» ROWTIME pseudo-column
» Provided by source application or generated by system
» WINDOW
» Present in regular SQL (e.g. SQL:2003) but more important in
streaming SQL
» Defines a ‘working set’ for streaming JOIN, GROUP BY, windowed
aggregation
» Monotonicity (“sortedness”)
» Prerequisite for certain streaming operations
Streaming SQL: example #3
Find all orders from New York that shipped within an hour:
» CREATE VIEW compliant_orders AS
SELECT STREAM *
FROM orders OVER sla
JOIN shipments
ON orders.id = shipments.orderid
WHERE city = 'New York'
WINDOW sla AS (RANGE INTERVAL '1' HOUR PRECEDING)
Streaming SQL: other stuff
» Schemas, views, tables
» Ability to nest queries
» User-defined functions and transforms
» Adapters make external systems look like read/write streams
Streaming SQL: example #4
Find all stock trades where the average price over the last ten trades is
two standard deviations higher than the average over the last hour:» SELECT STREAM *
FROM (
SELECT STREAM
ticker,
price,
volume,
AVG(price) OVER lastHour AS avgHr,
STDDEV(price) OVER lastHour AS stddevHr,
AVG(price) OVER lastTenTrades AS avg10
FROM Trades
WINDOW
lastTenTrades AS (PARTITION BY ticker ROWS 10 PRECEDING)),
lastHour AS (PARTITION BY ticker RANGE INTERVAL ’1’ HOUR PRECEDING)
WHERE avg10 > avgHr + 2 * stddevHr
Streaming SQL for business intelligence
Conventional BI:
» Star schema:
» Fact table
» Dimension tables
» Aggregate tables
» Data warehouse populated using an ETL process
» OLAP servers (e.g. Mondrian) provide a top-down view of data
Challenge:
» Keep all of these systems up to date in real time
» Alert when key metrics are outside acceptable range
ETL Process for OLAP
OLAP
Operationaldatabase
Datawarehouse
Conventional ETL
Aggregate tables populated from DW
OLAP cache flushed after load
Real-time OLAP: Challenges
OLAP imperatives
Highly aggregated data – e.g.
one number computed from
10M rows
Therefore:
1. Use a cache
2. Materialize results as
aggregates
Real-time imperatives
View latest version of the data
Maintaining N aggregates
requires ~N blocks of I/O per
incoming row
Therefore:
1. Don’t use a cache
2. Don’t maintain aggregates
Real-time OLAP: Solutions
1. Notify cache when underlying data has changed
1. Populate cache from data warehouse
2. Continuous ETL process
2. Build aggregates in memory
1. Flush to disk intermittently
2. OLAP engine looks for aggregates in memory first
Continuous ETL for Real-time OLAP
OLAP
Operationaldatabase
Datawarehouse
SQLstream
Aggregate tables populated incrementally
OLAP cache flushed
proactively
Summary
1. Data problems are getting harder
2. People are trying – and failing – to solve these problems
with SQL databases
3. Stream computing is a powerful new kind of platform
4. Streaming SQL is pragmatic and powerful
Any questions?
Thank you for attending!
Further reading:
» “Data in Flight” by Julian Hyde (Communications of the
ACM, Vol. 53 No. 1, Pages 48-52)
Blog: http://julianhyde.blogspot.com/
Twitter: @julianhyde