Download - Big SQL 3.0 - Fast and easy SQL on Hadoop
© 2014 IBM Corporation
z/OS und LUW
Big SQL 3.0 Fast and easy SQL on Hadoop
Wilfried Hoge IT Architect Big Data [email protected] @wilfriedhoge
© 2014 International Business Machines Corporation 2
Hadoop Observations
Technology Customers Vendors
Rapid innovation
Two sources of innovation - Open source community
- Integration of existing technologies
Tools and application
vendors selecting partners and integrating
High degree of interest
Many experimental
workstreams
ROI establishment varies by use case
Many customers want to offload data from EDW
Multiple business models
OSS support vendors have
mindshare lead
OSS support vendors business model viability
unclear
SW Portfolio vendors integrating/adding
© 2014 International Business Machines Corporation 3
InfoSphere BigInsights provides Enterprise Grade Hadoop analytics
• Manages a wide variety and huge volume of data
• Augments open source Hadoop with enterprise capabilities
– Visualization & Exploration – Development tools – Advanced Engines – Connectors – Workload Optimization – Enterprise integration – Analytic Accelerators – Application and industry accelerators – Administration & Security
Accelerators
Information Integration & Governance
Data Warehouse
Stream Computing
Hadoop System
Discovery Application Development
Systems Management
Data Media Content Machine Social
BIG DATA PLATFORM
© 2013 IBM Corporation
© 2014 International Business Machines Corporation 4
Key Differentiators for BigInsights
Enterprise Performance & Integration Analytics Usability
& Productivity
• Workload / performance optimization
• GPFS
• Security
• Key integrations & Connectors with Enterprise Ecosystem
• Text analytics
• Social Data Analytics Accelerators
• Machine Data Analytics Accelerators
• Execute R in an integrated application
• Big SQL
• BigSheets
• Development Tools
• Web Console
© 2014 International Business Machines Corporation 5
Integrated Web Console
• Manage BigInsights – Inspect /monitor system health – Add / drop nodes – Start / stop services – Run / monitor jobs (applications) – Explore / modify file system – Create custom dashboards
• Launch applications – Spreadsheet-like analysis tool – Pre-built applications (IBM supplied or
user developed)
• Publish applications
• Monitor cluster, applications, data – Create / view event alerts.
© 2014 International Business Machines Corporation 6
6
Applications
High level languages (SQL, JAQL, PIG, …)
Map/Reduce API
Hadoop DFS API
GPFS HDFS
Distributed Filesystem
Distributed filesystem GPFS FPO gives additional flexibility, security and high availability • Optional file system alternative to HDFS • More than 10 years experience with HPC • Key features
– No single point of failure – Built-in High Availability – POSIX compliance
• Standard applications cannot use HDFS but they can use GPFS-FPO
– Enhanced Security – Higher performance
• Allows concurrent read and write by multiple programs
– Recovery capabilties • Journaling filesystem
– Support for Storage Pools – SnapShot capability
© 2014 International Business Machines Corporation 7
BigInsights has a simple but effective security system based on a gateway to Hadoop
• All Hadoop servers are connected over a private network
• Unrestricted communication between cluster servers on the private network
• BigInsights Web Console acts as a gateway into the cluster
• Authentication through PAM or LDAP • Role based authorization • Authorization will be enforced at 3 levels:
– UI level – Data level – Map-Reduce level
• Authorization also respected by services (e.g. SQL) • Kerberos support
Authentication Authority
Gateway / Web Console
External Sources Users
Services Data Nodes
Infrastr. Nodes
Distributed Filesystem
© 2014 International Business Machines Corporation 8
BigSheets to analyze and visualize
• Model “big data” collected from various sources in spreadsheet-like structures
• Filter and enrich content with
built-in functions
• Combine data in different workbooks
• Visualize results through
spreadsheets, charts
• Export data into common formats (if desired)
No programming knowledge needed!
© 2014 International Business Machines Corporation 9
9
A centralized dashboard to visualize analytic results: • BigSheets collections • Analytic application results • Monitoring metrics
• Ability to view BigSheets data flows between and across data sets to quickly navigate and relate analysis and charts
• Visualize inner outer joins, enhanced filters for BigSheets columns, column data-type mapping for collections and application of analytics to BigSheets columns, … etc
Centralized dashboard & data flows
© 2014 International Business Machines Corporation 10
10
Editors • A workflow editor that greatly simplifies the
creation of complex Oozie workflows with a consumable interface
• A Pig/Jaql Editor with content assist and syntax highlighting that enables users to create and execute new applications using Pig or Jaql in local or cluster mode from the Eclipse IDE
Application development & deployment • Enablement of BigSheets macro
and BigSheets reader development • Text Analytics development,
including support for modular rule sets
• Publish new application: BigSheets Macro, BigSheets Reader, AQL module, Jaql module
Tools for Developers 1. Sample your
Data 2. Develop your application using BigInsights tools
3. Test your application
4. Package and publish your application
5. Deploy your application on the cluster
© 2014 International Business Machines Corporation 11
Running Applications on Big Data
• Browse available applications • Deploy published applications
(administrators only) • Launch (or schedule for launch) a
deployed application • Monitor job (application) execution
status
• Predefined applications • Import & Export Data
• Database & Files • Web and Social
• Analyze and Query • Predictive Analytics • Text Analytics • SQL/Hive, Jaql, Pig, Hbase
• Accelerators
© 2014 International Business Machines Corporation 12
Application linking and interfaces to build new apps • Compose new
applications from existing applications and BigSheets
• Invoke analytics applications from the web console, including integration within BigSheets
• REST data source App that enables users to load data from any data source supporting REST APIs into BigInsights, including popular social media services
• Sampling App that enables users to sample data for analysis • Subsetting App that enables users to subset data for data analysis
12
© 2014 International Business Machines Corporation 13
Collaborative Big Data for many roles • Business Users can get their hands on big
data and use big data applications and BigSheets to get insights into their data
§ Data scientists can perform deeper analysis and get richer insights
§ Administrators are empowered to be more agile through better controls and views into key performance indicators
§ Developers can leverage unified tooling in a Big Data Application Development Lifecycle and are able to create and deploy new types of applications, with enhancements that simplify even complex workflows
© 2014 International Business Machines Corporation 14
Big SQL 3.0 – Architected for Performance
• Leverage IBM's rich SQL heritage, expertise, and technology – Modern SQL:2011 capabilities – DB2 compatible SQL PL support
• SQL bodied functions and stored procedures • Application logic/security encapsulation
• Architected from the ground up for performance
– low latency and high throughput
• MapReduce replaced with a modern MPP architecture – Compiler and runtime are native code (not java) – Big SQL worker daemons live directly on cluster – Continuously running (no startup latency) – Processing happens locally at the data
• Operations occur in memory with the ability
to spill to disk – Supports aggregations and sorts larger than available RAM
• Integration with BigSheets (source & target)
InfoSphere BigInsights
Big SQL SQL MPP Runtime
Data Sources
Parquet CSV Seq RC
Avro ORC JSON Custom
SQL-based Application
IBM Data Server Client
© 2014 International Business Machines Corporation 15
Big SQL 3.0 – Architecture cont.
• Head (coordinator / management) node – Listens to the JDBC/ODBC connections and compiles / optimizes the query – Coordinates the execution of the query – Optionally store user data in traditional RDBMS table (single node only)
• Big SQL worker processes reside on compute nodes (some or all) • Worker nodes stream data between each other as needed • Workers can spill large data sets to local disk if needed
– Allows Big SQL to work with data sets larger than available memory
Mgmt Node
Big SQL
Mgmt Node
Hive Metastore
Mgmt Node
Name Node
Mgmt Node
Job Tracker •••
Compute Node
Task Tracker
Data Node
Compute Node
Task Tracker
Data Node
Compute Node
Task Tracker
Data Node
Compute Node
Task Tracker
Data Node ••• Big
SQL Big SQL
Big SQL
Big SQL
GPFS/HDFS
© 2014 International Business Machines Corporation 16
Big SQL 3.0 – Features
Data shared with Hadoop ecosystem Comprehensive file format support
Superior enablement of IBM software Enhanced by Third Party software
Modern MPP runtime Powerful SQL query rewriter
Cost based optimizer Optimized for concurrent user throughput
Results not constrained by memory
Distributed requests to multiple data sources within a single SQL statement
Main data sources supported: DB2 LUW, DB2/z, Teradata, Oracle, Netezza
Advanced security/auditing Resource and workload management
Self tuning memory management Comprehensive monitoring
Comprehensive SQL Support IBM SQL PL compatibility
Application Portability & Integration
Federation
Performance
Enterprise Features
Rich SQL
© 2014 International Business Machines Corporation 17
BigSQL Demo
© 2014 International Business Machines Corporation 18
Comparing Big SQL 3.0 and Hive 0.12 for Ad-Hoc Queries
0
500
1000
1500
2000
2500
3000
3500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Elap
sed Time (sec)
Query number
BigSQL 3.0 Parquet vs Hive 0.12 ORC 1TB Classic BI Workload
Hive 0.12 BigSQL 3.0
Big SQL is up to 41x faster
than Hive 0.12
*Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic BI Workload" in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no update functions are performed. TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014
© 2014 International Business Machines Corporation 19
IBM BigInsights brings efficient integration of R with Big R
• R as a big data query language – Outside-in execution
• R as a statistical language for deep computing – Inside-out execution – Partitioning of large data (“divide”) – Parallel cluster execution of pushed
down R code (“conquer”) – Almost any R package can run in
this environment
• R as the gateway to scalable machine learning – A scalable ML engine that provides
canned algorithms, and an ability to author new ones, all via R
R Clients
Scalable ML
Engine
Data Sources
Embedded R Execution
R Packages
R Packages
Pull data (summaries) to
R client
Or, push R functions right
on the data
© 2014 International Business Machines Corporation 20
Text Analytics in BigInsights
Distill structured information from unstructured data
– Rich annotator library supports multiple languages
– Declarative Information Extraction (IE) system based on an algebraic framework
– Richer, cleaner rule semantics – Better performance through optimization
How it works • Parses text and detects meaning with annotators • Understands the context in which the text is
analyzed • Hundreds of pre-built annotators for names,
addresses, phone numbers, along others
Accuracy • Highly accurate in deriving meaning from
complex text
Performance • AQL language optimized for MapReduce
Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win.
Unstructured text (document, email, etc)
Classification and Insight
© 2014 International Business Machines Corporation 21
BigInsights offers value beyond Open Source
Enterprise Capabilities
Administration & Security
Workload Optimization
Connectors
Open source components
Advanced Engines
Visualization & Exploration
Development Tools
IBM-certified Apache Hadoop
Key differentiators • Built-in analytics • Enterprise software integration • Spreadsheet-style analysis • Integrated installation of supported open
source and other components • Web Console for admin and application
access • Platform enrichment: additional security,
performance features, . . . • World-class support • Full open source compatibility
Business benefits • Quicker time-to-value due to IBM
technology and support • Reduced operational risk • Enhanced business knowledge with flexible
analytical platform • Leverages and complements existing
software
© 2014 International Business Machines Corporation 22
InfoSphere BigInsights for Hadoop includes the latest Open Source components, enhanced by enterprise components IBM InfoSphere BigInsights for Hadoop
Runtime
File System
Data Store
Resource M
anagement &
A
dministration
Security Data
Access
Advanced Analytics
Visualization & Ad Hoc Analytics
Applications & Development
Governance
MapReduce
HBase
HDFS
IBM Open Source
Text Analytics R Big R
Kerberos
Audit &
History GPFS FPO
Adaptive MapReduce
Console
Monitoring
LDA
P D
ata Security for H
adoop
Data P
rivacy for Hadoop
Data M
atching D
ata Masking
Stream Computing
Search
Streams
Enterprise S
earch S
olr/ Lucene
Jaql
Pig Hive
ZooKeeper
Oozie
Big SQL
Flexible S
cheduler
ETL
BigSheets
Dashboard Charting
Eclipse Tooling: MapReduce, Hive, Jaql,
Pig, Big SQL, AQL
BigSheets Reader and Macro
Text Analytics Extractors
Flume
Sqoop
HCatalog
YAR
N*
* In Beta
© 2014 International Business Machines Corporation 23
From Getting Starting to Enterprise Deployment: Different BigInsights Editions For Varying Needs
Standard Edition
Breadth of capabilities
Ente
rpris
e cl
ass
Enterprise Edition
- Spreadsheet-style tool - - Web console - - Dashboards - Pre-built applications - - Eclipse tooling - - RDBMS connectivity - - Big SQL - - Monitoring and alerts - - Platform enhancements - - . . .
- Accelerators - - GPFS – FPO - - Adaptive MapReduce - Text analytics - Enterprise Integration - - Big R - - InfoSphere Streams* - - Watson Explorer* - - Cognos BI* - - Data Click* - - . . .
- * Limited use license
Apache Hadoop
Quick Start Free. Non-production Same features as Standard Edition plus text analytics and Big R
IBM big data • IBM big data • IBM big data
IBM big data • IBM big data • IBM big data
IBM
big
dat
a
• IB
M b
ig d
ata
IBM
big data • IBM
big data
THINK