© 2002 ibm corporation ibm research impliance -- information management appliance 1 impliance: an...

31
1 © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee IBM Watson Research Center Vuk Ercegovac, Joseph Glider, Richard Golding, Guy Lohman, Volker Markl, Hamid Pirahesh, Jun Rao, Robert Rees, Frederick Reiss, Eugene Shekita, Garret Swart Almaden Research Center

Upload: gavin-blankenship

Post on 26-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

1

© 2002 IBM Corporation

IBM Research

Impliance -- Information Management Appliance

Impliance: an Information Management Appliance

Bishwaranjan BhattacharjeeIBM Watson Research Center Vuk Ercegovac, Joseph Glider, Richard Golding, Guy Lohman, Volker Markl, Hamid Pirahesh, Jun Rao, Robert Rees, Frederick Reiss, Eugene Shekita, Garret SwartAlmaden Research Center

Page 2: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

2 Impliance -- Information Management Appliance © 2007 IBM Corporation

Agenda

Motivation: Observations Requirements

What is Impliance?

How is Impliance different from…?

Research opportunities

Conclusions

Page 3: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

3 Impliance -- Information Management Appliance © 2007 IBM Corporation

After all our successes (and last night’s revelry), it’s easy to become self-congratulatory.

Sorry, time for…

Page 4: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

4 Impliance -- Information Management Appliance © 2007 IBM Corporation

Some embarrassing questions:

Why is most (>80%) of the world’s data still not in databases

Didn’t we “solve” this problem in the 1980s with object-relational systems?

Do you use a database to store your data on your laptop?

Why not? (You are a database bigot, aren’t you?)

Have you ever tried to query (with SQL) a database that:– You didn’t create, and…

– Had more than 500 tables?

Just how easy is it to incrementally add DB capacity beyond 1 machine? 100 machines?

Have “self-managing” databases significantly simplified administration?

Page 5: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

Observation Requirements (1 of 5)

Observation #1: Information converging Many types of data in today’s enterprise

Structured (traditional Data Base) Semi-structured (traditional Content Management, XML) Unstructured (text, multimedia)

Each needs a different search interface, today SQL JSR-170 Keyword search / Information Retrieval

Requirement #1: Store / Search / Analyze all data Need to rapidly relate information of different types With one unified interface! Real use cases in paper

Page 6: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

Observation Requirements (2 of 5)

Observation #2: Awash in data, but not information

Typical complaint: “I can’t find what I’m looking for!” But just finding data isn’t enough! Today’s Business Intelligence is too human-intensive

Requirement #2: Pro-actively derive useful information

Need to glean more business value from enterprise data What sort of analytics exploit unstructured data? Need to automatically extract the semantics of text A rebirth of data mining?

Page 7: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

Observation Requirements (3 of 5)

Obs. #3: Total Cost of Ownership (TCO) is paramount People costs dominate TCO

– Hardware often less than 50% of TCO Minimize Time To Value

– Databases take too long to set up! Wizards & Advisors simply mask complexity, add brittleness

Reqmt. #3: System must be simple, robust, & secure Sacrifice resource utilization for radical simplification of:

– Setup / Configuration / Deployment (e.g., Self-Organizing)– Operation

KISS (you know this one) KIWI – Kill It With Iron [Weikum]! Example: “Good enough” plans exploiting massive parallelism

+

Page 8: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

Observation Requirements (4 of 5)

Observation #4: Data volumes growing fast Data is kept longer Lots of new kinds of data: RFID, email, photos, videos Disk densities improving, but not seek times!

– 1 TB disk for $399 (Hitachi)

Requirement #4: Simple & massive scale-out 1000s of nodes With low management overhead No single point of failure

Page 9: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

Observation Requirements (5 of 5)Obs. #5: Today’s Info. Mgmt. software based upon

hardware 30 yrs. ago Example: Update-in-place databases due to expensive disk Today: Cheap CPUs, large storage, fast networks

Requirement #5: Need new (software) architecture Opportunity to radically rethink Info. Mgmt. software architecture

(Stonebraker: “refactor”), based upon:

– Hardware economics • e.g., cheap (multi-core) CPUs, storage, memory, network

– Software:• Formats (e.g., XML, semi-structured data)• Functionality required (e.g., unstructured search, analytics)

– Specified in the right order: • Service requirements Software Hardware

Page 10: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

IBM Research

Impliance – Information Management Appliance © 2007 IBM Corporationi 10

What is Impliance?

Scalable: Massively parallel scale-out… …to Petabytes!

Administrator-less: Low Time to Value by Self-Organizing Low Total Cost of Ownership

Manage and Search All Data: Structured, Semi-Structured, … …Even Unstructured Text!

TextXML

Pro-actively Mine Information: Glean business insight from data

Structured Data

(Tables)

Bundled: HW & SW Pre-configured Pre-tuned Limited APIs

Page 11: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

11 Impliance -- Information Management Appliance © 2007 IBM Corporation

What Does Impliance Actually Do?

All enterprise information:

√ Stores & Retrieves (Search / Query)

√ Composes / Integrates / Mashups

√ Finds trends & exceptions (Business Intelligence)

Page 12: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

12 Impliance -- Information Management Appliance © 2007 IBM Corporation

Think of Impliance as…

Content Management on steroids (beyond JSR-170) File System with all content searchable Data Warehouse with all your enterprise’s data

Not just structured information

Excluding high-rate OLTP (web site)

A Jambalaya

Page 13: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

Where does Impliance fit?S

truc

ture

d

Sem

i-S

truc

ture

dU

n -

Str

uctu

red

Lifetime of Data

Transaction Ingestion

Typ

es o

f D

ata

DBMS

Warehousing/OLAP Archiving

Content Management

OLTP

Impliance

XM

L

ArchivingProducts

Page 14: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

14 Impliance -- Information Management Appliance © 2007 IBM Corporation

How is Impliance related to… Google Base?

Primary data storeAppliance (product, i.e., sits in customer site), not a Service Enterprise, not “the masses”

DataSpaces / Google “Pay as you go”?Primary data store (vs. lazy federation of existing data sources)

Enterprise, not “the web”

Database “Appliances” (Netezza, DataAlegro, Green Plum, etc.)?Not just structured (relational) data

Discovery of semantics

More pro-active

Page 15: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

15 Impliance -- Information Management Appliance © 2007 IBM Corporation

Research Opportunities Reducing TCO – Make categories of administration just GO AWAY

– Self-Organizing to obviate database design

– Exploit appliance’s limited externalized interfaces New HW & SW architectures using off-the-shelf components

– Achieving fine-grained scale-out

– Targetting robust, “good enough” designs

– Exploiting integration of components Data and query models that

– Unify all data, yet are simple

– Tolerate “schema chaos”

– Combine best features of keyword search & SQL Automated discovery of

– Data & query semantics for

– Improving precision of queries

– Organizing data adaptively

– Trends, exceptions, etc. (pro-active Business Intelligence)

Page 16: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

16 Impliance -- Information Management Appliance © 2007 IBM Corporation

Conclusions

We’ve come a long way towards – the autonomic dream

– incorporating all data

But we can do much more!

Impliance provides exciting opportunity for DB research– To lower TCO for information management

– To exploit today’s hardware and software advances

– To rethink information management in a fundamentally new way

Join us!

Page 17: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

IBM Research

© 2007 IBM Corporation17 Impliance – Information Management Appliance

Thank You

MerciGrazie

Gracias

Obrigado

Danke

Japanese

English

French

Russian

German

Italian

Spanish

Brazilian Portuguese

Arabic

Traditional Chinese

Simplified Chinese

Hindi

Tamil

Thai

Korean

Page 18: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

18 Impliance -- Information Management Appliance © 2007 IBM Corporation

Appendix

Page 19: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

19 Impliance -- Information Management Appliance © 2007 IBM Corporation

Redefining Information Systems -- Players

Web 2.0 oriented next generation systems (delivered through services or appliances): Google, Yahoo, MSN, (IBM)

Google base (a semi-structured/un-structured information base)Google OneBox

NextGen systems built by integration of successful open source (Green Plum)Data models: RSS/ATOM/Wiki/… Architecture: DB+Search+Content systems (e.g., MYSQL+Lucene+Jackrabbit)

Entrenched HW/Storage/middleware companies Storage-driven:

EMC-- Moving up the value chain, brought in a classic Content systemIBM– IDS: synergy between classic CM (JCR) and storage

Server-driven:Netezza, Datallegro (for BI)Zantaz (for email compliance)Data Power (XSLT filtering)

Middleware-driven (IBM, Oracle, Microsoft)Oracle Secure Enterprise Search

Page 20: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

20 Impliance -- Information Management Appliance © 2007 IBM Corporation

Research Focus 1: Reducing TCO

Make entire categories of administration JUST GO AWAY

Reducing time-to-value through new design principlesSelf-organization of “schema chaos” obviates lengthy logical & physical design, REORG

Fine-grained scale-out (instead of scale-up) obviates need for load balancing, etc.

New software architectureTarget robust, highly-predictable, “good enough” utilization (KIWI = Kill It With Iron)

Componentization

Each component simple, robust, and adaptiveVirtual service model

Service Broker optimizes resources and assigns the workload

Exploit integrated hardware and storage systems to provideBuilt-in redundancy and availability

Automated backup and archiving (ILM)

Easy cluster management

Schema chaos support at storage level (semantic storage)

Ability to use new types of grid elements (cell blade server) seamlessly

Page 21: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

21 Impliance -- Information Management Appliance © 2007 IBM Corporation

Research Focus 2: Scalability

True Grid Model Off-the-shelf, commodity hardware

Dedicate blades to different tasks

Data: storage and simple filtering

Analytical: aggregation & mining

Transaction: search, transactional get/put

Supports Mixed Workloads

Analytics, Search, Content, … Fine-grained scale-out

Different blade types scale independently

From SMB to largest enterprises Integrating modern HW & storage, e.g.

BC3, intelligent bricks

Logic pushdown into storage

Predicate application

Aggregation

Redundancy management

Data Array

Data Array

Data Blade

Data proc

RAID

Data Blade

Data proc

RAID…

Analytic GridTransactionalCluster

AnalyticBlade

TransactionBlade

Commodity Interconnect

Data+Content+Search+Digital Media

ContentStream

DataStream

Archive/ILM

Stream

XactionStream

Page 22: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

22 Impliance -- Information Management Appliance © 2007 IBM Corporation

Parallel Run-time: Comparison of Plumbing

Platform ApplicationQuerying model Parallelism

Fault tolerance

Resource Scheduling

WS XDTransactional

(composition;no search, no BI)

limited moderate yes yes

DataStage (E2)ETL (streaming)

(cleansing, transformation,composition)

rich high yes yes

GPFS Storage extremely limitedextremely high yes limited

DB2 ESE with DPF Analytics for relational rich high yes yes

Google Map/Reduce

Analytics for anything(search, transformation, simplistic composition)

limited extremely highyes yes

Impliance Analytics for anything, Search, Composition rich extremely highyes yes

Page 23: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

23 Impliance -- Information Management Appliance © 2007 IBM Corporation

Virtual Storage and Computing Resource

Distributed Data Store

Security Control

Scalable Reliable Runtime Support

DiscoveryRelationaldata

SQL

contentJCR

XMLXSLT

Web pageHTTP

Video

ArchiveILM…

Data/Query

Modeler

Data Analyzer

Objects

ResourceModeler

Applications

Query

Data Analyzer, Discovery, Query:

Large-scale computation

Data ModelerSimple, generic

SRRSFault tolerant

DDSProvide reliability

VSCRCommodity HW

Page 24: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

24 Impliance -- Information Management Appliance © 2007 IBM Corporation

Research Focus 3: Information Modeling and Querying

Simple, rich, unified information model & associated query languages, e.g.Google Base approach promising

Defined typed attributes for navigation

Defined label for keyword search

Infosphere, MUSIC

Open community (RSS / Atom / wiki)

Automatic schema discovery and integration – self-organizing!Integrating solutions from Infosphere, CLIO

Intelligence discoveryAutomatic discovery of semantics (UIMA, Web Fountain, Avatar)

Pro-active, continuous mining (vs. passive BI model)

Contextual information supply

Including reporting and advanced analytics

Page 25: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

25 Impliance -- Information Management Appliance © 2007 IBM Corporation

Eliminate Admin Tasks… …Rather than adding layers (1 of 3):

Special-purpose, turn-key appliances for basic servicesvs. today’s general-purpose SW (but still uses off-the-shelf hardware!)

Bundled, Pre-installed, Pre-configured, Pre-tuned software!

Examples:

Information Management appliance Web Server appliance

Minimizes interfaces user has to worry about

No need to externalize underlying operating system, storage details

Eliminates need to install, configure, and tune

Self-organizing data systemsAutomatic discovery of data structure

Obviates need to

Define logical and physical schema a priori, reducing time to value

Migrate schema when organization changes

Page 26: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

26 Impliance -- Information Management Appliance © 2007 IBM Corporation

Eliminate Admin Tasks (2 of 3): Universal Data Management

Today:

Plethora of special-purpose data managers:Databases for structured data Content managers for semi-structured dataFile systems for unstructured data

For each, very differentUser interfaces (SQL, JSR 170, file interface)Degrees of semantic knowledge about the data’s contentsDegrees of searchabilityConsistency semantics (e.g., transactions) when updatedManagement capabilities and interfaces

Tomorrow: Single mechanism for managing all data

Uniform interfaces for all types of data, for SearchingUpdatingManagement

Universal indexing (“Google model”) of all data – default search mechanismPlus more precise searching for auto-discovered (above) structured

information Obviates need to Impose naming conventions to find desired data

Page 27: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

27 Impliance -- Information Management Appliance © 2007 IBM Corporation

Eliminate Admin Tasks (3 of 3): Robust storage mechanisms to eliminate need for backups

Never throw out data –keep versions!

Update-in-place

Is an anachronism from days of expensive disk

Increases complexity of transactions

Jeopardizes compliance requirements (Sarbanes-Oxley)

Versions permit queries “as of” some time

Exploits storage density increases (relative to number of disk arms)

RAID provides local reliability

Widely accepted and deployed

Weaver Codes extend to multiple simultaneous failures

How provide universal reliability (i.e., against site disasters)?

Selective, automated replication of new versions?

Cross-site RAID? Universal “Call Home” technology for remote management of

Monitoring

Problem determination

Software maintenance & upgrades

Page 28: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

28 Impliance -- Information Management Appliance © 2007 IBM Corporation

Observation / Requirements Information converging: Store / Search / Analyze ALL data

Structured (traditional Data Base)Semi-structured (traditional Content Management, XML, multi-media, call center records)Unstructured (text)Same advanced functionality required

Data volume growing fast: On Demand strategy requires massive scale-outLots of new data: RFID, email, photos, videos (Deep Internet-scale systems being built)Data is kept longer, due to compliance requirements

Total Cost of Ownership (TCO) is paramount: System simple & robust (not smart & fragile)

People costs dominate TCO: Hardware often less than 50% of TCOHence, sacrifice resource utilization for radical simplification Delivered in services or appliances

Today’s IM software based upon hardware 30 yrs ago: Need new software architectureCheap CPUs, large storage, fast network in hardwareOpportunity to radically rethink IM software architecture, based upon:

Hardware economics (e.g., cheap CPUs, storage, memory, & network)Data:

Formats (e.g., XML, semi-structured data)Functionality required (e.g., unstructured search, analytics)

Page 29: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

29 Impliance -- Information Management Appliance© 2006 IBM Corporation

Cost of management and administration 10% CAGR

New server spending (US$M) 3% CAGR

Spending(US$B)

Installed base (M Units)

Source: IDC, On-Demand Enterprises and Utility Computing: A Current Market Assessment and Outlook, IDC #31513, July 2004

$0

$20

$40

$60

$80

$100

$120

$140

$160

1996 ’97 ’98 ’99 2000 ’01 ’02 ’03 ’04 ’05 ’06 ’07 ’08

5

10

15

20

25

30

35

Cost of management and administration is outpacing spending on new systems

Total Cost of Ownership is the Driver

Page 30: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

IBM Research

Impliance – Information Management Appliance 30

Changing Characteristics of DataTransactions and

structured dataText and other human

dataMachine-generated and

unstructured data

Heterogeneity

Actionability

Scale

Heterogeneity

Actionability

Scale

Actionability

Scale

Heterogeneity

Seat on an airplane: Easy to find, structured

LifeScience data - protein folding, gene expression: Difficult to analyze but we

know where to look

Satellite and surveillance data: An infinite space of "patterns"

Page 31: © 2002 IBM Corporation IBM Research Impliance -- Information Management Appliance 1 Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee

31 Impliance -- Information Management Appliance © 2007 IBM Corporation

Impliance

Impliance: A Highly-Scalable, Rich-Functional Information

Management Appliance A box with software pre-installed

How delivered to enterprise: appliance or service

What functions? Store and manage all information

accept all types of enterprises data Deliver all intelligence

Integrate cross silo information

Advanced analytics with richer semantics

What properties? Low TCO

easy to deploy (“plug & play”)

simple and stable Scalability

From SMB to Very Large (PetaBytes)

(Not for high-end OLTP!)Data+Content+Digital Media

Relationaldata

SQL

content

JCR

XML

XSLT

Web page

Native

retrieval

interface

Native

update/

load

interface

HTTP

Video

ArchiveILM