in5040 -advanced database systems for big data · database trends [gray 2004] the object/relation...

40
1 IN5040 - Advanced Database Systems for Big Data Vera Goebel Garth Theron(TA) Fall 2019

Upload: others

Post on 01-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

1

IN5040 - Advanced Database Systems for Big Data

Vera GoebelGarth Theron(TA)

Fall 2019

Page 2: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

2

Course Purpose

• The course gives an overview of newdevelopments within data management technology– Emphasis on usage and applicability– Concepts and design, not so much about

concrete systems

Page 3: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

3

Organization of the Course• Course mode:

– Lectures: Tuesday, 3 hours, 14:15-17:00, Lille Aud. (old Ifi)

– 10 weeks lecturing– Mandatory exercise (2 parts)

• Examination:– Oral– Date to be announced later

Page 4: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

4

Syllabus (Pensum)

• All slides and handouts• Elmasri & Navathe: Fundamentals of Database

Systems, 5th or 6th edition, Addison-Wesley• Garcia-Molina, Ullman & Widom: Database

Systems – The Complete Book, 2nd edition, Prentice Hall, 2009

• Selected articles (on the Web page)• PhD students: 45 min. Presentation (add.)

Page 5: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

5

Mandatory Exercise• Organization: 2 parts = 2 deliveries

each student works alone• Delivery date: Hard Deadlines!

Part 1: October 8, 2019, 14:00 hPart 2: October 29, 2019, 14:00 h

• Only students that have passed themandatory exercise (both parts!) areallowed to take the examination!

• Help (TA): Garth Theron, [email protected]

Page 6: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

6

Trends and Challengesin Modern Data Management

Vera Goebel

Page 7: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

The Beckman Report on Database Research

(October 2013)• 30 leading DBS researchers / professors• 8th meeting: 1989, 1990, 1995, 1996,

1998, 2003, 2008, 2013, …• Publication: Communications of the ACM,

Volume 59, Number 2, February 2016, pp. 92-99

7

Page 8: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

Big Data – Defining Challenge• Distribution• Integration• Heterogeneity• In-memory processing• Large-scale systems• Data analysis• …

8

Page 9: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

Big Data – Main Characteristics

9

Page 10: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

Big Data – 5 Related Challenges

• Scalable big/fast data infrastructures• Coping with diversity in data management• End-to-end data-to-knowledge pipeline• Cloud services• Role of people in data life cycle

10

Page 11: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

Scalable Big/Fast Data Infrastructures• Parallel and distributed processing• Query processing and optimization• New hardware• Cost-efficient storage• High-speed data streams• Late-bound schemas• Consistency• Metrics and benchmarks

11

Page 12: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

Diversity in Data Management

• No one-size-fits-all• Cross-platform integration• Programming models• Data processing workflows

12

Page 13: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

End-to-End Processing of Data

• Data-to-knowledge pipeline• Tool diversity• Tool customizability• Open source• Understanding data• Knowledge bases

13

Page 14: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

Cloud Services

• IaaS, PaaS, SaaS• Elasticity (SLA)• Data replication• System administration and tuning• Multitenancy (VMs)• Data sharing• Hybrid clouds

14

Page 15: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

Roles of Humans in the Data Life Cycle

• Data producers• Data curators (crowdsourcing)• Data consumers• Online communities

15

Page 16: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

16

Database Trends [Gray 2004]The object/relation battleWeb ServicesQueues and workflowsCubes and online analytic

processingData mining and machine

learningColumn storesApproximate answersSemi-structured data

The Semantic WebData Stream and Sensor

ProcessingSmart Objects: Databases

everywherePublish-SubscribeMassive memory, massive

latencySelf managing, always up

A selection of these subjects (and some additional ones)are discussed more thoroughly in the rest of the course.

Page 17: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

17

Web Services• Web service: SW system designed to support

interoperable machine-to-machine interaction over a network– Web servers and runtime (Apache, IIS, J2EE, .NET)

displaced TP monitors and ORBS– Web services (soap, wsdl, xml) are displacing current

brokers– DBMS listening to port 80

Publishing WSDL, DISCO, WS-SecurityServicing SOAP callsDBMS is a web service

– Basis for distributed systems– A consequence of ORDBMS

Page 18: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

18

Queues and Workflows• Applications loosely connected via queued

messages• Queues:

– Supported in all major database systems• defining queues• queueing and dequeueing messages• attaching triggers to queues

– Basis for publish-subscribe and workflow• Challenges: How to structure workflows and

notifications; characterize design patterns

Page 19: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

19

Data Cubes and OLAP(Online Analytical Processing)

• OLAP: Approach to quickly provide the answer to analytical queries that are dimensional in nature – sales– marketing– management reporting– ...

• Databases for OLAP contain data cubes– Data cubes now standard– MDX is very powerful

(Multi-Dimensional eXpressions)– Cube stores cohabit with row stores

ROLAP + MOLAP + xOLAP(relational + multidimensional ++ ...OnLine Analytical Processing)

– Very sophisticated algorithms• Challenge: Better ways to query and

visualize cubes

CHEVY

FORD 19901991

19921993

REDWHITEBLUE

SELECT <axis_spec> FROM <cube_spec>WHERE <slicer_spec>

Page 20: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

20

Data Mining and Machine Learning• Tasks: Classification, association, prediction• Tools: Decision trees, Bayesian networks, a

priori clustering, regression, neural nets, ...• Now unified with DBs

– Create table T(x,y,z,a,b,c)Learn ”a,b,c” from ”x,y,z” using <algorithm>

– Train T with data– Then can ask:

• Probability (?x,?y,?z,?a,?b,?c)• Probability (x,y,z,?a,?b,?c)

– Example: Learn height from age• Challenge: Better learning algorithms

Page 21: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

21

Data ....• Data mining

– Process of automatically searching large volumes of data for patterns– Applies computational techniques from statistics, information retrieval,

machine learning and pattern recognition• Data farming

– Process of using a high performance computer or computing grid to run a simulation thousands or millions of times across a large parameter and value space

– Result is a “landscape” of output that can be analyzed for trends, anomalies, and insights in multiple parameter dimensions.

• Data warehouse– Collection of computerised data organised to most optimally support

reporting and analysis activity• Data mart

– Specialized version of a data warehouse for specific user groups or needs

Page 22: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

22

Data Mining – Database Synergy• Create the model: Learn height from Gender + Age

CREATE MINING MODEL HeightFromAgeSex(ID long key,Gender text discrete,Age long continuous,Height long continuous PREDICT)

USING Decision_Trees• Train a data mining model: Database verbs to drive Modeler

INSERT INTO HeightSELECT ID, Gender, Age, HeightFROM People

• Predict height from model: Probabilistic reasoningSELECT height,

PredictProbability(height)FROM Height PREDICTION JOIN New

ON New.Gender = Height.GenderAND New.Age = Height.Age

Page 23: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

23

Column Stores• Universal relations: Users see fat base tables• Ex. LDAP

– 7 required, thousands of measured attributes• Conceptually simple, but use only some columns• To avoid reading useless data,

– do vertical partitions– define 10% popular columns index– make many skinny indices– query engines uses covering index– much faster read, slower insert/update

• Column stores automate this• Challenge: Automate design

Page 24: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

24

Approximate Answers

• ”Messy” data types: Text, time, space– Integrating programming languages with

DBMS allows adding data types and libraries for indexing and accessing such data

– Approximate answers– Probabilistic reasoning– No clear algebra

Page 25: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

25

Semi-Structured Data

• Not all data fits into the relational model– XML – eXtensible Markup Language

• File directories are becoming databases– Pivot on any attribute– Folders are standing queries– Freetext + schema search (better precision/

recall)

”Cyberspace is a giant XML document:xQuery for manipulation”

”Structure: YES!Semi-structured: NO!”

Page 26: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

26

Data Stream & Sensor Processing

• Data generated by instruments that monitor the environment– Need to process/analyze streams of data– Traditionally:

• Query large amounts of facts– Streams:

• Large amounts of queries on each new fact

Page 27: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

27

Data Streams

• Implications:– New aggregation operators– New programming style– Streams in products:

• Queries represented as records• New query optimizations

• Lots of challenges – Data structures, query operators, execution

environments are qualitatively different from classical DBMS architectures

Page 28: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

28

Sensor Network Characteristics

• Autonomous nodes– Small, low-cost, low-power, multifunctional– Sensing, data processing, and communicating

components• Sensor network is composed of large number of

sensor nodes (motes, smart dust)– Proximity to physical phenomena

• Deployed inside the phenomenon or very close to it• Monitoring and collecting physical data

– Streams of data• No human interaction for weeks at a time

– Long-term, low-power nature

Page 29: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

29

Sensor Data Harvesting

• Optimize wrt. power and bandwidth– Push queries out to sensors

• Moving intelligence to the perifery of the network • Every mote and smart dust a small database in

itself– Aggregate results during data collection

• Much more dynamic query optimization strategies needed

Page 30: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

30

Smart Objects: DBS Everywhere• Phones, PDAs, Cameras, ... have small DBs

– even motes– and smart dust?

• Disk drives have enough cpu, memory to run a full-blown DBMS

• All these devices want/need to share data• Need a simple-but-complete DBMS

– They need an ”Esperanto”:a data exchange language and paradigm

• Billions of clients, million of servers

Page 31: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

31

Publish-Subscribe• Data with many users

– Data warehouses collect vast data archives and publish subsets to special interest group data-marts

– Replicas for availability and/or performance– Mobile users do local updates, synchronize later

• Publish-subscribe model:– Custom subscriptions installed at the warehouse– Real-time notification

Page 32: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

32

Publish-Subscribe and Stream Processing

• Compare publish-subscribe & stream processing systems:– Millions of standing queries (subscriptions)

compiled into dataflow graph– At arrival of new data, incrementally evaluate

dataflow graph• Challenge:

– Support more sophisticated standing queries– Better optimization techniques

Page 33: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

33

Massive Memory, Massive Latency• 2005: RAM costs ~ 100k€ - 300k€ per TByte• Main-memory databases!• Latency a problem

– TByte ram memory scan ~ minutes

– TByte disk scan ~ hours• NUMA latency a problem• Challenge: Algorithms for

massive main memory– Database engines need to

overhaul their algorithms

Storage Price vs TimeMegabytes per kilo-Euro

1,E-4

1,E-3

1,E-2

1,E-11,E+0

1,E+1

1,E+2

1,E+3

1,E+4

1980 1990 2000 2010Year

GB

/kEU

R

Page 34: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

34

Self Managing, Always Up• No DBAs for cell phones or cameras

– nor for panel heaters, washing machines,...• Self* is the key

– Self-managing– Self-configuring– Self-organizing– Self-healing– Self-...

• Requires a modular software architecture– Clear and simple knobs on modules– Software manages these knobs

Page 35: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

35

But What Happened to the Classical Databases?

• Classical databases are alive and kicking!• Classical requirements are still valid!

– Persistent data management

– Concurrency control– Availability /

fault tolerance– Ad-hoc queries– Data integration– Logical and physical

data independence• Classical database application domains

and classical database functionality do NOT disappear

– Data consistency– Data security– Distribution– Performance– Extensibility– Cost effectiveness– Simple

manageability

DATABASE

APPLICATIONPROGRAM

DATABASEMANAGEMENT

SYSTEM

OPERATING SYSTEM

Page 36: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

36

Database System Implementation

• Column stores– Dealing with sparse data in an efficient way

• Stream and sensor processing– Dealing with severe power, bandwidth, and memory

restrictions– Dealing with evolving data

• Massive memory, massive latency– Dealing with new memory and disk technologies

More efficient DBMSs

Page 37: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

37

Database Model• The object/relation battle

– The object-relational model– ”Real” programming languages

• Semi-structured data– Liberation from the O/R model

• Smart objects: Databases everywhere– Common data model

• Approximate and probabilistic reasoning– Models for data that cannot be

expected to provide exact answers• Stream and sensor processing

– Model for evolving data

• Cubes and OLAP– Multidimensional data model– Data organized for fast complex

analytical and ad-hoc queries• Data warehouses and data marts

– Multidimensional model– Data organized for optimally

supporting reporting and analysis• Queues and workflows

– Workflows rather than RPC style interaction model

• Publish-subscribe– Data dispatched according to a

push model

Data model paradigm. Interaction model

Page 38: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

38

Data Accessibility

• Web services– Realizing federated heterogeneous systems

• Queues and workflows– Using queues to obtain more loosely coupled systems

• Data warehouses and data marts– Collections of derived data

• The Semantic Web– ”Explaining” data through metadata

Preparing data for access- making sure relevant data is near by

Page 39: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

39

Data Processing

• Cubes and OLAP– Fast data analysis– For better business

decisions• Data mining and machine

learning– Knowledge discovery in

databases

• Data farming– Simulations for better

analysis

• Approximate answers – Text retrieval, spatio-

temporal data analysis – Approximate and

probabilistic reasoning

• The Semantic Web– Obtaining higher precision

and quality on data retrival

Climbing the value chain- from data to information to knowledge to wisdom

Page 40: IN5040 -Advanced Database Systems for Big Data · Database Trends [Gray 2004] The object/relation battle Web Services Queues and workflows Cubes and online analytic processing Data

INF5100 - Overview• Data Stream Management Systems• Complex Event Processing• Distributed Database Systems• Heterogeneous Multi-Database Systems• Data Warehouses• Data Mining and XML Databases• Cloud Data Management• Big Data & NoSQL• Performance Analysis & Large DBS 40