miscellaneous topics in databases - week 12 - misc.pdfmiscellaneous topics in databases parallel...

1111

Topic for Thursday?

Miscellaneous Topicsin Databases

PARALLEL DBMS

4444

WHY PARALLEL ACCESS TO DATA?

1 Terabyte

10 MB/s

At 10 MB/s1.2 days to scan

1 Terabyte

1,000 x parallel1.5 minute to scan.

Parallelism:divide a big probleminto many smaller onesto be solved in parallel.

5555

PARALLEL DBMS: INTRO

Parallelism is natural to DBMS processing

Pipeline parallelism: many machines each doing onestep in a multi-step process.

Partition parallelism: many machines doing the samething to different pieces of data.

Both are natural in DBMS!

Pipeline

Partition

AnySequentialProgram


SequentialSequential SequentialSequentialAny

SequentialProgram


outputs split N ways, inputs merge

6666

SOME || TERMINOLOGY

Speed-Up

More resources meansproportionally less timefor given amount ofdata.

Scale-Up

If resources increasedin proportion toincrease in data size,time is constant.

Why Realistic <> Ideal?

degree of ||-ism

Xac

t/se

c.(t

hro

ug

hpu

t)

Ideal

degree of ||-ism

sec.

/Xac

t(r

esp

on

seti

me)

Ideal

Realistic

Realistic

7777

INTRODUCTION

Parallel machines are becoming quite commonand affordable Prices of microprocessors, memory and disks have

dropped sharply

Recent desktop computers feature multiple processorsand this trend is projected to accelerate

Databases are growing increasingly large large volumes of transaction data are collected and

stored for later analysis.

multimedia objects like images are increasingly storedin databases

Large-scale parallel database systemsincreasingly used for: storing large volumes of data

processing time-consuming decision-support queries

providing high throughput for transaction processing

8888

Google data centers around the world, as of 2008

9999

PARALLELISM IN DATABASES

Data can be partitioned across multiple disks forparallel I/O.

Individual relational operations (e.g., sort, join,aggregation) can be executed in parallel

data can be partitioned and each processor can workindependently on its own partition

Results merged when done

Different queries can be run in parallel with eachother.

Concurrency control takes care of conflicts.

Thus, databases naturally lend themselves toparallelism.

10101010

PARTITIONING

Horizontal partitioning (shard)

involves putting different rows into different tables

Ex: customers with ZIP codes less than 50000 arestored in CustomersEast, while customers with ZIPcodes greater than or equal to 50000 are stored inCustomersWest

Vertical partitioning

involves creating tables with fewer columns and usingadditional tables to store the remaining columns

partitions columns even when already normalized

called "row splitting" (the row is split by its columns)

Ex: split (slow to find) dynamic data from (fast to find)static data in a table where the dynamic data is notused as often as the static

11111111

COMPARISON OF PARTITIONING TECHNIQUES

Evaluate how well partitioning techniquessupport the following types of data access:

1.Scanning the entire relation.

2.Locating a tuple associatively – point queries.

E.g., r.A = 25.

3.Locating all tuples such that the value of agiven attribute lies within a specified range –range queries.

E.g., 10 r.A < 25.

12121212

HANDLING SKEW USING HISTOGRAMS

Balanced partitioning vector can be constructed fromhistogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram

Histogram can be constructed by scanning relation, orsampling (blocks containing) tuples of the relation

13131313

INTERQUERY PARALLELISM

Queries/transactions execute in parallel with oneanother

concurrent processing

Increases transaction throughput; used primarilyto scale up a transaction processing system tosupport a larger number of transactions persecond.

Easiest form of parallelism to support

14141414

INTRAQUERY PARALLELISM

Execution of a single query in parallel on multipleprocessors/disks; important for speeding up long-running queries

Two complementary forms of intraqueryparallelism :

Intraoperation Parallelism – parallelize theexecution of each individual operation in the query

(each CPU runs on a subset of tuples)

Interoperation Parallelism – execute the differentoperations in a query expression in parallel.

(each CPU runs a subset of operations on the data)

15151515

PARALLEL JOIN

The join operation requires pairs of tuples to betested to see if they satisfy the join condition, andif they do, the pair is added to the join output.

Parallel join algorithms attempt to split the pairsto be tested over several processors. Eachprocessor then computes part of the join locally.

In a final step, the results from each processor canbe collected together to produce the final result.

16161616

QUERY OPTIMIZATION

Query optimization in parallel databases is more complexthan in sequential databases

Cost models are more complicated, since we must take intoaccount partitioning costs and issues such as skew andresource contention

When scheduling execution tree in parallel system, mustdecide:

How to parallelize each operation

how many processors to use for it

What operations to pipeline

what operations to execute independently in parallel

what operations to execute sequentially

Determining the amount of resources to allocate for eachoperation is a problem

E.g., allocating more processors than optimal can resultin high communication overhead

DEDUCTIVE DATABASES

18181818

OVERVIEW OF DEDUCTIVE DATABASES

Declarative Language

Language to specify rules

Inference Engine (Deduction Machine)

Can deduce new facts by interpreting the rules

Related to logic programming

Prolog language (Prolog => Programming in logic)

Uses backward chaining to evaluate

Top-down application of the rules

Consists of:

Facts

Similar to relation specification without the necessity ofincluding attribute names

Rules

Similar to relational views (virtual relations that are not stored)

19191919

PROLOG/DATALOG NOTATION

Facts are provided as predicates

Predicate has

a name

a fixed number of arguments

Convention:

Constants are numeric or character strings

Variables start with upper case letters

E.g., SUPERVISE(Supervisor, Supervisee)

States that Supervisor SUPERVISE(s) Supervisee

20202020


Rule

Is of the form head :- body

where :- is read as if and only iff

E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y)

E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y)

21212121


Query

Involves a predicate symbol followed by some variablearguments to answer the question

where :- is read as if and only iff

E.g., SUPERIOR(james,Y)?

E.g., SUBORDINATE(james,X)?

22222222

Supervisory treeProlog notation

23232323

PROVING A NEW FACT

24242424

DATA MINING

26262626

DEFINITION

Data mining is the exploration and analysis of largequantities of data in order to discover valid, novel,potentially useful, and ultimately understandablepatterns in data.

Example pattern (Census Bureau Data):If (relationship = husband), then (gender = male). 99.6%

27272727

DEFINITION (CONT.)

Data mining is the exploration and analysis of large quantitiesof data in order to discover valid, novel, potentially useful,and ultimately understandable patterns in data.

Valid: The patterns hold in general.

Novel: We did not know the patternbeforehand.

Useful: We can devise actions from thepatterns.

Understandable: We can interpret andcomprehend the patterns.

28282828

WHY USE DATA MINING TODAY?

Human analysis skills are inadequate:

Volume and dimensionality of the data

High data growth rate

Availability of:

Data

Storage

Computational power

Off-the-shelf software

Expertise

29292929

THE KNOWLEDGE DISCOVERY PROCESS

Steps:

Identify business problem

Data mining

Action

Evaluation and measurement

Deployment and integration into businessesprocesses

30303030

PREPROCESSING AND MINING

Original Data

TargetData

PreprocessedData

PatternsKnowledge

DataIntegrationand Selection

Preprocessing

ModelConstruction

Interpretation

31313131

DATA MINING TECHNIQUES

Supervised learning

Classification and regression

Unsupervised learning

Clustering

Dependency modeling

Associations, summarization, causality

Outlier and deviation detection

Trend analysis and change detection

32323232

EXAMPLE APPLICATION: SKY SURVEY

Input data: 3 TB of image data with 2billion sky objects, took more than sixyears to complete

Goal: Generate a catalog with all objectsand their type

Method: Use decision trees as data miningmodel

Results: 94% accuracy in predicting sky object classes

Increased number of faint objects classified by300%

Helped team of astronomers to discover 16 newhigh red-shift quasars in one order ofmagnitude less observation time

33333333

CLASSIFICATION EXAMPLE

Example trainingdatabase

Two predictor attributes:Age and Car-type (Sport,Minivan and Truck)

Age is ordered, Car-type iscategorical attribute

Class label indicateswhether person boughtproduct

Dependent attribute iscategorical

Age Car Class

20 M Yes

30 M Yes

25 T No

30 S Yes

40 S Yes

20 T No

30 M Yes

25 M Yes

40 M Yes

20 S No

34343434

GOALS AND REQUIREMENTS

Goals:

To produce an accurate classifier/regression function

To understand the structure of the problem

Requirements on the model:

High accuracy

Understandable by humans, interpretable

Fast construction for very large training databases

35353535

WHAT ARE DECISION TREES?

Minivan

Age

Car Type

YES NO

YES

<30 >=30

Sports, Truck

0 30 60 Age

YES

YES

NO

Minivan

Sports,Truck

36363636

DENSITY-BASED CLUSTERING

A cluster is defined as a connected dense component.

Density is defined in terms of number of neighbors ofa point.

We can find clusters of arbitrary shape

37373737

MARKET BASKET ANALYSIS

Consider shopping cart filled with several items

Market basket analysis tries to answer thefollowing questions:

Who makes purchases?

What do customers buy together?

In what order do customers purchase items?

38383838

MARKET BASKET ANALYSIS (CONTD.)

Coocurrences

80% of all customers purchase items X, Y and Z together.

Association rules

60% of all customers who purchase X and Y also buy Z.

Sequential patterns

60% of customers who first buy X also purchase Y withinthree weeks.

SPATIAL DATA

40404040

41414141

WHAT IS A SPATIAL DATABASE?

Database that:

Stores spatial objects

Manipulates spatial objects just like other objects inthe database

42424242

WHAT IS SPATIAL DATA?

Data which describes either location or shape

e.g.House or Fire Hydrant locationRoads, Rivers, Pipelines, Power linesForests, Parks, Municipalities, Lakes

In the abstract, reductionist view of the computer,these entities are represented as Points, Lines,and Polygons.

43434343

Roads are represented as LinesMail Boxes are represented as Points

44444444

TOPIC THREE

Land Use Classifications arerepresented as Polygons

45454545

TOPIC THREE

Combination of all the previous data

46464646

SPATIAL RELATIONSHIPS

Not just interested in location, also interested in“Relationships” between objects that are veryhard to model outside the spatial domain.

The most common relationships are Proximity : distance

Adjacency : “touching” and “connectivity”

Containment : inside/overlapping

47474747


Distance between a toxic waste dump and a piece ofproperty you were considering buying.

48484848


Distance to various pubs

49494949


Adjacency: All the lots which share an edge

50505050

Connectivity: Tributary relationships in rivernetworks

51515151

MOST ORGANIZATIONS HAVE SPATIAL

DATA

Geocodable addresses

Customer location

Store locations

Transportationtracking

Statistical/Demographic

Cartography

Epidemiology

Crime patterns

Weather Information

Land holdings

Natural resources

City Planning

Environmentalplanning

InformationVisualization

Hazard detection

52525252

ADVANTAGES OF SPATIAL DATABASES

Able to treat your spatial data like anything elsein the DB

transactions

backups

integrity checks

less data redundancy

fundamental organization and operations handled bythe DB

multi-user support

security/access control

locking

53535353


Offset complicated tasks to the DB server

organization and indexing done for you

do not have to re-implement operators

do not have to re-implement functions

Significantly lowers the development time ofclient applications

54545454


Spatial querying using SQL

use simple SQL expressions to determine spatialrelationships

distance

adjacency

containment

use simple SQL expressions to perform spatialoperations

area

length

intersection

union

buffer

55555555

Original Polygons

Union Intersection

56565656

Original river network

Buffered rivers

57575757


… WHERE distance(<me>,pub_loc) < 1000

SELECT distance(<me>,pub_loc)*$0.01 + beer_cost…... WHERE touches(pub_loc, street)

… WHERE inside(pub_loc,city_area) and city_name= ...

58585858


Simple value of the proposed lot

Area(<my lot>) * <price per acre>+ area(intersect(<my log>,<forested area>) ) * <wood value peracre>- distance(<my lot>, <power lines>) * <cost of power line laying>

59595959

New Electoral Districts

• Changes in areas between 1996 and2001 election.

• Want to predict voting in 2001 bylooking at voting in 1996.

• Intersect the 2001 district polygon withthe voting areas polygons.

• Outside will have zero area• Inside will have 100% area• On the border will have partial area

• Multiply the % area by 1996 actualvoting and sum

• Result is a simple prediction of 2001voting

More advanced: also use demographicdata.

60606060

DISADVANTAGES OF SPATIAL DATABASES

Cost to implement can be high

Some inflexibility

Incompatibilities with some GIS software

Slower than local, specialized data structures

User/managerial inexperience and caution

61616161

PICTOGRAMS - SHAPES

Types: Basic Shapes, Multi-Shapes, DerivedShapes, Alternate Shapes, Any possible Shape,User-Defined Shapes

Basic Shapes Alternate Shapes

Multi-Shapes Any Possible Shape

Derived Shapes User Defined Shape

N 0, N *

!

62626262

SPATIAL DATA ENTITY CREATION

Form an entity to hold county names, states,populations, and geographies

CREATE TABLE County(

Name varchar(30),

State varchar(30),

Pop Integer,

Shape Polygon);

Form an entity to hold river names, sources,lengths, and geographies

CREATE TABLE River(

Name varchar(30),

Source varchar(30),

Distance Integer,

Shape LineString);

63636363

EXTENDING THE ER DIAGRAM

Standard ERDiagram

Spatial ERDiagram

64646464

miscellaneous topics in databases - week 12 - misc.pdfmiscellaneous topics in databases parallel...

Documents