miscellaneous topics in databases - week 12 - misc.pdfmiscellaneous topics in databases parallel...
TRANSCRIPT
1111
Topic for Thursday?
Miscellaneous Topicsin Databases
PARALLEL DBMS
4444
WHY PARALLEL ACCESS TO DATA?
1 Terabyte
10 MB/s
At 10 MB/s1.2 days to scan
1 Terabyte
1,000 x parallel1.5 minute to scan.
Parallelism:divide a big probleminto many smaller onesto be solved in parallel.
5555
PARALLEL DBMS: INTRO
Parallelism is natural to DBMS processing
Pipeline parallelism: many machines each doing onestep in a multi-step process.
Partition parallelism: many machines doing the samething to different pieces of data.
Both are natural in DBMS!
Pipeline
Partition
AnySequentialProgram
AnySequentialProgram
SequentialSequential SequentialSequentialAny
SequentialProgram
AnySequentialProgram
outputs split N ways, inputs merge
6666
SOME || TERMINOLOGY
Speed-Up
More resources meansproportionally less timefor given amount ofdata.
Scale-Up
If resources increasedin proportion toincrease in data size,time is constant.
Why Realistic <> Ideal?
degree of ||-ism
Xac
t/se
c.(t
hro
ug
hpu
t)
Ideal
degree of ||-ism
sec.
/Xac
t(r
esp
on
seti
me)
Ideal
Realistic
Realistic
7777
INTRODUCTION
Parallel machines are becoming quite commonand affordable Prices of microprocessors, memory and disks have
dropped sharply
Recent desktop computers feature multiple processorsand this trend is projected to accelerate
Databases are growing increasingly large large volumes of transaction data are collected and
stored for later analysis.
multimedia objects like images are increasingly storedin databases
Large-scale parallel database systemsincreasingly used for: storing large volumes of data
processing time-consuming decision-support queries
providing high throughput for transaction processing
8888
Google data centers around the world, as of 2008
9999
PARALLELISM IN DATABASES
Data can be partitioned across multiple disks forparallel I/O.
Individual relational operations (e.g., sort, join,aggregation) can be executed in parallel
data can be partitioned and each processor can workindependently on its own partition
Results merged when done
Different queries can be run in parallel with eachother.
Concurrency control takes care of conflicts.
Thus, databases naturally lend themselves toparallelism.
10101010
PARTITIONING
Horizontal partitioning (shard)
involves putting different rows into different tables
Ex: customers with ZIP codes less than 50000 arestored in CustomersEast, while customers with ZIPcodes greater than or equal to 50000 are stored inCustomersWest
Vertical partitioning
involves creating tables with fewer columns and usingadditional tables to store the remaining columns
partitions columns even when already normalized
called "row splitting" (the row is split by its columns)
Ex: split (slow to find) dynamic data from (fast to find)static data in a table where the dynamic data is notused as often as the static
11111111
COMPARISON OF PARTITIONING TECHNIQUES
Evaluate how well partitioning techniquessupport the following types of data access:
1.Scanning the entire relation.
2.Locating a tuple associatively – point queries.
E.g., r.A = 25.
3.Locating all tuples such that the value of agiven attribute lies within a specified range –range queries.
E.g., 10 r.A < 25.
12121212
HANDLING SKEW USING HISTOGRAMS
Balanced partitioning vector can be constructed fromhistogram in a relatively straightforward fashion
Assume uniform distribution within each range of the histogram
Histogram can be constructed by scanning relation, orsampling (blocks containing) tuples of the relation
13131313
INTERQUERY PARALLELISM
Queries/transactions execute in parallel with oneanother
concurrent processing
Increases transaction throughput; used primarilyto scale up a transaction processing system tosupport a larger number of transactions persecond.
Easiest form of parallelism to support
14141414
INTRAQUERY PARALLELISM
Execution of a single query in parallel on multipleprocessors/disks; important for speeding up long-running queries
Two complementary forms of intraqueryparallelism :
Intraoperation Parallelism – parallelize theexecution of each individual operation in the query
(each CPU runs on a subset of tuples)
Interoperation Parallelism – execute the differentoperations in a query expression in parallel.
(each CPU runs a subset of operations on the data)
15151515
PARALLEL JOIN
The join operation requires pairs of tuples to betested to see if they satisfy the join condition, andif they do, the pair is added to the join output.
Parallel join algorithms attempt to split the pairsto be tested over several processors. Eachprocessor then computes part of the join locally.
In a final step, the results from each processor canbe collected together to produce the final result.
16161616
QUERY OPTIMIZATION
Query optimization in parallel databases is more complexthan in sequential databases
Cost models are more complicated, since we must take intoaccount partitioning costs and issues such as skew andresource contention
When scheduling execution tree in parallel system, mustdecide:
How to parallelize each operation
how many processors to use for it
What operations to pipeline
what operations to execute independently in parallel
what operations to execute sequentially
Determining the amount of resources to allocate for eachoperation is a problem
E.g., allocating more processors than optimal can resultin high communication overhead
DEDUCTIVE DATABASES
18181818
OVERVIEW OF DEDUCTIVE DATABASES
Declarative Language
Language to specify rules
Inference Engine (Deduction Machine)
Can deduce new facts by interpreting the rules
Related to logic programming
Prolog language (Prolog => Programming in logic)
Uses backward chaining to evaluate
Top-down application of the rules
Consists of:
Facts
Similar to relation specification without the necessity ofincluding attribute names
Rules
Similar to relational views (virtual relations that are not stored)
19191919
PROLOG/DATALOG NOTATION
Facts are provided as predicates
Predicate has
a name
a fixed number of arguments
Convention:
Constants are numeric or character strings
Variables start with upper case letters
E.g., SUPERVISE(Supervisor, Supervisee)
States that Supervisor SUPERVISE(s) Supervisee
20202020
PROLOG/DATALOG NOTATION
Rule
Is of the form head :- body
where :- is read as if and only iff
E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y)
E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y)
21212121
PROLOG/DATALOG NOTATION
Query
Involves a predicate symbol followed by some variablearguments to answer the question
where :- is read as if and only iff
E.g., SUPERIOR(james,Y)?
E.g., SUBORDINATE(james,X)?
22222222
Supervisory treeProlog notation
23232323
PROVING A NEW FACT
24242424
DATA MINING
26262626
DEFINITION
Data mining is the exploration and analysis of largequantities of data in order to discover valid, novel,potentially useful, and ultimately understandablepatterns in data.
Example pattern (Census Bureau Data):If (relationship = husband), then (gender = male). 99.6%
27272727
DEFINITION (CONT.)
Data mining is the exploration and analysis of large quantitiesof data in order to discover valid, novel, potentially useful,and ultimately understandable patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the patternbeforehand.
Useful: We can devise actions from thepatterns.
Understandable: We can interpret andcomprehend the patterns.
28282828
WHY USE DATA MINING TODAY?
Human analysis skills are inadequate:
Volume and dimensionality of the data
High data growth rate
Availability of:
Data
Storage
Computational power
Off-the-shelf software
Expertise
29292929
THE KNOWLEDGE DISCOVERY PROCESS
Steps:
Identify business problem
Data mining
Action
Evaluation and measurement
Deployment and integration into businessesprocesses
30303030
PREPROCESSING AND MINING
Original Data
TargetData
PreprocessedData
PatternsKnowledge
DataIntegrationand Selection
Preprocessing
ModelConstruction
Interpretation
31313131
DATA MINING TECHNIQUES
Supervised learning
Classification and regression
Unsupervised learning
Clustering
Dependency modeling
Associations, summarization, causality
Outlier and deviation detection
Trend analysis and change detection
32323232
EXAMPLE APPLICATION: SKY SURVEY
Input data: 3 TB of image data with 2billion sky objects, took more than sixyears to complete
Goal: Generate a catalog with all objectsand their type
Method: Use decision trees as data miningmodel
Results: 94% accuracy in predicting sky object classes
Increased number of faint objects classified by300%
Helped team of astronomers to discover 16 newhigh red-shift quasars in one order ofmagnitude less observation time
33333333
CLASSIFICATION EXAMPLE
Example trainingdatabase
Two predictor attributes:Age and Car-type (Sport,Minivan and Truck)
Age is ordered, Car-type iscategorical attribute
Class label indicateswhether person boughtproduct
Dependent attribute iscategorical
Age Car Class
20 M Yes
30 M Yes
25 T No
30 S Yes
40 S Yes
20 T No
30 M Yes
25 M Yes
40 M Yes
20 S No
34343434
GOALS AND REQUIREMENTS
Goals:
To produce an accurate classifier/regression function
To understand the structure of the problem
Requirements on the model:
High accuracy
Understandable by humans, interpretable
Fast construction for very large training databases
35353535
WHAT ARE DECISION TREES?
Minivan
Age
Car Type
YES NO
YES
<30 >=30
Sports, Truck
0 30 60 Age
YES
YES
NO
Minivan
Sports,Truck
36363636
DENSITY-BASED CLUSTERING
A cluster is defined as a connected dense component.
Density is defined in terms of number of neighbors ofa point.
We can find clusters of arbitrary shape
37373737
MARKET BASKET ANALYSIS
Consider shopping cart filled with several items
Market basket analysis tries to answer thefollowing questions:
Who makes purchases?
What do customers buy together?
In what order do customers purchase items?
38383838
MARKET BASKET ANALYSIS (CONTD.)
Coocurrences
80% of all customers purchase items X, Y and Z together.
Association rules
60% of all customers who purchase X and Y also buy Z.
Sequential patterns
60% of customers who first buy X also purchase Y withinthree weeks.
SPATIAL DATA
40404040
41414141
WHAT IS A SPATIAL DATABASE?
Database that:
Stores spatial objects
Manipulates spatial objects just like other objects inthe database
42424242
WHAT IS SPATIAL DATA?
Data which describes either location or shape
e.g.House or Fire Hydrant locationRoads, Rivers, Pipelines, Power linesForests, Parks, Municipalities, Lakes
In the abstract, reductionist view of the computer,these entities are represented as Points, Lines,and Polygons.
43434343
Roads are represented as LinesMail Boxes are represented as Points
44444444
TOPIC THREE
Land Use Classifications arerepresented as Polygons
45454545
TOPIC THREE
Combination of all the previous data
46464646
SPATIAL RELATIONSHIPS
Not just interested in location, also interested in“Relationships” between objects that are veryhard to model outside the spatial domain.
The most common relationships are Proximity : distance
Adjacency : “touching” and “connectivity”
Containment : inside/overlapping
47474747
SPATIAL RELATIONSHIPS
Distance between a toxic waste dump and a piece ofproperty you were considering buying.
48484848
SPATIAL RELATIONSHIPS
Distance to various pubs
49494949
SPATIAL RELATIONSHIPS
Adjacency: All the lots which share an edge
50505050
Connectivity: Tributary relationships in rivernetworks
51515151
MOST ORGANIZATIONS HAVE SPATIAL
DATA
Geocodable addresses
Customer location
Store locations
Transportationtracking
Statistical/Demographic
Cartography
Epidemiology
Crime patterns
Weather Information
Land holdings
Natural resources
City Planning
Environmentalplanning
InformationVisualization
Hazard detection
52525252
ADVANTAGES OF SPATIAL DATABASES
Able to treat your spatial data like anything elsein the DB
transactions
backups
integrity checks
less data redundancy
fundamental organization and operations handled bythe DB
multi-user support
security/access control
locking
53535353
ADVANTAGES OF SPATIAL DATABASES
Offset complicated tasks to the DB server
organization and indexing done for you
do not have to re-implement operators
do not have to re-implement functions
Significantly lowers the development time ofclient applications
54545454
ADVANTAGES OF SPATIAL DATABASES
Spatial querying using SQL
use simple SQL expressions to determine spatialrelationships
distance
adjacency
containment
use simple SQL expressions to perform spatialoperations
area
length
intersection
union
buffer
55555555
Original Polygons
Union Intersection
56565656
Original river network
Buffered rivers
57575757
ADVANTAGES OF SPATIAL DATABASES
… WHERE distance(<me>,pub_loc) < 1000
SELECT distance(<me>,pub_loc)*$0.01 + beer_cost…... WHERE touches(pub_loc, street)
… WHERE inside(pub_loc,city_area) and city_name= ...
58585858
ADVANTAGES OF SPATIAL DATABASES
Simple value of the proposed lot
Area(<my lot>) * <price per acre>+ area(intersect(<my log>,<forested area>) ) * <wood value peracre>- distance(<my lot>, <power lines>) * <cost of power line laying>
59595959
New Electoral Districts
• Changes in areas between 1996 and2001 election.
• Want to predict voting in 2001 bylooking at voting in 1996.
• Intersect the 2001 district polygon withthe voting areas polygons.
• Outside will have zero area• Inside will have 100% area• On the border will have partial area
• Multiply the % area by 1996 actualvoting and sum
• Result is a simple prediction of 2001voting
More advanced: also use demographicdata.
60606060
DISADVANTAGES OF SPATIAL DATABASES
Cost to implement can be high
Some inflexibility
Incompatibilities with some GIS software
Slower than local, specialized data structures
User/managerial inexperience and caution
61616161
PICTOGRAMS - SHAPES
Types: Basic Shapes, Multi-Shapes, DerivedShapes, Alternate Shapes, Any possible Shape,User-Defined Shapes
Basic Shapes Alternate Shapes
Multi-Shapes Any Possible Shape
Derived Shapes User Defined Shape
N 0, N *
!
62626262
SPATIAL DATA ENTITY CREATION
Form an entity to hold county names, states,populations, and geographies
CREATE TABLE County(
Name varchar(30),
State varchar(30),
Pop Integer,
Shape Polygon);
Form an entity to hold river names, sources,lengths, and geographies
CREATE TABLE River(
Name varchar(30),
Source varchar(30),
Distance Integer,
Shape LineString);
63636363
EXTENDING THE ER DIAGRAM
Standard ERDiagram
Spatial ERDiagram
64646464