data516/csed516 scalable data systems and algorithms
TRANSCRIPT
![Page 1: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/1.jpg)
DATA516/CSED516Scalable Data Systems and Algorithms
Lecture 1Design of a Relational DBMS
1DATA516/CSED516 - Fall 2020
![Page 2: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/2.jpg)
2
Course Staff• Instructor: Dan Suciu
• TA: Remy [email protected]
• TA: Zechariah Cheung [email protected]
DATA516/CSED516 - Fall 2020
![Page 3: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/3.jpg)
Coarse Aims
• Study design of big data systems– Historical perspective– Sample of modern systems– Breadth of designs (relational, streaming, graph, etc.)
• Study key scalable data processing algorithms• Gain hands-on experience with big data systems
– Demonstrations and tutorials in sections– Assignments and projects
DATA516/CSED516 - Fall 2020 3
![Page 4: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/4.jpg)
Coarse Content
• Query processing: single-sever, distributed• MapReduce, legacy, successors• Some important “Big data” algorithms• Misc: streaming, column stores, graph engines
DATA516/CSED516 - Fall 2020 4
![Page 5: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/5.jpg)
Course Format
• 5pm-7:50pm: Lectures– Discuss system architecture & algorithms
• 8pm-8:50pm: Hands-on tutorials– Learn how to use big data systems– Jump start your homeworks– Bring your laptop!
DATA516/CSED516 - Fall 2020 5
![Page 6: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/6.jpg)
6
Grading (subject to change!)• 15%: Reading assigned papers
– Write short statement/review• 60%: Homework assignments
– Redshift Spark, Snowflake, others• 25%: Final project
DATA516/CSED516 - Fall 2020
![Page 7: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/7.jpg)
Project
Choose a topic:• Don’t worry about novelty!• Highly recommended: Benchmark projects
– Analyze the performance of some features– Compare the performance of different systems– Try to implement an interesting workload
• I will post a few ideas, but you are strongly encouraged to come up with your own
7
![Page 8: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/8.jpg)
Project
1. Project proposal (1 page)2. Project milestone (2-3 pages)3. Project presentation (in class)4. Project final report (4-5 pages)
DATA516/CSED516 - Fall 2020 8
![Page 9: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/9.jpg)
Web Services
• HW1: Amazon Redshift – attend today’s section!
• HW2: Spark/AWS• HW3: Snowflake – see Remy’s post• HW4: mini-homeworks – stay tuned
Azure: optional, for the project9
![Page 10: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/10.jpg)
Communication
• Course webpage: all important stuff https://courses.cs.washington.edu/courses/csed516/20au/
• Discussion Board: ED. Say “hello”!
• Class email: only for important announcements
10
![Page 11: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/11.jpg)
How to Turn In
https://gitlab.cs.washington.edu/• Your own repository• Pull to get homework instructions, starter files• Push homework solutions, project reports
Reviews: we use google forms• Typically around ½ page• Goal is only for us to check that you have
read the paper 11
![Page 12: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/12.jpg)
Relational Database Management Systems
DATA516/CSED516 - Fall 2020 12
![Page 13: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/13.jpg)
Quick Review
• Database is a collection of files• Database management system (DBMS) is a
piece of software to help manage that data• History:
– Origins in the 1960’s– Relational model 1970– First relational DBMSs (Ingres and System R):
1970’s– Parallel DBMSs: 1980’s
DATA516/CSED516 - Fall 2020 13
![Page 14: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/14.jpg)
DBMS Functionality1. Describe real-world entities in terms of a data
model2. Create & persistently store large datasets3. Efficiently query & update
1. Must handle complex questions about data2. Must handle sophisticated updates3. Performance matters
4. Change structure (e.g., add attributes)5. Concurrency control: enable simultaneous updates6. Crash recovery7. Access control, security, integrity
DATA516/CSED516 - Fall 2020 14
![Page 15: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/15.jpg)
Relational Data Model• A Database is a collection of relations
• A Relation is a subset of Dom1 x Dom2 x … x Domn– Where Domi is the domain of attribute i– n is number of attributes of the relation– A relation R is a set of tuples
• A Tuple t is an element of Dom1 x Dom2 x … x Domn
DATA516/CSED516 - Fall 2020 15
Other names: relation = table; tuple = row
![Page 16: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/16.jpg)
Discussion• Rows in a relation:
– Ordering immaterial (a relation is a set)– All rows are distinct – set semantics– Query answers may have duplicates – bag semantics
• Columns in a tuple:– Ordering is significant– Applications refer to columns by their names
• Domain of each column is a primitive type
Data independence!
DATA516/CSED516 - Fall 2020 16
Or is it?
![Page 17: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/17.jpg)
DATA516/CSED516 - Fall 2020
Schema• Relation schema: describes column heads
– Relation name– Name of each field (or column, or attribute)– Domain of each field– The arity of the relation = # attributes
• Database schema: set of all relation schemas
17
![Page 18: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/18.jpg)
DATA516/CSED516 - Fall 2020
Instance• Relation instance: concrete table content
– Set of tuples (also called records) matching the schema– The cardinality of the relation = # tuples
(a.k.a. size)
• Database instance: set of all relation instances
18
![Page 19: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/19.jpg)
What is the schema?What is the instance?
DATA516/CSED516 - Fall 2020 19
sno sname scity sstate1 s1 city 1 WA2 s2 city 1 WA3 s3 city 2 MA4 s4 city 2 MA
Supplier
![Page 20: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/20.jpg)
What is the schema?What is the instance?
DATA516/CSED516 - Fall 2020 20
sno sname scity sstate1 s1 city 1 WA2 s2 city 1 WA3 s3 city 2 MA4 s4 city 2 MA
Relation schemaSupplier(sno: integer, sname: string, scity: string, sstate: string)Supplier
instance
![Page 21: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/21.jpg)
Relational Query Language
• Set-at-a-time:– Query inputs and outputs are relations
• Two variants of the query language:– Relational algebra: specifies order of operations– Relational calculus / SQL: declarative
DATA516/CSED516 - Fall 2020 21
![Page 22: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/22.jpg)
Note
• We will review Relational Algebra and SQL today
• In addition: please review at home:– Review material from DATA514/CSED514
DATA516/CSED516 - Fall 2020 22
![Page 23: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/23.jpg)
Structured Query Language: SQL
• Data definition language: DDL– Statements to create, modify tables and views– CREATE TABLE …,
CREATE VIEW …, ALTER TABLE…
• Data manipulation language: DML– Statements to issue queries, insert, delete data– SELECT-FROM-WHERE…,
INSERT…,UPDATE…, DELETE…
We focuson this
![Page 24: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/24.jpg)
DATA516/CSED516 - Fall 2020
SQL Query
SELECT <attributes>FROM <one or more relations>WHERE <conditions>
Basic form: (plus many many more bells and whistles)
24
![Page 25: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/25.jpg)
Quick Review of SQL
Supplier(sno,sname,scity,sstate)Supply(sno,pno,qty,price)Part(pno,pname,psize,pcolor)
DATA516/CSED516 - Fall 2020 25
![Page 26: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/26.jpg)
Quick Review of SQL
What doesthis querycompute?
Supplier(sno,sname,scity,sstate)Supply(sno,pno,qty,price)Part(pno,pname,psize,pcolor)
SELECT DISTINCT z.pno, z.pnameFROM Supplier x, Supply y, Part z WHERE x.sno = y.sno
and y.pno = z.pnoand x.scity = ‘Seattle’and y.price < 100
DATA516/CSED516 - Fall 2020 26
![Page 27: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/27.jpg)
Terminology
• Selection: return a subset of the rows:– SELECT * FROM Supplier
WHERE scity = ’Seattle’
• Projection: return subset of the columns:– SELECT DISTINCT scity FROM Supplier;
• Join: refers to combining two or more tables– SELECT * FROM Supplier, Supply, Part …
27
Supplier(sno,sname,scity,sstate)Supply(sno,pno,qty,price)Part(pno,pname,psize,pcolor)
![Page 28: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/28.jpg)
Self-Joins
Supplier(sno,sname,scity,sstate)Supply(sno,pno,qty,price)Part(pno,pname,psize,pcolor)
DATA516/CSED516 - Fall 2020 28
Find the Parts numbers available both fromsuppliers in Seattle, and suppliers in Portland
![Page 29: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/29.jpg)
Self-Joins
Supplier(sno,sname,scity,sstate)Supply(sno,pno,qty,price)Part(pno,pname,psize,pcolor)
SELECT DISTINCT y1.pnoFROM Supplier x1, Supplier x2, Supply y1, Supply y2WHERE x1.scity = ‘Seattle’
and x1.sno = y1.snoand x2.scity = ‘Portland’and x2.sno = y2.snoand y1.pno = y2.pno
DATA516/CSED516 - Fall 2020 29
Find the Parts numbers available both fromsuppliers in Seattle, and suppliers in Portland
![Page 30: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/30.jpg)
Self-Joins
Supplier(sno,sname,scity,sstate)Supply(sno,pno,qty,price)Part(pno,pname,psize,pcolor)
SELECT DISTINCT y1.pnoFROM Supplier x1, Supplier x2, Supply y1, Supply y2WHERE x1.scity = ‘Seattle’
and x1.sno = y1.snoand x2.scity = ‘Portland’and x2.sno = y2.snoand y1.pno = y2.pno
DATA516/CSED516 - Fall 2020 30
Find the Parts numbers available both fromsuppliers in Seattle, and suppliers in Portland
Self-join
![Page 31: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/31.jpg)
Simple Analytics
Supplier(sno,sname,scity,sstate)Supply(sno,pno,qty,price)Part(pno,pname,psize,pcolor)
SELECT z.pno, z.pname, min(y.price) as p1, max(y.price) as p2FROM Supply y, Part z WHERE y.pno = z.pnoGROUP BY z.pno, z.pname
DATA516/CSED516 - Fall 2020 31
For each part, compute its minimum and maximum pricefrom all suppliers.
![Page 32: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/32.jpg)
Terminology
• Online Analytical Processing (OLAP)a.k.a. Data Analytics queries– GROUP-BY + aggregates– No updates– Touch most of, or all the data– Very important in data science!
• Online Transaction Processing (OLTP):– Point queries: return account 12345– Often have updates 32
Data Science
![Page 33: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/33.jpg)
Terminology
• Online Analytical Processing (OLAP)a.k.a. Data Analytics queries– GROUP-BY + aggregates– No updates– Touch most of, or all the data– Very important in data science!
• Online Transaction Processing (OLTP):– Point queries: return account 12345– Often have updates 33
Data Science
![Page 34: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/34.jpg)
Other use of Relational Data
• Sparse vectors, matrics
• Graph databases
DATA516/CSED516 - Fall 2020 34
![Page 35: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/35.jpg)
Sparse Matrics
DATA516/CSED516 - Fall 2020 35
𝐴 =5 0 −20 0 −10 7 0
How can we representit as a relation?
![Page 36: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/36.jpg)
Sparse Matrics
DATA516/CSED516 - Fall 2020 36
𝐴 =5 0 −20 0 −10 7 0
Row Col Val1 1 51 3 -22 3 -13 2 7
![Page 37: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/37.jpg)
Matrix Multiplication in SQL
DATA516/CSED516 - Fall 2020 37
𝐶 = 𝐴 ⋅ 𝐵
![Page 38: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/38.jpg)
Matrix Multiplication in SQL
DATA516/CSED516 - Fall 2020 38
𝐶 = 𝐴 ⋅ 𝐵 𝐶!" =&#
𝐴!# ⋅ 𝐵#"
![Page 39: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/39.jpg)
Matrix Multiplication in SQL
DATA516/CSED516 - Fall 2020 39
𝐶 = 𝐴 ⋅ 𝐵 𝐶!" =&#
𝐴!# ⋅ 𝐵#"
SELECT A.row, B.col, sum(A.val*B.val)FROM A, BWHERE A.col = B.rowGROUP BY A.row, B.col;
![Page 40: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/40.jpg)
Discussion
• Matrix multiplication = join + group-by• Many operations can be written in SQL• E.g. try at home: write in SQL
𝑇𝑟 𝐴 ⋅ 𝐵 ⋅ 𝐶where the trace is defined as:
𝑇𝑟 𝑋 = ∑! 𝑋!!• Surprisingly, 𝐴 + 𝐵 is a bit harder…
40
![Page 41: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/41.jpg)
Matrix Addition in SQL
DATA516/CSED516 - Fall 2020 41
𝐶 = 𝐴 + 𝐵
![Page 42: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/42.jpg)
Matrix Addition in SQL
DATA516/CSED516 - Fall 2020 42
𝐶 = 𝐴 + 𝐵
SELECT A.row, A.col, A.val + B.val as valFROM A, BWHERE A.row = B.row and A.col = B.col
![Page 43: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/43.jpg)
Matrix Addition in SQL
DATA516/CSED516 - Fall 2020 43
𝐶 = 𝐴 + 𝐵
SELECT A.row, A.col, A.val + B.val as valFROM A, BWHERE A.row = B.row and A.col = B.col
Why is this wrong?
![Page 44: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/44.jpg)
Solution 1: Outer Joins
DATA516/CSED516 - Fall 2020 44
𝐶 = 𝐴 + 𝐵
SELECT(CASE WHEN A.row is null THEN B.row ELSE A.row END) as row,(CASE WHEN A.col is null THEN B.col ELSE A.col END) as col,(CASE WHEN A.val is null THEN 0 ELSE A.val END) +(CASE WHEN B.val is null THEN 0 ELSE B.val END) as valFROM A full outer join B ON A.row = B.row and A.col = B.col;
![Page 45: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/45.jpg)
Solution 1: Outer Joins
DATA516/CSED516 - Fall 2020 45
𝐶 = 𝐴 + 𝐵
SELECT(CASE WHEN A.row is null THEN B.row ELSE A.row END) as row,(CASE WHEN A.col is null THEN B.col ELSE A.col END) as col,(CASE WHEN A.val is null THEN 0 ELSE A.val END) +(CASE WHEN B.val is null THEN 0 ELSE B.val END) as valFROM A full outer join B ON A.row = B.row and A.col = B.col;
![Page 46: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/46.jpg)
Solution 1: Outer Joins
DATA516/CSED516 - Fall 2020 46
𝐶 = 𝐴 + 𝐵
SELECT(CASE WHEN A.row is null THEN B.row ELSE A.row END) as row,(CASE WHEN A.col is null THEN B.col ELSE A.col END) as col,(CASE WHEN A.val is null THEN 0 ELSE A.val END) +(CASE WHEN B.val is null THEN 0 ELSE B.val END) as valFROM A full outer join B ON A.row = B.row and A.col = B.col;
![Page 47: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/47.jpg)
Solution 1: Outer Joins
DATA516/CSED516 - Fall 2020 47
𝐶 = 𝐴 + 𝐵
SELECT(CASE WHEN A.row is null THEN B.row ELSE A.row END) as row,(CASE WHEN A.col is null THEN B.col ELSE A.col END) as col,(CASE WHEN A.val is null THEN 0 ELSE A.val END) +(CASE WHEN B.val is null THEN 0 ELSE B.val END) as valFROM A full outer join B ON A.row = B.row and A.col = B.col;
![Page 48: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/48.jpg)
Discussion
• Outer joins: includes a tuple even if it doesn’t join with anything in the other table
• Left outer join, right outer join, full outer join – what do they mean?
• Note distinction between ON and WHERE
DATA516/CSED516 - Fall 2020 48
![Page 49: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/49.jpg)
WHERE v.s. ONSparse vectors:
1 2 3 4 5 6 7 8 9X= 50 0 -30 60 -80 0 -90 10 0
Y= -55 0 65 -15 0 35 -75 15 25
![Page 50: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/50.jpg)
WHERE v.s. ONSparse vectors:
1 2 3 4 5 6 7 8 9X= 50 0 -30 60 -80 0 -90 10 0
Y= -55 0 65 -15 0 35 -75 15 25
SELECT x.pos, x.val, y.valFROM x left outer join yON x.pos = y.pos and y.val > 0;
![Page 51: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/51.jpg)
WHERE v.s. ONSparse vectors:
1 2 3 4 5 6 7 8 9X= 50 0 -30 60 -80 0 -90 10 0
Y= -55 0 65 -15 0 35 -75 15 25
SELECT x.pos, x.val, y.valFROM x left outer join yON x.pos = y.pos and y.val > 0;
SELECT x.pos, x.val, y.valFROM x left outer join yON x.pos = y.pos
WHERE y.val > 0;
v.s.
![Page 52: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/52.jpg)
WHERE v.s. ONSparse vectors:
1 2 3 4 5 6 7 8 9X= 50 0 -30 60 -80 0 -90 10 0
Y= -55 0 65 -15 0 35 -75 15 25
SELECT x.pos, x.val, y.valFROM x left outer join yON x.pos = y.pos and y.val > 0;
SELECT x.pos, x.val, y.valFROM x left outer join yON x.pos = y.pos
WHERE y.val > 0;x.pos x.val y.val
1 50 Null
3 -30 65
4 60 Null
5 -80 Null
7 -90 Null
8 10 15
x.pos x.val y.val
3 -30 65
8 10 15
v.s.
![Page 53: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/53.jpg)
Solution 2: Group By
DATA516/CSED516 - Fall 2020 53
𝐶 = 𝐴 + 𝐵
SELECT m.row, m.col, sum(m.val)FROM (SELECT * FROM A
UNION ALLSELECT * FROM B) as m
GROUP BY m.row, m.col;
![Page 54: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/54.jpg)
Graph Databases
• Graph databases systems are a niche category of products specialized for processing large graphs
• E.g. Neo4J, TigerGraph• A graph is a special case of a relation,
and can be processed using SQL
DATA516/CSED516 - Fall 2020 54
![Page 55: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/55.jpg)
Graph DatabasesA graph:
1
2
4
3
5
![Page 56: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/56.jpg)
Graph Databases
src dst1 22 12 3
1 4
3 44 5
EdgeA graph: A relation:
1
2
4
3
5
![Page 57: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/57.jpg)
Graph Databases
1
2
4
3
src dst1 22 12 3
1 4
3 44 5
Edge
5
A graph:
Find nodes at distance 2: 𝑥, 𝑧 ∃𝑦 𝐸𝑑𝑔𝑒 𝑥, 𝑦 ∧ 𝐸𝑑𝑔𝑒(𝑦, 𝑧)}
A relation:
![Page 58: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/58.jpg)
Graph Databases
src dst1 22 12 3
1 4
3 44 5
EdgeA graph:
Find nodes at distance 2: 𝑥, 𝑧 ∃𝑦 𝐸𝑑𝑔𝑒 𝑥, 𝑦 ∧ 𝐸𝑑𝑔𝑒(𝑦, 𝑧)}
SELECT DISTINCT e1.src as X, e2.dst as ZFROM Edge e1, Edge e2WHERE e1.dst = e2.src;
A relation:
1
2
4
3
5
![Page 59: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/59.jpg)
Other Representation
src dstAlice BobBob AliceBob Chris
Alice David
Chris DavidDavid Eve
EdgeNode
Alice
Bob
David
Chris
EveFrank src
AliceBobChris
David
EveFrank
Representing nodes separately;needed for “isolated nodes” e.g. Frank
![Page 60: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/60.jpg)
Other Representation
src dst weightAlice Bob 3Bob Alice 1Bob Chris 2
Alice David 9
Chris David 5David Eve 1
EdgeNode
Alice
Bob
David
Chris
EveFrank src
AliceBobChris
David
EveFrank
Adding edge labelsAdding node labels…
2
53
1
91
![Page 61: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/61.jpg)
Limitations of SQL
• No recursion! Examples requiring recursion:– Gradient descent– Connected components in a graph
• Advanced systems do support recursion• Practical solution: use some external
driver, e.g. pyton
DATA516/CSED516 - Fall 2020 61
![Page 62: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/62.jpg)
Example: Logistic RegressionTom Mitchell: Machine Learning
X1 X2 X3 Y
3 9 3 03 5 7 16 2 2 0
3 6 3 0
5 5 9 19 3 3 1… … …… … …
Data
![Page 63: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/63.jpg)
Example: Logistic RegressionTom Mitchell: Machine Learning
X1 X2 X3 Y
3 9 3 03 5 7 16 2 2 0
3 6 3 0
5 5 9 19 3 3 1… … …… … …
Data𝑃 𝑌 = 0 𝑋 =
11 + 𝑒𝑥𝑝 𝑤! + ∑"#$,&𝑤"𝑋"
𝑃 𝑌 = 1 𝑋 =𝑒𝑥𝑝 𝑤! + ∑"#$,&𝑤"𝑋"
1 + 𝑒𝑥𝑝 𝑤! + ∑"#$,&𝑤"𝑋"
Switched(following Mitchell)
![Page 64: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/64.jpg)
Example: Logistic RegressionTom Mitchell: Machine Learning
X1 X2 X3 Y
3 9 3 03 5 7 16 2 2 0
3 6 3 0
5 5 9 19 3 3 1… … …… … …
Data𝑃 𝑌 = 0 𝑋 =
11 + 𝑒𝑥𝑝 𝑤! + ∑"#$,&𝑤"𝑋"
𝑃 𝑌 = 1 𝑋 =𝑒𝑥𝑝 𝑤! + ∑"#$,&𝑤"𝑋"
1 + 𝑒𝑥𝑝 𝑤! + ∑"#$,&𝑤"𝑋"
Switched(following Mitchell)
Train weights 𝑤!, 𝑤$, 𝑤', 𝑤& to minimize loss:
𝐿 𝑤!, … , 𝑤& = :ℓ#$,)
𝑌ℓ ⋅ ln 𝑃 𝑌 = 1|𝑋ℓ + (1 − 𝑌ℓ) ⋅ ln 𝑃 𝑌 = 0|𝑋ℓ
![Page 65: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/65.jpg)
Example: Logistic RegressionTom Mitchell: Machine Learning
X1 X2 X3 Y
3 9 3 03 5 7 16 2 2 0
3 6 3 0
5 5 9 19 3 3 1… … …… … …
DataGradient Descent:
𝑤" ← 𝑤" + 𝜂 :ℓ#$,)
𝑋"ℓ 𝑌ℓ − 𝑃(𝑌 = 1|𝑋ℓ)
![Page 66: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/66.jpg)
Example: Logistic RegressionTom Mitchell: Machine Learning
X1 X2 X3 Y
3 9 3 03 5 7 16 2 2 0
3 6 3 0
5 5 9 19 3 3 1… … …… … …
DataGradient Descent:
𝑤" ← 𝑤" + 𝜂 :ℓ#$,)
𝑋"ℓ 𝑌ℓ − 𝑃(𝑌 = 1|𝑋ℓ)
CREATE TABLE W (k int primary key, w0 real, w1 real, w2 real, w3 real);INSERT INTO W VALUES (1, 0, 0, 0, 0);
![Page 67: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/67.jpg)
Example: Logistic RegressionTom Mitchell: Machine Learning
X1 X2 X3 Y
3 9 3 03 5 7 16 2 2 0
3 6 3 0
5 5 9 19 3 3 1… … …… … …
DataGradient Descent:
𝑤" ← 𝑤" + 𝜂 :ℓ#$,)
𝑋"ℓ 𝑌ℓ − 𝑃(𝑌 = 1|𝑋ℓ)
SELECTW.w0+0.01*sum(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3))) as w0,W.w1+0.01*sum(d.X1*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w1,W.w2+0.01*sum(d.X2*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w2,W.w3+0.01*sum(d.X3*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w3
FROM data d, WWHERE W.k=1GROUP BY W.k, W.w0, W.w1, W.w2, W.w3;
CREATE TABLE W (k int primary key, w0 real, w1 real, w2 real, w3 real);INSERT INTO W VALUES (1, 0, 0, 0, 0);
![Page 68: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/68.jpg)
Example: Logistic RegressionTom Mitchell: Machine Learning
X1 X2 X3 Y
3 9 3 03 5 7 16 2 2 0
3 6 3 0
5 5 9 19 3 3 1… … …… … …
DataGradient Descent:
𝑤" ← 𝑤" + 𝜂 :ℓ#$,)
𝑋"ℓ 𝑌ℓ − 𝑃(𝑌 = 1|𝑋ℓ)
SELECTW.w0+0.01*sum(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3))) as w0,W.w1+0.01*sum(d.X1*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w1,W.w2+0.01*sum(d.X2*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w2,W.w3+0.01*sum(d.X3*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w3
FROM data d, WWHERE W.k=1GROUP BY W.k, W.w0, W.w1, W.w2, W.w3;
CREATE TABLE W (k int primary key, w0 real, w1 real, w2 real, w3 real);INSERT INTO W VALUES (1, 0, 0, 0, 0);
![Page 69: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/69.jpg)
Example: Logistic RegressionTom Mitchell: Machine Learning
X1 X2 X3 Y
3 9 3 03 5 7 16 2 2 0
3 6 3 0
5 5 9 19 3 3 1… … …… … …
DataGradient Descent:
𝑤" ← 𝑤" + 𝜂 :ℓ#$,)
𝑋"ℓ 𝑌ℓ − 𝑃(𝑌 = 1|𝑋ℓ)
SELECTW.w0+0.01*sum(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3))) as w0,W.w1+0.01*sum(d.X1*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w1,W.w2+0.01*sum(d.X2*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w2,W.w3+0.01*sum(d.X3*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w3
FROM data d, WWHERE W.k=1GROUP BY W.k, W.w0, W.w1, W.w2, W.w3;
CREATE TABLE W (k int primary key, w0 real, w1 real, w2 real, w3 real);INSERT INTO W VALUES (1, 0, 0, 0, 0);
![Page 70: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/70.jpg)
Example: Logistic RegressionTom Mitchell: Machine Learning
X1 X2 X3 Y
3 9 3 03 5 7 16 2 2 0
3 6 3 0
5 5 9 19 3 3 1… … …… … …
DataGradient Descent:
𝑤" ← 𝑤" + 𝜂 :ℓ#$,)
𝑋"ℓ 𝑌ℓ − 𝑃(𝑌 = 1|𝑋ℓ)
SELECTW.w0+0.01*sum(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3))) as w0,W.w1+0.01*sum(d.X1*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w1,W.w2+0.01*sum(d.X2*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w2,W.w3+0.01*sum(d.X3*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w3
FROM data d, WWHERE W.k=1GROUP BY W.k, W.w0, W.w1, W.w2, W.w3;
CREATE TABLE W (k int primary key, w0 real, w1 real, w2 real, w3 real);INSERT INTO W VALUES (1, 0, 0, 0, 0);
![Page 71: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/71.jpg)
Example: Logistic RegressionTom Mitchell: Machine Learning
X1 X2 X3 Y
3 9 3 03 5 7 16 2 2 0
3 6 3 0
5 5 9 19 3 3 1… … …… … …
DataGradient Descent:
𝑤" ← 𝑤" + 𝜂 :ℓ#$,)
𝑋"ℓ 𝑌ℓ − 𝑃(𝑌 = 1|𝑋ℓ)
SELECTW.w0+0.01*sum(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3))) as w0,W.w1+0.01*sum(d.X1*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w1,W.w2+0.01*sum(d.X2*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w2,W.w3+0.01*sum(d.X3*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w3
FROM data d, WWHERE W.k=1GROUP BY W.k, W.w0, W.w1, W.w2, W.w3;
CREATE TABLE W (k int primary key, w0 real, w1 real, w2 real, w3 real);INSERT INTO W VALUES (1, 0, 0, 0, 0);
![Page 72: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/72.jpg)
Example: Logistic RegressionTom Mitchell: Machine Learning
X1 X2 X3 Y
3 9 3 03 5 7 16 2 2 0
3 6 3 0
5 5 9 19 3 3 1… … …… … …
DataGradient Descent:
𝑤" ← 𝑤" + 𝜂 :ℓ#$,)
𝑋"ℓ 𝑌ℓ − 𝑃(𝑌 = 1|𝑋ℓ)
SELECTW.w0+0.01*sum(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3))) as w0,W.w1+0.01*sum(d.X1*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w1,W.w2+0.01*sum(d.X2*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w2,W.w3+0.01*sum(d.X3*(d.Y - 1 + 1/(1+exp(W.w0+W.w1*d.X1+W.w2*d.X2+W.w3*d.X3)))) as w3
FROM data d, WWHERE W.k=1GROUP BY W.k, W.w0, W.w1, W.w2, W.w3;
CREATE TABLE W (k int primary key, w0 real, w1 real, w2 real, w3 real);INSERT INTO W VALUES (1, 0, 0, 0, 0);
Update W, then repeat thise.g. using python
![Page 73: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/73.jpg)
Discussion
SQL in Data Science:• Used primarily to prepare the data
– ETL – Extract/Transform/Load– Join tables, process columns, filter rows
• Can also be used in training– Much less convenient than ML packages– But can be the best option if data is huge
DATA516/CSED516 - Fall 2020 73
![Page 74: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/74.jpg)
SQL – Summary
• Very complex: >1000 pages, – No vendor supports full standard; (in practice,
people use postgres as de facto standard)– Much more than DML
• It is a declarative language: – we say what we want– we don’t say how to get it
• Relational algebra says how to get it 74
![Page 75: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/75.jpg)
DATA516/CSED516 - Fall 2020
Relational Algebra• Queries specified in an operational manner
– A query gives a step-by-step procedure
• Relational operators– Take one or two relation instances as input– Return one relation instance as result– Easy to compose into relational algebra expressions
75
![Page 76: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/76.jpg)
Five Basic Relational Operators
• Selection: 𝜎condition(S)– Condition is Boolean combination (∧,∨)
of atomic predicates (<, <=, =, ≠, >=, >)• Projection: πlist-of-attributes(S)• Union (∪)• Set difference (–), • Cross-product/cartesian product (⨯),
Join: R ⋈𝛉S = 𝜎𝛉(R⨯S)Other operators: anti-semijoin, renaming
DATA516/CSED516 - Fall 2020 76
![Page 77: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/77.jpg)
DATA516/CSED516 - Fall 2020
Extended Operatorsof Relational Algebra
• Duplicate elimination (𝛿)– Since commercial DBMSs operate on multisets
not sets• Group-by/aggregate (ɣ)
– Min, max, sum, average, count– Partitions tuples of a relation into “groups”– Aggregates can then be applied to groups
• Sort operator (𝜏)
77
![Page 78: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/78.jpg)
Logical Query Plans
Supplier(sno,sname,scity,sstate)Supply(sno,pno,qty,price)Part(pno,pname,psize,pcolor)
78
SELECT DISTINCT x.sname, x.scityFROM Supplier x, Supply y, Part zWHERE x.sno=y.sno
and y.pno=z.pnoand z.psize > 10;
![Page 79: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/79.jpg)
Logical Query Plans
Supplier Supply
pno=pno
Part
Π sname,scity
σ psize > 10sno=sno
Supplier(sno,sname,scity,sstate)Supply(sno,pno,qty,price)Part(pno,pname,psize,pcolor)
79
SELECT DISTINCT x.sname, x.scityFROM Supplier x, Supply y, Part zWHERE x.sno=y.sno
and y.pno=z.pnoand z.psize > 10;
𝝳
![Page 80: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/80.jpg)
Query Optimizer
• Rewrite one relational algebra expression to a better one
• Very brief review now, more details next lecture
DATA516/CSED516 - Fall 2020 80
![Page 81: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/81.jpg)
Optimization
Product Purchase
pid=pid
price>100 and city=‘Seattle’
x.name,z.name
δ
cid=cid
Customer
Π
σ
SELECT DISTINCT x.name, z.nameFROM Product x, Purchase y, Customer zWHERE x.pid = y.pid and y.cid = y.cid and
x.price > 100 and z.city = ‘Seattle’
Product(pid, name, price)Purchase(pid, cid, store)Customer(cid, name, city)
![Page 82: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/82.jpg)
Optimization
Product Purchase
pid=pid
price>100 and city=‘Seattle’
x.name,z.name
δ
cid=cid
Customer
Π
σ
SELECT DISTINCT x.name, z.nameFROM Product x, Purchase y, Customer zWHERE x.pid = y.pid and y.cid = y.cid and
x.price > 100 and z.city = ‘Seattle’
Product(pid, name, price)Purchase(pid, cid, store)Customer(cid, name, city)
Pushselections
down
![Page 83: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/83.jpg)
Optimization
Product Purchase
pid=pidcity=‘Seattle’
x.name,z.name
δ
cid=cid
Customer
Π
σprice>100
σ
SELECT DISTINCT x.name, z.nameFROM Product x, Purchase y, Customer zWHERE x.pid = y.pid and y.cid = y.cid and
x.price > 100 and z.city = ‘Seattle’
Product(pid, name, price)Purchase(pid, cid, store)Customer(cid, name, city)
More about thisnext lecture
Pushselections
down
![Page 84: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/84.jpg)
Relational Model -- Summary
• Schema v.s. Data
DATA516/CSED516 - Fall 2020 84
![Page 85: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/85.jpg)
Relational Model -- Summary
• Schema v.s. Data• Data is normalized (what is that?)
DATA516/CSED516 - Fall 2020 85
![Page 86: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/86.jpg)
Relational Model -- Summary
• Schema v.s. Data• Data is normalized (what is that?)
– 1st NF: relations are flat (also unordered)– BCNF (or 3rd or 4th NF…): split large table
into many small (why?), need to join back
DATA516/CSED516 - Fall 2020 86
![Page 87: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/87.jpg)
Relational Model -- Summary
• Schema v.s. Data• Data is normalized (what is that?)
– 1st NF: relations are flat (also unordered)– BCNF (or 3rd or 4th NF…): split large table
into many small (why?), need to join back– (Consequence: joins are really important)
DATA516/CSED516 - Fall 2020 87
![Page 88: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/88.jpg)
Relational Model -- Summary
• Schema v.s. Data• Data is normalized (what is that?)
– 1st NF: relations are flat (also unordered)– BCNF (or 3rd or 4th NF…): split large table
into many small (why?), need to join back– (Consequence: joins are really important)
• Query language is SQL, or something equivalent, like relational algebra
DATA516/CSED516 - Fall 2020 88
![Page 89: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/89.jpg)
Benefits of Relational Model
• Physical data independence– Can change how data is organized on disk without
affecting applications
• Logical data independence– Can change the logical schema without affecting
applications (not 100%... consider updates)
DATA516/CSED516 - Fall 2020 89
![Page 90: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/90.jpg)
Physical Data Independence
90
sno sname scity sstate1 s1 city 1 WA2 s2 city 1 WA3 s3 city 2 MA4 s4 city 2 MA
SupplierSELECT DISTINCT snameFROM SupplierWHERE scity = ‘Seattle’
How is the data stored on disk?(e.g. row-wise, column-wise)
Is there an index on scity?(e.g. no index, unclustered index, clustered index)
The SQL query worksthe same, regardless
of the answers tothese questions
![Page 91: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/91.jpg)
How to Implement a Relational DBMS?
DATA516/CSED516 - Fall 2020 91
DBMS
SQL
Data
![Page 92: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/92.jpg)
DBMS Architecture
92
Process Manager
Admission Control
Connection Mgr
Query Processor
Parser
Query Rewrite
Optimizer
Executor
Storage Manager
Access Methods
Lock Manager
Buffer Manager
Log ManagerShared Utilities
Memory Mgr
Disk Space Mgr
Replication Services
Admin Utilities
[Anatomy of a Db System. J. Hellerstein & M. Stonebraker. Red Book. 4ed.]
![Page 93: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/93.jpg)
Storage Manager
DATA516/CSED516 - Fall 2020 93
![Page 94: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/94.jpg)
Disks
• Data resides persistently on disks– Your local disk, or a Network Attached
Storage (NAS), or Amazon’s S3• For processing, data must reside in
main memory
DATA516/CSED516 - Fall 2020 94
![Page 95: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/95.jpg)
Disks v.s. Main MemoryDisk• Unit of data =1 block
4KB or 8KB or 16KB• Access time* = seek time +
rotational latency + transfer rate≈ 12ms + 5ms + 150MB/s
– Random access ≈ 17 ms– Sequential access ≈ 50𝜇s
• Organization:– Heap file– Index file
Main memory• Unit of data =1byte
• Access time** = 50ns
• Organization:– Lists, arrays, hash tables, …
* https://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics** https://en.wikipedia.org/wiki/Dynamic_random-access_memory#Memory_timing
![Page 96: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/96.jpg)
Disks v.s. Main MemoryDisk• Unit of data =1 block
4KB or 8KB or 16KB• Access time* = seek time +
rotational latency + transfer rate≈ 12ms + 5ms + 150MB/s
– Random access ≈ 17 ms– Sequential access ≈ 50𝜇s
• Organization:– Heap file– Index file
Main memory• Unit of data =1byte
• Access time** = 50ns
• Organization:– Lists, arrays, hash tables, …
* https://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics** https://en.wikipedia.org/wiki/Dynamic_random-access_memory#Memory_timing
Orders ofmagnitude
slower
![Page 97: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/97.jpg)
DATA516/CSED516 - Fall 2020
Heap File
30 18 …
70 21
20 20
40 19
80 19
60 18
10 21
50 22
Data on disk is stored in filesFiles consist of pages filled with recordsA heap file is not sorted on any attributeStudent(sid: int, age: int, …)
1 record
1 page
97
![Page 98: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/98.jpg)
DATA516/CSED516 - Fall 2020
Index
10
20
30
40
50
60
70
80
10
20
30
40
50
60
70
80
1 data entry
1 page
Index File Data File (Sequential file)
98
An index is a data structure stored on diskIt is stored in a file consisting of pages & recordsBut records are (search key value, record ID)
Pointer to a record on disk
![Page 99: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/99.jpg)
DATA516/CSED516 - Fall 2020
Clustered vs.Unclustered Index
Data entries(Index File)(Data file)
Data Records
Data entries
Data Records
CLUSTERED UNCLUSTEREDClustered = records close in index are close in data
99
![Page 100: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/100.jpg)
Buffer Manager
DATA516/CSED516 - Fall 2020 100
Disk
Mainmemory
Page requests from higher-level code
Buffer pool
Disk page
Free frame
1 page correspondsto 1 disk block
Disk is a collectionof blocks
Buffer pool managerAccess methods
![Page 101: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/101.jpg)
Discussion
• Disks are necessary both for persistent storage, and in order to process data larger than main memory
• They are slow! Buffer pool mitigates this• ”Cold v.s. Hot execution”: first time you
execute the query it is slow; if you repeat it, then it is faster (WHY?)
DATA516/CSED516 - Fall 2020 101
![Page 102: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/102.jpg)
Discussion
• Main idea for Distributed data processing: if we distribute the data to many servers, then the data will fit in main memory
• For that reason, they often do not implement “out of core” algorithm
• Our focus in this course is on distributed data processing, not on out-of-core.
DATA516/CSED516 - Fall 2020 102
![Page 103: DATA516/CSED516 Scalable Data Systems and Algorithms](https://reader031.vdocuments.us/reader031/viewer/2022012012/6157d446ce5a9d02d46f8d12/html5/thumbnails/103.jpg)
Summary
• RDMBS are complex systems• Need to know some of their basics inner
workings in order to understand query performance
Next week: we start review query processing, optimization, and start discussion distributed query processing
103