cs345: advanced databases chris ré. what this course is database fundamentals: –theory –old...
TRANSCRIPT
What this course is
Database fundamentals:– Theory– Old Crusty, Good SQL stuff– No/New/Not-Yet SQL
New stuff: Knowledge bases & Inference
Databases is a strange and beautiful area: Theory, Algorithms, Systems, &
Applications
It’s a bit scattered, and I love it.
Three Turing Award Winners
Charles Bachmann
Edgar Codd
JimGray
Seminal contributions made in Industry
The Birth of the Relational Model(1971)
database: a handful of relations (tables) with fixed schema.
WorksIn(Employee,Dept)
Query with small # of operations:Selection (filter),
Projection, Join, Union.
Basically, an operational finite model theory.
Data and Query ModelR(A,B) = { (a1,b2),…,(an,bn) }S(B,C,D) = { (b’1,c1,d1),…,(b’m,cm,dm) }
PA(R) ={ a : exists b. (a,b) in R } Projection
SelectionsF(R) ={ (a,b) : F( (a,b) ) for t in R }
F : D(R) -> {True, False}
Join(R,S) = { (a,b,c,d) : (a,b) in R & (b,c,d) in S} Join
Data
System R
In,1974 System R shows possible to get good performance.
1st Implementation of SQL.
IBM didn’t Push it,worried about IMS cannibalization, but…
Pat Selinger
Others Come on to the Scene…
Larry Ellison hears about IBM’s Research prototype and founds a company….
Takeaways about Database Research
Started with mathematical elegance and with close ties to
industry.
Improve runtime performance as a proxy to increase programmer
productivity.
Independence
Declarative languages can improve productivity– Different team members work
independently• Backend, Storage, UI, BI, Etc.
– Transactional model.– Challenge: Support efficient concurrent
access?
Performance
Parallel programming is hard; SQL is most popular parallel programming language.– How do you deal with asymmetry of
memory hierarchy (Disk/MM/Cache)? – How do you structure parallel
optimization?– Concurrency?
Manageability
Systems live over time, and the system should automate many routine tasks.–Maintain derived data products (views)– Self-monitoring systems (autonomic)
Topic 1: QP Fundamentals
Query Processing Fundamentals1. Empirical Join evaluation from 70s!2. System R: The Archetype (Cardinalityw)3. Formal Query Languages4. Acyclic Query Evaluation (Structure)5. Worst-case Optimal Join Algorithms (S
+ C)This will be the most
formal part of the course.
Topic 2: OLAP-Style Analytics
Building new and old data systems:1. Theory of Materialized View2. Gamma (Parallel DBs) 3. MapReduce & the Rise of NoSQL
(2000s)4. NewSQL & Optimizing Joins on MR
(theory)5. Fagin’s Algorithm (theory)6. Statistical Analytic Systems
Topic 3: Next-Generation Systems
1. Information Extraction2. Probabilistic Query Evaluation
(Theory)3. Scalable Inference4. Knowledge Bases
Topic 4: OLTP Style
Transactional Systems1. The rise of Key-Value Stores2. The case for determinism3. CALM & CAPs 4. The Return of Main Memory DBs.5. Spanner, F1, and Data Centers
Grading
• Course Project (More next)– Do something interesting with data.– Teams OK– Form teams soon and email me by Jan
12.
• Midterm Exam
Projects in each topic
1. Knowledgebase Construction– Pick a domain and build a KBC system for it with
DeepDive
2. Join Algorithms– Certificate versions (see me)– MapReduce? GraphLab? Spark?
3. Analytics Systems
4. Transactional Systems.
You are free to choose other
projects
Datasets
• Snapshot of the web marked up with NLP tools and structured data (KBP and KBA challenges)
• 500k+ docs used by PaleoBiologists and structured data.
• We can mark up even more stuff.
• Benchmark ML, graphs if you want to work on analytics or join evaluation.