class 8 indexing & sorting - harvard...

57
indexing & sorting prof. Stratos Idreos HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/ class 8

Upload: others

Post on 29-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • indexing & sortingprof. Stratos Idreos

    HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/

    class 8

    http://daslab.seas.harvard.edu/classes/cs165/

  • CS165, Fall 2016 Stratos Idreos /362

    first part done: basic concepts in modern systems

    coming up: indexing and fast scans

  • CS165, Fall 2016 Stratos Idreos /363

    column-storage row-storage

    A B C D A B C D

    it all starts with how we layout the data (bits)

  • CS165, Fall 2016 Stratos Idreos /364

    essential column-stores featuresvirtual ids late tuple reconstruction (if ever) vectorized execution compression fixed-width columns

    0"

    5"

    10"

    15"

    20"

    25"

    30"

    35"

    40"

    45"

    Column"Store" Row"Store"

    Run$

    me'(sec)'

    Performance'of'Column3Oriented'Op$miza$ons'

    –Late"Materializa:on"

    –Compression"

    –Join"Op:miza:on"

    –Tuple@at@a@:me"

    Baseline"

    Column-stores vs. row-stores: how different are they really? D. Abadi, S. Madden, and N. Hachem

    ACM SIGMOD Conference on Management of Data, 2008

  • CS165, Fall 2016 Stratos Idreos /365

    registers

    on chip cache2x

    on board cache10x

    memory100x

    disk100Kx

    Jim Gray, IBM, Tandem, DEC, Microsoft ACM Turing award ACM SIGMOD Edgar F. Codd Innovations award

    Pluto2 years

    New York1.5 hours

    this building10 min

    this room1 min

    my head~0

  • CS165, Fall 2016 Stratos Idreos /366

    random access & page-based access

    need to only read x… but have to read all of page 1

    page1 page2 page3

    data value x

    registers

    on chip cache

    on board cache

    memory

    disk

    CPU

    data

    mov

    e

  • CS165, Fall 2016 Stratos Idreos /367

    and then from ideas to practice: implementation, tuning, analysis

  • CS165, Fall 2016 Stratos Idreos /368

    as you are starting your projects, remembercome to office hours - read/post in piazza

    distributed API, DSL, code is supposed to help you start fast diverging is perfectly ok when there is a good reason to do so

    functionality goal: select max(R.a), min(S.a) from R, S where R.j=S.j and R.b10 and S.d

  • CS165, Fall 2016 Stratos Idreos /369

    vectorwised processing: how to

    select max(A) from R where B

  • CS165, Fall 2016 Stratos Idreos /3610

    midtermshow to prepare

    open book, notes, no laptop/discussion

    material from lectures only

    check all quizzes and questions

    quiz-like questions - no exact answer

    expectations: describe the design space - chose what you think is the best approach (>1 if we ask for it) and then analyze in detail all requests - if you made the wrong choice in the begging it is ok - but say so if you find out in the end and explain as much as possible

    explain all steps and tradeoffs

    Sunday before midterm: Stratos office hours 2pm+

    10/12

  • CS165, Fall 2016 Stratos Idreos /3611

    today+3data access made better

  • CS165, Fall 2016 Stratos Idreos /3612

    select

    data data data

    join

    aggr

    selectselect

    join

    it all starts with the select operator

    it touches all the data

  • CS165, Fall 2016 Stratos Idreos /3613

    filtering data: point/range queriesindex

    data

    index knows structure of the data

    an alternative data representation (data structure) of all or part of the data

  • CS165, Fall 2016 Stratos Idreos /3614

    why bother with creating/maintaining another data structure?

    but wait, why not just sort the data (array) +

    binary search?

  • CS165, Fall 2016 Stratos Idreos /3615

    ok let’s go with sorting for a while

    A B C

    initial state columns in

    insertion order

    sorted A B C

    select B+C from R where A

  • CS165, Fall 2016 Stratos Idreos /3616

    a1 a2 a3 a4 a5

    b1 b2 b3 b4 b5

    c1 c2 c3 c4 c5

    A B Ca5 a3 a2 a1 a4

    Ab1 b2 b3 b4 b5

    c1 c2 c3 c4 c5

    B Cvalues are out of order

  • CS165, Fall 2016 Stratos Idreos /3616

    a1 a2 a3 a4 a5

    b1 b2 b3 b4 b5

    c1 c2 c3 c4 c5

    A B Ca5 a3 a2 a1 a4

    Ab1 b2 b3 b4 b5

    c1 c2 c3 c4 c5

    B Cvalues are out of order

    5 3 2 1 4

  • CS165, Fall 2016 Stratos Idreos /3616

    a1 a2 a3 a4 a5

    b1 b2 b3 b4 b5

    c1 c2 c3 c4 c5

    A B Ca5 a3 a2 a1 a4

    Ab1 b2 b3 b4 b5

    c1 c2 c3 c4 c5

    B Cvalues are out of order

    5 3 2 1 4

    a5 a3 a2 a1 a4

    A5 3 2 1 4

    select2 1 4

    b1 b2 b3 b4 b5

    B

    intermediate out of order

  • CS165, Fall 2016 Stratos Idreos /36

    sort or cluster

    17

  • CS165, Fall 2016 Stratos Idreos /3618

    database kernel

    data data data

    algo

    rithm

    s/op

    erat

    ors

    applications

    sql

    disk

    memory

    cpu

  • CS165, Fall 2016 Stratos Idreos /3619

    A B C

    initial state columns in

    insertion order

    sorted A B C

    sorted A B C

    propagate order of A

  • CS165, Fall 2016 Stratos Idreos /3620

    pos1 pos2

    1 0 1 0

    A

    sort

    edselect max(D),min(E) from R where (A>10 and A20 and B20 and B

  • CS165, Fall 2016 Stratos Idreos /3621

    A

    sort

    edselect max(D),min(E) from R where (A>10 and A20 and B20 and B

  • CS165, Fall 2016 Stratos Idreos /3622

    A B C

    base data

    A B C

    sort

    ed

    B A C

    sort

    ed

    queries that filter on A benefit

    …C-Store: A Column-oriented DBMSMichael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J. O'Neil, Patrick E. O'Neil, Alex Rasin, Nga Tran, Stanley B. Zdonik In Proc. of the Very Large Databases Conference (VLDB), 2005

    queries that filter on B benefit

  • CS165, Fall 2016 Stratos Idreos /36

    initial state columns in

    insertion order

    23

    A B C

    base data

    A B C

    sort

    ed

    B A C

    sort

    ed

    space overhead - update overhead - which ones to build?

  • CS165, Fall 2016 Stratos Idreos /3624

    declarative interface ask what you want

    db system

    DBAindexes/views/tuning knobs

  • CS165, Fall 2016 Stratos Idreos /36

    online

    25

    initial state columns in

    insertion order

    A B C

    base datastorage budget

  • CS165, Fall 2016 Stratos Idreos /3626

    level 1

    level 2

    CPU

    (assume simplified memory hierarchy)

    cost to sort array Cs?cost to find a value once sorted Ca?optimized algorithm to minimize Cs & Ca

    data does not fit in level 1 memory CPU can read/write directly from/to level 1 only

  • CS165, Fall 2016 Stratos Idreos /3627

    memory level L

    memory level L-1

    (size=3 pages)

    initial state: 8 unordered pages

  • CS165, Fall 2016 Stratos Idreos /3627

    memory level L

    memory level L-1

    (size=3 pages)

    quicksort in place

    initial state: 8 unordered pages

  • CS165, Fall 2016 Stratos Idreos /3627

    memory level L

    memory level L-1

    (size=3 pages)

    initial state: 8 unordered pages

  • CS165, Fall 2016 Stratos Idreos /3627

    memory level L

    memory level L-1

    (size=3 pages)

    quicksort in place

    initial state: 8 unordered pages

  • CS165, Fall 2016 Stratos Idreos /3627

    memory level L

    memory level L-1

    (size=3 pages)

    initial state: 8 unordered pages

  • CS165, Fall 2016 Stratos Idreos /3627

    memory level L

    memory level L-1

    (size=3 pages)

    quicksort in place

    initial state: 8 unordered pages

  • CS165, Fall 2016 Stratos Idreos /3627

    memory level L

    memory level L-1

    (size=3 pages)

    initial state: 8 unordered pages

    each page is now sorted we read and wrote every page once

    data movement cost is 2N pages

  • CS165, Fall 2016 Stratos Idreos /3628

    memory level L

    memory level L-1

    (size=3 pages)

    initial state: 8 sorted pages

  • CS165, Fall 2016 Stratos Idreos /3628

    memory level L

    memory level L-1

    (size=3 pages)

    merge to new page

    initial state: 8 sorted pages

  • CS165, Fall 2016 Stratos Idreos /3628

    memory level L

    memory level L-1

    (size=3 pages)

    merge to new page

    initial state: 8 sorted pages

  • CS165, Fall 2016 Stratos Idreos /3628

    memory level L

    memory level L-1

    (size=3 pages)

    merge to new page

    initial state: 8 sorted pages

  • CS165, Fall 2016 Stratos Idreos /3628

    memory level L

    memory level L-1

    (size=3 pages)

    merge to new page

    initial state: 8 sorted pages

  • CS165, Fall 2016 Stratos Idreos /3628

    memory level L

    memory level L-1

    (size=3 pages)

    merge to new page

    initial state: 8 sorted pages

  • CS165, Fall 2016 Stratos Idreos /3628

    memory level L

    memory level L-1

    (size=3 pages)

    initial state: 8 sorted pages

  • CS165, Fall 2016 Stratos Idreos /3628

    memory level L

    memory level L-1

    (size=3 pages)

    initial state: 8 sorted pages

    each pair of pages is now sorted we read and wrote every page once

    data movement cost is 2N pages (total 2N+2N)

  • CS165, Fall 2016 Stratos Idreos /3629

    1 pass to sort each page (2N pages)

    1 pass to merge into 2 sorted pages (2N pages)

    1 pass to merge into 4 sorted pages (2N pages)

    1 pass to merge into 8 sorted pages (2N pages)

  • CS165, Fall 2016 Stratos Idreos /3629

    1 pass to sort each page (2N pages)

    1 pass to merge into 2 sorted pages (2N pages)

    1 pass to merge into 4 sorted pages (2N pages)

    1 pass to merge into 8 sorted pages (2N pages)

  • CS165, Fall 2016 Stratos Idreos /3629

    1 pass to sort each page (2N pages)

    1 pass to merge into 2 sorted pages (2N pages)

    1 pass to merge into 4 sorted pages (2N pages)

    1 pass to merge into 8 sorted pages (2N pages)

    log2(N)

  • CS165, Fall 2016 Stratos Idreos /3629

    1 pass to sort each page (2N pages)

    1 pass to merge into 2 sorted pages (2N pages)

    1 pass to merge into 4 sorted pages (2N pages)

    1 pass to merge into 8 sorted pages (2N pages)

    log2(N)+1

  • CS165, Fall 2016 Stratos Idreos /3629

    1 pass to sort each page (2N pages)

    1 pass to merge into 2 sorted pages (2N pages)

    1 pass to merge into 4 sorted pages (2N pages)

    1 pass to merge into 8 sorted pages (2N pages)

    2N(log2(N)+1)

  • CS165, Fall 2016 Stratos Idreos /3629

    1 pass to sort each page (2N pages)

    1 pass to merge into 2 sorted pages (2N pages)

    1 pass to merge into 4 sorted pages (2N pages)

    1 pass to merge into 8 sorted pages (2N pages)

    2N(log2(N)+1) x bytesPerPage

  • CS165, Fall 2016 Stratos Idreos /3630

    in general, we have M pages in memory not just 3 so

    in our first pass we can immediately sort groups of M pages

    &

  • CS165, Fall 2016 Stratos Idreos /3630

    in general, we have M pages in memory not just 3 so

    2N(log2(N)+1) -> 2N(logM-1(N)+1)

    in our first pass we can immediately sort groups of M pages

    2N(logM-1(N)+1) -> 2N(logM-1(N/M)+1)

    &

  • CS165, Fall 2016 Stratos Idreos /3631

    data size: N pages memory size: M pages

    how much memory M do we need to sort N data in p passes only?

    or

    how much data can we sort in p passes if we have M memory?

    logM-1(N/M)+1

  • CS165, Fall 2016 Stratos Idreos /36

    previous discussion holds for all levels of memory hierarchy

    32

  • CS165, Fall 2016 Stratos Idreos /3633

    other usage of sorting, e.g.,:order by group by sort merge join remove duplicates sort positions when unordered to avoid random access

  • CS165, Fall 2016 Stratos Idreos /3634

    indexing helps navigate data faster than scan

    indexing is (some times) just another way to organize data

    we need to consider all levels of memory hierarchy

    when we design our algorithms

    and to optimally use all available bytes

    Notes to remember

  • CS165, Fall 2016 Stratos Idreos /3635

    textbook: Chapter 13

    Self-organizing tuple reconstruction in column-storesStratos Idreos, Martin Kersten, Stefan Manegold In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2009

  • DATA SYSTEMSprof. Stratos Idreos

    class 8

    indexing & sorting