the tiledb array data storage manager - harvard...

36
The TileDB Array Data Storage Manager Shiyu Huang

Upload: others

Post on 25-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

The TileDB Array Data Storage Manager

Shiyu Huang

Page 2: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

What is the problem?

Page 3: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Storage manager for multi-dimensional arrays

Page 4: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Why is it important?

Page 5: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Large scientific data naturally represented as multi-dimensional arrays

Page 6: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Why is it hard?

Page 7: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Read & Write on large array

u Array representation

u Compression

u Parallel access

u performance

Page 8: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Why existing solutions do not work?

Page 9: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

HDF5Array-

oriented DBRelational databases

u Hard to identify and manage dense array

u In-place write

u Regular dimensional chunk as atomic unit

u Encoding the element indices as extra table columns

Page 10: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Core intuition for the solution?

Page 11: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Data model

u Dimensions

u Coordinates

u Cells

u Attributes

Page 12: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Global cell order

Co-located data according to the characteristics of the data

Page 13: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Data tiles

Dense arraysu Space tile

u equi-sized hyper-rectangles

Sparse arraysu Capacity

u Minimum bounding rectangle

u Each non-empty cell in one data tile

Page 14: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Fragmentsu Timestamped snapshot of a batch of updates

Page 15: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Physical Organization

Array -> Directory

Fragment -> Sub-directory Array schema

Values of Variable sized

attribute 1

Fixed-sized attribute 1

Starting offsets of variable –sized

attribute 1

Page 16: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Bookkeeping metadata:Minimum-bound rectangleBounding coordinates

Page 17: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Read operation

u Dense array: visit each space tile in global order

u Sparse array: visit each range that start before the minimum end bounding coordinate

Page 18: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Multiple fragments?

Algorithm!

Page 19: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Sort disjoint ranges on global cell order

u <start coordinate, end coordinate, fragment id>

u Each range appears contiguously on the disk

e.g. querying <2,5>:Get range <2,3>, <4,5>

Page 20: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Priority queue

<1,4,5>

<4,8,2>

<1,4,3> <8, 12, 3>

<8,12,4>

Compare on SC

When tied, give higher fid priority

Page 21: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Operations in the priority queue

Page 22: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

<8,12,3>

<8,12,4>

<8,12,4>

<8,12,3>

<4,8,2>

<4,8,2>

<8,12,4>

<8, 12, 3>

<1,4,5>

<1,4,3>

<4,8,2>

<8,12,4>

<8, 12, 3>

<1,4,5>

Priority queue

<1,4,5>

<4,8,2>

<1,4,3> <8, 12, 3>

<8,12,4>

Compare on SC

When tied, give higher fid priority

Page 23: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Write – dense fragment

u The user:

u populates one buffer per attribute, storing the cell values respecting the global cell order

u Write function:

u Append the values from the the buffers into the corresponding attribute file

Page 24: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Sparse fragment

u Mode:

u User provides sorted buffer

u User provides unsorted buffer, TileDB sorts it internally

u Buffer:

u Only non-empty cells

u Extra buffer with coordinates

u Extra write state information

u Deletion:

u Insertions of empty cells

Page 25: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Consolidation

Read operation

Write the retrieved cells to a new fragment Delete the old fragments

Page 26: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Parallel Programming

u Concurrent write: Each process/thread creates a separate fragment, no locking necessary

u Concurrent read:

u Multiple process: separate bookkeeping data and state, no locking

u Multiple thread: One bookkeeping data, only lock on it

u Mixed read and write: Special file indicates if the fragment is visible

u Background consolidation:

u Get the lock when delete the old fragment, release the lock after new fragment is visible

Page 27: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Does the paper prove its claims?

Page 28: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Does the paper prove its claims?

u Clear description of the physical layout and functions

u Logical justification of the design decision

u Comprehensive evaluations

Page 29: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Analysis/experiments?

Page 30: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Dense Array

LOAD

UPDATE

One Core Parallel

Page 31: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Read subarray

Page 32: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Fragments and consolidation

After consolidation

Page 33: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Sparse array

u Load

One core Parallel

Page 34: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Subarray read

One core

Parallel

Page 35: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

Gaps in the logic/proof?Possible next step

Page 36: The TileDB Array Data Storage Manager - Harvard Universitydaslab.seas.harvard.edu/classes/cs265/files/... · 2019. 5. 13. · Storage Manager ShiyuHuang. What is the problem? Storage

u Distributed version of database?

u Versioning?