how to run sql queries on tbs of data … using...
TRANSCRIPT
![Page 1: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/1.jpg)
1@sqreamtech SIL7138
How to run SQL queries on TBs of data using GPUs
Jake WheatLead Architect, SQream Technologies
![Page 2: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/2.jpg)
2@sqreamtech SIL7138
How to run SQL queries on TBs of data using GPUs1. A toy SQL query engine
2. Support wide range of SQL queries
3. Support larger data
4. Optimise
![Page 3: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/3.jpg)
3@sqreamtech SIL7138
Hypothetical set of physical operator primitives
Example SQL Physical Operator Implementationselect a+b, c * 5 from t select
(a.k.a project/extend/rename)thrust::transform
select a, count(*), sum(b), avg(b) from t group by a
stream aggregate thrust::reduce_by_key
select a, b from t where a > 0.5
filter thrust::remove_if
select distinct a from t stream distinct thrust::unique
select a, b, c, d from t order by a,b
sort thrust::sort
select * from t union allselect * from u
union all -
select * from t inner join u using (a)
sort merge join (smj) simple implementation:thrust::upper_bounds, lower_bounds, unnest, gather
![Page 4: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/4.jpg)
4@sqreamtech SIL7138
A more complete engine• non indexed nested loop join (ninlj)
• not matching
• outer join
• stream union distinct
• distinct aggregates
• grouping sets
• window functions
![Page 5: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/5.jpg)
5@sqreamtech SIL7138
Running on bigger data
![Page 6: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/6.jpg)
6@sqreamtech SIL7138
4 MB
at a time
GPU
4 MBResult
1 TB
Didn’t fit
1 TB
Result
Queries on data which doesn't fit in the GPU memoryChunk the data
![Page 7: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/7.jpg)
7@sqreamtech SIL7138
Queries on data which doesn't fit in the GPU memoryUse an external sorting algorithm
Extend the idea to other operators too, for instance:
• NINLJ
• SMJ
• window functions
• distinct aggregates (select a, count(distinct b), count(b) from t group by a)
• outer joins
![Page 8: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/8.jpg)
8@sqreamtech SIL7138
GPU
1 TB
GPU / CPU spool + external sort
Sort
4 MB at a time
Queries on data which doesn't fit in the GPU memoryUse external sorting algorithms and a variety of spools
![Page 9: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/9.jpg)
9@sqreamtech SIL7138
Optimising: some non-GPU specific ideas
• Use good benchmarking and profiling tools
• Take advantage of columnar - don't read columns which aren't needed
• Compressed data – helps with both disk I/O and PCI bus I/O
• Make sure we use good I/O patterns: big reads and writes, work with the Linux FS cache effectively
• Use large host memory buffers
• Use AST optimisations (e.g. push predicates closer to the tablescans)
• Use a cost based optimizer
![Page 10: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/10.jpg)
10@sqreamtech SIL7138
Some GPU-specific ideas
• Use CUDA specific profiling tools in addition to your regular C++ ones(or at least have a way to enable cudaDeviceSynchronize calls all over your code)
• GPU task/queue
• Combine tasks on the GPU
• Use large chunks
• Rechunking
• Reduce or avoid host ↔ device transfers
• Optimise GPU code
![Page 11: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/11.jpg)
11@sqreamtech SIL7138
GPU task/queue1. Allocate all the GPU memory we will need
2. Upload the data to the GPU
3. Run some kernels on this data
4. Download the results to host
5. Release the allocated GPU memory
![Page 12: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/12.jpg)
12@sqreamtech SIL7138
GPU task/queueBenefits
• Can help support concurrency in a single query (to hide PCI transfer latency)
• Can help support concurrent queries on a single GPU with some basic fairness property (long running query won't starve out a short running query)
• GPU memory usage is very predictable
• No inter-task co-ordination needed
• No chance of deadlock on GPU resources
• Can be simply extended to distribute tasks over multiple GPUs
![Page 13: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/13.jpg)
13@sqreamtech SIL7138
Host Worker
Host Worker
Host Worker
GPU task queue
GPU 1
GPU workerGPU worker….
GPU 2
GPU workerGPU worker….
Host Worker
Host Worker
Short GPU task concept, concurrency
![Page 14: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/14.jpg)
14@sqreamtech SIL7138
Combine tasks on the GPU where possibleselect a, b+c as d from t where b > 5 order by a
Logical
Direct
Combined
TableScana,b,c
Transformd:=b+c
Remove Ifb>5
Sort bya Sort Merge
TableScana,b,c To device Transform
d:=b+c To host To device Remove Ifb>5 To host To device Sort by
a To host Sort Merge
TableScana,b,c To device Transform
d:=b+cRemove If
b>5Sort by
a To host Sort Merge
![Page 15: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/15.jpg)
15@sqreamtech SIL7138
Use larger chunks of data1. transfer 10,000 rows to device2. transform d := b + c3. remove_if b > 54. sort by a5. transfer results to host
1. transfer 10,000,000 rows to device2. transform d := b + c3. remove_if b > 54. sort by a5. transfer results to host
![Page 16: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/16.jpg)
16@sqreamtech SIL7138
RechunkingTable scansOptimised disk IO sizes often much bigger than the GPU can handle
Small
Big
Big
IO reads GPU process
slow good
fast too big to fit
fast good
split
Chunk size
![Page 17: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/17.jpg)
17@sqreamtech SIL7138
RechunkingOutput of remove_if, join, reduce/reduce_by_key, etc.If the output of these is very small amounts of rows, collecting these rows back into big chunks can improve the performance of later operations (e.g. sort).
![Page 18: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/18.jpg)
18@sqreamtech SIL7138
High selectivity
Low selectivity
input remove if sort
fast
input remove if sort
slow
input remove if rechunk sort
fast
Rechunking
![Page 19: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/19.jpg)
19@sqreamtech SIL7138
RechunkingReducing PCI transfer amounts in NINLJfor each chunk in table A:for each chunk in table B:load A,B to the GPU, join them
Table A size: 10,000,000,000 rowsTable B size: 10,000,000 rows
At a chunk size of 1 million rows you will upload 110 billion rows to the device. If you increase it to 20 million rows, you will upload only 15 billion rows to the device
110B / 15B = Potentially more than 7x faster
![Page 20: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/20.jpg)
20@sqreamtech SIL7138
9 chunks Loop and load9 times
3 chunks Loop and load3 times
RechunkingReducing PCI transfer amounts in NINLJ
![Page 21: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/21.jpg)
21@sqreamtech SIL7138
Reduce / avoid HOST ↔ DEVICE transfers• NINLJ: keep the small table on the GPU for the whole join
• Chunk skipping -data is inserted to the database ordered by a record timestamp. The storage layer preserves this order (or inserts the data in sorted order)
Typical use-case is 24 months of data in the table, but queries only want to summarize a particular day, week or month.
![Page 22: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/22.jpg)
22@sqreamtech SIL7138
Optimising KernelsThree example bottlenecks from SQream:
• reduce_by_key
• multikey sort
• internals of the join
![Page 23: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/23.jpg)
23@sqreamtech SIL7138
Strings are very difficult to deal with• compression
• can be large size per record
non big data specific GPU issues:
• variable length data
• collations
• unicode, ICU doesn't run on GPUs
Good solutions: we're still looking
![Page 24: How to run SQL queries on TBs of data … using GPUson-demand.gputechconf.com/gtc-il/...how-to-run-sql-queries-on-tbs-of-data---using-gpus.pdfOutput of remove_if, join, reduce/reduce_by_key,](https://reader035.vdocuments.us/reader035/viewer/2022062918/5ede578dad6a402d6669aa0c/html5/thumbnails/24.jpg)
24@sqreamtech SIL7138
Find out more about SQream’s high performance GPU-driven database software
sqream.com