chapter 05
TRANSCRIPT
Chapter 5 - 1
����� ������ ���
Chapter 5
External Sorting
Chapter 5 - 1����� ������ ���
TABLE OF CONTENTS● Introduction
● External Sort/Merge Algorithms
● 2 Phase Multiway Merge Sort
● Optimization Strategy
Chapter 5 - 2
Chapter 5 - 2����� ������ ���
1. Introduction� File processing��� sorting����
✔Order-by� group-by, join ��������
✔Efficient sequential update– new master = old master + transaction
� Internal sorting algorithm����
✔Sorting����������������
✔File������������
– ����������������
– Array������ → ���������?
– Solution: External Sorting
Chapter 5 - 3����� ������ ���
2. External Sort/Merge Algorithm
� Basic Idea✔Sorting���������(run) ���
– Run: ��������
✔� run��� internal sorting ��������
✔���� run�����������
✔��� run��� 1���������
� ������� sort/merge algorithm���
✔Binary Sort/Merge✔Balanced Binary Sort/Merge✔Balanced K-way Sort/Merge✔Polyphase Sort/Merge
Chapter 5 - 3
Chapter 5 - 4����� ������ ���
Basic Idea of External Sorting
750 records
run 1
750 records
run 2
750 records
run 3
750 records
run 4
750 records
run 5
750 records
run 6
1500 records
run 1
1500 records
run 2
1500 records
run 3
3000 records
run 1
4500 records
run 1
Chapter 5 - 5����� ������ ���
Binary Sort/Merge� Sorting Phase
✔�� run�� sorting� �, 2�������
� Merging Phase✔�������� run������ run ���
��������
✔������� 2 �������������
Chapter 5 - 4
Chapter 5 - 6����� ������ ���
Binary Sort/Merge� �
� Input File (Run = 3)✔ 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80
� Sorting Phase����
✔ File 1: (50 95 110) (40 120 153) (22 80 140)✔ File 2: (10 36 100) (60 70 130)
� ��������
✔ File 3: (10 36 50 95 100 110) (40 60 70 120 130 153) (22 80 140)
� ������������
✔ File 1: (10 36 50 95 100 110) (22 80 140)✔ File 2: (40 60 70 120 130 153)✔ File 3: ���
Chapter 5 - 7����� ������ ���
Binary Sort/Merge� � - ��
� ��������
✔ File 3: (10 40 36 50 60 70 95 100 110 120 130 153) (22 80 140)
� ������������
✔ File 1: (10 40 36 50 60 70 95 100 110 120 130 153) ✔ File 2: (22 80 140)✔ File 3: ���
� ��������
✔ File 3: (10 22 40 36 50 60 70 80 95 100 110 120 130 140 153)
Chapter 5 - 5
Chapter 5 - 8����� ������ ���
�� Sort/Merge Algorithm�
� Balanced Binary Sort/Merge✔����� = ����� = 2✔���������, ����������
� Balanced k-way Sort/Merge✔k-way Sort/Merge� Balanced version✔����������������� run ��
✔k�������������?
� Polyphase Sort/Merge✔����� ≠ ����������, ����
���
Chapter 5 - 9����� ������ ���
Ex: Balanced Binary Sort/Merge� Input File: 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80
� Sorting Phase����✔ File 1: (50 95 110) (40 120 153) (22 80 140)✔ File 2: (10 36 100) (60 70 130)
� ��������✔ File 3: (10 36 50 95 100 110) (22 80 140)✔ File 4: (40 60 70 120 130 153)
� ��������✔ File 1: (10 40 36 50 60 70 95 100 110 120 130 153) ✔ File 2: (22 80 140)
� ��������✔ File 3: (10 22 40 36 50 60 70 80 95 100 110 120 130 140 153)
Chapter 5 - 6
Chapter 5 - 10����� ������ ���
Ex: Balanced k-way Sort/Merge� Input File: 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80
� Sorting Phase����
✔ File 1: (50 95 110) (60 70 130) ✔ File 2: (10 36 100) (22 80 140)✔ File 3: (40 120 153)
� ��������
✔ File 4: (10 36 40 50 95 100 110 120 153) ✔ File 5: (22 60 70 80 130 140) ✔ File 6:
� ��������
✔ File 1: (10 22 40 36 50 60 70 80 95 100 110 120 130 140 153)
Chapter 5 - 11����� ������ ���
Ex: Polyphase Sort/Merge� Sorting Phase����
✔ File 1: (50 95 110) (40 120 153) (22 80 140)✔ File 2: (10 36 100) (60 70 130)
� ��������✔ File 1: (22 80 140)✔ File 2: ���✔ File 3: (10 36 50 95 100 110) (40 60 70 120 130 153)
� ��������✔ File 1: ���✔ File 2: (10 22 36 50 80 95 100 110 140)✔ File 3: (40 60 70 120 130 153)
� ��������✔ File 1: (10 22 40 36 50 60 70 80 95 100 110 120 130 140 153)
Chapter 5 - 7
Chapter 5 - 12����� ������ ���
������ ��
� �� run��� R� Binary Sort/Merge✔Level�� = ⎡log2R⎤ + 1✔�������� = ⎡log2R⎤
� �� run��� R� k-way Sort/Merge✔�������� = ⎡logkR⎤✔k�� run����� key ���� run �
– Linear search: ����� = n * (k – 1) * ⎡logkR⎤– Selection tree: ����� = n * log2k * ⎡logkR⎤
= n * ⎡ log2R⎤
– Selection Tree? ⇒ See Section 5.8 of HSF
Chapter 5 - 13����� ������ ���
3. 2 Phase Multiway Merge/Sort� 2PMM����
✔2 Phase– Sorting Phase + 1�� Merging Phase– � phase������ �� read/write� 1� ��
✔Multiway– ��������������
✔2PMM � sorting���������� Memory �� M�����
Chapter 5 - 8
Chapter 5 - 14����� ������ ���
2PMM Algorithm� Phase 1
✔Fill main memory with records.✔Sort using favorite internal sort. (e.g. Quick Sort)✔Write sorted sub-list to a specific file.✔Repeat until all records are put into one of the sorted lists.
� Phase 2✔� sorted-list��������������.✔����������������, ����
�.✔���������, �����������.✔�����������������, ����
���������������.
Chapter 5 - 15����� ������ ���
Discussion of 2PMM� Analysis of Naive Implementation
✔Assume blocks are stored at random, so average access time is about 15 ms.
✔File stored on 250,000 blocks, read and written once in each phase.
✔1,000,000 disk I/O’s * 15 ms = 15,000 sec = 4+ hours.
� How many records can you sort with 2PMMS?✔(M / R)((M / B) - 1)
Chapter 5 - 9
Chapter 5 - 16����� ������ ���
4. Optimization� k-way Sort/Merge�� Parallel I/O
✔���
– Buffer �: 2k + 2 (double buffering for I/O)– ��, ��, ���������
✔Fixed Buffer Allocation– ������ 2�������
– ���������
✔Dynamic Buffer Allocation– Run��������� (Selection Tree ��)– Algorithm: HSF Section 7.11.3 ��
✔2PMM�� Parallel I/O ����?
Chapter 5 - 17����� ������ ���
Optimization - ��
� Run Generation✔���
– Memory ���� � �� run������
– Merge pass����
✔Algorithm– Double buffering���� 2�� I/O ���
– ��������� selection tree ��
– Selection tree: ��� = (M – 4 ) * rec_per_page– Tree� full���, ������
– ������ ������������
�����, run number���
– HSF Section 7.11.4 ��
Chapter 5 - 10
Chapter 5 - 18����� ������ ���
Optimization - ��
� Optimal Merging of Runs✔Run generation������ run������
✔� run����������������
✔External path length���������
� �
�
��
� � � ��
���� �� �� � ��
���� �� �� � ��
Chapter 5 - 19����� ������ ���
�� ����: Huffman Tree
void huffman(tree_pointer heap[], int n){
/* heap is a list of n single node binary trees */int i;tree_pointer tree;
initialize(heap, n); /* initialize min heap */for (i = 1; i < n; i++) {
tree = (tree_pointer) malloc(sizeof(tree_node));
tree->left_child = least(heap, n-i+1);tree->right_child = least(heap, n-i);tree->weight = tree->left_child->weight + tree->right_child->weight;insert(heap, n-i-1, tree);
}}
Chapter 5 - 11
Chapter 5 - 20����� ������ ���
Construction of a Huffman Tree� Run: 2, 3, 5, 7, 9, 13
�
� � �
� �
�
�� ��
� �
�
� �
�
�� ��
��
�
� �
�
�� ��
����
� �
��