sugi 28, paper 4 sugi 25 paper 129: paul m. dorfman private detectives in a data warehouse:...

66
SUGI 28, PAPER 4

Upload: destiny-pipp

Post on 16-Dec-2015

227 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

SUGI 28, PAPER 4

Page 2: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

SUGI 25

Paper 129: Paul M. DorfmanPrivate Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing

SUGI 26Paper 8: Paul M. DorfmanTable Look-Up by Direct Addressing: Key-Indexing — Bitmapping — Hashing

Paper 128: Paul M. DorfmanQuick Disk Table Look-up via Hybrid Indexing into a Directly Addressed SAS Data Set

SUGI 27Paper 12: Paul M. Dorfman and Gregg P. SnellHashing Rehashed

GENERATION I

Hand-Coded Direct-Addressing Routines Presented At:

Page 3: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

Associative Array

SUGI 27Presentation: Jason Secosky

The DATA step in Version 9: What’s New?

SAS® Version 9 DATA Step Object: Associative Array

Hash

GENERATION II

==

1

Page 4: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

2

Main Entry: pro·pae·deu·tic Pronunciation: "prO-pi-'dü-tik, -'dyü-Function: nounEtymology: Greek propaideuein to teach beforehand, from pro- before + paideuein to teach, from paid-, pais childDate: 1798

1: preparatory study or instruction

Merriam-Webster online http://www.m-w.com

PROPAEDEUTICS

1

Page 5: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

3

Main Entry: di·rect ad·dress·ing Pronunciation: "d&-'rekt, &-'dres-ingFunction: speedy table lookup

1: accessing key values “directly” by their location (address, node) in a table, as opposed to searching for them by comparing the search key to all or some table values

2: key-indexed search

Hashing Rehashed. SUGI 27, Orlando, FL, 2002.

DIRECT ADDRESSING

21

Page 6: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

WORK.SMALL

KEY S_SAT

185 00

971 01

400 02

260 03

922 04

970 05

543 06

532 07

050 08

067 09

data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;

ARRAY HKEY(0:999)

HKEY(000)=.

HKEY(185)=.

HKEY(400)=.

HKEY(971)=.

HKEY(999)=.

DEFINE

KEY-INDEXING

43

Page 7: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

WORK.SMALL

KEY S_SAT

185 00

971 01

400 02

260 03

922 04

970 05

543 06

532 07

050 08

067 09

data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;

ARRAY HKEY(0:999)

HKEY(000)=.

HKEY(185)=.

HKEY(400)=.

HKEY(971)=.

HKEY(999)=.

00

LOAD

DEFINE

KEY-INDEXING

2

Page 8: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

WORK.SMALL

KEY S_SAT

185 00

971 01

400 02

260 03

922 04

970 05

543 06

532 07

050 08

067 09

data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;

ARRAY HKEY(0:999)

HKEY(000)=.

HKEY(185)=.

HKEY(400)=.

HKEY(971)=.

HKEY(999)=.

00

01

DEFINE

LOAD

KEY-INDEXING

2

Page 9: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

WORK.SMALL

KEY S_SAT

185 00

971 01

400 02

260 03

922 04

970 05

543 06

532 07

050 08

067 09

data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;

ARRAY HKEY(0:999)

HKEY(000)=.

HKEY(185)=.

HKEY(400)=.

HKEY(971)=.

HKEY(999)=.

00

02

01

DEFINE

LOAD

KEY-INDEXING

2

Page 10: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

WORK.SMALL

KEY S_SAT

185 00

971 01

400 02

260 03

922 04

970 05

543 06

532 07

050 08

067 09

data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;

ARRAY HKEY(0:999)

HKEY(000)=.

HKEY(185)=.

HKEY(400)=.

HKEY(971)=.

HKEY(999)=.

00

02

01

DEFINE

LOAD

KEY-INDEXING

2

Page 11: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

WORK.SMALL

KEY S_SAT

185 00

971 01

400 02

260 03

922 04

970 05

543 06

532 07

050 08

067 09

data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;

ARRAY HKEY(0:999)

HKEY(000)=.

HKEY(185)=.

HKEY(400)=.

HKEY(971)=.

HKEY(999)=.

00

01

02

SEARCH

DEFINE

LOAD

KEY-INDEXING

1

Page 12: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

How can one possibly use key-indexing when SSN (or any other large or non-integer value) is the key?

LIMITATIONS

Key Must Be Integer

Key Range Limited By Memory

(9-Digit SSN Would Require 8 GB)

So you may be asking yourself...

KEY-INDEXING

321

Page 13: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

Main Entry: 1hash Pronunciation: 'hashFunction: transitive verbEtymology: French hacher, from Old French hachier, from hache battle-ax, of Germanic origin; akin to Old High German hAppa sickle; akin to Greek koptein to cutDate: 1590

1 a : to chop (as meat and potatoes) into small pieces

Merriam-Webster online http://www.m-w.com

HASHING

21

Page 14: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

Main Entry: 1hash·ing Pronunciation: 'hash-ingFunction: make BIG keys smaller

1: converting a long-range key (numeric or character) to a smaller-range integer number with a mathematical algorithm or function that must be:

• Rapidly computable

• Distribute the resulting keys uniformly

• Produce an integer in [0:HSIZE-1] range

2: H = MOD(KEY,HSIZE)

Private Detectives in a Data Warehouse. SUGI 25, Indianapolis, IN, 2000.

HASHING

21

Page 15: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

GENERATIONS

Which generation of hashing is the most efficient for subsetting LARGE based on the values of KEY in SMALL to produce a file MATCH?

data small ; input key s_sat ; cards ; 185 00 971 11 400 22 260 33 922 44 970 55 543 66 532 77 050 88 067 99 ; run ;

Our Mission:Match unsorted SMALL

(satellite variable S_SAT)To unsorted LARGE

LARGE cannot be sorted

Assume:

Memory will hold SMALL

(integer variable KEY)

7654321

Page 16: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

h = mod(key,&hsize);

%let load = 0.625;

data _null_;

do p=ceil(p/&load) by 1 until(j=up+1);

up = ceil(sqrt(p));

do j=2 to up until(not mod(p,j)); end;

end;

call symput('hsize',left(put(p,best.)));

stop;

set small nobs=p;

run;

%put hsize=&hsize;

hash_size=17

GENERATION I

Choose a proper function

Determine Load Factor

Calculate Optimal Array Size

KEY HASH_ADDR

185

971

400

260

922

970

543

532

050

067

Hash15

Hash

Hash

Hash

Hash

Hash

HashHash

Hash

Hash

02

09

05

04

01

16

05

16

16

Collision Resolution(Linear Probing)

54321

Page 17: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT

185 00

971 11

400 22

260 33

922 44

970 55

543 66

532 77

050 88

067 99

HKEY HSAT

(00)=. (00)=.

(01)=. (01)=.

(02)=. (02)=.

(03)=. (03)=.

(04)=. (04)=.

(05)=. (05)=.

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=. (09)=.

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=. (16)=.

(17)=. (17)=.

DEFINELOADSEARCH

DEFINE• Load Factor• Array Size• Define

13121110

Page 18: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

KEY S_SAT

185 00

971 11

400 22

260 33

922 44

970 55

543 66

532 77

050 88

067 99

KEY S_SAT H(KEY)

185 00

971 11

400 22

260 33

922 44

970 55

543 66

532 77

050 88

067 99

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=. (01)=.

(02)=. (02)=.

(03)=. (03)=.

(04)=. (04)=.

(05)=. (05)=.

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=. (09)=.

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=. (16)=.

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

15

185 00

9

Page 19: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11

400 22

260 33

922 44

970 55

543 66

532 77

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=. (01)=.

(02)=. (02)=.

(03)=. (03)=.

(04)=. (04)=.

(05)=. (05)=.

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=. (09)=.

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=185 (15)=00

(16)=. (16)=.

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

02

11971

9

Page 20: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22

260 33

922 44

970 55

543 66

532 77

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=. (01)=.

(02)=971 (02)=11

(03)=. (03)=.

(04)=. (04)=.

(05)=. (05)=.

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=. (09)=.

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=. (16)=.

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

09

22400

9

Page 21: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33

922 44

970 55

543 66

532 77

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=. (01)=.

(02)=971 (02)=11

(03)=. (03)=.

(04)=. (04)=.

(05)=. (05)=.

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=. (16)=.

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

05

33260

9

Page 22: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33 05

922 44

970 55

543 66

532 77

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=. (01)=.

(02)=971 (02)=11

(03)=. (03)=.

(04)=. (04)=.

(05)=260 (05)=33

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=. (16)=.

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

04 44922

9

Page 23: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33 05

922 44 04

970 55

543 66

532 77

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=. (01)=.

(02)=971 (02)=11

(03)=. (03)=.

(04)=922 (04)=44

(05)=260 (05)=33

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=. (16)=.

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

01

55970

9

Page 24: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33 05

922 44 04

970 55 01

543 66

532 77

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=970 (01)=55

(02)=971 (02)=11

(03)=. (03)=.

(04)=922 (04)=44

(05)=260 (05)=33

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=. (16)=.

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

16

66543

9

Page 25: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33 05

922 44 04

970 55 01

543 66 16

532 77

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=970 (01)=55

(02)=971 (02)=11

(03)=. (03)=.

(04)=922 (04)=44

(05)=260 (05)=33

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=543 (16)=66

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

77532

05

98

Page 26: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33 05

922 44 04

970 55 01

543 66 16

532 77 05

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=970 (01)=55

(02)=971 (02)=11

(03)=. (03)=.

(04)=922 (04)=44

(05)=260 (05)=33

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=543 (16)=66

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

16

88050

765

Page 27: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33 05

922 44 04

970 55 01

543 66 16

532 77 05

050 88 16

067 99

DEFINE

HKEY HSAT

(00)=050 (00)=88

(01)=970 (01)=55

(02)=971 (02)=11

(03)=. (03)=.

(04)=922 (04)=44

(05)=260 (05)=33

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=543 (16)=66

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

16

99067

4321

Page 28: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33 05

922 44 04

970 55 01

543 66 16

532 77 05

050 88 16

067 99 16

DEFINE

HKEY HSAT

(00)=050 (00)=88

(01)=970 (01)=55

(02)=971 (02)=11

(03)=067 (03)=99

(04)=922 (04)=44

(05)=260 (05)=33

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=543 (16)=66

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

SEARCH

...

do until ( eof2 ) ; set large end = eof2 ; do h=mod (key, &hsize) by +1 until ( hkey(h) = . ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; end ; end ; stop ; run ;

Page 29: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

h = mod(key,&hsize);

%let load = 0.625;

data _null_;

do p=ceil(p/&load) by 1 until(j=up+1);

up = ceil(sqrt(p));

do j=2 to up until(not mod(p,j)); end;

end;

call symput('hsize',left(put(p,best.)));

stop;

set small nobs=p;

run;

%put hsize=&hsize;

hash_size=17

Choose a proper function

Determine Load Factor

Calculate Optimal Array Size

KEY HASH_ADDR

185

971

400

260

922

970

543

532

050

067

Hash15

Hash

Hash

Hash

Hash

Hash

HashHash

Hash

Hash

02

09

05

04

01

16

05

16

16

Collision Resolution(Linear Probing)

GENERATION II

54321

Page 30: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

GENERATION II

DEFINE

LOAD

SEARCH

data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ;

LOAD

SEARCH

DEFINE

321

Page 31: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

return code = object.method;

GENERATION II

data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ;

data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;

data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;

dcl hash hh ();

Parameter Type MatchingVerbose Syntax

DEFINE

7654321

set small(keep=key s_sat) point = _n_ ;

Page 32: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

GENERATION II

data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ;

LOAD

data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;

DEFINE set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');

1

Page 33: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

GENERATION II

data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ;

LOAD

SEARCH

data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;

DEFINE set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');

Page 34: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

DEFINE

GENERATIONS

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep = key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_ ; array hsat (0:&hsize) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; do h = mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h) = key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; do until ( eof2 ) ; set large end = eof2 ; do h = mod (key, &hsize) by +1 until ( hkey(h) = . ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; end ; end ; stop ; run ;

data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;

set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');

32

Page 35: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;

set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep = key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_ ; array hsat (0:&hsize) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; do h = mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h) = key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; do until ( eof2 ) ; set large end = eof2 ; do h = mod (key, &hsize) by +1 until ( hkey(h) = . ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; end ; end ; stop ; run ;

LOAD

GENERATIONS

1

Page 36: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;

set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep = key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_ ; array hsat (0:&hsize) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; do h = mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h) = key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; do until ( eof2 ) ; set large end = eof2 ; do h = mod (key, &hsize) by +1 until ( hkey(h) = . ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; end ; end ; stop ; run ;

SEARCH

GENERATIONS

Page 37: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

Modify your hash code

• %let nodupes = 1 ; • if nodupes then leave ;

Three options !

• hh.add (); Keeps 1st occurrence

• hh.replace (); Keeps last

• create additional keys to discriminate further

DUPLICATE KEYS

CHALLENGES

4321

Page 38: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

Change your hash function

• rescale the key h = mod (key*1000, &hsize)

• use a numeric informat h=mod (input(key,16.), &hsize)

Change your array declaration

• array hkey(0:&hsize) $9 _temporary_

Already taken care of !

• set small(keep=key s_sat) point = _n_ ;

• hh.DefineKey ( 'key' );

CHALLENGES

21

NON-INTEGER KEYS• Fractional signed SAS numbers• Digit strings (char var of digits)• Arbitrary characters

Page 39: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

DYNAMIC TABLE PROCESSING

Simple coding

• DO LOOP through the array

do h=0 to &hsize – 1; if missing(hkey(h)) then continue; key = hkey(h); s_sat = hsat(h); < … further processing … >end;

GENERATION I GENERATION II

Use the HASH ITERATOR !

Dcl hash hh (dataset: ‘’sample’, ordered: 1);dcl hiter hi (‘hh’);< … hh.define stuff … >do rc=hi.first() by 0 while (rc=0); (key is automatically populated) rc=hi.next();end;

CHALLENGES

654321

Page 40: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

COMPOSITE KEYS

Change your hash function

• Possible but VERY difficult

• Complexity grows as the number of components to the key grows

GENERATION I GENERATION II

Just add the key(s) !

• set small(keep=k1 k2 k3 s_sat) point = _n_ ;

• hh.DefineKey ( 'k1', 'k2', 'k3' );

CHALLENGES

321

Page 41: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

STARFLEET COMMAND ENGINEERINGACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICS

Page 42: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACEHASH ITERATORADELSON-VELSKII & LANDIS TREES

Page 43: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

declare AssociativeArray hh

• instantiates or creates the object

( dcl hash myhash )

Page 44: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.DefineKey

• define a set of hash keys

Page 45: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.DefineData

• define a set of hash table satellites

Page 46: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.DefineDone

• tell SAS the definitions are done

Page 47: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.Add

• insert the key and satellites

(if the key is not yet in the table)

Page 48: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.Replace

• insert the key and satellites

(overwrites any existing data)

Page 49: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.Find

• search for the key

• if found, extract the satellite

and update the host DATA STEP variables

Page 50: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.Check

• search for the key

• if found, just return rc=0

Page 51: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.Delete

• delete the hash table from memory

Page 52: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

HASH ITERATOR

declare hiter hi

dcl hiter myhashiterator

• instansiates or creates the object

(don’t forget to add “ordered: 1”

to the hash declaration)

Page 53: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

HASH ITERATOR

hi.First

• fetch the smallest key into the host variable

Page 54: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

HASH ITERATOR

hi.Next

• fetch the next key in ascending order

Page 55: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

HASH ITERATOR

hi.Last

• fetch the largest key into the host variable

Page 56: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

HASH ITERATOR

hi.Prev

• fetch the previous key in descending order

Page 57: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

PEEK UNDER THE HOOD

ADELSON-VELSKII & LANDIS TREES

http://www.seanet.com/users/arsen/avltree.html

Binary Trees populated such that they average O(log(N)) search behavior regardless of the distribution.

For example:

• Insert the value: 05• Insert the value: 02

AVL maintains balance by rotating the values and preserving the search structure

HASHEXP

• controls the number of trees to create

• 2**

• EXP=16 HSIZE=65,536

HASHEXP

54321

Page 58: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

CONCLUSION

Simple numeric key falling in a limited range

• SAS date and time values are good examples

• This is the area where Generation I key-indexed search complete dominates the competition both in computer and programming efficiency

A short comparison of I vs. II

4321

Page 59: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

CONCLUSION

Simple numeric key with the range up to 9 digits; no satellites needed

• Bitmapping is king

1

Page 60: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

CONCLUSION

Simple numeric key or short (up to 10 bytes) character key

• Both generations do well

• If ultimate speed is the issue, Generation I (barely)

• Generation II has the advantage of coding simplicity

321

Page 61: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

CONCLUSION

Composite keys

• Generation I is better if the keys can be rapidly combined in a short integer

• Otherwise, Generation II dominates

21

Page 62: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

CONCLUSION

Retrieving data by key from a hash table in order

• Generation I can provide such functionality only through array sorting

• Generation II hash iterator object is designed for this purpose, works very fast

21

Page 63: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

CONCLUSION

Storing and handeling duplicate key entries in a hash table

• Generation I is more flexible

• Generation II only lets you control which duplicate takes over

2121

Page 64: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

CONCLUSION

Dynamic Data Step Processing

• Generation II is ideal

• Table grows at run-time as new entries are added• No need to allocate giant memories beforehand

321

Page 65: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

CONCLUSION

Dynamic DATA Step Structure

is the first ever…

Finally…

Although still experimental in V9…

The Generation II Hash Object

(AssociativeArray)

4321

Page 66: SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul

HASHING:

GENERATIONS

SUGI 28

GENERATIONS