sugi 28, paper 4 sugi 25 paper 129: paul m. dorfman private detectives in a data warehouse:...

Post on 16-Dec-2015

227 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

SUGI 28, PAPER 4

SUGI 25

Paper 129: Paul M. DorfmanPrivate Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing

SUGI 26Paper 8: Paul M. DorfmanTable Look-Up by Direct Addressing: Key-Indexing — Bitmapping — Hashing

Paper 128: Paul M. DorfmanQuick Disk Table Look-up via Hybrid Indexing into a Directly Addressed SAS Data Set

SUGI 27Paper 12: Paul M. Dorfman and Gregg P. SnellHashing Rehashed

GENERATION I

Hand-Coded Direct-Addressing Routines Presented At:

Associative Array

SUGI 27Presentation: Jason Secosky

The DATA step in Version 9: What’s New?

SAS® Version 9 DATA Step Object: Associative Array

Hash

GENERATION II

==

1

2

Main Entry: pro·pae·deu·tic Pronunciation: "prO-pi-'dü-tik, -'dyü-Function: nounEtymology: Greek propaideuein to teach beforehand, from pro- before + paideuein to teach, from paid-, pais childDate: 1798

1: preparatory study or instruction

Merriam-Webster online http://www.m-w.com

PROPAEDEUTICS

1

3

Main Entry: di·rect ad·dress·ing Pronunciation: "d&-'rekt, &-'dres-ingFunction: speedy table lookup

1: accessing key values “directly” by their location (address, node) in a table, as opposed to searching for them by comparing the search key to all or some table values

2: key-indexed search

Hashing Rehashed. SUGI 27, Orlando, FL, 2002.

DIRECT ADDRESSING

21

WORK.SMALL

KEY S_SAT

185 00

971 01

400 02

260 03

922 04

970 05

543 06

532 07

050 08

067 09

data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;

ARRAY HKEY(0:999)

HKEY(000)=.

HKEY(185)=.

HKEY(400)=.

HKEY(971)=.

HKEY(999)=.

DEFINE

KEY-INDEXING

43

WORK.SMALL

KEY S_SAT

185 00

971 01

400 02

260 03

922 04

970 05

543 06

532 07

050 08

067 09

data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;

ARRAY HKEY(0:999)

HKEY(000)=.

HKEY(185)=.

HKEY(400)=.

HKEY(971)=.

HKEY(999)=.

00

LOAD

DEFINE

KEY-INDEXING

2

WORK.SMALL

KEY S_SAT

185 00

971 01

400 02

260 03

922 04

970 05

543 06

532 07

050 08

067 09

data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;

ARRAY HKEY(0:999)

HKEY(000)=.

HKEY(185)=.

HKEY(400)=.

HKEY(971)=.

HKEY(999)=.

00

01

DEFINE

LOAD

KEY-INDEXING

2

WORK.SMALL

KEY S_SAT

185 00

971 01

400 02

260 03

922 04

970 05

543 06

532 07

050 08

067 09

data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;

ARRAY HKEY(0:999)

HKEY(000)=.

HKEY(185)=.

HKEY(400)=.

HKEY(971)=.

HKEY(999)=.

00

02

01

DEFINE

LOAD

KEY-INDEXING

2

WORK.SMALL

KEY S_SAT

185 00

971 01

400 02

260 03

922 04

970 05

543 06

532 07

050 08

067 09

data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;

ARRAY HKEY(0:999)

HKEY(000)=.

HKEY(185)=.

HKEY(400)=.

HKEY(971)=.

HKEY(999)=.

00

02

01

DEFINE

LOAD

KEY-INDEXING

2

WORK.SMALL

KEY S_SAT

185 00

971 01

400 02

260 03

922 04

970 05

543 06

532 07

050 08

067 09

data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;

ARRAY HKEY(0:999)

HKEY(000)=.

HKEY(185)=.

HKEY(400)=.

HKEY(971)=.

HKEY(999)=.

00

01

02

SEARCH

DEFINE

LOAD

KEY-INDEXING

1

How can one possibly use key-indexing when SSN (or any other large or non-integer value) is the key?

LIMITATIONS

Key Must Be Integer

Key Range Limited By Memory

(9-Digit SSN Would Require 8 GB)

So you may be asking yourself...

KEY-INDEXING

321

Main Entry: 1hash Pronunciation: 'hashFunction: transitive verbEtymology: French hacher, from Old French hachier, from hache battle-ax, of Germanic origin; akin to Old High German hAppa sickle; akin to Greek koptein to cutDate: 1590

1 a : to chop (as meat and potatoes) into small pieces

Merriam-Webster online http://www.m-w.com

HASHING

21

Main Entry: 1hash·ing Pronunciation: 'hash-ingFunction: make BIG keys smaller

1: converting a long-range key (numeric or character) to a smaller-range integer number with a mathematical algorithm or function that must be:

• Rapidly computable

• Distribute the resulting keys uniformly

• Produce an integer in [0:HSIZE-1] range

2: H = MOD(KEY,HSIZE)

Private Detectives in a Data Warehouse. SUGI 25, Indianapolis, IN, 2000.

HASHING

21

GENERATIONS

Which generation of hashing is the most efficient for subsetting LARGE based on the values of KEY in SMALL to produce a file MATCH?

data small ; input key s_sat ; cards ; 185 00 971 11 400 22 260 33 922 44 970 55 543 66 532 77 050 88 067 99 ; run ;

Our Mission:Match unsorted SMALL

(satellite variable S_SAT)To unsorted LARGE

LARGE cannot be sorted

Assume:

Memory will hold SMALL

(integer variable KEY)

7654321

h = mod(key,&hsize);

%let load = 0.625;

data _null_;

do p=ceil(p/&load) by 1 until(j=up+1);

up = ceil(sqrt(p));

do j=2 to up until(not mod(p,j)); end;

end;

call symput('hsize',left(put(p,best.)));

stop;

set small nobs=p;

run;

%put hsize=&hsize;

hash_size=17

GENERATION I

Choose a proper function

Determine Load Factor

Calculate Optimal Array Size

KEY HASH_ADDR

185

971

400

260

922

970

543

532

050

067

Hash15

Hash

Hash

Hash

Hash

Hash

HashHash

Hash

Hash

02

09

05

04

01

16

05

16

16

Collision Resolution(Linear Probing)

54321

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT

185 00

971 11

400 22

260 33

922 44

970 55

543 66

532 77

050 88

067 99

HKEY HSAT

(00)=. (00)=.

(01)=. (01)=.

(02)=. (02)=.

(03)=. (03)=.

(04)=. (04)=.

(05)=. (05)=.

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=. (09)=.

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=. (16)=.

(17)=. (17)=.

DEFINELOADSEARCH

DEFINE• Load Factor• Array Size• Define

13121110

KEY S_SAT

185 00

971 11

400 22

260 33

922 44

970 55

543 66

532 77

050 88

067 99

KEY S_SAT H(KEY)

185 00

971 11

400 22

260 33

922 44

970 55

543 66

532 77

050 88

067 99

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=. (01)=.

(02)=. (02)=.

(03)=. (03)=.

(04)=. (04)=.

(05)=. (05)=.

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=. (09)=.

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=. (16)=.

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

15

185 00

9

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11

400 22

260 33

922 44

970 55

543 66

532 77

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=. (01)=.

(02)=. (02)=.

(03)=. (03)=.

(04)=. (04)=.

(05)=. (05)=.

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=. (09)=.

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=185 (15)=00

(16)=. (16)=.

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

02

11971

9

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22

260 33

922 44

970 55

543 66

532 77

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=. (01)=.

(02)=971 (02)=11

(03)=. (03)=.

(04)=. (04)=.

(05)=. (05)=.

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=. (09)=.

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=. (16)=.

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

09

22400

9

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33

922 44

970 55

543 66

532 77

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=. (01)=.

(02)=971 (02)=11

(03)=. (03)=.

(04)=. (04)=.

(05)=. (05)=.

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=. (16)=.

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

05

33260

9

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33 05

922 44

970 55

543 66

532 77

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=. (01)=.

(02)=971 (02)=11

(03)=. (03)=.

(04)=. (04)=.

(05)=260 (05)=33

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=. (16)=.

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

04 44922

9

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33 05

922 44 04

970 55

543 66

532 77

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=. (01)=.

(02)=971 (02)=11

(03)=. (03)=.

(04)=922 (04)=44

(05)=260 (05)=33

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=. (16)=.

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

01

55970

9

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33 05

922 44 04

970 55 01

543 66

532 77

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=970 (01)=55

(02)=971 (02)=11

(03)=. (03)=.

(04)=922 (04)=44

(05)=260 (05)=33

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=. (16)=.

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

16

66543

9

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33 05

922 44 04

970 55 01

543 66 16

532 77

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=970 (01)=55

(02)=971 (02)=11

(03)=. (03)=.

(04)=922 (04)=44

(05)=260 (05)=33

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=543 (16)=66

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

77532

05

98

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33 05

922 44 04

970 55 01

543 66 16

532 77 05

050 88

067 99

DEFINE

HKEY HSAT

(00)=. (00)=.

(01)=970 (01)=55

(02)=971 (02)=11

(03)=. (03)=.

(04)=922 (04)=44

(05)=260 (05)=33

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=543 (16)=66

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

16

88050

765

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33 05

922 44 04

970 55 01

543 66 16

532 77 05

050 88 16

067 99

DEFINE

HKEY HSAT

(00)=050 (00)=88

(01)=970 (01)=55

(02)=971 (02)=11

(03)=. (03)=.

(04)=922 (04)=44

(05)=260 (05)=33

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=543 (16)=66

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

16

99067

4321

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...

GENERATION I

KEY S_SAT H(KEY)

185 00 15

971 11 02

400 22 09

260 33 05

922 44 04

970 55 01

543 66 16

532 77 05

050 88 16

067 99 16

DEFINE

HKEY HSAT

(00)=050 (00)=88

(01)=970 (01)=55

(02)=971 (02)=11

(03)=067 (03)=99

(04)=922 (04)=44

(05)=260 (05)=33

(06)=. (06)=.

(07)=. (07)=.

(08)=. (08)=.

(09)=400 (09)=22

(10)=. (10)=.

(11)=. (11)=.

(12)=. (12)=.

(13)=. (13)=.

(14)=. (14)=.

(15)=. (15)=.

(16)=543 (16)=66

(17)=. (17)=.

*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...

LOAD

185 00

SEARCH

...

do until ( eof2 ) ; set large end = eof2 ; do h=mod (key, &hsize) by +1 until ( hkey(h) = . ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; end ; end ; stop ; run ;

h = mod(key,&hsize);

%let load = 0.625;

data _null_;

do p=ceil(p/&load) by 1 until(j=up+1);

up = ceil(sqrt(p));

do j=2 to up until(not mod(p,j)); end;

end;

call symput('hsize',left(put(p,best.)));

stop;

set small nobs=p;

run;

%put hsize=&hsize;

hash_size=17

Choose a proper function

Determine Load Factor

Calculate Optimal Array Size

KEY HASH_ADDR

185

971

400

260

922

970

543

532

050

067

Hash15

Hash

Hash

Hash

Hash

Hash

HashHash

Hash

Hash

02

09

05

04

01

16

05

16

16

Collision Resolution(Linear Probing)

GENERATION II

54321

GENERATION II

DEFINE

LOAD

SEARCH

data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ;

LOAD

SEARCH

DEFINE

321

return code = object.method;

GENERATION II

data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ;

data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;

data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;

dcl hash hh ();

Parameter Type MatchingVerbose Syntax

DEFINE

7654321

set small(keep=key s_sat) point = _n_ ;

GENERATION II

data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ;

LOAD

data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;

DEFINE set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');

1

GENERATION II

data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ;

LOAD

SEARCH

data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;

DEFINE set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');

DEFINE

GENERATIONS

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep = key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_ ; array hsat (0:&hsize) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; do h = mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h) = key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; do until ( eof2 ) ; set large end = eof2 ; do h = mod (key, &hsize) by +1 until ( hkey(h) = . ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; end ; end ; stop ; run ;

data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;

set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');

32

data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;

set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep = key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_ ; array hsat (0:&hsize) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; do h = mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h) = key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; do until ( eof2 ) ; set large end = eof2 ; do h = mod (key, &hsize) by +1 until ( hkey(h) = . ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; end ; end ; stop ; run ;

LOAD

GENERATIONS

1

data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;

set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');

%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep = key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_ ; array hsat (0:&hsize) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; do h = mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h) = key and nodupes

then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; do until ( eof2 ) ; set large end = eof2 ; do h = mod (key, &hsize) by +1 until ( hkey(h) = . ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; end ; end ; stop ; run ;

SEARCH

GENERATIONS

Modify your hash code

• %let nodupes = 1 ; • if nodupes then leave ;

Three options !

• hh.add (); Keeps 1st occurrence

• hh.replace (); Keeps last

• create additional keys to discriminate further

DUPLICATE KEYS

CHALLENGES

4321

Change your hash function

• rescale the key h = mod (key*1000, &hsize)

• use a numeric informat h=mod (input(key,16.), &hsize)

Change your array declaration

• array hkey(0:&hsize) $9 _temporary_

Already taken care of !

• set small(keep=key s_sat) point = _n_ ;

• hh.DefineKey ( 'key' );

CHALLENGES

21

NON-INTEGER KEYS• Fractional signed SAS numbers• Digit strings (char var of digits)• Arbitrary characters

DYNAMIC TABLE PROCESSING

Simple coding

• DO LOOP through the array

do h=0 to &hsize – 1; if missing(hkey(h)) then continue; key = hkey(h); s_sat = hsat(h); < … further processing … >end;

GENERATION I GENERATION II

Use the HASH ITERATOR !

Dcl hash hh (dataset: ‘’sample’, ordered: 1);dcl hiter hi (‘hh’);< … hh.define stuff … >do rc=hi.first() by 0 while (rc=0); (key is automatically populated) rc=hi.next();end;

CHALLENGES

654321

COMPOSITE KEYS

Change your hash function

• Possible but VERY difficult

• Complexity grows as the number of components to the key grows

GENERATION I GENERATION II

Just add the key(s) !

• set small(keep=k1 k2 k3 s_sat) point = _n_ ;

• hh.DefineKey ( 'k1', 'k2', 'k3' );

CHALLENGES

321

PEEK UNDER THE HOOD

STARFLEET COMMAND ENGINEERINGACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICS

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACEHASH ITERATORADELSON-VELSKII & LANDIS TREES

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

declare AssociativeArray hh

• instantiates or creates the object

( dcl hash myhash )

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.DefineKey

• define a set of hash keys

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.DefineData

• define a set of hash table satellites

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.DefineDone

• tell SAS the definitions are done

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.Add

• insert the key and satellites

(if the key is not yet in the table)

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.Replace

• insert the key and satellites

(overwrites any existing data)

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.Find

• search for the key

• if found, extract the satellite

and update the host DATA STEP variables

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.Check

• search for the key

• if found, just return rc=0

PEEK UNDER THE HOOD

DATA STEP COMPONENT INTERFACE

hh.Delete

• delete the hash table from memory

PEEK UNDER THE HOOD

HASH ITERATOR

declare hiter hi

dcl hiter myhashiterator

• instansiates or creates the object

(don’t forget to add “ordered: 1”

to the hash declaration)

PEEK UNDER THE HOOD

HASH ITERATOR

hi.First

• fetch the smallest key into the host variable

PEEK UNDER THE HOOD

HASH ITERATOR

hi.Next

• fetch the next key in ascending order

PEEK UNDER THE HOOD

HASH ITERATOR

hi.Last

• fetch the largest key into the host variable

PEEK UNDER THE HOOD

HASH ITERATOR

hi.Prev

• fetch the previous key in descending order

PEEK UNDER THE HOOD

ADELSON-VELSKII & LANDIS TREES

http://www.seanet.com/users/arsen/avltree.html

Binary Trees populated such that they average O(log(N)) search behavior regardless of the distribution.

For example:

• Insert the value: 05• Insert the value: 02

AVL maintains balance by rotating the values and preserving the search structure

HASHEXP

• controls the number of trees to create

• 2**

• EXP=16 HSIZE=65,536

HASHEXP

54321

CONCLUSION

Simple numeric key falling in a limited range

• SAS date and time values are good examples

• This is the area where Generation I key-indexed search complete dominates the competition both in computer and programming efficiency

A short comparison of I vs. II

4321

CONCLUSION

Simple numeric key with the range up to 9 digits; no satellites needed

• Bitmapping is king

1

CONCLUSION

Simple numeric key or short (up to 10 bytes) character key

• Both generations do well

• If ultimate speed is the issue, Generation I (barely)

• Generation II has the advantage of coding simplicity

321

CONCLUSION

Composite keys

• Generation I is better if the keys can be rapidly combined in a short integer

• Otherwise, Generation II dominates

21

CONCLUSION

Retrieving data by key from a hash table in order

• Generation I can provide such functionality only through array sorting

• Generation II hash iterator object is designed for this purpose, works very fast

21

CONCLUSION

Storing and handeling duplicate key entries in a hash table

• Generation I is more flexible

• Generation II only lets you control which duplicate takes over

2121

CONCLUSION

Dynamic Data Step Processing

• Generation II is ideal

• Table grows at run-time as new entries are added• No need to allocate giant memories beforehand

321

CONCLUSION

Dynamic DATA Step Structure

is the first ever…

Finally…

Although still experimental in V9…

The Generation II Hash Object

(AssociativeArray)

4321

HASHING:

GENERATIONS

SUGI 28

GENERATIONS

top related