chapter 7 application of pcfa - information and...
TRANSCRIPT
112
CHAPTER 7
APPLICATION OF PCFA
7.1 INTRODUCTION
In this section, we will discuss the applications of our algorithm in
clustering large data sets. Experiments are conducted with two data sets in
order to know how PCFA algorithm is able to discover clusters in large data
sets. It is meant that data set is considered as large data set if the number of
attributes is more than 35 or number of data points (records) is more than
50000. In both category data sets, PCFA is able to discover clusters.
7.2 APPLICATION
7.2.1 Application of PCFA in data set 1.
The first data set water treatment plant data set is taken from UCI
Machine Learning data repository. It consists of 38 attributes and 527
instances (records).
Table 7.1 Water Treatment Plant Data Set
Data Set Characteristics:
Multivariate Number of Instances:
527
AttributeCharacteristics:
Integer, Real
Number of Attributes:
38
Associated Tasks: Clustering Missing Values? N/A
113
Data Set Information:
This dataset comes from the daily measures of sensors in a urban
waste water treatment plant. The objective is to classify the operational state
of the plant in order to predict faults through the state variables of the plant at
each of the stages of the treatment process. This domain has been stated as an
ill-structured domain.
Attribute Information:
All attributes are numeric and continuous. Some of the attributes
contains missing values also and they are replaced with 0.
Numeric Attributes
1 Q-E (input flow to plant)
2 ZN-E (input Zinc to plant)
3 PH-E (input pH to plant)
4 DBO-E (input Biological demand of oxygen to plant)
5 DQO-E (input chemical demand of oxygen to plant)
6 SS-E (input suspended solids to plant)
7 SSV-E (input volatile suspended solids to plant)
8 SED-E (input sediments to plant)
9 COND-E (input conductivity to plant)
10 PH-P (input pH to primary settler)
11 DBO-P (input Biological demand of oxygen to primary
settler)
12 SS-P (input suspended solids to primary settler)
13 SSV-P (input volatile suspended solids to primary settler)
14 SED-P (input sediments to primary settler)
15 COND-P (input conductivity to primary settler)
114
16 PH-D (input pH to secondary settler)
17 DBO-D (input Biological demand of oxygen to secondary
settler)
18 DQO-D (input chemical demand of oxygen to secondary
settler)
19 SS-D (input suspended solids to secondary settler)
20 SSV-D (input volatile suspended solids to secondary settler)
21 SED-D (input sediments to secondary settler)
22 COND-D (input conductivity to secondary settler)
23 PH-S (output pH)
24 DBO-S (output Biological demand of oxygen)
25 DQO-S (output chemical demand of oxygen)
26 SS-S (output suspended solids)
27 SSV-S (output volatile suspended solids)
28 SED-S (output sediments)
29 COND-S (output conductivity)
30 RD-DBO-P (performance input Biological demand of oxygen
in primary settler)
31 RD-SS-P (performance input suspended solids to primary
settler)
32 RD-SED-P (performance input sediments to primary settler)
33 RD-DBO-S (performance input Biological demand of oxygen
to secondary settler)
34 RD-DQO-S (performance input chemical demand of oxygen
to secondary settler)
35 RD-DBO-G (global performance input Biological demand of
oxygen)
115
36 RD-DQO-G (global performance input chemical demand of
oxygen)
37 RD-SS-G (global performance input suspended solids)
38 RD-SED-G (global performance input sediments)
In the data set itself the maximum value, minimum value, mean
value and standard deviation are given for each plant. Those values are as
follows
N. Attrib. min max mean st-dev
1 Q-E 10000 60081 37226.56 6571.46
2 ZN-E 0.1 33.5 2.36 2.74
3 PH-E 6.9 8.7 7.81 0.24
4 DBO-E 31 438 188.71 60.69
5 DQO-E 81 941 406.89 119.67
6 SS-E 98 2008 227.44 135.81
7 SSV-E 13.2 85.0 61.39 12.28
8 SED-E 0.4 36 4.59 2.67
9 COND-E 651 3230 1478.62 394.89
10 PH-P 7.3 8.5 7.83 0.22
11 DBO-P 32 517 206.20 71.92
116
12 SS-P 104 1692 253.95 147.45
13 SSV-P 7.1 93.5 60.37 12.26
14 SED-P 1.0 46.0 5.03 3.27
15 COND-P 646 3170 1496.03 402.58
16 PH-D 7.1 8.4 7.81 0.19
17 DBO-D 26 285 122.34 36.02
18 DQO-D 80 511 274.04 73.48
19 SS-D 49 244 94.22 23.94
20 SSV-D 20.2 100 72.96 10.34
21 SED-D 0.0 3.5 0.41 0.37
22 COND-D 85 3690 1490.56 399.99
23 PH-S 7.0 9.7 7.70 0.18
24 DBO-S 3 320 19.98 17.20
25 DQO-S 9 350 87.29 38.35
26 SS-S 6 238 22.23 16.25
27 SSV-S 29.2 100 80.15 9.00
28 SED-S 0.0 3.5 0.03 0.19
29 COND-S 683 3950 1494.81 387.53
117
30 RD-DBO-P 0.6 79.1 39.08 13.89
31 RD-SS-P 5.3 96.1 58.51 12.75
32 RD-SED-P 7.7 100 90.55 8.71
33 RD-DBO-S 8.2 94.7 83.44 8.4
34 RD-DQO-S 1.4 96.8 67.67 11.61
35 RD-DBO-G 9.6 97 89.01 6.78
36 RD-DQO-G 19.2 98.1 77.85 8.67
37 RD-SS-G 10.3 99.4 88.96 8.15
38 RD-SED-G 36.4 100 99.08 4.32
Input file may be provided as text file or a excel file with extension
csv (comma separated value) into the system. User has to give the input
values such as number of clusters, number of grids, number of clusters per
grid. After getting input values from the user, as per the specified number of
cluster, our algorithm discovers specified number of clusters.
Outcomes and Findings
Case Study 1: Initially Grid value is given as 3, number of cluster per grid is
given as 3, final number of resultant cluster is given as 3 and alpha value is
given as 30. 188 records are placed in first cluster with relevant attributes,
265 records are placed in second cluster and 69 records are placed in cluster
3. 5 records are removed as outliers. In this way our algorithm is used to
group data sets into clusters. The execution time for the above application is
0.249 seconds. Memory requirement is 515 KB. Accuracy is 51%.
118
Case Study 2: Number of grid is given as 4, number of cluster per grid is
given as 4, resultant cluster value is given as 4 and alpha value is given as 30.
191 records are placed in first cluster, 84 records are placed in second cluster,
214 records are placed in third cluster and 35 records are placed in cluster 4.
Accuracy is 57.4%. Execution time is 0.329. Memory Requirement is 303
KB.
Findings: Accuracy is high when the alpha value is 30 and the resultant
cluster is more than 3. Execution time increases when the number of
partitions increases. Memory requirement is low when the number of partition
is high.
7.2.2 Application of PCFA in Data Set 2:
The second data set IPUMS Census Database Data set is taken from
UCI Machine learning repository. IPUMS stands for Integrated Public Use
Microdata Series (IPUMS) is the world largest population database. IPUMS
consists of microdata samples from United States (IPUMS-USA) and
international (IPUMS-International) census records. Data is available to
researchers through a web-based data dissemination system.
The original owner is the director of IPUMS, Historical census
projects, University of Minnesota, Professor Steven Ruggles.
The IPUMS provides consistent variable names, coding schemes,
and documentation across all the samples.
IPUMS Census data includes data for countries from Africa, Asia,
Europe, and Latin America. The database currently includes 159 samples
from 55 countries around the world. IPUMS-International converts census
microdata for multiple countries into a consistent format, allowing for
119
comparisons across countries and time periods. Special efforts are made to
simplify use of the data while losing no meaningful information.
Comprehensive documentation is provided in a coherent form to facilitate
comparative analyses of social and economic change.
This data set contains un-weighted PUMS census data from the Los
Angeles and Long Beach areas for the years 1970, 1980, and 1990.
Table 7.2.2 IPUMS Census Database Data set
Data Set Characteristics: Multivariate Number of Instances: 256932
AttributeCharacteristics:
Categorical, Integer
Number of Attributes:
61
Associated Tasks: N/A Missing Values? N/A
The original source for this data set is the IPUMS project
(RugglesSobek, 1997). The IPUMS project is a large collection of federal
census data which has standardized coding schemes to make comparisons
across time easy.
The data is an unweighted 1 in 100 sample of responses from the
Los Angeles -- Long Beach area for the years 1970, 1980, and 1990. The
household and individual records were flattened into a single table and we
used all variables that were available for all three years. When there was more
than one version of a variable, such as for race, we used the most general.
It includes 3 data files ipums.la.97.gz, ipums.la.98.gz and
ipums.la.99.gz. These data files include census data of 1997, 1998 and 1999.
In order to apply PCFA algorithm, it is decided to use 97 census data file such
120
as ipums.la.97.gz. Totally there are 61 attributes and more than 200000
records.
The attributes names and applicable value are given as follows
year 1- 2
gq 3- 3
gqtypeg 4- 4
farm 5- 5
ownershg 6- 6
value 7- 12
rent 13- 16
ftotinc 17- 22
nfams 23- 24
ncouples 25- 25
nmothers 26- 26
nfathers 27- 27
momloc 28- 29
stepmom 30- 30
momrule 31- 31
121
poploc 32- 33
steppop 34- 34
poprule 35- 35
sploc 36- 37
sprule 38- 38
famsize 39- 40
nchild 41- 41
nchlt5 42- 42
famunit 43- 44
eldch 45- 46
yngch 47- 48
nsibs 49- 49
relateg 50- 51
age 52- 54
sex 55- 55
raceg 56- 56
marst 57- 57
chborn 58- 59
122
bplg 60- 62
school 63- 63
educrec 64- 64
schltype 65- 65
empstatg 66- 66
labforce 67- 67
occ1950 68- 70
occscore 71- 72
sei 73- 74
ind1950 75- 77
classwkg 78- 78
wkswork2 79- 79
hrswork2 80- 80
yrlastwk 81- 82
workedyr 83- 83
inctot 84- 89
incwage 90- 95
incbus 96-101
123
incfarm 102-107
incss 108-112
incwelfr 113-117
incother 118-122
poverty 123-125
migrat5g 126-126
migplac5 127-129
movedin 130-130
vetstat 131-131
tranwork 132-133
The applicable values for each attribute is also mentioned in the data
set as follows
variable labels gq 'Group quarters status'.
value labels gq 0 'Vacant unit'
1 'HH in 1970 definition, but sampled as larger unit'
2 'Additional hhs under 1990 definition'
3 'Institution'
4 'Other group quarters'
124
5 'Boarders and lodgers in 1900'
6 'Fragment'.
variable labels gqtypeg 'Group quarters type--general'.
value labels gqtypeg 0 'NA (household)' 1 'Institution (1990)'
2 'Correctional institutions'
3 'Mental institutions' 4 'Other institutions'
5 'Non-institutional GQ (1940, 1950, 1990)'
6 'Military' 7 'College dormitory' 8 'Rooming house'
9 'Other non-instit GQ and unknown'
variable labels farm 'Farm status'.
value labels farm 1 'Non-Farm' 2 'Farm'.
variable labels ownershg 'Ownership of dwelling--general'.
value labels ownershg 0 'N/A' 1 'Owned or being bought (loan)' 2 'Rented'.
variable labels relateg 'Relationship to hh head--general'.
value labels relateg 1 'Head/Householder' 2 'Spouse' 3 'Child'
4 'Child-in-law' 5 'Parent' 6 'Parent-in-Law' 7 'Sibling' 8 'Sibling-in-Law'
9 'Grandchild' 10 'Other relatives' 11 'Partner, friend, visitor'
12 'Other non-relatives' 13 'Institutional inmates'.
125
variable labels sex 'Sex'.
value labels sex 1 'Male' 2 'Female'.
variable labels raceg 'Race--general'.
value labels raceg 1 'White' 2 'Black/Negro' 3 'American Indian' 4 'Chinese'
5 'Japanese' 6 'Other Asian or Pacific' 7 'Other race, nec'.
variable labels marst 'Marital status'.
value labels marst 1 'Married, spouse present' 2 'Married, spouse absent'
3 'Separated' 4 'Divorced' 5 'Widowed' 6 'Never married/single'.
variable labels chborn 'Children ever born'.
value labels chborn 0 'N/A' 1 'No children' 2 '1 child' 3 '2 children' 4 '3'
5 '4' 6 '5' 7 '6' 8 '7' 9 '8' 10 '9' 11 '10 children' 12 '11'
13 '12 (12+ 1960-1990)' 14 '13' 15 '14' 16 '15' 17 '16' 18 '17' 19 '18'
20 '19' 21 '20' 22 '21' 23 '22' 24 '23' 25 '24' 26 '25 (25+ 1950)' 27 '26'
28 '27' 31 '30' 34 '33' 51 '50' 57 '56' 61 '60'.
variable labels bplg 'Birthplace--general'.
value labels bplg 001 'Alabama' 2 'Alaska' 4 'Arizona'
5 'Arkansas' 6 'California' 8 'Colorado' 9 'Connecticut'
10 'Delaware' 11 'District of Columbia' 12 'Florida' 13 'Georgia'
126
15 'Hawaii' 16 'Idaho' 17 'Illinois'
18 'Indiana' 19 'Iowa' 20 'Kansas' 21 'Kentucky'
22 'Louisiana' 23 'Maine' 24 'Maryland' 25 'Massachusetts'
26 'Michigan' 27 'Minnesota' 28 'Mississippi' 29 'Missouri'
30 'Montana' 31 'Nebraska' 32 'Nevada' 33 'New Hampshire'
34 'New Jersey' 35 'New Mexico'
36 'New York' 37 'North Carolina' 38 'North Dakota' 39 'Ohio'
40 'Oklahoma' 41 'Oregon' 42 'Pennsylvania'
44 'Rhode Island' 45 'South Carolina' 46 'South Dakota'
47 'Tennessee' 48 'Texas' 49 'Utah'
50 'Vermont' 51 'Virginia' 53 'Washington'
54 'West Virginia' 55 'Wisconsin' 56 'Wyoming'
90 'Native American' 99 'United States, ns'
100 'American Samoa' 105 'Guam'
110 'Puerto Rico' 115 'U.S. Virgin Islands'
120 'Other US Possessions' 150 'Canada' 160 'Atlantic Islands'
199 'North America, ns' 200 'Mexico' 210 'Central America'
250 'Cuba' 260 'West Indies' 300 'South America'
127
400 'Denmark' 401 'Finland' 402 'Iceland' 403 'Lapland' 404 'Norway'
405 'Sweden' 410 'England' 411 'Scotland' 412 'Wales'
413 'United Kingdom, ns' 414 'Ireland'
419 'Northern Europe, ns' 420 'Belgium' 421 'France'
422 'Liechtenstein' 423 'Luxembourg' 424 'Monaco'
425 'Netherlands' 426 'Switerland' 429 'Western Europe, ns'
430 'Albania' 431 'Andorra' 432 'Gibraltar' 433 'Greece'
434 'Italy' 435 'Malta' 436 'Portugal'
437 'San Marino' 438 'Spain' 439 'Vatican City'
440 'Southern Europe, ns' 450 'Austria'
451 'Bulgaria' 452 'Czechoslovakia'
453 'Germany' 454 'Hungary' 455 'Poland' 456 'Romania'
457 'Yugoslavia' 458 'Central Europe, ns' 459 'Eastern Europe, ns'
460 'Estonia' 461 'Latvia'
462 'Lithuania' 463 'Baltic States, ns' 465 'USSR/"Russia"'
499 'Europe, ns' 500 'China' 501 'Japan' 502 'Korea'
509 'East Asia, ns' 510 'Brunei' 511 'Cambodia (Kampuchea)'
512 'Indonesia' 513 'Laos'
128
514 'Malaysia' 515 'Philippines' 516 'Singapore' 517 'Thailand'
518 'Vietnam' 519 'Southeast Asia, ns' 520 'Afghanistan' 521 'India'
522 'Iran' 523 'Maldives' 524 'Nepal' 530 'Bahrain' 531 'Cyprus'
532 'Iraq' 533 'Iraq/Saudi Arabia' 534 'Israel/Palestine'
535 'Jordan' 536 'Kuwait' 537 'Lebanon' 538 'Oman' 539 'Qatar'
540 'Saudi Arabia' 541 'Syria' 542 'Turkey' 543 'United Arab Emirates'
544 'Yemen Arab Republic (North)' 545 'Yemen, PDR (South)'
546 'Persian Gulf States, ns'
547 'Middle East, ns' 548 'Southwest Asia, nec/ns' 549 'Asia Minor, ns'
550 'South Asia, nec' 599 'Asia, nec/ns' 600 'Africa'
700 'Australia and New Zealand' 710 'Pacific Islands'
800 'Antarctica, ns/nec' 900 'Abroad (unknown) or at sea'
997 'Missing/Unknown' 999 'Unknown value'.
variable labels school 'School attendance'.
value labels school 0 'N/A' 1 'No, not in school' 2 'Yes, in school'.
variable labels educrec 'Educational attainment, recode'.
value labels educrec 0 'N/A (or none, 1980)' 1 'None or preschool'
2 'Grade 1-4' 3 'Grade 5-8' 4 'Grade 9' 5 'Grade 10'
129
6 'Grade 11' 7 'Grade 12' 8 '1 to 3 years of college' 9 '4+ years of college'.
variable labels schltype 'Public or private school'.
value labels schltype 0 'N/A' 1 'Not enrolled' 2 'Public school'
3 'Private school (1960,1990)' 4 'Church-related (1980)' 5 'Parochial (1970)'
6 'Other private, 1980' 7 'Other private, 1970'.
variable labels empstatg 'Employment status--general'.
value labels empstatg 0 'N/A' 1 'Employed' 2 'Unemployed'
3 'Not in labor force'.
variable labels labforce 'Labor force status'.
value labels labforce 0 'N/A' 1 'No, not in labor force'
2 'Yes, in labor force'.
variable labels occ1950 'Occupation, 1950 basis'.
value labels occ1950 0 'Accountants and auditors' 1 'Actors and actresses'
2 'Airplane pilots and navigators' 3 'Architects'
4 'Artists and art teachers' 5 'Athletes' 6 'Authors' 7 'Chemists'
8 'Chiropractors' 9 'Clergymen' 10 'College presidents and deans'
12 'Aricultural sciences-Professors'
13 'Biological sciences-Professors'
130
14 'Chemistry-Professors'
15 'Economics-Professors'
16 'Engineering-Professors'
17 'Geology and geophysics-Professors'
18 'Mathematics-Professors'
19 'Medical Sciences-Professors'
23 'Physics-Professors'
24 'Psychology-Professors'
25 'Statistics-Professors'
26 'Natural science (nec)-Professors'
27 'Social sciences (nec)-Professors'
28 'Nonscientific subjects-Professors'
29 'Subject not specified-Professors'
31 'Dancers and dancing teachers' 32 'Dentists' 33 'Designers'
34 'Dieticians and nutritionists' 35 'Draftsmen' 36 'Editors and reporters'
41 'Aeronautical-Engineers' 42 'Chemical-Engineers' 43 'Civil-Engineers'
44 'Electrical-Engineers' 45 'Industrial-Engineers' 46 'Mechanical-Engineers'
47 'Metallurgical, metallurgists-Engineers' 48 'Mining-Engineers'
131
49 'Engineers (nec)' 51 'Entertainers (nec)'
52 'Farm and home management advisors' 53 'Foresters and conservationists'
54 'Funeral directors and embalmers' 55 'Lawyers and judges' 56 'Librarians'
57 'Musicians and music teachers' 58 'Nurses, professional'
59 'Nurses, student professional' 61 'Agricultural scientists'
62 'Biological scientists' 63 'Geologists and geophysicists'
67 'Mathematicians' 68 'Physicists' 69 'Misc. natural scientists'
70 'Optometrists' 71 'Osteopaths' 72 'Personnel and labor relations workers'
73 'Pharmacists' 74 'Photographers' 75 'Physicians and surgeons'
76 'Radio operators' 77 'Recreation and group workers' 78 'Religious workers'
79 'Social and welfare workers, except group' 81 'Economists'
82 'Psychologists' 83 'Statisticians and actuaries'
84 'Misc social scientists' 91 'Sports instructors and officials'
92 'Surveyors' 93 'Teachers (n.e.c.)' 94 'Medical and dental-technicians'
95 'Testing-technicians' 96 'Technicians (nec)'
97 'Therapists and healers (nec)' 98 'Veterinarians'
99 'Professional, technical & kindred workers (nec)'
100 'Farmers (owners and tenants)' 123 'Farm managers'
132
200 'Buyers and dept heads, store' 201 'Buyers and shippers, farm products'
203 'Conductors, railroad' 204 'Credit men'
205 'Floormen and floor managers, store'
210 'Inspectors, public administration'
230 'Managers & superintendants, building'
240 'Officers, pilots, pursers and engineers, ship'
250 'Officials & administratators (nec), public administration'
260 'Officials, lodge, society, union, etc.' 270 'Postmasters'
280 'Purchasing agents and buyers (nec)'
290 'Managers, officials, and proprietors (nec)' 300 'Agents (nec)'
301 'Attendants and assistants, library'
302 'Attendants, physicians and dentists office'
304 'Baggagemen, transportation' 305 'Bank tellers' 310 'Bookkeepers'
320 'Cashiers' 321 'Collectors, bill and account'
322 'Dispatchers and starters, vehicle'
325 'Express messengers and railway mail clerks' 335 'Mail carriers'
340 'Messengers and office boys' 341 'Office machine operators'
342 'Shipping and receiving clerks'
133
350 'Stenographers, typists, and secretaries' 360 'Telegraph messengers'
365 'Telegraph operators' 370 'Telephone operators'
380 'Ticket, station, and express agents'
390 'Clerical and kindred workers (n.e.c.)'
400 'Advertising agents and salesmen' 410 'Auctioneers' 420 'Demonstrators'
430 'Hucksters and peddlers' 450 'Insurance agents and brokers'
460 'Newsboys' 470 'Real estate agents and brokers'
480 'Stock and bond salesmen' 490 'Salesmen and sales clerks (nec)'
500 'Bakers' 501 'Blacksmiths' 502 'Bookbinders' 503 'Boilermakers'
504 'Brickmasons,stonemasons, and tile setters' 505 'Cabinetmakers'
510 'Carpenters' 511 'Cement and concrete finishers'
512 'Compositors and typesetters' 513 'Cranemen,derrickmen, and hoistmen'
514 'Decorators and window dressers' 515 'Electricians'
520 'Electrotypers and stereotypers' 521 'Engravers, except engravers'
522 'Excavating, grading, and road machinery operators' 523 'Foremen (nec)'
524 'Forgemen and hammermen' 525 'Furriers' 530 'Glaziers'
531 'Heat treaters, annealers, temperers'
532 'Inspectors, scalers, and graders log and lumber' 533 'Inspectors (nec)'
134
534 'Jewelers, watchmakers, goldsmiths, and silversmiths'
535 'Job setters, metal'
540 'Linemen and servicemen, telegraph, telephone, & power'
541 'Locomotive engineers' 542 'Locomotive firemen' 543 'Loom fixers'
544 'Machinists' 545 'Airplane-mechanics and repairmen'
550 'Automobile-mechanics and repairmen'
551 'Office machine-mechanics and repairmen'
552 'Radio and television-mechanics and repairmen'
553 'Railroad and car shop-mechanics and repairmen'
554 'Mechanics and repairmen (nec)' 555 'Millers, grain, flour, feed, etc'
560 'Millwrights' 561 'Molders, metal' 562 'Motion picture projectionists'
563 'Opticians and lens grinders and polishers'
564 'Painters, construction and maintenance' 565 'Paperhangers'
570 'Pattern and model makers, except paper'
571 'Photoengravers & lithographers'
572 'Piano and organ tuners and repairmen' 573 'Plasterers'
574 'Plumbers and pipe fitters' 575 'Pressmen and plate printers, printing'
580 'Rollers and roll hands, metal' 581 'Roofers and slaters'
135
582 'Shoemakers and repairers, except factory' 583 'Stationary engineers'
584 'Stone cutters and stone carvers' 585 'Structural metal workers'
590 'Tailors and tailoresses'
591 'Tinsmiths, coppersmiths, and sheet metal workers'
592 'Tool makers, and die makers and setters' 593 'Upholsterers'
594 'Craftsmen and kindred workers (nec)' 595 'Members of the armed
services'
600 'Auto mechanics apprentice' 601 'Bricklayers and masons apprentice'
602 'Carpenters apprentice' 603 'Electricians apprentice'
604 'Machinists and toolmakers apprentice'
605 'Mechanics, except auto apprentice'
610 'Plumbers and and pipe fitters apprentice'
611 'Apprentices, building trades (nec)'
612 'Apprentices, metalworking trades (nec)'
613 'Apprentices, printing trades' 614 'Apprentices, other specified trades'
615 'Apprentices, trade not specified' 620 'Asbestos and insulation workers'
621 'Attendants, auto service and parking' 622 'Blasters and powdermen'
623 'Boatmen, canalmen, and lock keepers' 624 'Brakemen, railroad'
136
625 'Bus drivers' 630 'Chainmen, rodmen, and axmen, surveying'
631 'Conductors, bus & street railway' 632 'Deliverymen and routemen'
633 'Dressmakers and seamstresses except factory' 634 'Dyers'
635 'Filers, grinders, and polishers, metal'
640 'Fruit, nut, and vegetable graders, and packers, except facto'
641 'Furnacemen, smeltermen and pourers' 642 'Heaters, metal'
643 'Laundry and dry cleaning Operatives'
644 'Meat cutters, except slaughter and packing house' 645 'Milliners'
650 'Mine operatives and laborers'
660 'Motormen, mine, factory, logging camp, etc'
661 'Motormen, street, subway, and elevated railway'
662 'Oilers and greaser, except auto'
670 'Painters, except construction or maintenance'
671 'Photographic process workers' 672 'Power station operators'
673 'Sailors and deck hands' 674 'Sawyers' 675 'Spinners, textile'
680 'Stationary firemen' 681 'Switchmen, railroad'
682 'Taxicab drivers and chauffeurs' 683 'Truck and tractor drivers'
684 'Weavers, textile' 685 'Welders and flame cutters'
137
690 'Operative and kindred workers (nec)'
700 'Housekeepers, private household' 710 'Laundressses, private household'
720 'Private household workers (nec)'
730 'Attendants, hospital and other institution'
731 'Attendants, professional and personal service (nec)'
732 'Attendants, recreation and amusement'
740 'Barbers, beauticians, and manicurists' 750 'Bartenders' 751 'Bootblacks'
752 'Boarding and lodging house keepers' 753 'Charwomen and cleaners'
754 'Cooks, except private household' 760 'Counter and fountain workers'
761 'Elevator operators' 762 'Firemen, fire protection'
763 'Guards, watchmen, and doorkeepers'
764 'Housekeepers and stewards, except private household'
770 'Janitors and sextons' 771 'Marshals and constables' 772 'Midwives'
773 'Policemen and detectives' 780 'Porters' 781 'Practical nurses'
782 'Sheriffs and bailiffs' 783 'Ushers, recreation and amusement'
784 'Waiters and waitresses' 785 'Watchmen (crossing) and bridge tenders'
790 'Service workers, except private household (nec)' 810 'Farm foremen'
820 'Farm laborers, wage workers' 830 'Farm laborers, unpaid family workers'
138
840 'Farm service laborers, self-employed' 910 'Fishermen and oystermen'
920 'Garage laborers and car washers and greasers'
930 'Gardeners, except farm, and groundskeepers'
940 'Longshoremen and stevedores' 950 'Lumbermen, raftsmen, and
woodchoppers'
960 'Teamsters' 970 'Laborers (nec)'
980 'Keeps house/house work/housewife'
981 'Imputed keeping house (1860-1880)' 982 'At home/ helps in home'
983 'At school' 984 'Retired' 985 'Unemployed/ without occ'
986 'Invalid/sick/disabled' 987 'Inmate/prisoner' 988 'Ration Indian'
990 'Landlord' 991 'Capitalist/gentleman' 992 'Child labor-farm'
993 'Child labor-domestic' 994 'Child labor-other' 995 'Other non-
occupation'
997 'Occupation missing/unknown'
999 'N/A (blank)'.
variable labels ind1950 'Industry, 1950 basis'.
value labels ind1950 0 'N/A' 1 'Ration Indians' 2 'Prisoners'
3 'Identified housework' 105 'Agriculture' 116 'Forestry' 126 'Fisheries'
206 'Metal mining' 216 'Coal mining'
139
226 'Crude petroleum and natural gas extraction'
236 'Nonmettalic mining and quarrying, except fuel' 246 'Construction'
306 'Logging' 307 'Sawmills, planing mills, and mill work'
308 'Misc wood products' 309 'Furniture and fixtures'
316 'Glass and glass products'
317 'Cement, concrete, gypsum and plaster products'
318 'Structural clay products' 319 'Pottery and related prods'
326 'Misc nonmetallic mineral and stone products'
336 'Blast furnaces, steel works, & rolling mills'
337 'Other primary iron and steel industries'
338 'Primary nonferrous industries' 346 'Fabricated steel products'
347 'Fabricated nonferrous metal products'
348 'Not specified metal industries'
356 'Agricultural machinery and tractors' 357 'Office and store machines'
358 'Misc machinery' 367 'Electrical machinery, equipment and supplies'
376 'Motor vehicles and motor vehicle equipment' 377 'Aircraft and parts'
378 'Ship and boat building and repairing'
140
379 'Railroad and misc transportation equipment' 386 'Professional
equipment'
387 'Photographic equipment and supplies'
388 'Watches, clocks, and clockwork-operated devices'
399 'Misc manufacturing industries' 406 'Meat products' 407 'Dairy products'
408 'Canning and preserving fruits, vegetables, and seafoods'
409 'Grain-mill products' 416 'Bakery products'
417 'Confectionary and related products' 418 'Beverage industries'
419 'Misc food preparations and kindred products'
426 'Not specified food industries' 429 'Tobacco manufactures'
436 'Knitting mills' 437 'Dyeing and finishing textiles, except knit goods'
438 'Carpets, rugs, and other floor coverings' 439 'Yarn, thread, and fabric'
446 'Misc textile mill products' 448 'Apparel and accessories'
449 'Misc fabricated textile products'
456 'Pulp, paper, and paper-board mills'
457 'Paperboard containers and boxes' 458 'Misc paper and pulp products'
459 'Printing, publishing, and allied industries' 466 'Synthetic fibers'
467 'Drugs and medicines' 468 'Paints, varnishes, and related products'
141
469 'Misc chemicals and allied products' 476 'Petroleum refining'
477 'Misc petroleum and coal products' 478 'Rubber products'
487 'Leather: tanned, curried, and finished' 488 'Footwear, except rubber'
489 'Leather products, except footwear'
499 'Not specified manufacturing industries' 506 'Railroads and railway'
516 'Street railways and bus lines' 526 'Trucking service'
527 'Warehousing and storage' 536 'Taxicab service'
546 'Water transportation' 556 'Air transportation'
567 'Petroleum and gasoline pipe lines'
568 'Services incidental to transportation' 578 'Telephone' 579 'Telegraph'
586 'Electric light and power' 587 'Gas and steam supply systems'
588 'Electric-gas utilities' 596 'Water supply' 597 'Sanitary services'
598 'Other and not specified utilities' 606 'Motor vehicles and equipment'
607 'Drugs, chemicals, and allied products' 608 'Dry goods apparel'
609 'Food and related products'
616 'Electrical goods, hardware, and plumbing equipment'
617 'Machinery, equipment, and supplies' 618 'Petroleum products'
619 'Farm prods--raw materials' 626 'Misc wholesale trade'
142
627 'Not specified wholesale trade' 636 'Food stores, except dairy'
637 'Dairy prods stores and milk retailing' 646 'General merchandise'
647 'Five and ten cent stores'
656 'Apparel and accessories stores, except shoe' 657 'Shoe stores'
658 'Furniture and house furnishings stores'
659 'Household appliance and radio stores'
667 'Motor vehicles and accessories retailing'
668 'Gasoline service stations' 669 'Drug stores'
679 'Eating and drinking places' 686 'Hardware and farm implement stores'
687 'Lumber and building material retailing' 688 'Liquor stores'
689 'Retail florists' 696 'Jewelry stores' 697 'Fuel and ice retailing'
698 'Misc retail stores' 699 'Not specified retail trade'
716 'Banking and credit'
726 'Security and commodity brokerage and invest companies' 736 'Insurance'
746 'Real estate' 756 'Real estate-insurance-law offices' 806 'Advertising'
807 'Accounting, auditing, and bookkeeping services'
808 'Misc business services' 816 'Auto repair services and garages'
817 'Misc repair services' 826 'Private households'
143
836 'Hotels and lodging places' 846 'Laundering, cleaning, and dyeing'
847 'Dressmaking shops' 848 'Shoe repair shops' 849 'Misc personal services'
856 'Radio broadcasting and television' 857 'Theaters and motion pictures'
858 'Bowling alleys, and billiard and pool parlors'
859 'Misc entertainment and recreation services'
868 'Medical and other health services, except hospitals' 869 'Hospitals'
879 'Legal services' 888 'Educational services'
896 'Welfare and religious services' 897 'Nonprofit membership organizs.'
898 'Engineering and architectural services'
899 'Misc professional and related' 906 'Postal service'
916 'Federal public administration' 926 'State public administration'
936 'Local public administration'
997 'Not classifiable' 998 'Industry not reported'.
variable labels classwkg 'Class of worker--general'.
value labels classwkg 0 'N/A' 1 'Self-employed' 2 'Works for wages/salary'
3 'New worker' 4 'Unemployed, last worked 5 years ago'.
variable labels wkswork2 'Weeks worked last year, intervalled'.
value labels wkswork2 0 'N/A' 1 '1-13 weeks' 2 '14-26 weeks' 3 '27-39 weeks'
144
4 '40-47 weeks' 5 '48-49 weeks' 6 '50-52 weeks'.
variable labels hrswork2 'Hours work last week, intervalled'.
value labels hrswork2 0 'N/A' 1 '1-14 hours' 2 '15-29 hours' 3 '30-34 hours'
4 '35-39 hours' 5 '40 hours' 6 '41-48 hours' 7 '49-59 hours' 8 '60+ hours'.
variable labels yrlastwk 'Year last worked'.
value labels yrlastwk 0 'N/A' 10 'Worked current yr' 20 'Worked previous yr'
31 'Worked 2 yrs prior' 32 'Worked 2-5 yrs ago' 33 'Worked 3-5 yrs ago'
34 'Worked 3-6 yrs ago' 35 'Worked 6-10 yrs ago' 36 'Worked 7-10 yrs ago'
40 'Worked more than 10 yrs ago' 50 'Never worked'.
variable labels workedyr 'Worked last year'.
value labels workedyr 0 'N/A' 1 'No' 2 'Yes'.
variable labels migrat5g 'Migration status, 5 yrs--general'.
value labels migrat5g 0 'N/A' 1 'Same house' 2 'Moved, place not reported'
3 'Same state/county,different house' 4 'Same state, different county'
5 'Different state' 6 'Abroad' 7 'Same state, place not reported'
9 'Unknown'.
variable labels migplac5 'State/country resid. 5 yrs ago'.
value labels migplac5 0 'N/A' 1 'Alabama' 2 'Alaska' 4 'Arizona' 5 'Arkansas'
145
6 'California' 8 'Colorado' 9 'Connecticut' 10 'Delaware'
11 'District of Columbia' 12 'Florida' 13 'Georgia' 15 'Hawaii' 16 'Idaho'
17 'Illinois' 18 'Indiana' 19 'Iowa' 20 'Kansas' 21 'Kentucky' 22 'Louisiana'
23 'Maine' 24 'Maryland' 25 'Massachusetts' 26 'Michigan' 27 'Minnesota'
28 'Mississippi' 29 'Missouri' 30 'Montana' 31 'Nebraska' 32 'Nevada'
33 'New Hampshire' 34 'New Jersey' 35 'New Mexico' 36 'New York'
37 'North Carolina' 38 'North Dakota' 39 'Ohio' 40 'Oklahoma' 41 'Oregon'
42 'Pennsylvania' 44 'Rhode Island' 45 'South Carolina' 46 'South Dakota'
47 'Tennessee' 48 'Texas' 49 'Utah' 50 'Vermont' 51 'Virginia'
53 'Washington' 54 'West Virginia' 55 'Wisconsin' 56 'Wyoming'
61 'State group (1980): Maine-New Hamp-Vermont'
62 'State group (1980): Massachusetts-Rhode Island'
63 'State group (1980): Minn-Iowa-Missouri-Kansas-Nebras'
64 'State group (1980): Maryland-Delaware'
65 'State group (1980): Montana-Idaho-Wyoming'
66 'State group (1980): Utah-Nevada'
67 'State group (1980): Arizona-New Mexico'
68 'State group (1980): Alaska-Hawaii'
146
99 'United States, n.s. or confidential'
100 'American Samoa' 105 'Guam' 110 'Puerto Rico' 115 'Virgin Islands'
119 'US outlying area, 1980' 150 'Canada' 151 'English Canada'
152 'French Canada' 160 'Atlantic Islands' 200 'Mexico'
211 'Belize/British Honduras' 212 'Costa Rica' 213 'El Salvador'
214 'Guatemala' 215 'Honduras' 216 'Nicaragua' 217 'Panama' 218 'Canal
Zone'
219 'Central America, nec' 250 'Cuba' 261 'Dominican Republic' 262 'Haiti'
263 'Jamaica' 264 'British West Indies' 266 'Trinidad & Tobago'
267 'Other West Indies' 305 'Argentina' 310 'Bolivia' 315 'Brazil'
320 'Chile' 325 'Colombia' 330 'Ecuador' 345 'Paraguay' 350 'Peru'
360 'Uruguay' 365 'Venezuela' 390 'South America, nec' 400 'Denmark'
402 'Finland' 403 'Iceland' 404 'Ireland' 406 'Norway' 409 'Sweden'
410 'England' 413 'Northern Ireland' 415 'Scotland' 416 'Wales' 420 'Austria'
421 'Belgium' 422 'France' 424 'Luxembourg' 426 'Netherlands'
427 'Switzerland' 430 'Albania' 433 'Greece' 435 'Italy' 437 'Portugal'
438 'Azores' 441 'Spain' 450 'Czechoslovakia' 453 'Germany' 456 'Hungary'
457 'Poland' 460 'Yugoslavia' 470 'Bulgaria' 471 'Romania' 489 'Europe, nec'
147
490 'Estonia' 491 'Latvia' 492 'Lithuania' 495 'USSR' 496 'Byelorussia'
498 'Ukraine' 500 'China' 504 'Japan' 506 'Korea' 515 'Philippines'
518 'Vietnam' 521 'India' 525 'Pakistan' 527 'Iran' 535 'Israel/Palestine'
539 'Jordan' 541 'Lebanon' 545 'Syria' 546 'Turkey' 559 'Southwest Asia, nec';
590 'Asia, nec' 600 'Africa' 610 'Northern Africa'
612 'Egypt/United Arab Rep.' 670 'Central Africa' 690 'Southern Africa'
694 'South Africa (Union of)' 699 'Africa, nec' 701 'Australia'
702 'New Zealand' 710 'Pacific Islands' 715 'US Pacific Trust Terrs'
900 'Abroad (unknown) or at sea' 911 'Abroad, n.s.' ; 912 'Abroad, at sea'
990 'Same house' ; 997 'Undocumented value' 999 'Missing/unknown'.
variable labels vetstat 'Veteran status' ; value labels vetstat 0 'N/A' 1 'No
Service' 2 'Yes' 9 'Not ascertained'.
variable labels tranwork 'Means of transport to work'.
value labels tranwork 00 'N/A (+ not reported 1960)' 10 'Auto, truck, or van'
11 'Auto' 12 'Driver' 13 'Passenger' 14 'Truck' 15 'Van' 20 'Motorcycle'
30 'Bus or streetcar' 31 'Bus or trolley bus' 32 'Streetcar or trolley car'
33 'Subway or elevated' 34 'Railroad' 35 'Taxicab' 36 'Ferryboat'
40 'Bicycle' 50 'Walked only' 60 'Other' 70 'Worked at home'.
148
Outcomes and Findings
Case Study 1: Input values are given as 8 for number of grids, 10 for clusters
per grid, 8 for resultant cluster and 30 for alpha. Execution time is 342.69
seconds. Accuracy is 49.6%. Eight clusters are formed. 41248 records are
placed in cluster1, 27254 clusters are placed in cluster2, 34325 records are
placed in clster3, 33845 records are placed in cluster4, 20117 records are
placed in cluster5, 46234 records are placed in cluster6, 25829 records are
placed in cluster7, 26765 records are placed in cluster8. 3478 records are
omitted as outliers.
Case Study2: During execution, input value 10 is given for number of grids,
10 is given as clusters per grid, 10 is given resultant cluster. Alpha value is
given as 30. The execution time is 440 seconds and 10 clusters are formed.
Accuracy is 51.8%. 10 clusters are formed with relevant attributes.
Findings:
1. Accuracy increases if number of partitions is increased.
2. Execution time increases as the number of partitions as
increases.
3. Accuracy is high when alpha value is 30 and the resultant
cluster is high.
7.2.3 Application of PCFA in Data Set 3
German Credit Data Set
Abstract: This dataset classifies people described by a set of attributes as
good or bad credit risks. Comes in two formats (one all numeric).
149
Table 7.2.3 German Credit Data Set
Data Set Characteristics: Multivariate Number of Instances: 1000
AttributeCharacteristics:
Categorical, Integer
Number of Attributes:
20
Associated Tasks: Classification,Clustering
Missing Values? N/A
Source:
Professor Dr. Hans Hofmann
Institut f"ur Statistik und "Okonometrie
Universit"at Hamburg
FB Wirtschaftswissenschaften
Von-Melle-Park 5
2000 Hamburg 13
Data Set Information:
Two datasets are provided. the original dataset, in the form provided
by Prof. Hofmann, contains categorical/symbolic attributes and is in the file
"german.data".
For algorithms that need numerical attributes, Strathclyde
University produced the file "german.data-numeric". This file has been edited
and several indicator variables added to make it suitable for algorithms which
cannot cope with categorical variables. Several attributes that are ordered
categorical (such as attribute 17) have been coded as integer.
150
Attribute Information:
Attribute 1: (qualitative)
Status of existing checking account
A11 : ... < 0 DM
A12 : 0 <= ... < 200 DM
A13 : ... >= 200 DM / salary assignments for at least 1 year
A14 : no checking account
Attribute 2: (numerical)
Duration in month
Attribute 3: (qualitative)
Credit history
A30 : no credits taken/ all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/ other credits existing (not at this bank)
Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
151
A48 : retraining
A49 : business
A410 : others
Attribute 5: (numerical)
Credit amount
Attibute 6: (qualitative)
Savings account/bonds
A61 : ... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account
Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : ... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 : .. >= 7 years
Attribute 8: (numerical)
Installment rate in percentage of disposable income
Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
152
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor
Attribute 11: (numerical)
Present residence since
Attribute 12: (qualitative)
Property
A121 : real estate
A122 : if not A121 : building society savings agreement/ life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property
Attribute 13: (numerical)
Age in years
Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none
153
Attribute 15: (qualitative)
Housing
A151 : rent
A152 : own
A153 : for free
Attribute 16: (numerical)
Number of existing credits at this bank
Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer
Attribute 18: (numerical)
Number of people being liable to provide maintenance for
Attribute 19: (qualitative)
Telephone
A191 : none ; A192 : yes, registered under the customers name
Attribute 20: (qualitative)
foreign worker
A201 : yes ; A202 : no
154
Case Study 1:
German Credit data set is applied as input file into PCFA algorithm. It
requires inputs for number of grids, clusters per grid and number of target
clusters. If we give the above inputs as 5, 3 and 5 then 5 target clusters are
formed. Alpha value is given as 40. The credit data is grouped according to
job, property and credit amount. 220 data points are included in cluster1, 185
in cluster 2, 260 in cluster3, 140 in cluster4 and 173 in cluster5. Remaining
data points are omitted as outliers. Execution time is 389 seconds. Accuracy
is 50.8%. Memory requirement is 640 KB.
Case Study 2:
German Credit data set is applied as input file into PCFA algorithm. It
requires inputs for number of grids, clusters per grid and number of target
clusters. If we give the above inputs as 4, 4 and 4 then 4 target clusters are
formed. Alpha value is given as 20. 229 data points are included in cluster1,
177 in cluster 2, 260 in cluster3, 309 in cluster4. Remaining data points are
omitted as outliers. Execution time is 323 seconds. Accuracy is 58.8%.
Memory requirement is 589 KB.
Findings
1. Accuracy is high for smaller alpha values when the resultant
cluster is more than 3.
2. Memory requirement is low when the alpha value is low.
3. Execution time increase as the partition increases.