abstract introduction the graphical user interface · abstract enterprise miner™ software is the...

1

Finding the Solution to Data Mining:

A Map of the Features and Components of SAS® Enterprise Miner™ Software

John Brocklebank, SAS Institute Inc., Cary, NCMark Brown, SAS Institute Inc., Cary, NC

AbstractEnterprise Miner™ software is the data mining solutionfrom SAS Institute Inc. This paper discusses threemain features of Enterprise Miner—the graphical userinterface (GUI), the SEMMA methodology, andclient/server enablement—and maps the componentsof the solution to those features.

IntroductionData mining is a process; not just a series of statisticalanalyses. Simply applying disparate software tools to adata mining project can take one only so far. Instead,what is needed to plan, implement, and successfullyrefine a data mining project is an integrated softwaresolution—one that encompasses all steps of theprocess beginning with the sampling of data, throughsophisticated data analyses and modeling, to thedissemination of the resulting business-criticalinformation. In addition, the ideal solution should beintuitive and flexible enough that users with differentdegrees of statistical expertise can understand anduse it.

To accomplish all this, the data mining solution mustprovide

• advanced, yet easy-to-use, statistical analyses andreporting techniques

• a guiding, yet flexible, methodology• client/server enablement.

SAS® Enterprise Miner™ software is that solution. Itsynthesizes the world-renowned statistical analysisand reporting system of SAS Institute with an easy-to-use GUI that can be understood and used by businessanalysts as well as quantitative experts.

The components of the GUI can be used to implementa data mining methodology developed by the Institute.However, the methodology does not dictate the steps

to be taken in projects. Instead, the methodologyenables users to move data mining projects from rawdata to information by going step-by-step as neededfrom sampling of data, through exploration andmodification, to modeling, assessment and scoring ofnew data, and then to the dissemination of the results.

The GUI also contains components that helpadministrators set up and maintain the client/serverdeployment of Enterprise Miner software.

In addition, as a software solution from SAS Institute,Enterprise Miner fully integrates with the rest of theSAS® System including the award-winningSAS/Warehouse Administrator™ software, the SASsolution for online analytical process (OLAP), andSAS/IntrNet™ software, which enables applicationsdeployment via intranets and the World Wide Web.

The Graphical User InterfaceEnterprise Miner employs a single, graphical userinterface (GUI) to give users all the functionalityneeded to uncover valuable information hidden in theirvolumes of data. With one point-and-click interface,users can perform the entire data mining process frominputting data from diverse sources, through preparingthe data for modeling and accessing the value of themodels, to scoring models for use in making businessdecisions.

The GUI is designed with two groups of users in mind:business analysts, who may have minimal statisticalexpertise, can quickly and easily navigate through thedata mining process; and quantitative experts, whomay want to go explore the details, can access andfine tune the underlying analytical processes. The GUIuses familiar desktop objects such as tool bars,menus, windows, and dialog pages to equip bothgroups with a full range of data mining tools.

Beginning TutorialsBeginning Tutorials

2

The main components of the GUI are the

• Available Projects window• Enterprise Miner Workspace window• Node Types window• Message window.

Available Projects WindowThe Available Projects window opens when you startEnterprise Miner.

Figure 1: Projects Displayed in the Available Projects Window

The window displays any existing projects in a familiarhierarchical form. Along with the tool bar and menus,the Available Projects window enables you to createand manage data mining projects.

Enterprise Miner WorkspaceWindowWhen you open an existing project or create a newone by using the Available Projects window, theEnterprise Miner Workspace window is displayed.Using the Enterprise Miner Workspace window, theNode Types window, the tool bar, and menus, you canbuild, edit, and run process flow diagrams (PFDs).

Figure 2: A PFD Displayed in the Enterprise Miner WorkspaceWindow

With these easy-to-use tools, you can map out yourentire data mining project, launch individual functions,and modify PFDs simply by pointing and clicking.

Node Types WindowWhen you open a project, the Node Types windowappears along with the Enterprise Miner Workspacewindow.

Figure 3: Node Types Window

The Node Types window functions as apalette, which displays the data miningnodes that are available for constructingPFDs. When you place the cursor over anode icon, the name of the nodeappears in a pop-up field. Using themouse, you can drag and drop nodesfrom the Node Types window onto theEnterprise Miner Workspace windowand connect the nodes in the desiredprocess flow.

Utility NodesIn addition to nodes that perform specific data miningsteps, such as sampling or data partitioning, EnterpriseMiner includes the following utility nodes:

• The SAS Code node enables you to submit SASsoftware programming statements.

• The Control Point node enables you to establisha control point in process flow diagrams. A ControlPoint node can be used to reduce the number ofconnections that are made.

• The Sub-diagram node enables you to group aportion of a PFD into sub-units, which arethemselves diagrams. For complex PFDs, youmay want to create sub-diagrams to displayvarious levels of detail.

• The Data Mining Database node enables you tocreate a data mining database (DMDB) for batchprocessing. For non-batch processing, DMDBs areautomatically created as they are needed.

Common Features Among NodesThe nodes of Enterprise Miner software have auniform look and feel. For example, the tabbed dialogpages enable you to quickly access the appropriateoptions; the common node functionality enables you tolearn the usage of the nodes quickly; and the ResultsBrowser, which is available in many of the nodes,enables you to view the results of running the processflow diagrams.


3

Message WindowThe Message window displays messages generatedby the creation or execution of a PFD. It is hidden bydefault, and it can be toggled on and off using the Viewpull-down menu.

The SEMMA MethodologyOne of the keys to the effectiveness of EnterpriseMiner is the fact that the GUI makes it easy forbusiness analysts as well as quantitative experts toorganize data mining projects into a logical framework.The visible representation of this framework is a PFD.It graphically illustrates the steps taken to complete anindividual data mining project.

In addition, a larger, more general, framework forstaging data mining projects exists. This largerframework for data mining is the SEMMA methodologyas defined by SAS Institute. SEMMA is simply anacronym for “Sample, Explore, Modify, Model, andAssess.” However, this logical superstructure providesusers with a comprehensive method in which individualdata mining projects can be developed andmaintained. Not all data mining projects will need tofollow each step of the SEMMA methodology—thetools in Enterprise Miner software give users thefreedom to deviate from the process to meet theirneeds—but the methodology does give users ascientific, structured way of conceptualizing, creating,and evaluating data mining projects.

You can use the nodes of Enterprise Miner to followthe steps of the SEMMA methodology; themethodology is a convenient and productivesuperstructure in which to view the nodes logically andgraphically. In addition, part of the strength ofEnterprise Miner software comes from the fact that therelationship between the nodes and the methodologyis flexible.

The relationship is flexible in that you can build PFDsto fit particular data mining requirements. You are notconstrained by the GUI or the SEMMA methodology.For example, in many data mining projects, you maywant to repeat parts of SEMMA by exploring data andplotting data at several points in the process. For otherprojects, you may want to fit models, assess thosemodels, re-fit new data to the models, and then re-assess.

Sampling NodesIn data-intensive applications such as data mining,using a sample of data as input rather than an entiredatabase can greatly reduce the amount of timerequired for processing. If you can ensure the sampledata are sufficiently representative of the whole,patterns that appear in the entire database also will bepresent in the sample. Although Enterprise Miner doesnot require the use of sample data, using the GUI andthe sampling nodes, you can complete sophisticateddata mining projects with a minimum amount of timeand effort.

Warehouse DataFor obtaining samples of data from data warehousesand other types of data stores, Enterprise Minerprovides a unique advantage—complete connectivityto the Institute’s award-winning SAS/WarehouseAdministrator™ software.

Input Data Source Node

The Input Data Source node enables you to accessand query SAS data sets and other types of datastructures that will be used for data mining projects.You can use more than one Input Data Source node ina single project to define multiple sources of input.

The node includes a set of dialog pages and otherwindows that enable you to specify the details aboutthe data source and the variables in the data source.

Dialog PagesThe interface to the Input Data Source node is a dialogbox that includes the following pages:

Data

When you open an Input Data Source node, a Datadialog page is displayed in which you can enter thename of the input data set. After entering the name,you press the return key to begin pre-processing thedata source. Pre-processing includes the automaticgeneration of meta data, such as model roles andsummary statistics.

By default, the Input Data Source node uses a sampleof 2000 observations to determine the meta data.Samples are stored on the local client when you run aremote project. They are used for several tasks, such


4

as viewing a variable’s distribution in a histogram,determining variable hierarchies in the VariableSelection node, and as input to the Insight node.

You can control the size of the meta data sample aswell as the purpose of the input data source. Bydefault, the purpose of the sample is for use as rawdata. Alternatively, you can define the purpose of thedata source to be training, validating, testing, orscoring.

The meta data sample is not intended for use beyondthe Input Data Source node. To obtain a sample to beused for modeling or other data mining analysis, youshould use the Sampling node, which is specificallydesigned to give you many options for sampling data.

SQL Query Window

You can also use the SQL Query Window to define adata source. By selecting the Query button, you openthe SQL Query window, which enables you to build aSAS View in a point-and-click environment.

After the SQL view is completed, Enterprise Miner pre-processes the view for faster processing in subsequentmining activities.

When data pre-processing is completed, the Datadialog page re-opens and displays meta data includingthe name of the view, its description, its role orpurpose in the project, and measurements of size.This automatic assignment of variable roles andmeasurement levels by Enterprise Miner saves youfrom the tedious task of defining this information foreach variable.

Variables

The Variables dialog page contains a data table, whichenables you to redefine model roles, measurementlevels, formats, and labels for the variables in the inputdata. These settings are determined initially byEnterprise Miner during preprocessing of the datasource. The settings are based on the meta datasample. Changing the role of a model or themeasurement level for a variable is as easy as placingthe cursor in the desired cell and single clicking theright mouse button. This action opens a pop-up menufrom which you can make changes.

Other features of the Variables dialog page enable youto

• specify formats and labels for the variables bytyping in the values directly

• sort in ascending order or subset (if relevant) thatcolumn by its values

• display a histogram showing the distribution of anyvariable.

Interval Variables

The Interval Variables dialog page displays a sortabletable of summary statistics for all interval variablesfrom the input data set including minimum andmaximum values, the mean, standard deviation,percentage of observations with missing values, andthe skewness and kurtosis.

Class Variables

The Class Variables dialog page displays a sortabletable of information about all non-interval levelvariables from the input data set including the name ofthe variable, the number of unique values, the order inwhich the values are sorted in the data set, and thenames of other variables from the input data set onwhich the variable depends. The Class Variablesdialog page is where you set the target level for classvariables.

Sampling Node

The Sampling node enables you to extract a sample ofyour input data source. Sampling is recommended forextremely large data bases, because it cantremendously decrease model fitting time. TheSampling node performs simple random sampling, nth-observation sampling, stratified sampling, or first-nsampling of an input data set. For any type ofsampling, you can specify either a number ofobservations or a percentage of the population toselect for the sample. The Sampling node writes thesampled observations to an output data set. TheSampling Node saves the seed values used togenerate the random numbers for the samples so thatyou may replicate the samples.

The Sampling node must be preceded by a node thatexports at least one raw data table, which is usually anInput Data Source node.


5

To partition the sample into training, validation, andtest data sets, follow the Sampling node with a DataPartition node. Exploratory and modeling nodes alsocan follow a sampling node.

After you run the Sampling node, you can use theActions pull-down menu to browse the results. TheSampling Results window is a tabbed dialog interfacethat provides easy access to a table view of thesample data set, the Output window, the SAS Log, andthe Notes window.

Dialog PagesThe interface to the Sampling node is a dialog box thatincludes the following pages:

General

In the General dialog page, you specify the samplingmethods, the sampling size, and the random seed.

Sampling Methods – The Sampling node supportssimple random sampling (the default), sampling everynth observation, stratified sampling, and sampling thefirst n observations.

• Every nth Observation – With nth observationsampling (also called systematic sampling), therandom sample node computes the percentage ofthe population that is required for the sample, oruses the percentage specified in the General tab.

• Stratified Sampling – In stratified sampling, youspecify categorical variables from the input dataset to form strata (or subsets) of the totalpopulation. Within each stratum, all observationshave an equal probability of being selected for thesample. Across all strata, however, theobservations in the input data set generally do nothave equal probabilities of being selected for thesample. You perform stratified sampling topreserve the strata proportions of the populationwithin the sample. This may improve theclassification precision of fitted models.

• First n Observations – With first n sampling, theSampling node selects the first n observationsfrom the input data set for the sample. You canspecify either a percentage or an absolute numberof observations to sample in the General tab.

• Cluster – Cluster specifies that the sample data isto be based on a cluster variable. For values of thecluster variable that are selected by the Samplingnode, all records associated with the selectedcluster are included in the sample. Selectingcluster sampling on the General dialog page

enables (ungrays) the Cluster dialog page, whichis where you specify the cluster variable, thecluster sampling method, and the number ofclusters.

Sample Size – You can specify sample size as apercentage of the total population, or as an absolutenumber of observations to be sampled.

Random Seed – The Sampling node displays theseed value used in the random number function foreach sample. The default seed value is a randomnumber generated from a call to the system clock. Youmay type in a new seed directly, or select the GenerateNew Seed button to have the Sampling node generatea new seed automatically. The Sampling node savesthe seed value used for each sample so that you mayreplicate samples exactly.

Stratification Variables

The Stratification dialog page contains two sub-pagesthat enable you to control the variables and optionsused for stratification.

Variables – The Variables sub-page contains a datatable that lists the variables that are appropriate for useas stratification variables. Stratification variables mustbe categorical (binary, ordinal, or nominal). Continuousvariables have too many levels with too fewobservations per level to form useful strata. Forexample, a variable with two levels of gender or avariable with five ordinal categories of income areappropriate stratification variables. A variable withinterval-level measurements of income is not anappropriate stratification variable.

Options – In the Options sub-page you may specifythe stratification criteria, a deviation variable, and theminimum stratum size for each stratum.

Data Partition Node

Most data mining projects have large volumes of datathat can be sampled with the Sampling node. Aftersampling, the data can be partitioned with the DataPartition node before you begin constructing datamodels. The Data Partition node enables you topartition the input data source or sample into data setsfor the following purposes:

• Training – used to fit initial models.


6

• Validation – used by default for modelassessment. A validation data set also is used forfine tuning the model. The Decision Tree andNeural Network nodes have the capacity of over-fitting training data sets. To prevent these nodesfrom over fitting, validation data sets areautomatically used to retreat to a simpler fit thanthe fit based on the training data alone. Validationdata sets also can be used by the Regressionnode as stepwise selection criterion.

• Testing – an additional data set type that can beused for model assessment.

Partitioning provides a mutually exclusive data set(s)for cross validation and model assessment. A mutuallyexclusive data set does not share commonobservations with another data set. Partitioning theinput data also helps to speed preliminary modeldevelopment.

Simple or stratified random sampling is performed topartition the input observations into training, validation,and test data sets. You specify the proportion ofsampled observations to write to each partitioned dataset. The Data Partition node saves the seed valuesused to generate the random numbers for the samplesso that you can replicate the data sets.

Dialog PagesThe interface to the Data Partition node is a dialog boxthat includes the following pages:

Partition

In the Partition page, you can specify the

• sampling method used to write observations to thepartition data sets

• random seed used to generate the random sample• percentage of observations to allocate to the

training, validation and test data sets.

Sampling Methods

The Partition node supports simple random sample,which is the default, and stratified sampling.

• Simple Random Sampling – In simple randomsampling, every observation in the data set has thesame probability of being written to the sample.For example, if you specify that 40 percent of thepopulation should be selected for the training dataset, then each observation in the input data sethas a 40 percent chance to be selected.

• Stratified Sampling – In stratified sampling, youspecify variables from the input data set to formstrata (or subsets) of the total population. Withineach stratum, all observations have an equalprobability of being selected for the sample.Across all strata, however, the observations in theinput data set generally do not have equalprobabilities of being selected for the sample. Youperform stratified sampling to preserve the strataproportions of the population within the sample.This may improve the classification precision offitted models.

Exploration and Modification NodesData mining is a dynamic, iterative process throughwhich you can gain insights at various stages of aproject. One perspective on a problem can lead toanother and to the need for further modification andexploration. Enterprise Miner gives you numerous toolsand techniques to help you explore and modify yourdata including:

• Graphical Displays from simple graphs tomultidimensional bar charts,

• Outlier Filters to stabilize parameter estimates,• Transformations that enable you to normalize or

linearize the data and to stabilize the variance• Advanced Visualization Techniques including

OLAP, which enable you to interact withmultidimensional graphical displays to evaluateissues such as data structure and the applicabilityof variables.

Associations Node

The Associations node enables you to perform eitherassociation or sequence discovery. Associationdiscovery enables you to identify items that occurtogether in a given event or record. The technique isalso known as market basket analysis. Rules of formare discovered based on frequency counts of thenumber of times items occur alone and in combinationin a database. The rules can be expressed instatements such as "if item A is part of an event, thenitem B is also part of the event X percent of the time.”

Such rules should not be interpreted as a directcausation, but as an association between two or moreitems. However, identifying credible associations canhelp the business technologist make business


7

decisions such as when to distribute coupons, when toput a product on sale, or how to lay out items in astore.

Dialog PagesThe interface to the Associations node is a dialog boxthat includes the following pages:

General

The General dialog page enables you to set thefollowing rule-forming parameters: • Minimum Transaction Frequency – a measure

of support that indicates the percentage of timesthe items occur together in the data base.

• Maximum Number of Items in an Association –determines the maximum size of the item set to beconsidered. For example, the default of 4 itemsindicates that you wish to examine up to 4-wayassociations.

• Minimum Confidence – specifies the minimumconfidence level to generate a rule. The defaultlevel is 10 percent.

Sequences

The Sequences page enables you to set parametersthat are used to set sequence rules. For example, youmay want to set a minimum support for sequence tofilter out items from the sequence analysis that occurtoo infrequently in the data base.

Advanced

The Advanced page enables you to set the followingrule-forming parameters:

• Calculate Maximum Number of AssociationsUsing 2**n – This option determines themaximum number of associations that arecomputed.

• Customize Sort – By default, if the Associationnode creates 100,000 rules or less, then it sortsthe support values in descending (highest tolowest) order within each relation in theAssociations results table. If there is more than100,000 rules, then the sort routine is notexecuted; the node runs faster but the supportvalues are listed in ascending order within eachrelation when you view the results in the browser.

Sequence DiscoveryThe Association node also enables you to performsequence discovery. Sequence discovery goes a stepfurther than association discovery by taking into

account the ordering of the relationships (timesequence) among items. For example, sequencediscovery rules may indicate relationships such as, “Ofthose customers who currently hold an equity indexfund in their portfolio, 15 percent of them will open aninternational fund in the next year,” or, “Of thosecustomers who purchase a new computer, 25 percentof them will purchase a laser printer in the next month.”

3-Dimensional Histogram

You can view a grid plot of the left side and right sideitems simply by selecting the "Graph" item from theView pull-down menu. The support for each ruledetermines the size of the square in the plot. Theconfidence level establishes the color of the square. Aconfidence density legend is annotated in the bottomleft of the graph.

Subsetting

You can subset the table to see a more specific set ofrules. For example, you can subset only those ruleshaving a lift greater than one.

Bar Chart Node

The Bar Chart node is an advanced visualization toolthat enables you to explore large volumes of datagraphically. You can use the node to uncover patternsand trends, and to reveal extreme values in the database. You can generate multi-dimensional histogramsfor discrete or continuous variables. The node is fullyinteractive—you can rotate a chart to different anglesand move it anywhere on the screen. You can alsoprobe the data by positioning the cursor over aparticular bar within the chart. A text window displaysthe values that correspond to that bar.

Interactive ToolsThe Bar Chart node includes easy-to-use input fields,buttons, and sliders that you use to interact with thegraphical display. Pull-down menus and a HistogramsToolbar provide additional functionality to explore thedata.

Clustering Node

The Clustering node performs observation clustering,which can be used to segment databases. Clustering


8

places objects into groups or clusters suggested by thedata. The objects in each cluster tend to be similar toeach other in some sense, and objects in differentclusters tend to be dissimilar. If obvious clusters orgroupings could be developed prior to the analysis,then the clustering analysis could be performed bysimply sorting the data.

The clustering methods perform disjoint clusteranalysis on the basis of Euclidean distances computedfrom one or more quantitative variables and seeds thatare generated and updated by the algorithm. You canspecify the clustering criterion that is used to measurethe distance between data observations and seeds.The observations are divided into clusters such thatevery observation belongs to at most one cluster.

After clustering is performed, the characteristics of theclusters can be examined graphically using the resultsbrowser. Often of interest is the consistency of clustersacross variables. The three-dimensional charts andplots enable you to graphically compare the clusters.

Lastly, the cluster identifier for each observation canbe passed to other nodes for use as an input, id,group, or target variable. For example, you could formclusters based on different age groups you want totarget. Then you could build predictive models for eachage group by passing the cluster variable as a groupvariable to a modeling node.

Dialog PagesThe interface to the Clustering node is a dialog boxthat includes the following pages:

Clusters

In the Clusters dialog page, you specify options for themaximum number of clusters. The terms segment andprofile segment both refer to a cluster of observations;therefore, for practical purposes the terms segmentand cluster are equivalent.

Segment Identifier – The segment identifier consistsof the variable name, its label, and its role in themodel.Maximum Number of Clusters – The defaultmaximum number of clusters is 10. You may want toperform cluster analysis using different values for themaximum number of clusters. A preliminary clusteranalysis may identify outlying observations. Severeoutlier problems can distort the remaining clusters.

Seeds

The Seeds dialog page consists of three sub-pages.

General – The General sub-page of the Seeds dialogpage enables you to specify the clustering criterion.The default clustering criterion is Least Squares (OLS),and clusters are constructed so that the sum of thesquared distances of observations to the clustermeans is minimized. Other criteria you can specify areas follows:

• Mean Absolute Deviation (Median)• Modified Ekblom-Newton• Root-Mean-Square Difference• Least Squares (fast)• Least Squares• Newton• Midrange.

Initial – The Initial sub-page of the Seeds dialog pagespecifies how the cluster seeds are to be updated orreplaced.

Final – The Final sub-page of the Seeds dialog pagecontrols the stopping criterion for generating clusterseeds.

Missing Values

In the Missing Values dialog page, you specify howdata observations containing some missing values areto be handled. Observations that have missing valuescannot be used as cluster seeds. You can choose tohandle missing values by excluding the observationscontaining missing values, or by selecting one of theavailable methods for imputing values for the missingvalues.

The imputation methods include

• Seed of Nearest Cluster• Mean of Nearest Cluster• Conditional Mean• Multiple Mean• Multiple Stochastic.

When performing imputations, you can further specifythat unequal variances be accounted for, by using theoptions for Unequal Variance Adjustment.


9

Output

The Output dialog page consists of the Clustered Datasub-pages.

• Print Page – You use the Print sub-page of theOutput dialog page to specify the output to beproduced. The default output is the “ClusterStatistics.” Optionally, you can specify “No Output,”“Distance Between Cluster Mean,” and “ ClusterListing.”

• Statistics Data Sets – The Statistics Data Setssub-page of the Output dialog page lists data setsthat contain the cluster statistics and the seedstatistics. The cluster statistic data set containsstatistics about each cluster.

• Clustered Data – The Clustered Data sub-page ofthe Output dialog page lists the data libraries andoutput data sets for training, validation, testing,and scoring.

Viewing the ResultsAfter you run the Clustering node, you can view theresults by using the Results Browser.

• Partition Page – The Partition dialog pageprovides a graphical representation of keycharacteristics of the clusters. A three-dimensionalpie chart displays the count (slice width), thevariability (height), and the central tendency(color). The labels correspond to the segmentnumber. The toolbox enables you to interact withthe pie chart.

• Cluster Distances Page – The Cluster Distances

dialog page provides a graphical representation ofthe size of each cluster and the relationship amongclusters.

• Cluster Profiles Page – The Cluster Profiles

dialog page provides a graphical representation ofthe input variables for each cluster.

• Statistics Page – The Statistics dialog page

displays information about each cluster in a tabularformat. To sort the table, single click on theappropriate column heading. To reverse the sortorder, single click again on the column heading.

• Output Page – The Output dialog page displays

the output from running the underlying SASsoftware programming statements.

Data Replacement Node

Data sources can contain records that have missingvalues for one or more variables, which can be theresult of any number of situations such as data entryerrors, incomplete customer responses, or transactionsystem and measurement failures.

By default, if an observation contains a missing value,then Enterprise Miner will not use that observation formodeling by the Variable Selection, Neural Network, orRegression nodes. As a remedy, you could discardincomplete observations, but that may mean throwingaway useful information from the variables that havenon-missing values. Discarding incompleteobservations also may bias the sample, becauseobservations that have missing values may have othercharacteristics in common as well. As an alternative todiscarding observations, you could use the DecisionTree node to define a decision alternative or surrogaterules to group missing values into a special category.You also could use the Clustering node to impute (fillin) missing values in the Missing Values dialog tab.

Another alternative is to use the Data Replacementnode to replace missing values for interval and classvariables. The Data Replacement node provides thefollowing interval replacement statistics:

• mean• median• midrange.

Using the Data Replacement node, you can imputemissing values for class variables with the variablesmode or leave the value as missing. You cancustomize the default replacement statistics byspecifying your own replacement values for missingand non-missing data. Missing values for the training,validation, test, and score data sets are replaced usingreplacement statistics that are calculated from thetraining predecessor data set or from the meta datasample file of the Input Data Source node.

If you created more than one predecessor data set thathas the same model role, then the Data Replacementnode automatically chooses one of the data sources. Ifa valid predecessor data set exists, then you canassign a different data source to a role.


10

Default Method

You set the default interval and class imputationstatistics in the Default Method dialog page. Intervalvariables contain continuous values, such as AGE andINCOME. Class variables have discrete levels, suchas DEPT or ITEM. You also specify whether tocalculate the imputation statistic based on the sampleor the entire training data set.

The default imputation statistic is used to imputemissing values for all variables of the same type(interval or class). You can assign different imputationstatistics to different variables, and specify your ownimputation criteria in the Interval Variables or ClassVariables tabs.

The Default Method dialog page includes radio buttonsthat enable you to set the default interval imputationstatistic as one of the following:

• Mean – the arithmetic average, which is calculatedas the sum of all values divided by the number ofobservations. The mean is the most commonmeasure of a variable’s central tendency; it is anunbiased estimate of the population mean.

• Median – the 50th percentile, which is either themiddle value or the arithmetic mean of the twomiddle values for a set of numbers arranged inascending order.

• Midrange – the average of the range, where therange is the difference between the maximum andminimum values.

• None – do not replace the missing values.

The mean may be preferable for data replacementwhen the variable values have, in general, a normaldistribution. The mean and median are equal for anormal distribution. The median is less sensitive toextreme values than is the mean or midrange.Therefore, the median is preferable when you want toreplace missing values for variables that have skeweddistributions. The median is also useful for ordinaldata. The midrange is a rough measure of centraltendency that is easy to calculate.

By selecting a radio button, you can specify to replacemissing values for class variables with values from oneof two variables. The replacement variables are

• Most Frequent – replaces missing class valueswith the variable’s mode, which is the value thatoccurs with the greatest frequency. For example, ifMEDIUM is the most common value for SIZE, then

all missing values for SIZE are replaced with avalue of MEDIUM.

• None – missing values are left as missing.

You also can specify whether the data source used tocalculate the interval and class imputation statistics isthe meta data sample (created when you run the InputData Source node), or the entire training data set. Themeta data sample is used by other nodes to performmany tasks, such as setting variable roles andcalculating summary statistics, viewing a variable’sdistribution in a histogram, and determining variablehierarchies. You may want to use the meta datasample file for data replacement if the entire trainingdata set is very large.

Weights

The Weights dialog page enables you to specify aweight variable that contains relative weights for eachobservation in the data source. The observationweights are used to calculate a weighted mean.Weight variables are not used to calculate the medianor midrange.

Interval Variables

The Interval Variables dialog page enables you tocustomize the default interval method that you set inthe Default Method tab. The Interval Variables tab liststhe name, model role, status, imputation statistic,replacement values for non-missing data, format, andlabel.

The Internal Variables dialog page displays the defaultimputation statistic. You can assign differentimputation statistics to different variables or specifyyour own numeric replacement values.

You also can use one of these methods to replacenon-missing values that are less than (or greater than)a particular value with a new value. These datareplacement methods enable you to replace extremevalues on either side of the variable’s distribution witha more centrally located data value.

Before non-missing values are replaced with a newvalue, missing values are imputed with the methoddisplayed in the Imputation Method column.

Class Variables

The Class Variables dialog page enables you tocustomize the default class imputation statistic that isset to either “Most Frequent” or “None” in the DefaultMethod tab. The Class Variables dialog page lists the


11

name, model role, status, imputation statistic,replacement values, measurement level, type, format,and label. You can assign different imputation statisticsto different variables or specify your own replacementvalue.

Managing Data Replacement GraphsThe tool box at the top of the Enterprise Miner GUIenables you to manage the bar charts and histogramsthat are displayed in the Select Value window includingfunctions to do the following:

• print the graph• paste it to the clipboard• select points on the graph• display a text box that lists the frequency and

variable value for a bar• move the graph• zoom in and out.

The File pull-down menu also contains items that areuseful for managing the graph including functions forsaving the graph as a bmp, gif, tif, or ps file; printingthe graph to your default printer, and e-mailing thegraphical image to others.

Data Set Attributes Node

The Data Set Attributes node enables you to modifydata set attributes, such as data set names,descriptions, and roles. You also can use this node tomodify the meta data sample that is associated with adata set. For example, you could generate a data setin the SAS Code node and then modify its meta datasample with the Data Set Attributes node.

Filter Outliers Node

The Filter Outliers node enables you to apply a filter toyour data to exclude observations, such as outliers orother observations that you do not want to include infurther data mining analysis. Filtering extreme valuesfrom the data tends to produce better models becausethe parameter estimates are more stable.

Automatic Filter Options WindowThe Automatic Filter Options window enables you to

• eliminate rare values for classification variableswith fewer than 25 unique values

• eliminate extreme values for interval variables.

Classification Variable with Fewer than 25 UniqueValues

This automatic filter option enables you to eliminaterare values of classification variables, such asREGION, that have less than or equal to n uniquelevels. You eliminate rare class values because of thelack of precision in their parameter estimates.

Eliminate Extreme Values for Interval Variables

This automatic filter option enables you to eliminateextreme values for interval variables, such asSALARY, using one of the following methods:

• Standard Deviations from the Mean – eliminatesvalues that are more than n standard deviationsfrom the mean.

• Extreme Percentiles – eliminates values that arein the top and bottom pth percentile. For measuresarranged in order of magnitude, the bottom pthpercentile is the value such that p percent of theinterval measurements are less than or equal tothat value.

• Modal Centroid – eliminates values more than nspacings from the modal center.

• Median Absolute Deviations (MAD) – eliminatesvalues that are more than n deviations from themedian.

Apply Filter WindowThe Apply Filter window enables you to examine andadjust the automatic filtering options that apply to classvariables and interval variables.

Class Variables

The Class Variables dialog page is a data table thatlists the names and labels of all the classificationvariables in the input data set. This page also lists theminimum frequency cutoff and excluded values. Pop-up menus are provided that enable you to adjust theminimum frequency cutoff value, view histograms ofthe density of each level of a class variable, and adjustsettings interactively.

Interval Variables

The Interval Variables dialog page is a data table thatlists the name, label, and range of each intervalvariable in the input data set. The range for eachinterval variable is determined by the method youselected in the Automatic Filter Options window.


12

Observations outside of the range are excluded fromthe output data set. Pop-up menus and tool boxesenable you to make adjustments to the range of anyinternal variable and display the results graphically.

Group Processing Node

The Group Processing node enables you to definegroup variables, such as GENDER, to obtain separateanalyses for each level of the grouping variable(s). Ifyou defined more than one target variable in the InputData Source node, then a separate analysis is alsodone for each target. You can have one GroupProcessing node per process flow diagram. By default,group processing occurs automatically when you run aprocess flow diagram that contains a GroupProcessing node.

Dialog PagesThe interface to the Group Processing node is a dialogbox that includes the following pages:

General

The General dialog page enables you to definewhether or not you want to perform group processingwhen the process flow diagram is run. By default, thenode loops through each level of the group variable(s)when you run the process flow diagram. When trainingpreliminary models, you can improve runtimeperformance by selecting an option to suppress theautomatic looping. The General dialog page alsodisplays the number of targets used, number ofgroups, and total number of loops.

Group Variables

The Group Variables dialog page contains a data tablethat enables you to specify various characteristics ofvariables used in group processing such as:

• Whether group variables are used as input or forgrouping.

• The levels contained in each group variable. Bydefault, group processing is performed on alllevels of the group variable. If a group variable hasseveral levels, you may want to perform groupprocessing on only a few levels.

• The sort sequence of the variables.

Target Variables

The Target Variables dialog page contains a table thatenables you to define the target variables that youwant to use for group processing.

By default, all variables that you defined as targets inthe Input Data Source node are used as targets duringgroup processing (each target is analyzed separatelywhen the process flow diagram is run). Processingonly the desired targets reduces group processingtime.

Insight Node

The Insight node enables you to interactively exploreand analyze your data through multiple graphs andanalyses that are linked across multiple windows. Forexample, you can analyze univariate distributions,investigate multivariate distributions, create scatter andbox plots, display mosaic charts, and examinecorrelations. In addition, you can fit explanatory modelsusing analysis of variance, regression, and thegeneralized linear model.

InputIf the Insight node follows a node that exports a dataset in process flow, then the Insight node can use asinput either the meta data sample file or the entire dataset. The default is to use the meta data sample file asinput. For representative random samples, patternsfound in the meta data sample file should generalize tothe entire data set. An option is provided that enablesyou to load the entire data set into the Insight node.

Transform Variables Node

The Transform Variables node enables you to createnew variables that are transformations of existingvariables in your data. Transformations are usefulwhen you want to improve the fit of a model to thedata. For example, transformations can be used tostabilize variances, remove non-linearity, and correctnon-normality in variables. You can choose from thefollowing types of transformations:

• log• square root• inverse• square


13

• exponential• standardize• bucket• quantile.

You also can create your own transformation or modifya transformation by defining a formula in theComputed Column window. The Computed Columnwindow is a graphical interface that contains a columnlist box, a number pad, an operator pad, and afunctions list box to build an expression.

Transform Variables TableThe interface to the Transform Variables node is atable editor in which each row represents an originalvariable or a transformed variable. The TransformVariables data table enables you to specifyinteractively the keep status, delete variables,transform variables, modify transformed variables, andchange formats and labels. The interface includescolumns for the following:

• Name - the name of the original or transformedvariable.

• Keep - whether the variable is to be output to asubsequent node.

• Mean - the mean value.• Std Dev - the standard deviation, which is a

measure of dispersion about the mean.• Skew - the skewness value, which is measure of

the tendency for the distribution of values to bemore spread out on one side than the other.Positive skewness indicates that values located tothe right of the mean are more spread out than arevalues located to the left of the mean. Negativeskewness indicates the opposite.

• Kurtosis - the kurtosis statistic, which is ameasure of the shape of the distribution of values.Large values of kurtosis indicates the data containsome values that are very distant from the mean,as compared to most of the other values in thedata set.

• C.V. - the coefficient of variation is a measure ofspread relative to the mean. It is calculated as thestandard deviation divided by the mean.

• Formula - the formula for the transformation.• Format - the format for the variable.• Label - the variable label.

Binning TransformationsBinning transformations enable you to collapse aninterval variable, such as debt-to-income ratio, into an

ordinal grouping variable. There two types of binningtransformations: quantile and bucket.

A quantile is any of the four values that divide thevalues of a variable frequency distribution into fourclasses. You have the flexibility to divide the datavalues into n equally spaced classes. The quantiletransformation is useful when you want to createuniform groups. For example, you may want to dividethe variable values into 10 uniform groups (10th, 20th,30th,.... percentiles).

Buckets are created by dividing the data values into nequally spaced intervals based on the differencebetween the minimum and maximum values. Unlike aquantile transformation, the number of observations ineach bucket is typically unequal.

Creating New ColumnsCreating new columns is quick and easy by selectingthe Create Column item from the Actions pull-downmenu or by selecting the Add Computed Column toolicon on the tool bar.

Building EquationsThe Customize window contains a column list box, anumber pad, an operator pad, and a functions list boxto build an equation, which is displayed in the equationbox at the bottom of the window and verified when youclose the window.

Modifying a Variable’s DefinitionTo change a transformed variable’s definition, rightclick anywhere in the variable row of the transformedvariable and select the Modify Definition pop-up menu.This action opens the Computed Column window,which enables you to modify the variable’s definition.

Variable Selection Node

Many data mining databases have hundreds ofpotential model inputs (independent variables). TheVariable Selection node can assist you in reducing thenumber of inputs by dropping those that are unrelatedto the target. The Variable Selection node quicklyidentifies input variables that are useful for predictingthe target variable(s) based on a linear modelsframework. Then, the remaining information-rich inputscan be passed to one of the modeling nodes, such asthe Regression node, for more detailed evaluation.


14

The Variable Selection node facilitates ordinary leastsquares or logistic regression methods.

ParametersYou can use the Variable Selection node to pre-selectvariables using an R-square or Chi-square criterionmethod. The following parameters are available:

• Remove Variables Unrelated to the Target –This method provides a fast preliminary variableassessment and facilitates the rapid developmentof predictive models with large volumes of data.You can quickly identify input variables which areuseful for predicting the target variable(s) based ona linear models framework.

• Remove Variables in Hierarchies – This variableselection method searches for input variables witha hierarchical relationship. For example, if there isa relationship between state and zip code then youmay not want to use both of these inputs. Theinformation may be redundant between the twovariables. The Variable selection node finds thesehierarchical relationships and gives you the optionof keeping the input that has the most detail or thevariable that has the least detail.

• Remove Variables by Percentage of Missing –This method removes variables with largepercentages of missing values. By default,variables having more than 50 percent missingvalues will be removed. You can type anothervalue in the Percent Missing field.

• Remove Class Variables with Greater Than orEqual to n Values – This method removes classvariables if they have only a single value or if theyhave greater than or equal to n unique values. Thedefault number of unique values for class variableremoval is 80.

You select and deselect these methods interactively inthe Variable Selection Parameter window. All fourmethods are used by default for variable selection, andthey work together to determine which variables toeliminate. Class variables that have many levels, suchas zip code, can be related to the target, buteliminating these variables tends to speed theprocessing time of the modeling nodes, often without agreat loss of information.

Viewing the ResultsWhen you run the Variable selection node, the resultsare generated and displayed in a results browser. Thebrowser automatically opens when the node finishes

running. The Results browser is a tabbed dialogcomposed of the following tabs:

• Variables – summarizes the decisions generatedby the variable selection algorithms.

• Log – displays the log generated by the DMINEprocedure.

• Output – displays the output from the DMINEprocedure.

• Code – shows the code generated by the DMINEprocedure.

• R-square – displays a horizontal bar chart of thesimple R-square value for each model term.

• Effects – displays a horizontal bar chart showingthe incremental increase in the model R-squarevalue for selected inputs.

Modeling NodesThe three main tools in Enterprise Miner software forperforming statistical modeling are

• Decision Trees – for classification trees• Regression – for linear and logistic regression• Neural Networks – for nonlinear or linear

modeling.

Decision Tree Node

The Decision Tree node enables you to createdecision trees that either

• classify observations based on the values ofnominal or binary targets,

• predict outcomes for interval targets, or• predict the appropriate decision when you specify

decision alternatives.

Decision trees produce a set of rules that can be usedto generate predictions for a new data set. Thisinformation can then be used to drive businessdecisions. For example, in database marketing,decision trees can be used to develop customerprofiles that can help target promotional mailings inorder to generate a higher response rate.

The Decision Tree Node finds multi-way splits basedon nominal, ordinal, and interval inputs. You choosethe splitting criteria that you would like to use to createthe tree. The available options represent a hybrid ofthe options from the CHAID (Chi-squared automaticinteraction detection), CART (classification and


15

regression trees), and C4.5 algorithms. You also canset the options to simulate traditional CHAID, CART, orC4.5.

The Decision Tree Node supports both automatic andinteractive training. When you run the node inautomatic mode, it automatically ranks the inputvariables based on the strength of their contribution tothe target. This ranking may be used to selectvariables for use in subsequent modeling. In addition,dummy variables that represent important"interactions" between variables can be automaticallygenerated for use in subsequent modeling. You mayoverride any automatic step with the option to define asplitting rule and prune explicit nodes or sub-trees.Interactive training enables you to explore andevaluate a large set of trees that you developheuristically.

In addition, the Decision Tree Node enables you to

• Use prior probabilities and frequencies to traindata in proportions that are different from those ofthe populations on which predictions are made.For example, if fraud occurs in one percent oftransactions, then one tenth of the non-fraud datais often adequate to develop the model using priorprobabilities that adjust the 10-to-1 ratio in thetraining data to the 100-to-1 ratio in the generalpopulation.

• Base the criterion for evaluating a splitting ruleon either a statistical significance test, namelyan F-test or a Chi-square test, or on the reductionin variance, entropy, or gini impurity measure. TheF-test and Chi-square test accept a p-value inputas a stopping rule. All criteria allow the creation ofa sequence of sub-trees. You can use validation toselect the best sub-tree.

• Evaluate a tree (or sub-tree) by incorporating aprofit matrix that is associated with a particulardecision alternative. In the special situation inwhich you predict the value of a categoricalvariable, the profit matrix implementsmisclassification costs. For example, the incorrectprediction of a transaction as fraudulent might costless than the incorrect prediction of thattransaction as non-fraudulent.

Dialog PagesThe interface to the Decision Tree node is a dialog boxthat includes the following pages:

General

In the General dialog page, you can specify thesplitting criterion and values related to the size of thetree. For nominal or binary targets, you have a choiceof three splitting criteria:

• Chi-Square Test – the Pearson Chi-Squaremeasure of the target vs. the branch node, with adefault significance level of 0.20.

• Entropy Reduction – the entropy measure ofnode impurity

• Gini Reduction – the Gini measure of nodeimpurity.

For ordinal or interval targets, you have a choice of twosplitting criteria:

• F Test – with a default significance level of 0.20.• Variance Reduction.

Advanced

In the Advanced page, you can select from thefollowing sub-tree options:

• Best Assessment Value• Distinct Distributions in Leaves• Most Leaves• At Most Indicated Number of Leaves.

If you have selected either the Chi-Square Test or FTest in the Options page, you also can specify amethod to adjust the p-values, either

• KASS – multiplies the p-value by a factor thatdepends on the number of branches and numberof distinct values of the inputs

• DEPTH – adjusts the final p-value for a partition tosimultaneously accept all previous partitions usedto create the current subset being partitioned.

Priors

In the Priors dialog page, you can select one of thefollowing prior probabilities options to implement priorclass probabilities for nominal targets:

• Proportional to the Data – implements prior classprobabilities for the target based on the distributionof the target values in the training data set. Forexample, if 20 percent of the observations have atarget value of 1 and 80 percent have a targetvalue of 0, then the prior probabilities for the targetwould be .2 and .8.


16

• Equal probability – applies equal probabilities forthe target, .5 and .5 respectively for the targetvariable.

• Explicit – enables you to apply explicit priorprobabilities to the target values.

When you select the Explicit option, a table opens thatenables you to enter explicit prior probabilities for eachtarget value.

Assessment

In the Assessment dialog page, you can define how toperform overall assessment of the target and specifydecision alternatives and threshold values.

Output

The Output page consists of two sub-pages: the DataSub-page, which enables you to score the model andlists output data set details; and the Variables Sub-page, which enables you to select output variables tobe used by subsequent modeling nodes.

Tree Diagram Pop-up MenuIf you right click on the background of the TreeDiagram window, a pop-up menu opens that enablesyou to specify tree customizations, save the tree, andprint the tree.

InputThe Decision Tree node requires one target variableand at least one input variable. The target variable canbe nominal, binary, or interval. The input variables canbe nominal, binary, ordinal, or interval. The bonus,frequency, and weight variables must be interval. Thetarget, input, bonus, frequency, and weight variablesare exclusive. Optionally, you can specify bonusvariables, a frequency variable, and a weight variable.

Viewing the ResultsOutput of the Decision Tree node includes thefollowing:

• Summary Table – provides summary statistics forthe currently selected tree. For nominal targetvariables, the Summary Table presents n x mtables for the training data and the validation data.

• Tree Ring Navigator – presents a graphicaldisplay of possible data segments from which toform a tree. The Tree Ring Navigator also enablesyou to view specific segments in the TreeDiagram. You can use the tool box at the top of theapplication to control the Tree Ring. Tool tips are

displayed when you place your cursor over a toolicon.

• Assessment Table – provides a measure of howwell the tree describes the data. For a nominaltarget, the default measure is the proportion ofobservations correctly classified. For an intervaltarget, the default measure is the average sum ofsquared differences of an observation from itspredicted value. The table displays theassessment for several candidate partitions of thedata. If a validation data set is used, theassessment based on the validation data will bemore reliable than that based on the training data.

• Assessment Graph – plots the assessmentvalues from the Assessment Table.

Tree Diagram

The Tree diagram displays node (segment) statistics,the names of variables used to split the data intonodes, and the variable values for several levels ofnodes in the tree.

Output Data Sets

The Decision Tree node includes an Output page thatis part of the Tree Browser, which enables you tospecify a scoring output data set, and to select thevariables you want to output for subsequent modeling.

Neural Network Node

An artificial neural network is a computer applicationthat attempts to mimic the neurophysiology of thehuman brain in the sense that the network learns tofind patterns in data from a representative datasample. More specifically, it is a class of flexiblenonlinear regression models, discriminant models, anddata reduction models, which are interconnected in anonlinear dynamic system. By detecting complex non-linear relationships in data, neural networks can helpyou make predictions about real-world problems.

An important feature of the Neural Network node is itsbuilt-in intelligence about neural network architecture.The node surfaces this intelligence to the user bymaking functions available or unavailable in the GUIaccording to what is mathematically compatible withina neural network. Unavailable functions are grayedout, which simplifies the building process for the userand ensures that all available functions are compatiblewith neural network architecture.


17

The following neural network architectures areavailable in Enterprise Miner:

• generalized linear model (GLIM)• multi-layer perceptron (MLP), which is often the

best architecture for prediction problems• radial basis function (RBF), which is often best for

clustering problems• equal-width RBF• normalized RBF• normalized equal-width RBF.

Dialog PagesThe interface to the Neural Network node is a dialogbox that includes the following pages:

Initialization

In the Initialization dialog page, you can accomplish thefollowing tasks:

• generate a random seed, by selecting "GenerateNew Seed." The random seed affects the startingpoint for training the network. If the starting point isclose to the final settings, then the training timecan be dramatically reduced. Conversely, if thestarting point is not close to the final settings, thentraining time tends to increase. You may want tofirst accept the default random seed setting, andthen in later runs, specify other random seeds.

• select a distribution, by clicking the down arrowand selecting from the resulting menu. Choicesare uniform, normal, and cauchy.

• select a scale.• select a location.• select initial estimates.

Preliminary Optimization

In the Preliminary Optimization dialog page, you canspecify

• the number of preliminary runs• the training technique• the maximum iterations• whether model defaults are allowed• the maximum CPU time.

Training

In the Training dialog page, you can:

• specify the training technique• specify the maximum iterations• allow model defaults to be provided• specify the maximum CPU time

• request a plot of error history• specify whether to always retrain the network.

Output

In the Output dialog page, you can view properties ofoutput data sets. By clicking Properties, you can viewthe administrative details about the data set and viewthe data set in a table. These data sets can also beviewed in the Data Sets tab of the Results Browser.Output includes the following:

• estimates data sets – for preliminary optimizationand training

• output data sets – for training, validation, testing,and scoring

• fit statistics data sets – for training, validation,and testing.

Advanced

In the Advanced dialog page, you can specify the

• objective function• maximum number of function calls• default layer size• convergence criteria.

Regression Node

The Regression node enables you to fit both linear andlogistic regression models to a predecessor data set inan Enterprise Miner process flow. Linear regressionattempts to predict the value of a continuous target asa linear function of one or more independent inputs.Logistic regression attempts to predict the probabilitythat a binary or ordinal target will acquire the event ofinterest as a function of one or more independentinputs.

The node includes a point-and click "InteractionBuilder" to assist you in creating higher-order modelingterms. The Regression node, like the Decision Treeand Neural Network nodes, also provides you with adirectory table facility, called the Model Manager, inwhich you can store and access models on demand.The node supports forward, backward, and stepwiseselection methods. Data sets that have a role of scoreare automatically scored when you train the model.

In addition, Enterprise Miner enables you to buildregression models in a batch environment.


18

Data sets used as input to the Regression Node caninclude cross-sectional data, which are data collectedacross multiple customers, products, geographicregions, and so on, but typically, not across multipletime periods.

Dialog PagesThe interface to the Regression node is a dialog boxthat includes the following pages:

Variables

The Variables dialog page contains a data table, whichenables you to specify the status for the variables inthe input data set, sort the variables, and an InteractionBuilder, which enables you to add interaction terms tothe model.

For example, if the effect of one input on the targetdepends on the level of one or more inputs, you maywant to use the Interaction Builder to add interactionterms to your model. An interaction term, ormultiplicative effect, is a product of existingexplanatory inputs. For example, the interaction ofSALARY and DEPT is SALARY*DEPT.

Another example involves a polynomial model, whichincludes powers of existing explanatory inputs. In sucha situation, you may want to use the Variables dialogpage to include polynomial terms if you suspect anonlinear relationship exists between the input(s) andthe target.

Model Options

The Model Options dialog page provides details about,and enables you to specify options for, the targetvariable and the regression process. The ModelOptions dialog page includes sub-pages for the TargetDefinition and the type of Regression.

• The Target Definition sub-page lists the nameand measurement level of the target variable.

• The Regression sub-page enables you to specifywhether the regression type is linear or logistic andwhat type of link functions to use. For binary orordinal targets, the default regression type islogistic. For interval targets, the default regressiontype is linear. For a linear regression, the identitylink function is used. For a logistic regression, youcan select either logit, cloglog (complementary log-log), or probit as the link function.

The Model Options dialog page also enables you tospecify the input coding as either deviation or GLM aswell as suppress or not suppress the intercept.

Selection Method

The Selection Method dialog page enables you tospecify details about model selection. You can choosefrom the following selection methods:

• backward – begins with all inputs in the modeland then systematically removes inputs that arenot related to the target.

• forward – begins with no inputs in the model andthen systematically adds inputs that are related tothe target.

• stepwise – systematically adds and deletes inputsfrom the model. Stepwise selection is similar toforward selection except that stepwise mayremove an input once it has entered the model andreplace it with another input.

• none – all inputs are used to fit the model.

If you choose the Forward, Backward, or Stepwiseselection method, then you can specify the selectioncriteria as either AIC (Akaike's Information Criterion),SBC (Schwarz's Bayesian Criterion), Validate, CrossValidate, or None.

Advanced

You set the optimization method, iteration controls, andconvergence criteria in the Advanced dialogpage. Nonlinear optimization methods include thefollowing:

• gradient• double dogleg• Newton-Raphson with line search• Newton-Raphson with ridging• quasi-Newton• trust-region.

Viewing the ResultsThe interface to the Regression results is a dialog boxthat includes the following dialog pages:

Estimates – displays a bar chart of the standardizedor non-standardized parameter estimates from theregression analysis. A standardized parameterestimate is obtained by standardizing all the inputs tozero mean and unit variance prior to running theregression.


19

Statistics – lists fit statistics, in alphabetical order, forthe training data, validation data, and test dataanalyzed with the regression model.

Output – lists the standard SAS output from linear orlogistic regression analysis, depending on what type ofregression analysis you specified in the Regressionnode. For linear regression, the standard output liststhe following information about the model:

• R-square• adjusted R-square• AIC (Akaike’s Information Criterion)• SBC (Schwarz’s Bayesian Criterion)• BIC (Bayesian Information Criterion)• CP (Mallows’ CP statistic).

For logistic regression, the standard output lists thefollowing information about the target and inputvariables:

• Response Profile – For each level of the target, itlists the ordered value of the response variable,and the count or frequency.

• Class Level Information – For each class inputvariable, it lists the values of the design matrix.

Properties – lists the following information about themodel:

• name you specified for the model settings• description you specified for the model• date that you created the model• last date that the model was modified• type of regression (linear, or logistic)• name of the target variable.

User-Defined Model Node

The User-Defined Model node enables you to generateassessment statistics using predicted values from amodel built with the SAS Code node (such as a logisticmodel using the SAS/STAT™ LOGISTIC procedure) orfrom the Variable Selection node. Also, the predictedvalues can be saved to a data set and then importedinto the process flow with the Input Data Source node.

Model ManagerThe Regression node, the Decision Tree node, and theNeural Network node includes a directory table facility,

called the Model Manger, in which you can store andaccess models on demand.

Dialog PagesThe interface to the Model Manager is a dialog boxthat includes the following pages:

Models

The Model Manager opens with the Models dialogpage, which lists the trained models. For each model,information about when and how the model wascreated is listed along with fit statistics generated fromthe training, validation, and/or test data sets.

Profit Matrix

You use the Profit Matrix dialog page to define a tableof expected revenues and costs for each decisionalternative for each level of the target variable.

Assessment Options

In the Assessment Options dialog page, you set thepartitioned data set that is used for model assessmentand for determining if an exact model is created or not.

Assessment Reports

You use the Assessment Reports dialog page to selectthe assessment charts you want to create for a model.

Assessment NodesAssessment provides a common framework tocompare models and predictions from any analyticaltool in Enterprise Miner. The common criteria for allmodeling and predictive tools are the expected andactual profits for a project that uses the model results.These are the criteria that enable you to make cross-model comparisons and assessments, independent ofall other factors such as sample size or the type ofmodeling tool used.

Assessment Node

The Assessment node provides a common frameworkto compare models and predictions from the DecisionTree, Neural Network, and Regression nodes. Thecommon criteria for all modeling and predictive toolsare the expected and actual profits obtained frommodel results. These are the criteria that enable theuser to make cross-model comparisons and


20

assessments, independent of all other factors such assample size and modeling node.

Assessment statistics are automatically computedwhen you train a model with a modeling node. You cancompare models with either the Assessment node orthe Model Manager of a modeling node. TheAssessment node and the Model Manager provide thesame assessment reports.

An advantage of the Assessment node is that itenables you to compare models created by multiplemodeling nodes. The Model Manager is restricted tocomparing models trained by the respective modelingnode. An advantage of the Model Manager is that itenables you to re-define the cost function (ProfitMatrix) for a model. The Assessment node uses theProfit Matrix defined in the Model Manager as input.Essentially, the Assessment node serves as abrowser. Therefore, you cannot re-define the ProfitMatrix in the Assessment node.

Initial assessment report options are defined in theEnterprise Miner Administrator. You can re-definethese options in the Model Manager of the respectivemodeling node. Assessment options defined in theAdministrator or the Model Manager cannot bechanged in the Assessment node.

InputThe Assessment node requires the following twoinputs:

• scored data set – consists of a set of posteriorprobabilities for each level of a binary-level,nominal-level or ordinal-level target variable. TheRegression, Neural Network, and Decision Treenodes automatically produce a scored data set asoutput. If the target is interval-level, then thescored data set does not contain posteriorprobabilities. Instead, it contains the predictedvalues for the target. You produce a scored dataset when you train a model with a modeling node.

• cost function – is a table of expected revenues

and expected costs for each decision alternativefor each level of the target variable. Also known asthe profit matrix. An optional parameter in thecost function is a value for unit cost. The costs ofthe decision alternatives may be the same for eachlevel of the target, but they may differ, dependingon the actions required by the business decision.The expected profit depends on the level of the

target variable. The quality of the assessmentdepends upon how accurately the users canestimate the cost function. Cost functions shouldbe regularly examined and updated as necessary.You incorporate a cost function into the modelassessment by defining a profit matrix in the ModelManager.

Expected Profits

The Assessment node combines the posteriorprobabilities from the scored data set with the costfunction to produce expected profits.

Standard Charts for Assessment

The Assessment node uses the expected profits andactual profits to produce standard charts and tablesthat describe the usefulness of the model that wasused to create the scored data set.

You view the results of the Assessment node byselecting one or more assessment charts. The Toolspull-down menu enables you to specify which charts tocreate and to browse the data sources. The datasources can be training, validation, or test data sets.Assessment charts include the following:

• lift charts (or gains charts)• profit charts• return on investment (ROI) charts• diagnostic classification charts• statistical receiver operating characteristic (ROC)

charts• business ROC charts• top-bottom charts (or top 10 marginal impact

variables charts)• mosaic charts• MDDB charts• threshold-based charts• interactive profit/loss assessment charts.

Lift Charts

In a lift chart (also known as a gains chart) allobservations from the scored data set are sorted fromhighest expected profit to lowest expected profit. Thenthe observations are grouped into cumulative deciles.If a profit matrix was not defined in the ModelManager, then a default profit matrix is used, whichhas the expected profit equal to the posteriorprobability.

Lift charts show the percent captured positiveresponse, percent positive response, or the lift valueon the vertical axis. These statistics can be displayed


21

as either cumulative or non-cumulative values for eachdecile.

An index value, called a lift index, scaled from -100 to100, represents this area of gain in the lift chart. Usefulmodels have index values closer to 100, while weakermodels have index values closer to zero. In raresituations, the index may have negative values.

Profit Chart

In a profit chart, the cumulative or non-cumulativeprofits within each decile of expected profits iscomputed. For a useful predictive model, the chart willreach its maximum fairly quickly. For a model that haslow predictive power, the chart will rise slowly and notreach its maximum until a high cumulative decile.

Return on Investment Chart

The return on investment (ROI) chart displays thecumulative or non-cumulative ROI for each decile ofobservations in the scored data set. The return oninvestment is the ratio of actual profits to costs,expressed as a percentage.

Diagnostic Classification Charts

Diagnostic classification charts provide information onhow well the scored data set predicts the actual data.The type of chart you can produce depends on if thetarget is a non-interval or interval target.

Classification Charts for Non-Interval Targets

Classification charts display the agreement betweenthe predicted and actual target variable values for non-interval-level target variables.

Classification Plots for Interval-Level Targets

For interval-level target variables, the Assessmentnode plots the actual values against the valuespredicted by the model.

The Assessment node also plots the residuals (actualvalues - predicted values) for the target against thepredicted target values.

Receiver Operating Characteristic (ROC) Charts

ROC charts display the sensitivity (true positive / totalactual positive) and specificity (true positive / totalactual negative) of a classifier for a range of cutoffs.ROC charts require a binary target.

Statistical ROC Charts

Each point on the curve represents a cutoff probability.Points closer to the upper-right corner correspond tolow cutoff probabilities. Points in the lower leftcorrespond to higher cutoff probabilities. The extremepoints (1,1) and (0,0) represent no-data rules where allcases are classified into class 1 or class 0,respectively.

Business ROC Charts

Business ROC charts display the prediction accuracyof the target across a range of decision thresholdvalues. An advantage of a business ROC chart over astatistical ROC chart, is that you have an indication ofmodel performance across a range of threshold levels.

Top-Bottom Charts

Top-bottom charts (also know as top 10 marginalimpact charts) compare observations in the scoreddata set that are ranked the highest in expected profitswith those that are ranked the lowest in expectedprofits to show which input variables are important inmaking that distinction.

For nominal-level or ordinal-level input variables in thetop-bottom (TB) chart, the chart displays the totalfrequency for each level and the percentage of thattotal in the top and bottom categories. For interval-levelinput variables, the TB chart displays the totalfrequency and the average value in the top and bottomcategories.

The Assessment node can produce the following top-bottom charts:

• TB10 – compares the top 10 percent of the data tothe bottom 10 percent of the data

• TB25 – compares the top 25 percent of the data tothe bottom 25 percent of the data

• TB50 – compares the top 50 percent of the data tothe bottom 50 percent of the data.

Mosaic Charts

Mosaic charts provide a graphical representation of thediscriminative power of the input variables listed in thetop-bottom charts.

MDDB Charts

MDDB charts display the top ten input variables, asdetermined by the modeling tool.


22

Threshold-based Charts

Threshold-based charts enable you to display theagreement between the predicted and actual targetvalues across a range of threshold levels (the cutoffthat is used to classify an observation based on theevent level posterior probabilities).

Interactive Profit/Loss Assessment Charts

You can also create an interactive profit/lossassessment chart that enables you to interactivelyaccess how a profit/loss matrix impacts the total returnover a range of threshold levels.

Score Node

The Score node enables you to manage, edit, export,and execute scoring code that is generated fromtrained models. Scoring is the generation of predictedvalues for a new data set that may not contain a target.Scoring a new data set is the end result of most datamining problems. For example,

• A marketing analyst may want to score a database to create a mailing list of customers mostlikely to make a purchase.

• A financial analyst may want to score credit unionmembers to identify probable fraudulenttransactions.

The Score node generates and manages scoringformulas in the form of a single SAS data step, whichcan be used in most SAS environments even withoutthe presence of Enterprise Miner software.

Any node that modifies the observations of the inputvariables or creates scoring formula generatescomponents of score code. The following nodesgenerate components of scoring code:

• Transformation Node – creates new variablesfrom data set variables

• Data Replacement Node – imputes missingvalues

• Clustering Node – creates a new segment IDcolumn and imputes missing values

• Group Processing Node – subsets the datausing IF statements

• Regression Node – creates predicted values• Neural Network Node – creates predicted values

• Decision Tree Node – creates predicted values• SAS Code Node – avenue for customized scoring

code.

Dialog PagesThe interface to the Score node is a dialog box thatincludes the following pages:

Score Code Page

The Score Code dialog page provides access to thescore code management functions. Managementfunctions include current imports , which lists thescoring code currently imported from nodepredecessors, and accumulated runs , which listsscoring code exported by the nodes predecessorsduring the most recent path run (training action).

The Score Code dialog page also includes a pop-upmenu that provides access to the saved managementfunctions, which enable you to display the score codein the viewer and save currently imported oraccumulated run entries to a file.

Saving Code

If you decide that a model provides good scoring code,you may want to save it for future scoring. Saved codecan be edited, deleted, exported, or submitted usingthe pop-up menu.

Run Action

The Run Action dialog page contains the following sub-pages:

• General Sub-page – enables you to select theaction that will be performed when the Score nodeis run within a diagram path (default)

• Merge Options Sub-page – enables you to selectvariables that you want to keep in the merged dataset.

Client/Server EnablementThe GUI includes dialog pages and other graphicaltools such as radio buttons and menus that makeestablishing connections between servers and clientsfast, efficient, and easy to understand. Theclient/server functionality of Enterprise Miner softwareprovides advantages because the solution

• distributes data-intensive processing to themost appropriate machine

• minimizes network traffic by processing the dataon the source machine


23

• minimizes data redundancy by maintaining onecentral data source

• distributes server profiles to multiple clients• can regulate access to data sources• can toggle between remote and local

processing.

With Enterprise Miner, you can access diverse datasources from database management systems onservers to use with your local SAS Enterprise Minersession.

Administrator and UsersDue to the complexities that can be encountered inconfiguring client/server connections, Enterprise Minerrelies on a person to act as an administrator, someonewho will perform all system setup functions. Theadministrator role is established when you first invokethe Administration Dialog Box and enter a password.

A user is the beneficiary of server profiles that aredefined and distributed by the administrator. A usercan establish a remote connection, but does nothave the authority to define new server profiles.

Administration WindowThe interface to the client/server Administrationfunctions is a tabbed dialog box as shown in Figure 4.

Figure 4: Administration Window

Dialog PagesThe interface to the Administration window is a dialogbox that includes the following pages:

General

In the General dialog page of the Administrationwindow, you set the privilege mode as eitheradministrator or user. An administrator can define andmodify server profiles; a user cannot.

Servers

The Server dialog page enables you to defineserver and query profiles. A server profile contains allthe configuration information necessary to establish aconnection on a remote server. You must definea server profile before you can establish a remoteconnection or define a query profile.

The Server Setup WindowTo add a server profile, begin by simply selecting “Add”from the Administration window. This action opens theServer Setup window, which is shown in Figure 5.

Figure 5: Server Setup Window

The Server Setup window enables you to enter theserver profile including information such as adescription, the network address of the server in eithername or number format, and the default data librarypathname. By selecting a setting of the radio button atthe bottom of the Server Setup window, you quicklycan specify whether you want processing to take placeon the remote server or download data samples fromthe server to be processed locally.

The Server Setup window also includes options thatenable you to configure aspects of the remote sessionsuch as start code, which will run when the server isinitialized.

Defining Query Profiles

The Server dialog page also provides an option foradding an SQL query profile. Query profiles are usedby the Input Data Source node to automatically loadyour query profile preferences.

Modifying and Removing a Server Profile

A Modify button and a Remove button are also a partof the Server dialog page, which provide convenientways for the administrator to change or delete serverprofiles as needed.


24

Projects

The Projects dialog page of the Administration windowenables you to define the project libraries that willappear in the Available Projects window. Adding,modifying, and removing project libraries are allcompleted using simple data entry fields, pop-upmenus, and radio buttons.

Assessment

The Assessment dialog page enables you todefine global assessment chart options for modelassessment in subsequent diagrams. Assessmentcharts help you describe the usefulness of the modelthat was used to create a scored data set. Forexample, you may want to delimit globally the displayof assessment charts to one type such as lift charts, oryou may want to enable the display of mosaic charts,which, by default, are not displayed.

OptionsBy default, results of the log and output are sent to therespective nodes. The Options dialog page enablesyou to redirect log and output from the node to theSAS System log and output windows of SAS DisplayManager.

DefaultsThe Defaults dialog page enables you to customize thedefault node options.

ConclusionEnterprise Miner software provides all the functionalityone needs to plan, implement, and successfully refinedata mining projects. The software fully integrates allsteps of the data mining process beginning with thesampling of data, through sophisticated data analysesand modeling, to the dissemination of the resultinginformation. The functionality of Enterprise Miner issurfaced through an intuitive and flexible GUI, whichenables users, who may have different degrees ofstatistical expertise, to mine volumes of data forvaluable information.

References and FurtherReading

SAS Institute Inc. (1995), SAS Institute White Paper:Building a SAS® Data Warehouse, Cary, NC: SASInstitute Inc.

SAS Institute Inc. (1996), SAS Institute White Paper:OLAP Tools and Techniques within the SAS® System,Cary, NC: SAS Institute Inc.

SAS Institute Inc. (1996), SAS Institute White Paper:The SAS® System and Web Integration, Cary, NC:SAS Institute Inc.

SAS Institute Inc. (1997), SAS Institute White Paper:Business Intelligence Systems and Data Mining, Cary,NC: SAS Institute Inc.

AcknowledgmentsContributors to Finding the Solution to Data Mining: AMap of the Features and Components of SAS®

Enterprise Miner ™ Software include the following SASInstitute employees: Brent L. Cohen, James D.Seabolt, R. Wayne Thompson, and John S. Williams.

ContactMark BrownProgram Manager, Data MiningSAS Institute Inc.SAS Campus DriveCary, NC, 27513

Voice: 919-677-8000, ext. 7165E-mail: [email protected]

SAS and all other SAS Institute Inc. product or service names are the registered trademarks or trademarks of SAS Institute Inc. ® indicates USAregistration. Other brand and product names are registered trademarks or trademarks of their respective owners.Copyright © 1998 by SAS Institute Inc., Cary, NC. All rights reserved. Credit must be given to the publisher. Otherwise, no part of this publicationmay be reproduced without prior permission of the publisher.


abstract introduction the graphical user interface · abstract enterprise miner™ software is the...

Documents