the surveyreg procedure

Chapter 62The SURVEYREG Procedure

Chapter Table of Contents

OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3217

GETTING STARTED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3217Simple Random Sampling. . . . . . . . . . . . . . . . . . . . . . . . . . .3217Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3220Create Output Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3224

SYNTAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3225PROC SURVEYREG Statement . . . . . . . . . . . . . . . . . . . . . . . .3225BY Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3227CLASS Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3227CLUSTER Statement . . .. . . . . . . . . . . . . . . . . . . . . . . . . . .3228CONTRAST Statement . .. . . . . . . . . . . . . . . . . . . . . . . . . . .3228ESTIMATE Statement . .. . . . . . . . . . . . . . . . . . . . . . . . . . .3230MODEL Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3232STRATA Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3233WEIGHT Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3234

DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3234Specification of Population Totals and Sampling Rates .. . . . . . . . . . .3234Primary Sampling Units (PSUs) . . . . . . . . . . . . . . . . . . . . . . . .3235Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3235Stratum Collapse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3236Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3236Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3237Computational Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3237Output Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3241Displayed Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3241ODS Table Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3245

EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3246Example 62.1 Simple Random Sampling. . . . . . . . . . . . . . . . . . . .3246Example 62.2 Simple Random Cluster Sampling. . . . . . . . . . . . . . .3248Example 62.3 Regresson Estimator for Simple Random Sample . .. . . . . 3251Example 62.4 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . .3252Example 62.5 Regression Estimator for Stratified Sample. . . . . . . . . . .3260

3216 � Chapter 62. The SURVEYREG Procedure

Example 62.6 Stratum Collapse . . . . . . . . . . . . . . . . . . . . . . . .3264Example 62.7 Domain Analysis . . . . . . . . . . . . . . . . . . . . . . . .3269

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3271

SAS OnlineDoc: Version 8

Chapter 62The SURVEYREG Procedure

Overview

The SURVEYREG procedure performs regression analysis for sample survey data.This procedure can handle complex survey sample designs, including designs withstratification, clustering, and unequal weighting. The procedure fits linear models forsurvey data and computes regression coefficients and their variance-covariance ma-trix. The procedure also provides significance tests for the model effects and for anyspecified estimable linear functions of the model parameters. Using the regressionmodel, the procedure can compute predicted values for the sample survey data.

PROC SURVEYREG computes the regression coefficient estimators by generalizedleast squares estimation using element-wise regression. The procedure assumes thatthe regression coefficients are the same across strata and primary sampling units(PSUs). To estimate the variance-covariance matrix for the regression coefficients,PROC SURVEYREG uses the Taylor expansion theory for estimating sampling er-rors of estimators based on complex sample designs (Woodruff 1971; Fuller 1975;S�arndal, Swenson, and Wretman 1992, Chapter 5 and Chapter 13). This method ob-tains a linear approximation for the estimator and then uses the variance estimator forthis approximation to estimate the variance of the estimator itself.

PROC SURVEYREG uses ODS (Output Delivery System) to place results in outputdata sets. This is a departure from older SAS procedures that provide OUTPUTstatements for similar functionality.

Getting Started

This section demonstrates how you can use PROC SURVEYREG to perform a re-gression analysis for sample survey data. For a complete description of the usageof PROC SURVEYREG, see the section “Syntax” on page 3225. The “Examples”section on page 3246 provides more detailed examples that illustrate the applicationsof PROC SURVEYREG.

Simple Random Sampling

Suppose that, in a junior high school, there are a total of 4,000 students in grades 7,8, and 9. You want to know how household income and the number of children in ahousehold affect students’ average weekly spending for ice cream.


In order to answer this question, you draw a sample using simple random samplingfrom the student population in the junior high school. You randomly select 40 stu-dents and ask them their average weekly expenditure for ice cream, their householdincome, and the number of children in their household. The answers from the 40students are saved as a SAS data set.

data IceCream;input Grade Spending Income Kids @@;datalines;

7 7 39 2 7 7 38 1 8 12 47 19 10 47 4 7 1 34 4 7 10 43 27 3 44 4 8 20 60 3 8 19 57 47 2 35 2 7 2 36 1 9 15 51 18 16 53 1 7 6 37 4 7 6 41 27 6 39 2 9 15 50 4 8 17 57 38 14 46 2 9 8 41 2 9 8 41 19 7 47 3 7 3 39 3 7 12 50 27 4 43 4 9 14 46 3 8 18 58 49 9 44 3 7 2 37 1 7 1 37 27 4 44 2 7 11 42 2 9 8 41 28 10 42 2 8 13 46 1 7 2 40 39 6 45 1 9 11 45 4 7 2 36 17 9 46 1;

In the data setIceCream, the variableGrade indicates a student’s grade. The vari-ableSpending contains the dollar amount of each student’s average weekly spendingfor ice cream. The variableIncome specifies the household income, in thousands ofdollars. The variableKids indicates how many children are in a student’s family.

The following PROC SURVEYREG statements request a regression analysis.

title1 ’Ice Cream Spending Analysis’;title2 ’Simple Random Sampling Design’;proc surveyreg data=IceCream total=4000;

class Kids;model Spending = Income Kids / solution;

run;

The PROC SURVEYREG statement invokes the procedure. The TOTAL=4000 op-tion specifies the total in the population from which the sample is drawn. The CLASSstatement requests that the procedure use the variableKids as a classification variablein the analysis. The MODEL statement describes the linear model that you want tofit, with Spending as the dependent variable andIncome andKids as the indepen-dent variables. The SOLUTION option in the MODEL statement requests that theprocedure output the regression coefficient estimates.


Simple Random Sampling � 3219

Ice Cream Spending AnalysisSimple Random Sampling Design

The SURVEYREG Procedure

Regression Analysis for Dependent Variable Spending

Data Summary

Number of Observations 40Mean of Spending 8.75000Sum of Spending 350.00000

Fit Statistics

R-square 0.8132Root MSE 2.4506Denominator DF 39

Class Level Information

ClassVariable Levels Values

Kids 4 1 2 3 4

Figure 62.1. Summary of Data

Figure 62.1 displays the summary of the data, the summary of the fit, and the levels ofthe classification variableKids. The “Fit Summary” table displays the denominatordegrees of freedom, which are used inF tests andt tests in the regression analysis.




ANOVA for Dependent Variable Spending

Sum of MeanSource DF Squares Square F Value Pr > F

Model 4 915.310 228.8274 38.10 <.0001Error 35 210.190 6.0054Corrected Total 39 1125.500

Tests of Model Effects

Effect Num DF F Value Pr > F

Model 4 119.15 <.0001Intercept 1 153.32 <.0001Income 1 324.45 <.0001Kids 3 0.92 0.4385

NOTE: The denominator degrees of freedom for the F tests is 39.

Figure 62.2. Testing Effects in the Regression



Figure 62.2 displays the ANOVA table for the regression and the tests for modeleffects. The effectIncome is significant in the linear regression model, while theeffectKids is not significant at the 5% level.




Estimated Regression Coefficients

StandardParameter Estimate Error t Value Pr > |t|

Intercept -26.084677 2.46720403 -10.57 <.0001Income 0.775330 0.04304415 18.01 <.0001Kids 1 0.897655 1.12352876 0.80 0.4292Kids 2 1.494032 1.24705263 1.20 0.2381Kids 3 -0.513181 1.33454891 -0.38 0.7027Kids 4 0.000000 0.00000000 . .

NOTE: The denominator degrees of freedom for the t tests is 39.Matrix X’X is singular and a generalized inverse was used to solve thenormal equations. Estimates are not unique.

Figure 62.3. Regression Coefficients

The regression coefficient estimates and their standard errors and associatedt testsare displayed in Figure 62.3.

Stratified Sampling

Suppose that the previous student sample is actually drawn from a stratified sampling.The strata are grades in the junior high school: the 7th grade, the 8th grade, and the9th grade. Within strata, simple random samples are selected. Table 62.1 providesthe number of students in each grade.

Table 62.1. Students in Grades

Grade Number of Students7 1,8248 1,0253 1,151

Total 4,000

In order to analyze this sample using PROC SURVEYREG, you need to input thestratification information by creating a SAS data set for Table 62.1. The followingSAS statements create a data set calledStudentTotal.


Stratified Sampling � 3221

data StudentTotal;input Grade _TOTAL_;datalines;

7 18248 10259 1151;

The variableGrade is the stratification variable, and the variable–TOTAL– containsthe total numbers of students in the strata in the survey population. PROC SUR-VEYREG requires you to use the keyword–TOTAL– as the name of the variablethat contains the population total information.

The following statements demonstrate how you can fit the linear model while incor-porating the sample design information (stratification).

title1 ’Ice Cream Spending Analysis’;title2 ’Stratified Simple Random Sampling Design’;proc surveyreg data=IceCream total=StudentTotal;

strata Grade /list;class Kids;model Spending = Income Kids / solution;run;

By comparing these statements to those in the section “Simple Random Sampling”on page 3217, the TOTAL=StudentTotal option replaces the previous TOTAL=4000option. When the population totals and sample sizes differ among strata, the popula-tion totals must be provided by a data set.

The STRATA statement specifies the stratification variableGrade. The LIST optionin the STRATA statement requests that the stratification information be included inthe output.



Ice Cream Spending AnalysisStratified Simple Random Sampling Design



Data Summary

Number of Observations 40Mean of Spending 8.75000Sum of Spending 350.00000

Design Summary

Number of Strata 3

Fit Statistics


Figure 62.4. Summary of the Regression

Figure 62.4 summarizes the data information, the sample design information, andthe fit information. Note that, due to the stratification, the denominator degrees offreedom forF tests andt tests is 37, which is different from the analysis in Figure62.1.




Stratum Information

Stratum Population SamplingIndex Grade N Obs Total Rate

1 7 20 1824 0.012 8 9 1025 0.013 9 11 1151 0.01

Class Level Information

ClassVariable Levels Values

Kids 4 1 2 3 4

Figure 62.5. Stratification and Classification Information


Stratified Sampling � 3223

Figure 62.5 displays the identifications of strata, numbers of observations or samplesizes in strata, total numbers of students in strata, and calculated sampling rates orsampling fractions in strata.




ANOVA for Dependent Variable Spending


Model 4 915.310 228.8274 38.10 <.0001Error 35 210.190 6.0054Corrected Total 39 1125.500



Model 4 114.60 <.0001Intercept 1 150.05 <.0001Income 1 317.63 <.0001Kids 3 0.93 0.4355


Figure 62.6. Testing Effects

Figure 62.6 displays the ANOVA table for the regression and the tests for the sig-nificance of model effects under the stratified sample design. Theincome effect issignificant, while thekids effect is not significant at the 5% level.






Intercept -26.084677 2.48241893 -10.51 <.0001Income 0.775330 0.04350401 17.82 <.0001Kids 1 0.897655 1.11778377 0.80 0.4271Kids 2 1.494032 1.25209199 1.19 0.2404Kids 3 -0.513181 1.36853454 -0.37 0.7098Kids 4 0.000000 0.00000000 . .

NOTE: The denominator degrees of freedom for the t tests is 37.Matrix X’X is singular and a generalized inverse was used to solve thenormal equations. Estimates are not unique.

Figure 62.7. Regression Coefficients



The regression coefficient estimates for the stratified sample are displayed in Figure62.7. The standard errors of the estimates and associatedt tests are also shown in thistable.

You can request other statistics and tests using PROC SURVEYREG. You can alsoanalyze data from a more complex sample design. The remainder of this chapterprovides more detailed information.

Create Output Data Set

PROC SURVEYREG uses the Output Delivery System (ODS) to create output datasets. This is a departure from older SAS procedures that provide OUTPUT statementsfor similar functionality. For more information on ODS, see Chapter 15, “Using theOutput Delivery System.”

For example, to place the “ParameterEstimates” table (Figure 62.7) in the previoussection in an output data set, you use the ODS OUTPUT statement as follows.


strata Grade /list;class Kids;model Spending = Income Kids / solution;ods output ParameterEstimates = MyParmEst;run;

The statement

ods output ParameterEstimates = MyParmEst;

requests that the “ParameterEstimates” table that appears in Figure 62.7 to be placedin a SAS data set namedMyParmEst.

The PRINT procedure displays observations of the data setMyParmEst.

proc print data=MyParmEst;run;

Figure 62.8 displays the observations in the data setMyParmEst.


Obs Parameter Estimate StdErr DenDF tValue Probt

1 Intercept -26.084677 2.48241893 37 -10.51 <.00012 Income 0.775330 0.04350401 37 17.82 <.00013 Kids 1 0.897655 1.11778377 37 0.80 0.42714 Kids 2 1.494032 1.25209199 37 1.19 0.24045 Kids 3 -0.513181 1.36853454 37 -0.37 0.70986 Kids 4 0.000000 0.00000000 37 . .

Figure 62.8. The Data Set MyParmEst


PROC SURVEYREG Statement � 3225

Syntax

The following statements are available in PROC SURVEYREG.

PROC SURVEYREG < options > ;BY variables ;CLASS variables ;CLUSTER variables ;CONTRAST ’label’ effect values

< : : : effect values > < / options > ;ESTIMATE ’label’ effect values

< : : : effect values > < / options > ;MODEL dependent = < effects > < / options > ;STRATA variables < / options > ;WEIGHT variable ;

The PROC SURVEYREG and MODEL statements are required. If your model con-tains classification effects, you must list the classification variables in a CLASS state-ment, and the CLASS statement must precede the MODEL statement. If you use aCONTRAST statement or an ESTIMATE statement, the MODEL statement mustprecede the CONTRAST or ESTIMATE statement.

The CONTRAST and ESTIMATE statements can appear multiple times; all otherstatements can appear only once.

PROC SURVEYREG Statement

PROC SURVEYREG < options >;

The PROC SURVEYREG statement invokes the procedure. You can specify thefollowing options in the PROC SURVEYREG statement.

ALPHA=�sets the confidence level for confidence limits. The value of the ALPHA= option mustbe between 0.0001 and 0.9999, and the default value is 0.05. A confidence level of� produces100(1 � �)% confidence limits. The default of ALPHA=0.05 produces95% confidence limits. If� is between 0 and 1 but outside the range of 0.0001 to0.9999, the procedure uses the closest range endpoint. For example, if you specifyALPHA=0.000001, the procedure uses 0.0001 to determine confidence limits.

DATA=SAS-data-setspecifies the SAS data set to be analyzed by PROC SURVEYREG. If you omit theDATA= option, the procedure uses the most recently created SAS data set.



NOPRINTsuppresses the display of all output. Note that this option temporarily disables theOutput Delivery System (ODS). For more information, see Chapter 15, “Using theOutput Delivery System.”

RATE=value jSAS-data-setR=value jSAS-data-set

specifies the sampling rate as a positivevalue, or specifies an input data set that con-tains the stratum sampling rates. The procedure uses this information to compute afinite population correction for variance estimation. If your sample design has multi-ple stages, you should specify thefirst-stage sampling rate, which is the ratio of thenumber of PSUs selected to the total number of PSUs in the population.

For a nonstratified sample design, or for a stratified sample design with the samesampling rate in all strata, you should specify a positivevalue for the RATE= op-tion. If your design is stratified with different sampling rates in the strata, then youshould name a SAS data set that contains the stratification variables and the samplingrates. See the section “Specification of Population Totals and Sampling Rates” onpage 3234 for details.

Thevalue in the RATE= option or the values of–RATE– in the secondary data setmust be positive numbers. You can specifyvalue as a number between 0 and 1. Oryou can specifyvalue in percentage form as a number between 1 and 100, and PROCSURVEYREG will convert that number to a proportion. The procedure treats thevalue 1 as 100%, and not the percentage form 1%.

If you do not specify the TOTAL= option or the RATE= option, then the varianceestimation does not include a finite population correction. You cannot specify boththe TOTAL= option and the RATE= option.

TOTAL=value jSAS-data-setN=value jSAS-data-set

specifies the total number of primary sampling units in the study population as apositivevalue, or specifies an input data set that contains the stratum population totals.The procedure uses this information to compute a finite population correction forvariance estimation.

For a nonstratified sample design, or for a stratified sample design with the samepopulation total in all strata, you should specify a positivevalue for the TOTAL=option. If your sample design is stratified with different population totals in the strata,then you should name a SAS data set that contains the stratification variables and thepopulation totals. See the section “Specification of Population Totals and SamplingRates” on page 3234 for details.

If you do not specify the TOTAL= option or the RATE= option, then the varianceestimation does not include a finite population correction. You cannot specify boththe TOTAL= option and the RATE= option.


CLASS Statement � 3227

BY Statement

BY variables ;

You can specify a BY statement with PROC SURVEYREG to obtain separate analy-ses on observations in groups defined by the BY variables.

Note that using a BY statement provides completely separate analyses of the BYgroups. It does not provide a statistically valid subpopulation or domain analysis,where the total number of units in the subpopulation is not known with certainty. Formore information on subpopulation analysis for sample survey data, refer to Cochran(1977).

When a BY statement appears, the procedure expects the input data sets to be sorted inorder of the BY variables. If you specify more than one BY statement, the procedureuses only the latest BY statement and ignores any previous ones.

If your input data set is not sorted in ascending order, use one of the following alter-natives:

� Sort the data using the SORT procedure with a similar BY statement.

� Specify the BY statement option NOTSORTED or DESCENDING in the BYstatement for the SURVEYREG procedure. The NOTSORTED option doesnot mean that the data are unsorted but rather that the data are arranged ingroups (according to values of the BY variables) and that these groups are notnecessarily in alphabetical or increasing numeric order.

� Create an index on the BY variables using the DATASETS procedure.

For more information on the BY statement, refer to the discussion inSAS LanguageReference: Concepts. For more information on the DATASETS procedure, refer tothe discussion in theSAS Procedures Guide.

CLASS Statement

CLASS jCLASSES variables ;

The CLASS statement specifies the classification variables to be used in themodel. Typical class variables are TREATMENT, GENDER, RACE, GROUP, andREPLICATION. If you specify the CLASS statement, it must appear before theMODEL statement.

Classification variables can be either character or numeric. The procedure uses onlythe first 16 characters of a character variable. Class levels are determined from theformatted values of the CLASS variables. Thus, you can use formats to group valuesinto levels. Refer to the discussion of the FORMAT procedure in theSAS Procedures



Guide and to the discussions of the FORMAT statement and SAS formats inSASLanguage Reference: Concepts.

You can use multiple CLASS statements to specify classification variables.

CLUSTER Statement

CLUSTER jCLUSTERS variables ;

The CLUSTER statement specifies variables that identify clusters in a clustered sam-ple design. The combinations of categories of CLUSTER variables define the clustersin the sample. If there is a STRATA statement, clusters are nested within strata.

If your sample design has clustering at multiple stages, you should identify only thefirst-stage clusters, or primary sampling units (PSUs), in the CLUSTER statement.

The CLUSTERvariablesare one or more variables in the DATA= input data set.These variables can be either character or numeric. The formatted values of theCLUSTER variables determine the CLUSTER variable levels. Thus, you can useformats to group values into levels. Refer to the discussion of the FORMAT proce-dure in theSAS Procedures Guideand to the discussions of the FORMAT statementand SAS formats inSAS Language Reference: Dictionary.

You can use multiple CLUSTER statements to specify cluster variables. The proce-dure uses variables from all CLUSTER statements to create clusters.

CONTRAST Statement

CONTRAST ’label’ effect values < / options > ;CONTRAST ’label’ effect values < : : : effect values > < / options > ;

The CONTRAST statement provides custom hypothesis tests for linear combinationsof the regression parametersH0:L� = 0, whereL is the vector or matrix you specifyand� is the vector of regression parameters. Thus, to use this feature, you must befamiliar with the details of the model parameterization used by PROC SURVEYREG.For information on the parameterization, see the section “Parameterization of PROCGLM Models” on page 1521 in Chapter 30, “The GLM Procedure.”.

Each term in the MODEL statement, called aneffect, is a variable or a combinationof variables. You can specify an effect with a variable name or a special notationusing variable names and operators. For more details on how to specify an effect,see the section “Specification of Effects” on page 1517 in Chapter 30, “The GLMProcedure.”.

For each CONTRAST statement, PROC SURVEYREG computes Wald’sF test. Theprocedure displays this value with the degrees of freedom, and identifies it with thecontrast label. The numerator degrees of freedom for Wald’sF test equals rank(L).


CONTRAST Statement � 3229

The denominator degrees of freedom equals the number of clusters (or the num-ber of observations if there is no CLUSTER statement) minus the number of strata.Alternatively, you can use the DF= option in the MODEL statement to specify thedenominator degrees of freedom.

You can specify any number of CONTRAST statements, but they must appear afterthe MODEL statement.

In the CONTRAST statement,

label identifies the contrast in the output. A label is required forevery contrast specified. Labels must be enclosed in singlequotes.

effect identifies an effect that appears in the MODEL statement. Youcan use the INTERCEPT keyword as an effect when an inter-cept is fitted in the model. You do not need to include alleffects that are in the MODEL statement.

values are constants that are elements ofL associated with the effect.

You can specify the following options in the CONTRAST statement after a slash (/).

Edisplays the entire coefficientL vector or matrix.

NOFILLrequests no filling in higher-order effects. When you specify only certain portions ofL, by default PROC SURVEYREG constructs the remaining elements from the con-text (for more information, the section “Specification of ESTIMATE Expressions” onpage 1536 in Chapter 30, “The GLM Procedure.”).

When you specify the NOFILL option, PROC SURVEYREG does not construct theremaining portions and treats the vector or matrixL as it is defined in the CON-TRAST statement.

SINGULAR=valuespecifies the sensitivity for checking estimability. Ifv is a vector, define ABS(v)to be the largest absolute value of the elements ofv. SayH is the (X0X)�X0X

matrix, andC is ABS(L) except for elements ofL that equal 0, and thenC is1. If ABS(L� LH) > C�value, thenL is declared nonestimable. The SINGU-LAR=value must be between 0 and 1, and the default is10�4.

As stated previously, the CONTRAST statement enables you to perform hypothesistestsH0:L� = 0.



If theLmatrix contains more than one contrast, then you can separate the rows of theL matrix with commas. For example, for the model

proc surveyreg;class A B;model Y=A B;

run;

with A at 5 levels andB at 2 levels, the parameter vector is

(� �1 �2 �3 �4 �5 �1 �2)

To test the hypothesis that the pooledA linear andA quadratic effect is zero, you canuse the followingL matrix:

L =

�0 �2 �1 0 1 2 0 0

0 2 �1 �2 �1 2 0 0

�The corresponding CONTRAST statement is

contrast ’A Linear & Quadratic’a -2 -1 0 1 2,a 2 -1 -2 -1 2;

ESTIMATE Statement

ESTIMATE ’label’ effect values < / options > ;ESTIMATE ’label’ effect values < : : : effect values > < / options > ;

You can use an ESTIMATE statement to estimate a linear function of the regressionparameters by multiplying a row vectorL by the parameter estimate vectorb�.

Each term in the MODEL statement, called aneffect, is a variable or a combinationof variables. You can specify an effect with a variable name or with a special notationusing variable names and operators. For more details on how to specify an effect,see the section “Specification of Effects” on page 1517 in Chapter 30, “The GLMProcedure.”.

PROC SURVEYREG checks the linear function for estimability. (See the SINGU-LAR= option described on page 3231).

The procedure displays the estimateLb� along with its standard error andt test. Ifyou specify the CLPARM option in the MODEL statement, PROC SURVEYREGalso displays confidence limits for the linear function. By default, the degrees offreedom for thet test equals the number of clusters (or the number of observations ifthere is no CLUSTER statement) minus the number of strata. Alternatively, you canspecify the degrees of freedom with the DF= option in the MODEL statement.


ESTIMATE Statement � 3231

You can specify any number of ESTIMATE statements, but they must appear afterthe MODEL statement.

In the ESTIMATE statement,

label identifies the linear functionL in the output. A label is re-quired for every function specified. Labels must be enclosedin single quotes.

effect identifies an effect that appears in the MODEL statement. Youcan use the INTERCEPT keyword as an effect when an inter-cept is fitted in the model. You do not need to include alleffects that are in the MODEL statement.

values values are constants that are elements of the vectorL associ-ated with the effect. For example, the following code formsan estimate that is the difference between the parameters es-timated for the first and second levels of the CLASS variableA.

estimate ’A1 vs A2’ A 1 -1;

You can specify the following options in the ESTIMATE statement after a slash (/).

DIVISOR=valuespecifies a value by which to divide all coefficients so that fractional coefficients canbe entered as integers. For example, note the difference between the following twoESTIMATE statements.

estimate ’1/3(A1+A2) - 2/3A3’ a 1 1 -2 / divisor=3;estimate ’1/3(A1+A2) - 2/3A3’ a .33333 .33333 -.66667;

Edisplays the entire coefficient vectorL.

NOFILLrequests no filling in higher-order effects. When you specify only certain portionsof the vectorL, by default PROC SURVEYREG constructs the remaining elementsfrom the context. (See the section “Specification of ESTIMATE Expressions” onpage 1536 in Chapter 30, “The GLM Procedure.”) When you specify the NOFILLoption, PROC SURVEYREG does not construct the remaining portions and treats thevectorL as it is defined in the ESTIMATE statement.

SINGULAR=valuespecifies the sensitivity for checking estimability. Ifv is a vector, define ABS(v)to be the largest absolute value of the elements ofv. SayH is the (X0X)�X0X

matrix, andC is ABS(L) except for elements ofL that equal 0, and thenC is 1.If ABS(L� LH) > C�value, thenL is declared nonestimable. The SINGULAR=valuemust be between 0 and 1, and the default is10�4.



MODEL Statement

MODEL dependent = < effects >< / options >;

The MODEL statement specifies the dependent (response) variable and the indepen-dent (regressor) variables or effects. Each term in a MODEL statement, called aneffect, is a variable or a combination of variables. You can specify an effect with avariable name or with a special notation using variable names and operators. For moreinformation on how to specify an effect, see the section “Specification of Effects” onpage 1517 in Chapter 30, “The GLM Procedure.” The dependent variable must benumeric. Only one MODEL statement is allowed for each PROC SURVEYREGstatement. If you specify more than one MODEL statement, the procedure uses thefirst model and ignores the rest.

You can specify the following options in the MODEL statement after a slash (/).

COVBdisplays the estimated covariance matrix of the estimated regression estimates.

DF=valuespecifies the denominator degrees of freedom for theF tests and the degrees of free-dom for thet tests. The default is the number of clusters (or the number of obser-vations if there is no CLUSTER statement) minus the number of actual strata. Thenumber of actual strata equals the number of strata in the data before collapsing minusthe number of strata collapsed plus 1.

CLPARMrequests confidence limits for the parameter estimates. The SURVEYREG proce-dure determines the confidence coefficient using the ALPHA= option described onpage 3225, which by default equals 0.05 and produces 95% confidence bounds. TheCLPARM option also requests confidence limits for all the estimable linear functionsof regression parameters in the ESTIMATE statements.

Note that when there is a CLASS statement, you need to use the SOLUTION op-tion with the CLPARM option to obtain the parameter estimates and their confidencelimits. See the SOLUTION option on page 3233.

DEFFdisplays design effects for the regression coefficient estimates.

NOINTomits the intercept from the model.

IINVERSE

displays the inverse or the generalized inverse of theX0X matrix. When there isa WEIGHT variable, the procedure displays the inverse or the generalized inverseof theX0WX matrix, whereW is the diagonal matrix constructed from WEIGHTvariable values.


STRATA Statement � 3233

XXPX

displays theX0X matrix, or theX0WX matrix when there is a WEIGHT variable,whereW is the diagonal matrix constructed from WEIGHT variable values. The Xoption also displays the crossproducts vectorX0y, orX0Wy.

SOLUTIONdisplays a solution to the normal equations, which are the parameter estimates. TheSOLUTION option is useful only when you use a CLASS statement. If you do notspecify a CLASS statement, PROC SURVEYREG displays parameter estimates bydefault. But if you specify a CLASS statement, PROC SURVEYREG does not dis-play parameter estimates unless you also specify the SOLUTION option.

STRATA Statement

STRATA jSTRATUM variables < / options > ;

The STRATA statement specifies variables that form the strata in a stratified sampledesign. The combinations of categories of STRATA variables define the strata in thesample.

If your sample design has stratification at multiple stages, you should identify onlythe first-stage strata in the STRATA statement. See the section “Specification ofPopulation Totals and Sampling Rates” on page 3234 for more information.

The STRATAvariablesare one or more variables in the DATA= input data set. Thesevariables can be either character or numeric. The procedure uses only the first 16characters of the value of a character variable. The formatted values of the STRATAvariables determine the levels. Thus, you can use formats to group values into levels.Refer to the discussion of the FORMAT procedure in theSAS Procedures Guide.

You can use multiple STRATA statements to specify stratum variables.

You can specify the following options in the STRATA statement after a slash (/).

LISTdisplays a “Stratum Information” table, which includes values of the STRATA vari-ables, and the number of observations, number of clusters, population total, and sam-pling rate for each stratum. This table also displays stratum collapse information.

NOCOLLAPSEprevents the procedure from collapsing, or combining, strata that have only one sam-pling unit. By default, the procedure collapses strata that contain only one samplingunit. See the section “Stratum Collapse” on page 3236 for details.



WEIGHT Statement

WEIGHT jWGT variable ;

The WEIGHT statement specifies the variable that contains the sampling weights.This variable must be numeric. If you do not specify a WEIGHT statement, PROCSURVEYREG assigns all observations a weight of 1. Sampling weights must bepositive numbers. If an observation has a weight that is nonpositive or missing, thenthe procedure omits that observation from the analysis. If you specify more thanone WEIGHT statement, the procedure uses only the first WEIGHT statement andignores the rest.

Details

Specification of Population Totals and Sampling Rates

If your analysis should include a finite population correction (fpc), you can input ei-ther the sampling rate or the population total using the RATE= option or the TOTAL=option. You cannot specify both of these options in the same PROC SURVEYREGstatement. If you do not specify one of these options, the procedure does not usethe fpc when computing variance estimates. For fairly small sampling fractions, it isappropriate to ignore this correction. Refer to Cochran (1977) and Kish (1965).

If your design has multiple stages of selection and you are specifying the RATE=option, you should input the first-stage sampling rate, which is the ratio of the numberof PSUs in the sample to the total number of PSUs in the study population. If youare specifying the TOTAL= option for a multistage design, you should input the totalnumber of PSUs in the study population.

For a nonstratified sample design, or for a stratified sample design with the same sam-pling rate or the same population total in all strata, you should use the RATE=valueoption or the TOTAL=value option. If your sample design is stratified with dif-ferent sampling rates or population totals in the strata, then you can use theRATE=SAS-data-setoption or the TOTAL=SAS-data-setoption to name a SAS dataset that contains the stratum sampling rates or totals. This data set is called asec-ondary data set, as opposed to theprimary data setthat you specify with the DATA=option.

The secondary data set must contain all the stratification variables listed in theSTRATA statement and all the variables in the BY statement. If there are for-mats associated with the STRATA variables and the BY variables, then the for-mats must be consistent in the primary and the secondary data sets. If you spec-ify the TOTAL=SAS-data-setoption, the secondary data set must have a variablenamed–TOTAL– that contains the stratum population totals. Or if you specifythe RATE=SAS-data-setoption, the secondary data set must have a variable named

–RATE– that contains the stratum sampling rates.


Missing Values � 3235

The secondary data set must contain all BY and STRATA groups that occur in theprimary data set. If the secondary data set contains more than one observation forany one stratum, then the procedure uses the first value of–TOTAL– or –RATE– forthat stratum and ignores the rest.

Thevalue in the RATE= option, or the values of–RATE– in the secondary data set,must be positive numbers. You can specify a sampling rate as a number between 0and 1. Or you can specify a sampling rate in percentage form as a number between1 and 100, and PROC SURVEYREG will convert that number to a proportion. Theprocedure treats the value 1 as 100%, and not the percentage form 1%.

If you specify the TOTAL=valueoption,valuemust not be less than the sample size.If you provide stratum population totals in a secondary data set, these values must notbe less than the corresponding stratum sample sizes.

Primary Sampling Units (PSUs)

When you have clusters, or primary sampling units (PSUs), in your sample design,the procedure estimates variance from the variation among PSUs. For more informa-tion, see the section “Variance Estimation” on page 3239. You can use the CLUSTERstatement to identify the first stage clusters in your design. PROC SURVEYREG as-sumes that each cluster represents a PSU in the sample and that each observation isan element of a PSU. If you do not specify a CLUSTER statement, the proceduretreats each observation as a PSU.

Missing Values

If an observation has a missing value or a nonpositive value for the WEIGHT variable,then PROC SURVEYREG excludes that observation from the analysis. An observa-tion is also excluded if it has a missing value for any STRATA variable, CLUSTERvariable, dependent variable, or any variable used in the independent effects. Theanalysis includes all observations in the data set that have nonmissing values for allthese design and analysis variables.

If you have missing values in your survey data for any reason (such as nonresponse),this can compromise the quality of your survey results. If the respondents are differentfrom the nonrespondents with regard to a survey effect or outcome, then survey esti-mates will be biased and will not accurately represent the survey population. Thereare a variety of techniques in sample design and survey operations that can reducenonresponse. Once data collection is complete, you can use imputation to replacemissing values with acceptable values, and you can use sampling weight adjustmentsto compensate for nonresponse. You should complete this data preparation and ad-justment before you analyze your data with PROC SURVEYREG. Refer to Cochran(1977) for more details.



Stratum Collapse

If there is only one sampling unit in a stratum, then PROC SURVEYREG cannot es-timate the variance for this stratum. To estimate stratum variances, by default the pro-cedure collapses, or combines, those strata that contain only one sampling unit. If youspecify the NOCOLLAPSE option in the STRATA statement, PROC SURVEYREGdoes not collapse strata and uses a variance estimate of 0 for any stratum that containsonly one sampling unit.

If you do not specify the NOCOLLAPSE option, PROC SURVEYREG collapsesstrata according to the following rules. If there are multiple strata that each containonly one sampling unit, then the procedure collapses, or combines, all these stratainto a new pooled stratum. If there is only one stratum with a single sampling unit,then PROC SURVEYREG collapses that stratum with the preceding stratum, wherestrata are ordered by the STRATA variable values. If the stratum with one samplingunit is the first stratum, then the procedure combines it with the following stratum.

If you specify stratum sampling rates using the RATE=SAS-data-setoption, PROCSURVEYREG computes the sampling rate for the new pooled stratum as theweighted average of the sampling rates for the collapsed strata. See the section“Computational Method” on page 3237 for details. If the specified sampling rateequals 0 for any of the collapsed strata, then the pooled stratum is assigned a sam-pling rate of 0. If you specify stratum totals using the TOTAL=SAS-data-setoption,PROC SURVEYREG combines the totals for the collapsed strata to compute thesampling rate for the new pooled stratum.

Analysis of Variance

PROC SURVEYREG produces an analysis of variance table for the model specifiedin the MODEL statement. This table is identical to the one produced by the GLMprocedure for the model. PROC SURVEYREG computes ANOVA table entries usingthe sampling weights, but not the sample design information on stratification andclustering.

The degrees of freedom (DF) displayed in the ANOVA table are the same as thosein the ANOVA table produced by PROC GLM. The Total DF is the total degrees offreedom used to obtain the regression coefficient estimates. The Total DF equals thetotal number of observations minus 1 if the model includes an intercept. If the modeldoes not include an intercept, the Total DF equals the total number of observations.The Model DF equals the degrees of freedom for the effects in the MODEL statement,not including the intercept. The Error DF equals the total DF minus the model DF.


Computational Method � 3237

Degrees of Freedom

PROC SURVEYREG produces tests for the significance of model effects, regressionparameters, estimable functions specified in the ESTIMATE statement, and contrastsspecified in the CONTRAST statement. It computes all these tests taking into accountthe sample design. The degrees of freedom for these tests differ from the degrees offreedom for the ANOVA table, which does not consider the sample design.

Denominator Degrees of FreedomThe denominator DF refers to the denominator degrees of freedom forF tests andto the degrees of freedom fort tests in the analysis. By default, the denominatorDF equals the number of clusters minus the actual number of strata. If there areno clusters, the denominator DF equals the number of observations minus the actualnumber of strata. Theactual number of strataequals

� one, if there is no STRATA statement

� the number of strata in the sample, if there is a STRATA statement but theprocedure does not collapse any strata

� the number of strata in the sample after collapsing, if there is a STRATA state-ment and the procedure collapses strata that have only one sampling unit

Alternatively, you can specify the denominator DF using the DF= option in theMODEL statement.

Numerator Degrees of FreedomThe numerator DF refers to the numerator degrees of freedom for the WaldF statisticassociated with an effect or with a contrast. The procedure computes the WaldFstatistic for an effect as a Type III test; that is, the test has the following properties:

� The hypothesis for an effect does not involve parameters of other effects exceptfor containing effects (which it must involve to be estimable).

� The hypotheses to be tested are invariant to the ordering of effects in the model.

See the section “Testing Effects” on page 3239 for more information. The numeratorDF for the WaldF statistic for a contrast is the rank of theL matrix that defines thecontrast.

Computational Method

For a stratified clustered sample design, observations are represented by ann�(p+2)

matrix(w;y;X) = (whij ; yhij;xhij)

where

� w denotes the sampling weight vector

� y denotes the dependent variable



� X denotes the design matrix. (When an effect contains only classification vari-ables, the columns ofX corresponding to this effect contain only 0s and 1s; noreparameterization is made.)

� h = 1; 2; : : : ;H is the stratum number with a total ofH strata

� i = 1; 2; : : : ; nh is the cluster number within stratumh, with a total ofnhclusters

� j = 1; 2; : : : ;mhi is the unit number within clusteri of stratumh, with a totalof mhi units

� p is the total number of parameters (including an intercept if the INTERCEPTeffect is included in the MODEL statement)

� n =PH

h=1

Pnh

i=1mhi is the total number of observations in the sample

Also, fh denotes the sampling rate for stratumh. You can use the TOTAL= optionor the RATE= option to input population totals or sampling rates. See the section“Specification of Population Totals and Sampling Rates” on page 3234 for details.If you input stratum totals, PROC SURVEYREG computesfh as the ratio of thestratum sample size to the stratum total. If you input stratum sampling rates, PROCSURVEYREG uses these values directly forfh. If you do not specify the TOTAL=option or the RATE= option, then the procedure assumes that the stratum samplingratesfh are negligible, and a finite population correction is not used when computingvariances.

Regression CoefficientsPROC SURVEYREG solves the normal equationsX0WX� = X0Wy using a modi-fied sweep routine that produces a generalized (g2) inverse(X0WX)� and a solution(Pringle and Raynor 1971)

b� = (X0WX)�X0WY

whereW is the diagonal matrix constructed from WEIGHT variable values.

For models with class variables, there are more design matrix columns than there aredegrees of freedom (DF) for the effect. Thus, there are linear dependencies among thecolumns. In this case, the parameters are not estimable; there is an infinite number ofleast-squares solutions. PROC SURVEYREG uses a generalized (g2) inverse to ob-tain values for the estimates. The solution values are not displayed unless you specifythe SOLUTION option in the MODEL statement. The solution has the characteristicthat estimates are 0 whenever the design column for that parameter is a linear combi-nation of previous columns. (Strictly termed, the solution values should not be calledestimates.) With this full parameterization, hypothesis tests are constructed to testlinear functions of the parameters that are estimable.


Computational Method � 3239

Variance EstimationPROC SURVEYREG uses the Taylor series expansion theory to estimate thecovariance-variance matrix of the estimated regression coefficients (Fuller 1975). Let

r = y �Xb�where the(h; i; j)th element isrhij. Compute1� p row vectors

ehij = whijrhijxhij

ehi� =

mhiXj=1

ehij

�eh�� =1

nh

nhXi=1

ehi�

and calculate thep� p matrix

G =n� 1

n� p

HXh=1

nh(1� fh)

nh � 1

nhXi=1

(ehi� � �eh��)0(ehi� � �eh��)

PROC SURVEYREG computes the covariance matrix of� as

bV = (X0WX)�G(X0WX)�

Testing EffectsFor each effect in the model, PROC SURVEYREG computes anL matrix such thatevery element ofL� is estimable; theL matrix has the maximum possible rank as-sociated with the effect. To test the effect, the procedure uses the WaldF statistic forthe hypothesisH0:L� = 0. The WaldF statistic equals

FWald =(Lb�)0(L0 bVL)�1(Lb�)

rank(L)

with numerator degrees of freedom equal torank(L) and denominator degrees offreedom equal to the number of clusters minus the number of strata (unless you havespecified the denominator degrees of freedom with the DF= option in the MODELstatement; see the section “Denominator Degrees of Freedom” on page 3237). It ispossible that theL matrix cannot be constructed for an effect, in which case thateffect is not testable. For more information on how the matrixL is constructed, seethe discussion in Chapter 12, “The Four Types of Estimable Functions.”

Multiple R-squaredPROC SURVEYREG computes a multiple R-squared for the weighted regression as

R2 = 1� SSerrorSStotal

whereSSerror is the error sum of squares in the ANOVA table

SSerror = r0Wr



andSStotal is the total sum of squares

SStotal =

8>><>>:y0Wy if no intercept

y0Wy �0@ HX

h=1

nhXi=1

mhiXj=1

whijyhij

1A2

= w�� if there is an intercept

wherew�� is the sum of the sampling weights over all observations.

Root Mean Square ErrorsPROC SURVEYREG computes the square root of mean square errors as

pMSE =

pn SSerror = (n� p) w��

wherew�� is the sum of the sampling weights over all observations.

Design EffectIf you specify the DEFF option in the MODEL statement, PROC SURVEYREG cal-culates the design effects for the regression coefficients. The design effect of an esti-mate is the ratio of the actual variance to the variance computed under the assumptionof simple random sampling.

DEFF =Variance under the Sample Design

Variance under Simple Random Sampling

Refer to Kish (1965, p.258). PROC SURVEYREG computes the numerator as de-scribed in the section “Variance Estimation” on page 3239. And the denominator iscomputed under the assumption that the sample design is simple random sampling,with no stratification and no clustering.

To compute the variance under the assumption of simple random sampling, PROCSURVEYREG calculates the sampling rate as follows. If you specify both samplingweights and sampling rates (or population totals) for the analysis, then the samplingrate under simple random sampling is calculated as

f SRS= n = w��

wheren is the sample size andw�� (the sum of the weights over all observations)estimates the population size. If the sum of the weights is less than the sample size,fSRS is set to zero. If you specify sampling rates for the analysis but not samplingweights, then PROC SURVEYREG computes the sampling rate under simple randomsampling as the average of the stratum sampling rates.

f SRS=1

H

HXh=1

fh

If you do not specify sampling rates (or population totals) for the analysis, then thesampling rate under simple random sampling is assumed to be zero.

f SRS= 0


Displayed Output � 3241

Sampling Rate of the Pooled Stratum from CollapseAssuming that PROC SURVEYREG collapses single-unit stratah1; h2; : : : ; hc intothe pooled stratum, the procedure calculates the sampling rate for the pooled stratumas

fPooled Stratum=

8><>:0 if any of fhl = 0 wherel = 1; 2; : : : ; c

cXl=1

nhlf�1

hl

!�1 cX

l=1

nhl otherwise

ContrastsYou can use the CONTRAST statement to perform custom hypothesis tests. If thehypothesis is testable in the univariate case, the WaldF statistic forH0 : L� = 0 iscomputed as

FWald =(LFullb�)0(LFull 0 bVLFull)�1(LFullb�)

rank(L)

whereL is the contrast vector or matrix you specify,� is the vector of regressionparameters,b� = (X0WX)�X0WY, bV is the estimated covariance matrix ofb�,rank(L) is the rank ofL , andLFull is a matrix such that

� LFull has the same number of columns asL

� LFull has full row rank

� the rank ofLFull equals the rank of theL matrix

� all rows ofLFull are estimable functions

� the WaldF statistic computed using theLFull matrix is equivalent to the WaldF statistic computed using theL matrix with any row deleted that is a linearcombination of previous rows

If L is a full-rank matrix, and all rows ofL are estimable functions, thenLFull is thesame asL . It is possible thatLFull matrix cannot be constructed for contrasts in aCONTRAST statement, in which case the contrasts are not testable.

Output Data Sets

Output data sets from PROC SURVEYREG are produced using the ODS (Output De-livery System). ODS encompasses more than just the production of output data sets.For more information on using ODS, see Chapter 15, “Using the Output DeliverySystem.”

Displayed Output

The SURVEYREG procedure produces the following output.

Data SummaryBy default, PROC SURVEYREG displays the following information in the “DataSummary” table.



� Number of Observations, which is the total number of observations used in theanalysis, excluding observations with missing values

� Sum of Weights, if you specify a WEIGHT statement

� Mean of the dependent variable in the MODEL statement, or Weighted Meanif you specify a WEIGHT statement

� Sum of the dependent variable in the MODEL statement, or Weighted Sum ifyou specify a WEIGHT statement

Design SummaryWhen you specify a CLUSTER statement or a STRATA statement, the proceduredisplays a “Design Summary” table, which provides the following sample designinformation.

� Number of Strata, if you specify a STRATA statement

� Number of Strata Collapsed, if the procedure collapses strata

� Number of Clusters, if you specify a CLUSTER statement

� Overall Sampling Rate used to calculate the design effect, if you specify theDEFF option in the MODEL statement

Fit SummaryBy default, PROC SURVEYREG displays the following regression statistics in the“Fit Summary” table.

� R-square for the regression

� Root MSE, which is the square root of the mean square error

� Denominator DF, which is the denominator degrees of freedom for theF testsand also the degrees of freedom for thet tests produced by the procedure

Stratum InformationWhen you specify the LIST option in the STRATA statement, PROC SURVEYREGdisplays a “Stratum Information” table, which provides the following information foreach stratum.

� Stratum Index, which is a sequential stratum identification number

� STRATA variable(s), which lists the levels of STRATA variables for the stra-tum

� Population Total, if you specify the TOTAL= option

� Sampling Rate, if you specify the TOTAL= option or the RATE= option. Ifyou specify the TOTAL= option, the sampling rate is based on the number ofnonmissing observations in the stratum.

� N Obs, which is the number of observations

� number of Clusters, if you specify a CLUSTER statement

� Collapsed, which has the value ’Yes’ if the stratum is collapsed with anotherstratum before analysis


Displayed Output � 3243

If PROC SURVEYREG collapses strata, the “Stratum Information” table also dis-plays stratum information for the new, collapsed stratum. The new stratum has aStratum Index of 0 and is labeled ‘Pooled’.

Class Level InformationIf you use a CLASS statement to name classification variables, PROC SURVEYREGdisplays a “Class Level Information” table. This table contains the following infor-mation for each classification variable:

� Class Variable, which lists each CLASS variable name

� Levels, which is the number of values or levels of the classification variable

� Values, which lists the values of the classification variable. The values areseparated by a white space character; therefore, to avoid confusion, you shouldnot include a white space character within a classification variable value.

X0X MatrixIf you specify the XPX option in the MODEL statement, PROC SURVEYREG dis-plays theX0Xmatrix, or theX0WXmatrix when there is a WEIGHT variable. Thisoption also displays the crossproducts vectorX0y orX0Wy, wherey is the responsevector (dependent variable).

Inverse Matrix of X0X

If you specify the INV option in the MODEL statement, PROC SURVEYREG dis-plays the inverse or the generalized inverse of theX0X matrix. When there is aWEIGHT variable, the procedure displays the inverse or the generalized inverse oftheX0WX matrix.

ANOVA for Dependent VariableBy default, PROC SURVEYREG displays an analysis of variance table for the de-pendent variable. This table is identical to the ANOVA table displayed by the GLMprocedure.

Tests of Model EffectsBy default, PROC SURVEYREG displays a “Tests of Model Effects” table, whichprovides Wald’sF test for each effect in the model. The table contains the followinginformation for each effect:

� Effect, which is the effect name

� Num DF, which is the numerator degrees of freedom for Wald’sF test

� F Value, which is Wald’sF statistic

� Pr > F, which is the significance probability corresponding to the F Value

A footnote displays the denominator degrees of freedom, which is the same for alleffects.

Estimated Regression CoefficientsPROC SURVEYREG displays the “Estimated Regression Coefficients” table by de-fault when there is no CLASS statement. Also, the procedure displays this tablewhen you specify a CLASS statement and also specify the SOLUTIONS option in



the MODEL statement. This table contains the following information for each re-gression parameter:

� Parameter, which identifies the effect or regressor variable

� Estimate, which is the estimate of the regression coefficient

� Standard Error, which is the standard error of the estimate

� t Value, which is thet statistic for testingH0:Parameter= 0

� Pr > | t |, which is the two-sided significance probability corresponding to the tValue

Covariance of Estimated Regression CoefficientsWhen you specify the COVB option in the MODEL statement, PROC SURVEYREGdisplays the “Covariance of Estimated Regression Coefficients” matrix.

Coefficients of ContrastWhen you specify the E option in a CONTRAST statement, PROC SURVEYREGdisplays a “Coefficients of Contrast” table for the contrast. You can use this table tocheck the coefficients you specified in the CONTRAST statement. Also, this tablegives a note for a nonestimable contrast.

Analysis of ContrastsIf you specify a CONTRAST statement, PROC SURVEYREG produces an “Analysisof Contrasts” table, which displays Wald’sF test for the contrast. If you use more thanone CONTRAST statement, the procedure displays all results in the same table. The“Analysis of Contrasts” table contains the following information for each contrast:

� Contrast, which is the label of the contrast

� Num DF, which is the numerator degrees of freedom for Wald’sF test

� F Value, which is Wald’sF statistic for testingH0:Contrast= 0

� Pr > F, which is the significance probability corresponding to the F Value

Coefficients of EstimateWhen you specify the E option in an ESTIMATE statement, PROC SURVEYREGdisplays a “Coefficients of Estimate” table for the linear function of the regressionparameters in the ESTIMATE statement. You can use this table to check the coeffi-cients you specified in the ESTIMATE statement. Also, this table gives a note for anonestimable function.

Analysis of Estimable FunctionsIf you specify an ESTIMATE statement, PROC SURVEYREG checks the functionfor estimability. If the function is estimable, PROC SURVEYREG produces an“Analysis of Estimable Functions” table, which displays the estimate and the cor-respondingt test. If you use more than one ESTIMATE statement, the proceduredisplays all results in the same table. The table contains the following informationfor each estimable function.


ODS Table Names � 3245

� Parameter, which is the label of the function

� Estimate, which is the estimate of the estimable liner function

� Standard Error, which is the standard error of the estimate

� t Value, which is thet statistic for testingH0:Estimable Function= 0

� Pr > | t |, which is the two-sided significance probability corresponding to the tValue

ODS Table Names

PROC SURVEYREG assigns a name to each table it creates. You can use thesenames to reference the table when using the Output Delivery System (ODS) to selecttables and create output data sets. These names are listed in the following table. Formore information on ODS, see Chapter 15, “Using the Output Delivery System.”

Table 62.2. ODS Tables Produced in PROC SURVEYREG

ODS Table Name Description Statement OptionANOVA ANOVA for dependent variable MODEL default

ClassVarInfo Class level information CLASS default

ContrastCoef Coefficients of contrast CONTRAST E

Contrasts Analysis of contrasts CONTRAST default

CovB Covariance of estimatedregression coefficients

MODEL COVB

DataSummary Data summary MODEL default

DesignSummary Design summary STRATA | CLUSTER default

Effects Tests of model effects MODEL

EstimateCoef Coefficients of estimate ESTIMATE E

Estimates Analysis of estimable functions ESTIMATE default

FitStatistics Fit summary MODEL default

InvXPX Inverse matrix ofX0X MODEL INV

ParameterEstimates Estimated regressioncoefficients

MODEL default

StrataInfo Stratum information STRATA LIST

XPX X0X matrix MODEL XPX

By referring to the names of such tables, you can use the ODS OUTPUT statementto place one or more of these tables in output data sets.

For example, the following statements create an output data set namedMyStrata,which contains the “StrataInfo” table, an output data set namedMyParmEst, whichcontains the “ParameterEstimates” table, and an output data set namedCov, whichcontains the “CovB” table for the ice cream study discussed in the section “StratifiedSampling” on page 3220.




strata Grade /list;class Kids;model Spending = Income Kids / solution covb;ods output StrataInfo = MyStrata

ParameterEstimates = MyParmEstCovB = Cov;

run;

Note that the option CovB is specified in the MODEL statement in order to producethe covariance matrix table.

Examples

Example 62.1. Simple Random Sampling

This example investigates the relationship between the labor force participation rate(LFPR) of women in 1968 and 1972 in large cities in the United States. A simplerandom sample of 19 cities is drawn from a total of 200 cities. For each selected city,the LFPRs are recorded and saved in a SAS data set namedLabor. The LFPR in1972 is contained in the variableLFPR1972, and the LFPR in 1968 is identified bythe variableLFPR1968.

data Labor;input City $ 1-16 LFPR1972 LFPR1968;datalines;

New York .45 .42Los Angeles .50 .50Chicago .52 .52Philadelphia .45 .45Detroit .46 .43San Francisco .55 .55Boston .60 .45Pittsburgh .49 .34St. Louis .35 .45Connecticut .55 .54Washington D.C. .52 .42Cincinnati .53 .51Baltimore .57 .49Newark .53 .54Minn/St. Paul .59 .50Buffalo .64 .58Houston .50 .49Patterson .57 .56Dallas .64 .63;


Example 62.1. Simple Random Sampling � 3247

Assume that the LFPRs in 1968 and 1972 have a linear relationship, as shown in thefollowing model.

LFPR1972 = �0 + �1 � LFPR1968+ error

You can use PROC SURVEYREG to obtain the estimated regression coefficients andestimated standard errors of the regression coefficients. The following statementsperform the regression analysis.

title ’Study of Labor Force Participation Rates of Women’;proc surveyreg data=Labor total=200;

model LFPR1972 = LFPR1968;run;

Here, the TOTAL=200 option specifies the finite population total from which thesimple random sample of 19 cities is drawn. You can specify the same informationby using the sampling rate option RATE=0.095 (19/200=.095).

Output 62.1.1. Summary of Regression Using Simple Random Sampling

Study of Labor Force Participation Rates of Women


Regression Analysis for Dependent Variable LFPR1972

Data Summary

Number of Observations 19Mean of LFPR1972 0.52684Sum of LFPR1972 10.01000

Fit Statistics


ANOVA for Dependent Variable LFPR1972


Model 1 0.035813 0.035813 11.19 0.0038Error 17 0.054398 0.003200Corrected Total 18 0.090211

Output 62.1.1 summarizes the data information, the fit information, and the ANOVAtable.



Output 62.1.2. Regression Coefficient Estimates

Study of Labor Force Participation Rates of Women


Regression Analysis for Dependent Variable LFPR1972



Model 1 13.84 0.0016Intercept 1 4.63 0.0452LFPR1968 1 13.84 0.0016




Intercept 0.20331056 0.09444296 2.15 0.0452LFPR1968 0.65604048 0.17635810 3.72 0.0016

NOTE: The denominator degrees of freedom for the t tests is 18.

Output 62.1.2 presents the significance tests for the model effects and estimated re-gression coefficients. TheF tests andt tests for the effects in the model are alsopresented in these tables.

From the regression performed by PROC SURVEYREG, you obtain a positive esti-mated slope for the linear relationship between the LFPR in 1968 and the LFPR in1972. The regression coefficients are all significant at the 5% level. EffectsInterceptandLFPR1968 are significant in the model at the 5% level. In this example, theFtest for the overall model without intercept is the same as the effectLFPR1968.

Example 62.2. Simple Random Cluster Sampling

This example illustrates the use of regression analysis in a simple random clustersampling design. The data are from S�arndal, Swenson, and Wretman (1992, p. 652).

A total of 284 Swedish municipalities are grouped into 50 clusters of neighboringmunicipalities. Five clusters with a total of 32 municipalities are randomly selected.The results from the regression analysis in which clusters are used in the sampledesign are compared to the results of a regression analysis that ignores the clusters.The linear relationship between the population in 1975 and in 1985 is investigated.

The 32 selected municipalities in the sample are saved in the data setMunicipalities.

data Municipalities;input Municipality Cluster Population85 Population75;datalines;205 37 5 5206 37 11 11207 37 13 13


Example 62.2. Simple Random Cluster Sampling � 3249

208 37 8 8209 37 17 19

6 2 16 157 2 70 628 2 66 549 2 12 12

10 2 60 5094 17 7 795 17 16 1696 17 13 1197 17 12 1198 17 70 6799 17 20 20

100 17 31 28101 17 49 48276 50 6 7277 50 9 10278 50 24 26279 50 10 9280 50 67 64281 50 39 35282 50 29 27283 50 10 9284 50 27 31

52 10 7 653 10 9 854 10 28 2755 10 12 1156 10 107 108

;

The variableMunicipality identifies the municipalities in the sample; the variableCluster indicates the cluster to which a municipality belongs; and the variablesPop-ulation85 and Population75 contain the municipality populations in 1985 and in1975 (in thousands), respectively. A regression analysis is performed by PROC SUR-VEYREG with a CLUSTER statement.

title1 ’Regression Analysis for Swedish Municipalities’;title2 ’Cluster Simple Random Sampling’;proc surveyreg data=Municipalities total=50;

cluster Cluster;model Population85=Population75;

run;

The TOTAL=50 option specifies the total number of clusters in the sampling frame.



Output 62.2.1. Regression Analysis for Simple Random Cluster Sampling

Regression Analysis for Swedish MunicipalitiesCluster Simple Random Sampling


Regression Analysis for Dependent Variable Population85

Data Summary

Number of Observations 32Mean of Population85 27.50000Sum of Population85 880.00000

Design Summary

Number of Clusters 5

Fit Statistics




Intercept -0.0191292 0.89204053 -0.02 0.9839Population75 1.0546253 0.05167565 20.41 <.0001


Output 62.2.1 displays the data summary, design summary, fit summary, and regres-sion coefficient estimates. Since the sample design includes clusters, the proceduredisplays the total number of clusters in the sample in the “Design Summary” table.In the “Estimated Regression Coefficients” table, the estimated slope for the linearrelationship is 1.05, which is significant at the 5% level; but the intercept is not sig-nificant. This suggests that a regression line crossing the original can be establishedbetween populations in 1975 and in 1985.

The CLUSTER statement is necessary in PROC SURVEYREG in order to incorpo-rate the sample design. If you do not specify a CLUSTER statement in the regressionanalysis, the standard deviation of the regression coefficients will be incorrectly esti-mated.

title1 ’Regression Analysis for Swedish Municipalities’;title2 ’Simple Random Sampling’;proc surveyreg data=Municipalities total=284;

model Population85=Population75;run;


Example 62.3. Regresson Estimator for Simple Random Sample � 3251

The analysis ignores the clusters in the sample, assuming that the sample design is asimple random sampling. Therefore, the TOTAL= option specifies the total numberof municipalities, which is 284.

Output 62.2.2. Regression Analysis for Simple Random Sampling

Regression Analysis for Swedish MunicipalitiesSimple Random Sampling



Data Summary

Number of Observations 32Mean of Population85 27.50000Sum of Population85 880.00000

Fit Statistics




Intercept -0.0191292 0.67417606 -0.03 0.9775Population75 1.0546253 0.03668414 28.75 <.0001


Output 62.2.2 displays the regression results ignoring the clusters. Compared to theresults in Output 62.2.1 on page 3250, the regression coefficient estimates are thesame. However, without using clusters, the regression coefficients have a smallervariance estimate in Output 62.2.2. Using clusters in the analysis, the estimated re-gression coeffiecient for effectPopulation75 is 1.05, with the estimated standarderror 0.05, as displayed in Output 62.2.1; without using the clusters, the estimate is1.05, but with the estimated standard error 0.04, as displayed in Output 62.2.2. Toestimated the variance of the regression coefficients correctly, you should include theclustering information in the regression analysis.

Example 62.3. Regresson Estimator for Simple RandomSample

Using auxilary information, you can construct the regression estimators to providemore accurate estimates of the population characteristics that are of interest. WithESTIMATE statements in PROC SURVEYREG, you can speciofy a regression es-timator as a linear function of the regression parameters to estimate the populationtotal. This example illustrates this application, using the data in the previous example.



In this sample, a linear model between the Swedish populations in 1975 and in 1985is established.

Population85 = �+ � � Population75 + error

Assuming that the total population in 1975 is known to be 8200 (in thousands), youcan use the ESTIMATE statement to predict the 1985 total population using the fol-lowing statements.

title1 ’Regression Analysis for Swedish Municipalities’;title2 ’Estimate Total Population’;proc surveyreg data=Municipalities total=50;

cluster Cluster;model Population85=Population75;estimate ’1985 population’ Intercept 284 Population75 8200;

run;

Since each observation in the sample is a municipality, and there is a total of 284 mu-nicipalities in Sweden, the coefficient forIntercept (�) in the ESTIMATE statementis 284, and the coefficient forPopulation75 (�) is the total population in 1975 (8.2million).

Output 62.3.1. Use the Regression Estimator to Estimate the Population Total

Regression Analysis for Swedish MunicipalitiesEstimate Total Population



Analysis of Estimable Functions


1985 population 8642.49485 258.558613 33.43 <.0001


Output 62.3.1 displays the regression results and the estimation of the total popula-tion. Using the linear model, you can predict the total population in 1985 to be 8.64million, with a standard error of 0.26 million.

Example 62.4. Stratified Sampling

This example illustrates the SURVEYREG procedure to perform a regression in astratified sample design. Consider a population of 235 farms producing corn in thestates of Nebraska and Iowa. You are interested in the relationship between corn yield(CornYield) and the total farm size (FarmArea).

Each state is divided into several regions, and each region is used as a stratum. Withineach stratum, a simple random sample with replacement is drawn. A total of 19 farms


Example 62.4. Stratified Sampling � 3253

is selected to the stratified simple random sample. The sample size and populationsize within each stratum are displayed in Table 62.3.

Table 62.3. Number of Farms in Each Stratum

Number of Farms inStratum State Region Population Sample

1 Iowa 1 100 32 2 50 53 3 15 34 Nebraska 1 30 65 2 40 2

Total 235 19

Three models are considered to represent the data:

� Model I — Common intercept and slope:

Corn Yield= �+ � � Farm Area

� Model II — Common intercept, different slope:

Corn Yield=

��+ �Iowa � Farm Area if the farm is from Iowa�+ �Nebraska� Farm Area if the farm is from Nebraska

� Model III — Different intercept and slope:

Corn Yield=

��Iowa+ �Iowa � Farm Area if the farm is from Iowa�Nebraska+ �Nebraska� Farm Area if the farm is from Nebraska

Data from the stratified sample are saved in the SAS data setFarms.

data Farms;input State $ Region FarmArea CornYield Weight;datalines;

Iowa 1 100 54 33.333Iowa 1 83 25 33.333Iowa 1 25 10 33.333Iowa 2 120 83 10.000Iowa 2 50 35 10.000Iowa 2 110 65 10.000Iowa 2 60 35 10.000Iowa 2 45 20 10.000Iowa 3 23 5 5.000Iowa 3 10 8 5.000Iowa 3 350 125 5.000Nebraska 1 130 20 5.000Nebraska 1 245 25 5.000Nebraska 1 150 33 5.000Nebraska 1 263 50 5.000Nebraska 1 320 47 5.000



Nebraska 1 204 25 5.000Nebraska 2 80 11 20.000Nebraska 2 48 8 20.000;

In the data setFarms, the variableWeight represents the sampling weight. In thisexample, the sampling weight is proportional to the reciprocal of the sampling ratewithin each stratum from which a farm is selected. The information on populationsize in each stratum is saved in the SAS data setTotalInStrata.

data TotalInStrata;input State $ Region _TOTAL_;datalines;

Iowa 1 100Iowa 2 50Iowa 3 15Nebraska 1 30Nebraska 2 40;

Using the sample data from the data setFarms and the control information data fromthe data setTotalInStrata, you can fit Model I using PROC SURVEYREG.

title1 ’Analysis of Farm Area and Corn Yield’;title2 ’Model I: Same Intercept and Slope’;proc surveyreg data=Farms total=TotalInStrata;

strata State Region / list;model CornYield = FarmArea / covb;weight Weight;

run;



Output 62.4.1. Data Summary and Stratum Information Fitting Model I

Analysis of Farm Area and Corn YieldModel I: Same Intercept and Slope


Regression Analysis for Dependent Variable CornYield

Data Summary

Number of Observations 19Sum of Weights 234.99900Weighted Mean of CornYield 31.56029Weighted Sum of CornYield 7416.6

Design Summary

Number of Strata 5

Fit Statistics


Stratum Information

Stratum Population SamplingIndex State Region N Obs Total Rate

1 Iowa 1 3 100 0.032 2 5 50 0.103 3 3 15 0.204 Nebraska 1 6 30 0.205 2 2 40 0.05

Output 62.4.1 displays the data summary and stratification information fitting ModelI. The sampling rates are automatically computed by the procedure based on thesample sizes and the population totals in strata.



Output 62.4.2. Estimated Regression Coefficients and the Estimated CovarianceMatrix

Analysis of Farm Area and Corn YieldModel I: Same Intercept and Slope





Model 1 21.74 0.0004Intercept 1 4.93 0.0433FarmArea 1 21.74 0.0004




Intercept 11.8162978 5.31981027 2.22 0.0433FarmArea 0.2126576 0.04560949 4.66 0.0004


Covariance of EstimatedRegression Coefficients

Intercept FarmArea

Intercept 28.300381277 -0.146471538FarmArea -0.146471538 0.0020802259

Output 62.4.2 displays tests of model effects and the estimated regression coefficientsand their covariance matrix.

Alternatively, you can assume that the linear relationship between corn yield(CornYield) and farm area (FarmArea) is different among the states. Therefore,you consider fitting Model II.

In order to analyze the data using Model II, you create auxiliary variablesFarmAre-aNE andFarmAreaIA to represent farm area in different states:

FarmAreaNE =

�0 if the farm is from IowaFarmArea if the farm is from Nebraska

FarmAreaIA =

�FarmArea if the farm is from Iowa0 if the farm is from Nebraska



The following statements create these variables in a new data set calledFarmsByState and use PROC SURVEYREG to fit Model II.

title1 ’Analysis of Farm Area and Corn Yield’;title2 ’Model II: Same Intercept, Different Slopes’;data FarmsByState; set Farms;

if State=’Iowa’ then do;FarmAreaIA=FarmArea ; FarmAreaNE=0 ;

end;else do;

FarmAreaIA=0 ; FarmAreaNE=FarmArea;end;

run;

The following statements perform the regression using the new data setFarms-ByState. The analysis uses the auxilary variablesFarmAreaIA andFarmAreaNEas the regressors.

proc SURVEYREG data=FarmsByState total=TotalInStrata;strata State Region;model CornYield = FarmAreaIA FarmAreaNE / covb;weight Weight;

run;



Output 62.4.3. Regression Results from Fitting Model II

Analysis of Farm Area and Corn YieldModel II: Same Intercept, Different Slopes



Data Summary


Design Summary

Number of Strata 5

Fit Statistics




Intercept 4.04234816 3.80934848 1.06 0.3066FarmAreaIA 0.41696069 0.05971129 6.98 <.0001FarmAreaNE 0.12851012 0.02495495 5.15 0.0001


Covariance of Estimated Regression Coefficients

Intercept FarmAreaIA FarmAreaNE

Intercept 14.511135861 -0.118001232 -0.079908772FarmAreaIA -0.118001232 0.0035654381 0.0006501109FarmAreaNE -0.079908772 0.0006501109 0.0006227496

Output 62.4.3 displays the data summary, design information, fit summary, and pa-rameter estimates and their covariance matrix. The estimated slope parameters foreach state are quite different from the estimated slope in Model I. The results fromthe regression show that Model II fits these data better than Model I.

For Model III, different intercepts are used for the linear relationship in two states.The following statements illustrate the use of the NOINT option in the MODEL state-ment associated with the CLASS statement to fit Model III.

title1 ’Analysis of Farm Area and Corn Yield’;title2 ’Model III: Different Intercepts and Slopes’;proc SURVEYREG data=FarmsByState total=TotalInStrata;

strata State Region;



class State;model CornYield = State FarmAreaIA FarmAreaNE

/ noint covb solution;weight Weight;

run;

The model statement includes the classification effectState as a regressor. Therefore,the parameter estimates for effectState will presents the intercepts in two states.

Output 62.4.4. Regression Results for Fitting Model III

Analysis of Farm Area and Corn YieldModel III: Different Intercepts and Slopes



Data Summary


Design Summary

Number of Strata 5

Fit Statistics




State Iowa 5.27797099 5.27170400 1.00 0.3337State Nebraska 0.65275201 1.70031616 0.38 0.7068FarmAreaIA 0.40680971 0.06458426 6.30 <.0001FarmAreaNE 0.14630563 0.01997085 7.33 <.0001


Covariance of Estimated Regression Coefficients

StateState Iowa Nebraska FarmAreaIA FarmAreaNE

State Iowa 27.790863033 0 -0.205517205 0State Nebraska 0 2.8910750385 0 -0.027354011FarmAreaIA -0.205517205 0 0.0041711265 0FarmAreaNE 0 -0.027354011 0 0.0003988349



Output 62.4.4 displays the regression results for fitting Model III, including the datasummary, parameter estimates, and covariance matrix of the regression coefficients.The estimated covariance matrix shows a lack of correlation between the regressioncoefficients from different states. This suggests that Model III might be the bestchoice for building a model for farm area and corn yield in these two states.

However, some statistics remain the same under different regression models, for ex-ample, Weighted Mean of CornYield. These estimators do not rely on the particularmodel you use.

Example 62.5. Regression Estimator for Stratified Sample

This example uses the corn yield data from the previous example to illustrate how toconstruct a regression estimator for a stratified sample design.

Similar to Example 62.3 on page 3251, by incorporating auxilary information intoa regression estimator, the procedure can produce more accurate estimates of thepopulation characteristics that are of interest. In this example, the sample design is astratified sampling design. The auxilary information is the total farm areas in regionsof each state, as displayed in Table 62.4. You want to estimate the total corn yieldusing this information under the three linear models given in Example 62.4.

Table 62.4. Information for Each Stratum

Number of Farms inStratum State Region Population Sample Total Farm Area

1 Iowa 1 100 32 2 50 5 13,2003 3 15 34 Nebraska 1 30 6 8,7505 2 40 2

Total 235 19 21,950

The regression estimator to estimate the total corn yield under Model I can be ob-tained by using PROC SURVEYREG with an ESTIMATE statement.

title1 ’Estimate Corn Yield from Farm Size’;title2 ’Model I: Same Intercept and Slope’;proc surveyreg data=Farms total=TotalInStrata;

strata State Region / list;class State Region;model CornYield = FarmArea State*Region /solution;weight Weight;estimate ’Estimate of CornYield under Model I’

INTERCEPT 235 FarmArea 21950State*Region 100 50 15 30 40 /e;

run;

To apply the contraint in each stratum that the weighted total number of farms equalsto the total number of farms in the stratum, you can include the strata as an effect inthe MODEL statement, effectState*Region. Thus, the CLASS statement must list


Example 62.5. Regression Estimator for Stratified Sample � 3261

the STRATA variables,State andRegion, as classification variables. The followingESTIMATE statement specifies the regression estimator, which is a linear function ofthe regression parameters.

estimate ’Estimate of CornYield under Model I’INTERCEPT 235 FarmArea 21950State*Region 100 50 15 30 40 /e;

This linear function contains the total for each explanatory variable in the model.Because the sampling units are farms in this example, the coefficient forInterceptin the ESTIMATE statement is the total number of farms (235); the coefficient forFarmArea is the total farm area listed in Table 62.4 (21950); and the coefficientsfor effectState*Region are the total number of farms in each strata (as displayed inTable 62.4).

Output 62.5.1. Regression Estimator for the Total of CornYield under Model I

Estimate Corn Yield from Farm SizeModel I: Same Intercept and Slope





Estimate of CornYield under Model I 7463.52329 926.841541 8.05 <.0001


Output 62.5.1 displays the results of the ESTIMATE statement. The regression esti-mator for the total ofCornYield in Iowa and Nebraska is 7464 under Model I, witha standard error of 927.

Under Model II, a regression estimator for totals can be obtained using the followingstatements.

title1 ’Estimate Corn Yield from Farm Size’;title2 ’Model II: Same Intercept, Different Slopes’;proc surveyreg data=FarmsByState total=TotalInStrata;

strata State Region;class State Region;model CornYield = FarmAreaIA FarmAreaNE

state*region /solution;weight Weight;estimate ’Total of CornYield under Model II’

INTERCEPT 235 FarmAreaIA 13200 FarmAreaNE 8750State*Region 100 50 15 30 40 /e;

run;

In this model, you also need to include strata as a fixed effect in the MODEL state-ment. Other regressors are the auxiliary variablesFarmAreaIA andFarmAreaNE



(defined in Example 62.4). In the following ESTIMATE statement, the coefficientfor Intercept is still the total number of farms; and the coefficients forFarmAreaIAandFarmAreaNE are the total farm area in Iowa and Nebraska, respectively, as dis-played in Table 62.4. The total number of farms in each strata are the coefficients forthe strata effect.

estimate ’Total of CornYield under Model II’INTERCEPT 235 FarmAreaIA 13200 FarmAreaNE 8750State*Region 100 50 15 30 40 /e;

Output 62.5.2. Regression Estimator for the Total of CornYield under Model II

Estimate Corn Yield from Farm SizeModel II: Same Intercept, Different Slopes





Total of CornYield under Model II 7580.48657 859.180439 8.82 <.0001


Output 62.5.2 displays that the results of the regression estimator for the total of cornyield in two states under Model II is 7580 with a standard error of 859. The regressionestimator under Model II has a slightly smaller standard error than under Model I.

Finally, you can apply Model III to the data and estimate the total corn yield. UnderModel III, you can also obtain the regression estimators for the total corn yield foreach state. Three ESTIMATE statements are used in the following statements tocreate the three regression estimators.

title1 ’Estimate Corn Yield from Farm Size’;title2 ’Model III: Different Intercepts and Slopes’;proc SURVEYREG data=FarmsByState total=TotalInStrata;

strata State Region;class State Region;model CornYield = state FarmAreaIA FarmAreaNE

State*Region /noint solution;weight Weight;estimate ’Total CornYield in Iowa under Model III’

State 165 0 FarmAreaIA 13200 FarmAreaNE 0State*region 100 50 15 0 0 /e;

estimate ’Total CornYield in Nebraska under Model III’State 0 70 FarmAreaIA 0 FarmAreaNE 8750State*Region 0 0 0 30 40 /e;

estimate ’Total CornYield in both states under Model III’State 165 70 FarmAreaIA 13200 FarmAreaNE 8750State*Region 100 50 15 30 40 /e;

run;


Example 62.5. Regression Estimator for Stratified Sample � 3263

The fixed effectState is added to the MODEL statement to obtain different interceptsin different states, using the NOINT option. Among the ESTIMATE statements, thecoefficients for explanatory variables are different depending on which regressionestimator is estimated. For example, in the ESTIMATE statement

estimate ’Total CornYield in Iowa under Model III’State 165 0 FarmAreaIA 13200 FarmAreaNE 0State*region 100 50 15 0 0 /e;

the coefficients for the effectState are 165 and 0, respectively. This indicates thatthe total number of farms in Iowa is 165 and the total number of farms in Nebraskais 0, because the estimation is the total corn yield in Iowa only. Similarly, the totalnumbers of farms in three regions in Iowa are used for the coefficients of the strataeffectState*Region, as displayed in Table 62.4.

Output 62.5.3. Regression Estimator for the Total of CornYield under Model III

Estimate Corn Yield from Farm SizeModel III: Different Intercepts and Slopes




StandardParameter Estimate Error t Value

Total CornYield in Iowa under Model III 6246.10697 851.272372 7.34Total CornYield in Nebraska under Model III 1334.37961 116.302948 11.47Total CornYield in both states under Model III 7580.48657 859.180439 8.82


Parameter Pr > |t|

Total CornYield in Iowa under Model III <.0001Total CornYield in Nebraska under Model III <.0001Total CornYield in both states under Model III <.0001


Output 62.5.3 displays the results from the three regression estimators using ModelIII. Since the estimations are independent in each state, the total corn yield from bothstates is equal to the sum of the estimated total of corn yield in Iowa and Nebraska,6246 + 1334 = 7580. This regression estimator is the same as the one under ModelII. The variance of regression estimator of the total corn yield in both states is the sumof variances of regression estimators for total corn yield in each state. Therefore, itis not necessary to use Model III to obtain the regression estimator for the total cornyield unless you need to estimate the total corn yield for each individual state.



Example 62.6. Stratum Collapse

In a stratified sample, it is possible that some strata will have only one samplingunit. When this happens, PROC SURVEYREG collapses these strata that containsingle sampling unit into a pooled stratum. For more detailed information on stratumcollapse, see the section “Stratum Collapse” on page 3236.

Suppose that you have the following data.

data Sample;input Stratum X Y;datalines;

10 0 010 1 111 1 111 1 212 3 333 4 414 6 712 3 4;

The variableStratum is the stratification variable, the variableX is the independentvariable, and the variableY is the dependent variable. You want to regressY on X.In the data setSample, bothStratum=33 andStratum=14 contain one observation.By default, PROC SURVEYREG collapses these strata into one pooled stratum inthe regression analysis.

To input the finite population correction information, you create the SAS data setStratumTotal.

data StratumTotal;input Stratum _TOTAL_;datalines;

10 1011 2012 3233 4033 4514 5015 .66 70;

The variableStratum is the stratification variable, and the variable–TOTAL– con-tains the stratum totals. The data setStratumTotal contains more strata than the datasetSample. Also in the data setStratumTotal, more than one observation containsthe stratum totals forStratum=33.

33 4033 45


Example 62.6. Stratum Collapse � 3265

PROC SURVEYREG allows this type of input. The procedure simply ignores thestrata that are not present in the data setSample; for the multiple entries of a stratum,the procedure uses the first observation. In this example,Stratum=33 has the stratumtotal –TOTAL–=40.

The following SAS statements perform the regression analysis.

title1 ’Stratified Sample with Single Sampling Unit in Strata’;title2 ’With Stratum Collapse’;proc SURVEYREG data=Sample total=StratumTotal;

strata Stratum/list;model Y=X;

run;

Output 62.6.1. Summary of Data and Regression

Stratified Sample with Single Sampling Unit in StrataWith Stratum Collapse


Regression Analysis for Dependent Variable Y

Data Summary

Number of Observations 8Mean of Y 2.75000Sum of Y 22.00000

Design Summary

Number of Strata 5Number of Strata Collapsed 2

Fit Statistics


Output 62.6.1 displays that there are a total of 5 strata in the input data set, and 2 strataare collapsed into a pooled stratum. The denominator degrees of freedom is 4, due tothe collapse (see the section “Denominator Degrees of Freedom” on page 3237).



Output 62.6.2. Stratification Information




Stratum Information

Stratum Population SamplingIndex Collapsed Stratum N Obs Total Rate

1 10 2 10 0.202 11 2 20 0.103 12 2 32 0.064 Yes 14 1 50 0.025 Yes 33 1 40 0.03

0 Pooled 2 90 0.02

NOTE: Strata with only one observation are collapsed into the stratum withStratum Index "0".

Output 62.6.2 displays the stratification information, including stratum collapse. Un-der the column Collapsed, the fourth (Stratum Index=4) stratum and the fifth (StratumIndex=5) stratum are marked as “Yes,” which indicates that these two strata are col-lapsed into the pooled stratum (Stratum Index=0). The sampling rate for the pooledstratum is 2%, which combined from the 4th stratum and the 5th stratum (see thesection “Sampling Rate of the Pooled Stratum from Collapse” on page 3241).

Output 62.6.3. Parameter Estimates and Effect Tests






Model 1 155.62 0.0002Intercept 1 0.24 0.6503X 1 155.62 0.0002




Intercept 0.13004484 0.26578532 0.49 0.6503X 1.10313901 0.08842825 12.47 0.0002



Example 62.6. Stratum Collapse � 3267

Output 62.6.3 displays the parameter estimates and the tests of the significance of themodel effects.

Alternatively, if you prefer not to collapse the strata that have single sampling unit,you can specify the NOCOLLAPSE option in the STRATA statement.

title1 ’Stratified Sample with Single Sampling Unit in Strata’;title2 ’Without Stratum Collapse’;proc SURVEYREG data=Sample total=StratumTotal;

strata Stratum/list nocollapse;model Y = X;run;

Output 62.6.4. Summary of Data and Regression

Stratified Sample with Single Sampling Unit in StrataWithout Stratum Collapse



Data Summary

Number of Observations 8Mean of Y 2.75000Sum of Y 22.00000

Design Summary

Number of Strata 5

Fit Statistics


Output 62.6.4 does not contain stratum collapse information as compared to Out-put 62.6.1. The denominator degrees of freedom is 3 instead of 4 as in Output 62.6.1.



Output 62.6.5. Stratification Information




Stratum Information

Stratum Population SamplingIndex Stratum N Obs Total Rate

1 10 2 10 0.202 11 2 20 0.103 12 2 32 0.064 14 1 50 0.025 33 1 40 0.03

In Output 62.6.5, although the fourth stratum and the fifth stratum contain only oneobservation, no stratum collapse occurs as in Output 62.6.2.

Output 62.6.6. Parameter Estimates and Effect Tests






Model 1 391.94 0.0003Intercept 1 0.25 0.6508X 1 391.94 0.0003




Intercept 0.13004484 0.25957741 0.50 0.6508X 1.10313901 0.05572135 19.80 0.0003


As a result of not collapsing strata, the standard error estimates of the parameters aredifferent from those in Output 62.6.3, the tests of the significance of model effectsare different as well.


Example 62.7. Domain Analysis � 3269

Example 62.7. Domain Analysis

Recall that in the section “Getting Started” on page 3217, you collected a stratifiedsimple random sample from a junior high school to examine how household incomeand the number of children in a household affect students’ average weekly spendingfor ice cream. You can also use the same sample to estimate the average weeklyspending among male and female students, respectively. This is often called domainanalysis (subgroup analysis). You can use PROC SURVEYREG to perform domainanalysis as in the following example.

data IceCreamData;input Grade Spending Income Gender$ @@;if Gender=’M’ then Male=1; else Male=0;if Gender=’F’ then Female=1; else Female=0;datalines;

7 7 39 M 7 7 38 F 8 12 47 F9 10 47 M 7 1 34 M 7 10 43 M7 3 44 M 8 20 60 F 8 19 57 M7 2 35 M 7 2 36 F 9 15 51 F8 16 53 F 7 6 37 F 7 6 41 M7 6 39 M 9 15 50 M 8 17 57 F8 14 46 M 9 8 41 M 9 8 41 F9 7 47 F 7 3 39 F 7 12 50 M7 4 43 M 9 14 46 F 8 18 58 M9 9 44 F 7 2 37 F 7 1 37 M7 4 44 M 7 11 42 M 9 8 41 M8 10 42 M 8 13 46 F 7 2 40 F9 6 45 F 9 11 45 M 7 2 36 F7 9 46 F;

In the data setIceCreamData, the variableGrade indicates a student’s grade, whichis the stratification variable. The variableSpending contains the dollar amount ofeach student’s average weekly spending for ice cream. The variableIncome specifiesthe household income, in thousands of dollars. The variableGender indicates astudent’s gender.Male and Female are two indicator variables that identify thesubgroups of male and female students, respectively.

data StudentTotal;input Grade _TOTAL_;datalines;

7 18248 10259 1151;

In the data setStudentTotal, the variableGrade is the stratification variable, and thevariable–TOTAL– contains the total numbers of students in the strata in the surveypopulation.



The following statements demonstrate how you can estimate the average spending inthe subgroup of male students.

title1 ’Ice Cream Spending Analysis’;title2 ’Domain Analysis for Subgroup: Male Students’;proc surveyreg data=IceCreamData total=StudentTotal;

strata Grade;model Spending = Male / noint;ods select ParameterEstimates;

run;

Output 62.7.1. Domain Analysis for Male Students

Ice Cream Spending AnalysisDomain Analysis for Subgroup: Male Students





Male 8.57142857 0.97971846 8.75 <.0001


Output 62.7.1 shows that average spending for the subgroup of male students is $8.57with a standard error of $.99.

Similarly, you can obtain a domain analysis for the subgroup of female students withthe following statements.

title1 ’Ice Cream Spending Analysis’;title2 ’Domain Analysis for Subgroup: Female Students’;proc surveyreg data=IceCreamData total=StudentTotal;

strata Grade /list;model Spending = Female / noint;

run;


References � 3271

Output 62.7.2. Domain Analysis for Female Students

Ice Cream Spending AnalysisDomain Analysis for Subgroup: Female Students





Female 8.94736842 1.06370643 8.41 <.0001


Output 62.7.2 shows that average spending for the subgroup of female students is$8.95 with a standard error of $1.06.

Note that you would not obtain the same results by using a subset of your sample, forexample, by restricting the analysis to male students using a WHERE clause or a BYstatement. This is because the domain sample size is not fixed in the original sampledesign, but is actually a random variable. The variance estimation for the domainmean must include this variability of the sample size. Refer to Cochran (1977) formore details.

References

Cochran, W. G. (1977),Sampling Techniques, Third Edition, New York: John Wiley& Sons, Inc.

Foreman, E. K. (1991),Survey Sampling Principles, New York: Marcel Dekker, Inc.

Fuller, W. A. (1975), “Regression Analysis for Sample Survey,”Sankhy�a, 37 (3),Series C, 117–132.

Kish, L. (1965),Survey Sampling, New York: John Wiley & Sons, Inc.

Pringle, R. M. and Raynor, A. A. (1971),Generalized Inverse Matrices with Appli-cations to Statistics, New York: Hafner Publishing Co.

S�arndal, C.E., Swenson, B., and Wretman, J. (1992),Model Assisted Survey Sam-pling, New York: Springer-Verlag Inc.

Statistical Laboratory (1989),PC CARP, Ames, IA: Statistical Laboratory, Iowa StateUniversity.

Woodruff, R. S. (1971), “A Simple Method for Approximating the Variance ofa Complicated Estimate,”Journal of the American Statistical Association, 66,411–414.


The correct bibliographic citation for this manual is as follows: SAS Institute Inc.,SAS/STAT ® User’s Guide, Version 8, Cary, NC: SAS Institute Inc., 1999.

SAS/STAT® User’s Guide, Version 8Copyright © 1999 by SAS Institute Inc., Cary, NC, USA.ISBN 1–58025–494–2All rights reserved. Produced in the United States of America. No part of this publicationmay be reproduced, stored in a retrieval system, or transmitted, in any form or by anymeans, electronic, mechanical, photocopying, or otherwise, without the prior writtenpermission of the publisher, SAS Institute Inc.U.S. Government Restricted Rights Notice. Use, duplication, or disclosure of thesoftware and related documentation by the U.S. government is subject to the Agreementwith SAS Institute and the restrictions set forth in FAR 52.227–19 Commercial ComputerSoftware-Restricted Rights (June 1987).SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513.1st printing, October 1999SAS® and all other SAS Institute Inc. product or service names are registered trademarksor trademarks of SAS Institute Inc. in the USA and other countries.® indicates USAregistration.Other brand and product names are registered trademarks or trademarks of theirrespective companies.The Institute is a private company devoted to the support and further development of itssoftware and related services.

the surveyreg procedure

Documents