2014 independent study on the census bureau's survey of business owners public use microdata sample

8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

1/56

Analysis of the 2007 Survey of Business Owners

Public Use Microdata Sample

Arthur Wu

Supervisor: Professor Amber Tomas

STAT 4993 Independent Study5/9/2013


2/56

2

Table of Contents

Introduction ..................................................................................................................................... 4Overview of the Survey of Business Owners Public Microdata Sample ........................................ 4

Primary Methodology ................................................................................................................. 4Missing Data ............................................................................................................................... 6Nonresponse ................................................................................................................................ 7Inherent Differences between SBO and PUMS Information ...................................................... 7

Data Manipulation and Cleaning .................................................................................................... 8Subsetting to Small Businesses and One Owner Data ................................................................ 8Tabulation Weights ..................................................................................................................... 9

Fitting a Regression ........................................................................................................................ 9Regression One: All Variables Against Receipts ...................................................................... 10Model Selection Techniques ..................................................................................................... 12Regression Two: PROC GLSELECT with the Schwarz Bayesian Criterion ........................... 14Regression Three: PROC GLMSELECT with Akaikes Information Criterion....................... 15Regression Four: PROC GLMSELECT with AIC: Receipts with Logarithm Transform ........ 18Regression Five: PROC GLMSELECT with AIC: Modified Binary Payroll and Employment

with Other Character Variables against the Logarithm of Receipts .......................................... 20Analysis of Language ................................................................................................................... 22

Research Question 1a: Does the language spoken in transactions produce a difference incorrelated receipts? .................................................................................................................... 22Research Question 1b: Out of only existing Only English and Spanish businesses, which are

the most popular industries for business? ................................................................................. 26Visualizing the Percentage of Hispanic/Spanish-Speaking Businesses in the United States ... 27

Analysis of Capital Sources in California versus the United States ............................................. 29Dataset Overview ...................................................................................................................... 29Research Question 2a: What sources of capital have the most positive relationship with

receipts? ..................................................................................................................................... 30Research Question 2b: How does the spread of receipts for certain capital sources compare

between businesses in California and the general United States? ............................................. 32Conclusion .................................................................................................................................... 33Lessons Learned............................................................................................................................ 33Appendix A: Full Output of Regression with log(Receipts) and Modified Payroll and

Employment .................................................................................................................................. 34


3/56

3

Appendix B: Regression of All Available Variables .................................................................... 46Appendix C: Full Tables of Capital Regression ........................................................................... 54


4/56

4

Introduction

In selecting my topic for independent study, I wanted to combine the skills I had learnedfor statistical computing software with my main passion and primary fields of studybusinessand entrepreneurship. As a result, I originally intended to build a predictive model for new or

small business venture success. According to the U.S. Small Business Administration (SBA),small businesses represent 99.7 percent of all employer firms. Since 1995, small businesses havegenerated 64 percent of new jobs, and paid 44 percent of the total United States private payroll,according to the SBA. However, I quickly realized that there was a dearth of accurate, thorough,and easily accessible data that had a large enough sample size to satisfy the normalityassumption across all sectors and states. Moreover, attempting to answer this question wouldfurther require significant longitudinal data on the individual business level. Ultimately, I reliedon the U.S. Census Bureaus Survey of Business Owners (SBO) Public Use Microdata Sample(PUMS), in which I examine entrepreneurial activity and the relationships between businesscharacteristics such as access to capital, firm size, employer-paid benefits, minority ownership,and firm age. In this report, I detail how I conducted data cleaning on the 2007 SBO PUMS in

addition to the development of a regression model as well as more in-depth analyses of therelationships between specific variables.

Overview of the Survey of Business Owners Public Microdata SamplePrimary Methodology

The 2007 Survey of Business Owners (SBO) questionnaire, Form SBO-1, was mailed toa random sample of 2.3 million businesses selected from a list of 27 million firms operatingduring 2007 with receipts of $1,000 or more. The list of all firms (the sampling universe) wasderived from both official business tax returns and data collected on other economic censusreports. The Census Bureau obtained electronic files from the Internal Revenue Service (IRS) for

all companies reporting any business activity on 2007 IRS Tax Forms such as Form 1040 and1065.

With regards to the background of the SBO, this survey is part of the Economic Censusprogram, which the Census Bureau is required by law to conduct every 5 years for years endingin "2" and "7." The Census Bureau combines and crosschecks data from the SBO with data fromother economic surveys, economic censuses, and administrative records. The published datainclude number of firms (both firms with paid employees and firms with no paid employees),sales and receipts, number of paid employees, and annual payroll; they are presented by kind ofbusiness, geographic area, and size of firm (employment and receipts). These results will alsocontain summary statistics on the composition of businesses in the United States by gender,

ethnicity, race, and veteran status. Additional demographic and economic characteristics ofbusiness owners and their businesses are included, such as: owner's age, education level, hoursworked, and primary function in the business; family- and home-based businesses; types ofcustomers and workers; sources of financing for start-up, expansion, or capital improvements;outsourcing; use of Internet and e-commerce; and employer-paid benefits.

The IRS provided certain identification, classification, and measurement data forbusinesses filing those forms. For most firms with paid employees, the Census Bureau also


5/56

5

collected employment, payroll, receipts, and kind of business for each plant, store, or physicallocation during the 2007 Economic Census.

For the 2007 SBO, firms could either report electronically by using Census Taker, theCensus Bureau's secure online interactive application, or return their completed form by mail.Three report form re-mails to employer firms and two report form re-mails to nonemployer firms

were conducted at one-month intervals to all delinquent respondents. The returned formsunderwent extensive review and computer processing. All reports were geographically coded,data-keyed, and edited.

This wealth of data provides a resource to main parties from government officials toindustry organization leaders. For example, this data allows agencies such as the Small BusinessAdministration to identify and address the needs of small businesses in the United States. In theprivate sector, consultants and researchers to analyze long-term economic and demographicshifts, and differences in ownership and performance among geographic areas.

Survey Overview:

Form SBO-1, given to every sampled business, primarily asked basic information aboutthe business in general while focusing on the demographics and level of ownership for eachlisted owner. There are only 9 numeric variables (tabulation weight, total revenues with injectednoise, payroll with injected noise, employment injected with noise, and general ownershippercentages for up to four owners) while the other hundreds are character variables (age,education, startup capital type, race, ethnicity, etc.). These character variables usually adopt theformat of a binary yes/no answer to most questions, except for variables with multiple levels(education, age, and race).


6/56

6

Missing Data

For the numeric variables of the cleaned data set (receipts, payroll, employment, andpercent of ownership), there were no missing values. However, for most character variables, thepercentage of missing data ranged from 20-40%. There were also several variables in which over

General Owner andBusiness Characteristics

Additional OwnerCharacteristics

Additional Business Characteristics

Sector

Employer status

Random group(for varianceestimation)

Tabulationweight

Measures of size(noise-infusedfor disclosureavoidance):

Employment

Payroll

Receipts Individual owner

information (forup to fourowners):

Percentageownership

Gender

Ethnicity

Race

Veteran status

How the ownerinitially acquired the

business When the owner

acquired the business

Owners primaryfunction in thebusiness

Owners averagenumber of hours perweek spent workingin the business

Whether the business

provided the ownersprimary source ofpersonal income

Whether the ownerpreviously owned abusiness or had beenself-employed

Owners educationalbackground

Owners age

Whether the owner

was born in theUnited States

If the owner was aveteran, whether theowner was disabledas the result of injuryincurred duringactive militaryservice

Year business was established

Source(s) of start-up or acquisition

capital Amount of start-up or acquisition

capital

Home-based business

Operated as a franchise

Owned by a franchise

Source(s) of capital used to expandbusiness

Types of customers

Percent of total sales exported

Operations established outside the

United States Outsourced any business function

outside the United States

Language(s) used in transactions

Types of workers employed

Employer-paid benefits offered

Whether the company had a website

Whether the company had e-commerce sales

E-commerce as a percentage of total

sales Whether the company made online

purchases

Business activity (e.g., seasonal orpart-time)

Whether the business currentlyoperates

Reasons for ceasing operations

Joint ownership by husband and wife

Family-owned business

Number of owners


7/56

7

90% of the observations had missing values, such as whether or not the owner is retired, theowner is deceased, or the business had low or inadequate sales.

Nonresponse

Approximately 62 percent of the 2.3 million businesses in the SBO sample responded to

the survey, compared to 75 percent for the 2002 survey. For the 2007 survey, 72 percent of thecompanies in the SBO sample returned a questionnaire, but 10 percent of the returns did notcontain enough information to be considered a response for the estimates by race, gender,ethnicity or veteran status. Many of these respondents were sole proprietors that answered "No"to Item 8, "In 2007, did any individual own 10% or more of the rights, claims, interests, or stockin this business?" Another identified issue was duality between race (Hispanic vs non-Hispanic)and ethnicity (White, Black, Asian, American Indian). Every Hispanic business owner also hadto identify at least one additional ethnicity, which may lead to indication of mixed race when anowner is solely Hispanic is heritage. This led to consequent variable manipulation for correction.

According to the U.S. Census, about 4 percent of the 2007 nonrespondents were selectedfor and responded to the 2002 SBO. For these firms, data from the 2002 survey were used in

place of the missing 2007 responses. For the remaining nonrespondents, gender, ethnicity, raceand veteran status were imputed from donor respondents in the same sampling frame withsimilar characteristics (state, industry, employment status, size). Because the assignment ofbusinesses to sampling frames relies heavily on administrative data, and there is a high level ofagreement between sampling frame assignment and tabulated race or ethnicity for respondingfirms, the donor imputations are considered to be reliable. Estimates of sampling variability areadjusted to account for nonresponse. Estimates with high error (relative standard error for salesor receipts of 50 percent or more) are suppressed. Overall, imputed data accounted forapproximately 47 percent of the firm count estimates by gender, ethnicity, race, and veteranstatus and approximately 20 percent of the estimates of sales.

Inherent Differences between SBO and PUMS Information

The Public Use Microdata Sample (PUMS) is a large dataset available to the public derivedfrom the original SBO dataset of responses. According to the U.S. Census Bureau, measureswere taken in constructing the PUMS file to protect the confidentiality of the SBO data in orderfor it to be used freely among the public. In the PUMS file, each record corresponds to abusiness, but deliberate measures were taken to ensure the anonymity of each business. Forbusinesses operating in multiple states and/or industry sectors, one record exists for each statecombination in which the firm conducts business. Identifiers to link the component records of abusiness are not included. Additionally, businesses classified in the SBO as publicly owned ornot classifiable by gender, ethnicity, race, or veteran status are not included in the PUMS file

because many publicly owned firms are easily identifiable. Since the primary focus of myresearch is on small businesses, exclusion of possibly larger public corporations does notsignificantly impact the integrity of the data.

Finally, the U.S. Census Bureau infused noise into the PUMS data for disclosure avoidanceand confidentiality protection. Values are perturbed prior to tabulation by applying a randomnoise multiplier to the magnitude data, such as the sales and receipts for all firms. Thisintroduced variation perturbed data points by no more than a few percentage points.


8/56

8

Data Manipulation and CleaningSubsetting to Small Businesses and One Owner Data

According to the Small Business Administration, small businesses are generallyconsidered to have less than 500 employees and under 7 million in receipts depending on the

industry. By subsetting the original dataset according to these new parameters, the totalobservation count decreased from 2,165,680 to 2,025,530.

Furthermore, according to the 2007 SBO PUMS Guide, a response of 0 for most of thequalitative questions indicated that the data for these variables were missing. As a result, each 0was converted to to more accurately reflect the nature of this data and to correctly set upregression procedures later on.

Another major issue for data analysis were the inclusion of data points of up to threeadditional business owners. As a result, there exist three additional sets of demographicvariables. However, if a business only has one owner, these three additional sets of variables

would all have missing variables. Given the fact that PROC REG and other regressioncommands exclude observations that have even one missing value, allowing the sixty extravariables for additional owners would lead to over-exclusion of a significant amount ofobservations. Moreover, I noticed that virtually every business had responses to all variablesdescribing the first owner, and that the first owner almost always owned as much (if not a greateramount) of the business than his or her other 1-3 business partners. This justified my decision toremove all variables affiliated with the second, third, and fourth owners.

One future recommendation would be to keep these deleted variables and find otheroptions to analyze the overall dataset despite the necessary inclusion of extra missing variables.Observing businesses with the intent to analyze the relationship between multiple business

owners (when also factoring in age, experience, and education) could be incredibly valuable forthe studies of organization behavior and business.

Additional deleted variables included those that had over a 90% missing rate, since itseverely diminished the number of total observations used in regression. For the purposes of myresearch questions, having more observations to interpret is preferable, but including thesevariables in another analysis of businesses that have generally ceased activity would also beworthwhile. These variables include:

CEASEOTHERceased operations for another reason

SOLDBUSsold this business

STARTANOTHERStarted another business NOPERSCREDLack of personal loans/credit

NOBUSCREDLack of business loans/credit

LOWSALESInadequate cash flow or low sales

ONETIMEOperated for one-time event

DECEASEDOwner died

RETIREOwner retired


9/56

9

Lastly, I consolidated all language variables and the race and ethnicity variables in orderto improve overall adjusted fit and reduce noise. After analyzing the spread of the manylanguage variables in the dataset (English, Arabic, Chinese, French, German, Greek, Hindi,Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Tagalog, and Vietnamese), the

most frequent languages spoken during business transactions are naturally English and Spanish.Other minor languages collectively do not constitute more than 3% of the total number ofobservations, so we created a new language variable with five levels: Only English, OnlySpanish, Only English and Spanish, Only English and Other, and Only Other Language, whicheffectively consolidated sixteen binary variables to one variable with five levels. Additionally,the race variable had at least twenty levels due to the combinations of difference races, whichencouraged consolidation. Both race and ethnicity were combined to create a new variable,which followed this algorithm:

1. If ethnicity is Hispanic then Race/Ethnicity is Hispanic2. If ethnicity is not Hispanic and the owner is White AND another minority, then the owner

is considered part of the other minority3. If ethnicity is not Hispanic and the owner is only White or another minority, then the racestays as listed

4. If ethnicity is not Hispanic and the owner is at least two types of race (both of which arenot White), then the owner is considered Mixed.

As a result, the final levels for the new race/ethnicity variable are: W for White, B for Black, Afor Asian, H for Hispanic, I for American Indian, P for Nhopi (Native Hawaiian), Mixed (for anycombination of two non-white and non-hispanic races such as Black/Asian.

Tabulation Weights

In most surveys, it will be the case that some groups are over-represented in the raw dataand others under-represented. In order to address this, weights are assigned to each observationto compensate for the over/under-representation of data. While the exact method of determiningthese weights for the PUMS is unknown, the values of the tabulation weight range from 1.0 to35.0. In this sense, a single observation with a weight of 35.0 would functionally be the same asthirty-five individual observations with the same parameters except for a weight of 1.0. Analysisof the data therefore requires the weights to be properly factored into averages, percentages, andregressions through the proper SAS procedures.

Fitting a Regression

Intuitively, I first wanted to fit a regression for all variables against receipts to observethe overall coefficient of determination and the comparative significance of each individualvariable in determining receipts. This would immediately confirm or deny several of my possibleresearch questions about the variables and would then be a starting point for other furtherresearch ideas. I gave further consideration to other possible response variables besides receipts,such as employment, payroll, and certain categorical variables such as whether or not thebusiness was still operating. Ultimately, I decided to primarily focus on receipts as the response


10/56

10

variable, seeing as in general business theory that an increase in either payroll or employmentusually comes after an increase in overall revenues.

In order to begin, I had to identify the proper statistical procedure to regress severalquantitative and multi-level categorical variables against receipts. I also wanted to use a

procedure that would automatically create dummy variables and evaluate correlations whilecontrolling for other variables. Ultimately, I selected the generalized linear model procedurebecause it satisfied the aforementioned criteria. With this in mind, limitations include the factthat responses must be independent of another (which is extremely difficult to prove given gametheory and general economics of competition), and the fact that predictors are still assumed to belinear.

Regression One: All Variables Against Receipts

Data Summary:Number of Observations874,182

Included Variables: RECEIPTS_NOISY EMPLOYMENT_NOISY PAYROLL_NOISY PCT1EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 FIPST SECTOR N07_EMPLOYER SEX1FOUNDED1 PURCHASED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1 ESTABLISHEDSCSAVINGS SCASSETS SCEQUITY SCCREDIT SCGOVTLOAN SCGOVTGUARSCBANKLOAN SCFAMLOAN SCGRANT SCDONTKNOW SCAMOUNT HOMEBASEDFRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECCREDIT ECGOVTLOANECGOVTGUAR ECBANKLOAN ECFAMLOAN ECVENTURE ECPROFITS ECGRANTECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL STATELOCALINDIVIDUALS EXPORTS OPSOUTSIDE FULLTIME PARTTIME DAYLABORTEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHARE

HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURSLT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV CEASENR HUSBWIFEFAMILYBUS NUMOWNERS race1noblanks OPERATING LANGUAGE

Further analysis shows that these are the top five variables based on both the magnitudeof the t-value and the estimate, as these parameters indicate both statistical and practicalsignificance. Since the vast majority of variables are significant in the model on a 0.05significance level, the magnitude of the estimate is the most determinant factor in establishingimportance.


11/56

11

Estimated Regression Coefficients

Parameter Estimate Standard Error t Value Pr > |t

SECTOR 11 Agriculture, Fishing 273.5380 19.767836 13.84


12/56

12

HOLIDAYS 1 168.2109 5.420379 31.03


13/56

13

sets the significance level at which variables can be removed from the model, which is onceagain usually 0.05.

Stepwise Regression

As a fusion of both forward and backward stepwise regression, in stepwise regressionfour options are considered at each stage: add a term, delete a term, swap a term in the model for

one not in the model, or stop. This algorithm is most often used in practice. Despite itswidespread use, it has little theoretical basis. A more theoretically robust tool like AkaikesInformation Criterion (AIC) can also be used as a good metric to assess models. Limitations O

Fit Statistics

Adjusted Coefficient of Determination / Adjusted R-Square

The adjusted R-squared is a modified version of R-squared that has been adjusted for thenumber of predictors in the model. The adjusted R-squared increases only if the new termimproves the model more than would be expected by chance. It decreases when a predictorimproves the model by less than expected by chance. The adjusted R-squared can be negative,but its usually not. It is always lower than the R-squared.

Akaike Information Criterion

The Akaike Information Criterion (AIC) is a way of selecting a model from a set ofmodels. The chosen model is the one that minimizes the Kullback-Leibler divergence between

the model and the truth. It's based on information theory, but a heuristic way to think about it isas a criterion that seeks a model that has a good fit to the truth but few parameters. It is definedas:

AIC = -2 ( ln ( SSE / n )) + 2 K

where likelihood is the probability of the data given a model and K is the number of freeparameters in the model. AIC scores are often shown as AIC scores, or difference between thebest model (smallest AIC) and each model (so thebest model has a AIC of zero). Used instepwise regression, AIC can be used instead of the p-value as the main criterion for modelselection. Each iterative models AIC should be calculated and be compared to the previous, and

should only be preferred if the current AIC is smaller than the AIC of the prior model. Thisprocess continues until the best model is selected.

Bayesian Information Criterion

BIC = n log (SSEp)n log (n) + p log (n)

The BIC acts essentially the same as AIC but incorporates a more severe decrease if n > 8


14/56

14

Schwarz Bayesian Criterion

SBC = n ln (SSE / n) + k ln (n)

This is essentially like the AIC equation but uses a multiplicative penalty term based onsample size rather than a constant of 2. By default, PROC GLMSELECT uses the stepwise

selection based on the Schwarz Bayesian Criterion.1

Mallows Cp

Cp= ((1-Rp2)(n-T) / (1-RT2))[n2(p+1)]

The AIC has been shown as equivalent to Mallows Cp, which is used to assess the fit ofa regression model that has been estimated using least ordinary squares. This measures the biasin the reduced regression model relative to the full model having all T candidate predictors. If Cpis roughly equivalent to p, then the reduced model predicts as well as the full model. If Cp < pthen the reduced model is estimated to predict better than the full model. In practice, the selected

model should have the smallest Cp.Mean Squared Error (MSE)

The Mean Squared Error in regression refers to the residual sum of squares divided bythe number of degrees of freedom.Minimizing MSE is important to ensure that the maximumamount of variation of the regression can be explained by the independent variables, thusestablishing the robustness of a model. It is one of the most important and fundamental criteriathat can be used to evaluate models.

General Criteria:

General diagnostics should be calculated for each model to help determine which modelis best. Thesemodel diagnostics include the mean square error (MSE), the adjusted coefficientof determination (R2), and Mallows Cp. A good linear model will have small MSE and Cp anda high adjusted R2 close to 1. With these criteria in mind in addition to stepwise regression withtools such as AIC, BIC, and SBC, I can develop a more robust model than the original. However,it is also important to note that use of these criteria and selection procedures will not definitivelyyield the best model due to the sheer number of potential models and inherent limitations ofthese tools.

Regression Two: PROC GLSELECT with the Schwarz Bayesian CriterionUsing PROC GLMSELECT, I used the aforementioned steps to select a model just using

stepwise regression.


Included Variables: EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 SECTORN07_EMPLOYER SEX1 VET1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1

1http://www2.sas.com/proceedings/sugi31/207-31.pdf


15/56

15

FNCTNABV1 HOURS1 PRMINC1 EDUC1 BORNUS1 ESTABLISHED SCSAVINGSSCEQUITY SCCREDIT SCGOVTGUAR SCDONTKNOW SCAMOUNT HOMEBASEDFRANCHISE ECSAVINGS ECASSETS ECCREDIT ECBANKLOAN ECVENTUREECPROFITS ECGRANT ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERALSTATELOCAL INDIVIDUALS EXPORTS OPSOUTSIDE FULLTIME PARTTIME

TEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHAREHOLIDAYS BENENABV WEBSITE ECOMMPCT LT40HOURS SEASONALOCCASIONALLY ACTIVITYNABV OPERATING HUSBWIFE NUMOWNERS REGION

Fit Statistics:R-square0.5768Adjusted R-square0.5768Root MSE1516.49340

Regression Three: PROC GLMSELECT with Akaikes Information CriterionUsing PROC GLMSELECT, I used the aforementioned steps to select a model using the AIC.


Included Variables: EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 FIPST SECTORN07_EMPLOYER SEX1 FOUNDED1 PURCHASED1 ACQYR1 PROVIDE1 MANAGE1FINANCIAL1 FNCTNABV1 HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1DISVET1 ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDITSCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN SCGRANTSCDONTKNOW SCAMOUNT HOMEBASED FRANCHISE FRANCHISER50 ECSAVINGSECASSETS ECCREDIT ECGOVTLOAN ECGOVTGUAR ECBANKLOAN ECFAMLOAN

ECVENTURE ECPROFITS ECGRANT ECOTHER ECDONTKNOW ECNOACCESSECNOEXPAND FEDERAL STATELOCAL INDIVIDUALS EXPORTS OPSOUTSIDEFULLTIME PARTTIME DAYLABOR TEMPSTAFF LEASED CONTRACTORSHEALTHINS RETIREMENT PROFITSHARE HOLIDAYS BENENABV WEBSITEECOMMPCT ONLINEPURCH LT40HOURS LT12MONTHS SEASONAL OCCASIONALLYACTIVITYNABV OPERATING CEASENR HUSBWIFE FAMILYBUS NUMOWNERSrace1noblanks LANGUAGE

Fit Statistics:R-square0.5771Adjusted R-square0.5770Root MSE1516.10130

Since the AIC model is on par with highest overall adjusted R-squared, has the leastmodel variables, and relatively low MSE compared to the second stepwise model, it should beconsidered the most robust. It is noted that the first linear regression of all variables has aconsiderably lower MSE than the stepwise-selected models, despite the inclusion of far morevariables. However, given the prevalence of missing data and nonresponse in these data sets,future data collection may benefit from relying on fewer variables in the generalized linearmodel. Therefore, the AIC model ranks the most effective.


16/56

16

The most significant variables are also those aforementioned in Regression One: sector,health insurance, holidays, and employer status for tabulation. Sector is arguably the mostintuitively significant, while both health insurance and holidays are usually indicative of abusiness performing well enough to provide extended luxuries for employees. Surprisingly, bothpayroll and employment had far smaller magnitudes in their estimates, which indicates

insignificance on a practical level.

There existed several limitations to the regression. As stated earlier, one of the maindrawbacks of using the generalized linear model is that it can produce over-fitting to data as wellas the assumption that the relationships between the explanatory and response variable is linear,which may not be the case. Ideally, tests for nonlinearity could be conducted on each variable byplotting the residuals versus predicted values. Furthermore, multicollinearity or the varianceinflation factor should be calculated for each variable, which remains to be done consideringthere is no built-in functionality for this purpose for survey data in SAS. Due to the large amountof variables, testing for multicollinearity is especially important. From a cursory analysis, mostof the t-ratios for the individual coefficients is statistically significant, which could indicate that

multicollinearity is not severe. Another possible option was to create a correlation matrix, whichproved to be too cumbersome considering the matrix have dimensions greater than 150 x 150.Finally, further work could be done to investigate the addition of extra terms, such as interactivecombinations of other original terms that better reflect the lack of complete independencebetween explanatory variables.

Proper model selection requires conscientious consideration of the tradeoff between thecompeting objectives of conformity and adherence to data and model simplicity. Good modelsconform to the data with a strong goodness of fit, but can also be easily generalizable in itsinterpretation. Finally, good models should not under-fit (leaving out key variables in favor ofattempt to be generalizable) or over-fit to the data (including extraneous or unrealistic variableeffects in its attempt to have the best goodness of fit) because in each scenario, the conclusionloses value.


17/56

17

QQ Plot of Residuals versus a Normal Distribution

This quantile-quantile plot of the residuals versus a normal distribution show that the data seemsto be normally distributed through the inner quartiles, but heavily skewed with long-taileddistributions on both sidesparticularly the left side. Since this QQ plot indicates significantskew in this model, its conclusions cannot be used to draw the strongest conclusions.

Plot of Fitted Values versus Residuals


18/56

18

This plot of fitted values essentially entails a biased and heteroscedastic spread with aninteresting phenomenon of residuals steadily decreasing in variation at higher response variablevalues. Due to the extreme density of points attributable to the large amount of overall variationand sheer number of datapoints, more specific phenomena cannot be analyzed at this point intime.

The plot also points to the issue of the modeldrastically over-estimating predicted valuesgiven the sheer magnitude of the negativeresiduals, which calls for further analysisamong observed businesses that deviate the

most from the model. Given that over half of the observed businesses have receipts less than$16,000 according to the five-number summary, running a regression on the subset of PUMS toonly include businesses that earn more than the median point may produce a more well-fitting,unbiased, and homoscedastic model.

The five number summary forresidual values clearly shows greaternegative skew on the whole, but a

greater density of positive values over smaller intervals, thus confirming the residuals versusfitted plot. Since numerical variables typically have more leverage over the fit and spread of theresiduals and receipts is extremely skewed right, an analysis of both payroll and employment iswarranted to see if similar effects are in place. The following two tables describe the five numbersummaries for both payroll and employment.

These results confirm that both employment and payroll are heavily skewed right, whichwould explain the potential for the model to drastically overestimate certain values. Since theactual magnitude (from a dollar or labor force standpoint) causes this extreme skew, creating anew binary variable for each of these variables such that 0 indicates employment or payroll of 0and 1 indicates employment or payroll greater than 0 could result in better fit. Additionally,taking the logarithm of RECEIPTS_NOISY may induce better fit as well due to the skew.

In order to establish control for the following regressions, I first conduct the PROCGLMSELECT procedure with AIC with the only change of taking the logarithm of receipt

values.

Regression Four: PROC GLMSELECT with AIC: Receipts with Logarithm

Transform

Data Summary:Number of Observations - 784,208

Five Number Summary of

RECEIPTS_NOISY

Min Q1 Median Q3 Max

0 1.72393 15.23411 81.77834 6900

Five Number Summary of Residuals

Min Q1 Median Q3 Max-93534.00 -222.63 -36.71 101.67 8271.42

Five Number Summary of EMPLOYMENT_NOISY


0 0 0 0 4890.00

Five Number Summary of PAYROLL_NOISY


0 0 0 0 280000.00


19/56

19

Included Variables: PAYROLL_NOISY EMPLOYMENT_NOISY PCT1 FIPST SECTORN07_EMPLOYER SEX1 VET1 FOUNDED1 PURCHASED1 INHERITED1 RECEIVED1ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1 HOURS1 PRMINC1SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1 ESTABLISHED SCSAVINGS SCASSETSSCEQUITY SCCREDIT SCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN

SCVENTURE SCGRANT SCOTHER SCDONTKNOW SCAMOUNT HOMEBASEDFRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECEQUITY ECCREDITECGOVTLOAN ECBANKLOAN ECVENTURE ECPROFITS ECGRANT ECOTHERECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL OTHERBUS INDIVIDUALSEXPORTS FULLTIME PARTTIME LEASED CONTRACTORS HEALTHINS RETIREMENTPROFITSHARE HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCHLT40HOURS LT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABVOPERATING CEASENR HUSBWIFE NUMOWNERS race1noblanks LANGUAGE

Fit Statistics:R-Square0.6689

Adjusted R-Square0.6688Root MSE3.07507

While the fit has improved by moststandards, the overall Q-Q plot and fittedversus residual values plot still indicatemajor issues in skew and bias. As statedearlier, trying to correct the issue bycreating binary variables out of theexisting numeric payroll andemployment variables could controlsome of the drastic skewedness observedin both plots, which leads to the nextregression.


20/56

20

Regression Five: PROC GLMSELECT with AIC: Modified Binary Payroll and

Employment with Other Character Variables against the Logarithm of Receipts

Using PROC GLMSELECT, I used the aforementioned steps to select a model using the AICwith the modified variables of binary payroll and employment.

Variables Used: PCT1 FIPST SECTOR N07_EMPLOYER SEX1 VET1 FOUNDED1

PURCHASED1 INHERITED1 RECEIVED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1

FNCTNABV1 HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1

ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDIT SCGOVTLOAN

SCBANKLOAN SCFAMLOAN SCVENTURE SCGRANT SCOTHER SCAMOUNT

HOMEBASED FRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECEQUITY

ECCREDIT ECGOVTLOAN ECGOVTGUAR ECBANKLOAN ECVENTURE ECPROFITS

ECGRANT ECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL

STATELOCAL OTHERBUS INDIVIDUALS EXPORTS FULLTIME DAYLABORTEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHARE

HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURS

LT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV OPERATING CEASENR

HUSBWIFE FAMILYBUS NUMOWNERS race1noblanks LANGUAGE ifpayroll

ifemployment

Fit Statistics:R-square0.6550Adjusted R-square0.6549

Root MSE3.13900

Although the fit statistics seem to

indicate worse fit than the previous

model given lower coefficients of

determination and a higher root MSE,

the Q-Q plot absolutely indicates better

overall fit and far less skewespecially

left skewedness.


21/56

21

The correction of skew also allows me to observe the actual bulk of the fitted versus

residuals plot more closely, which is starting to show very peculiar patterns that require furtherinvestigation. One possible reason for this is the fact that virtually all of the regressed variables

are categorical with the exception of ownership percentage of the first business owner.

Although the previous model had better fit values and the current model may be

susceptible to data overfitting to the sample, this model iteration still ultimately has better results

according to the Q-Q plot and the fitted versus residuals plot due to reduced skewedness.


22/56

22

Analysis of Language

Research Question 1a: Does the language spoken in transactions produce a

difference in correlated receipts?

The United States is often seen as the nation most conducive to immigrant

entrepreneurship, especially in an increasingly globalized society as well. To start, I investigatedthe relationship between different languages and overall business receipts. Running a generallinear model procedure for only the language variable against receipts results in the following:

Estimates of Language Variable Levels Against Receipts

One interesting phenomenon is that a business that conducts transactions through onlyEnglish and Spanish is associated with higher gross receipts than a business that conductsbusiness transactions through any other language or language combination. Most notably, theestimate for a business that only speaks English and Spanish is nearly 40% greater than theestimate for a business that only speaks English for transactions.

Its possible that Hispanics are largely employed by the agriculture, manufacturing, andconstruction industries more so than others. According to 2008 research by the Center forConstruction Research and Training, the following depicts a graph of Hispanic employees as apercentage of each industry.

This data point to the possible underlying non-uniform distribution of Hispanics acrossbusiness industries.Its important to understand the difference between businesses that conduct


23/56

23

transactions in a certain language and businesses that operate internally using the language. Forexample, a business that could have a predominantly large portion of Hispanics may notnecessarily conduct business transactions in Spanish. Therefore, an additional research questioncould be the relationship between the language and new race/ethnicity variable. The percentageof Hispanics by industry in this instance simply points to industries to investigate more closely,

such as construction and agriculture.Using this information, I analyzed the breakdown of language by NAICS sector code,

paying special attention to the industries that had the greatest percentage of Only English andSpanish businesses.


24/56

24

Industries with the Greatest Proportion of Only Spanish and English Businesses:Sector 55: 13.63%The Management of Companies and Enterprises sector comprises (1) establishments that hold

the securities of (or other equity interests in) companies and enterprises for the purpose ofowning a controlling interest or influencing management decisions or (2) establishments (except

government establishments) that administer, oversee, and manage establishments of the companyor enterprise and that normally undertake the strategic or organizational planning and decision

making role of the company or enterprise.

Sector 62: 10.74%The Health Care and Social Assistance sector comprises establishments providing health careand social assistance for individuals.

Industries with the Greatest Proportion of Only English Businesses:Sector 21: 95.71%

The Mining sector comprises establishments that extract naturally occurring mineral solids, suchas coal and ores; liquid minerals, such as crude petroleum; and gases, such as natural gas.

Sector 11: 92.90%

The Agriculture, Forestry, Fishing and Hunting sector comprises establishments primarily

engaged in growing crops, raising animals, harvesting timber, and harvesting fish and other

animals from a farm, ranch, or their natural habitats.

Given these interpretations, looking at the general mean receipts for each sector could then

explain why businesses that conducted business transactions in English and Spanish havestatistically higher receipts on average. The following table depicts sector and it means receipts:

Statistics

Variable Mean Std Error of Mean

RECEIPTS_NOISY 174.631446 0.353243

Domain Analysis: SECTOR

SECTOR Mean Std Error of Mean

11 Agriculture, Fishing 96.067138 2.182337

21 Mining, Quarrying, Oil Extraction 233.433129 6.192045

22 Trade, Transportation, Utilities 113.609696 6.397033

23 Construction 218.497973 1.149555


25/56

25

Domain Analysis: SECTOR

SECTOR Mean Std Error of Mean

31 Manufacturing 527.070573 4.920525

42 Wholesale Trade 628.044453 5.29576244 Retail Trade 267.633023 1.575551

48 Transportation and Warehousing 144.352095 1.225643

51 Information 156.241621 2.441339

52 Finance and Insurance 170.082058 1.558410

53 Real Estate and Rental 120.396544 0.699686

54 Professional, Scientific, TechnicalServices

139.530715 0.707921

55 Mgmt. of Companies and Enterprises 490.728375 19.475524

56 Admin. and Support and WasteManagement

100.129900 0.835629

61 Education 45.306452 0.868810

62 Healthcare 175.218046 1.177138

71 Arts and Etnmt. 58.393615 0.740265

72 Accommodation and Food Services 356.426518 2.657923

81 Other Services 69.516196 0.417748

99 Unclassifiable 102.268452 8.707951

According to this PROC SURVEYMEANS procedure of the mean gross receipts of the averagebusiness in each industry and the average gross receipts of all businesses (across industries),which is 174.63, both sectors 55 and 62 earn above average. While this preliminarily explainswhy the overall estimate for Only English and Spanish businesses is higher than the estimateof other language levels, the following question should be investigated:


26/56

26

Research Question 1b: Out of only existing Only English and Spanish businesses,

which are the most popular industries for business?

Table of LANGUAGE by SECTOR

LANGUAGE SECTOR Frequency WeightedFrequency

Percent

Only English and Spanish 11 390 4076 0.4269

21 263 1907 0.1998

22 123 533.41200 0.0559

23 7472 101173 10.5960

31 3480 17790 1.8632

42 4346 27406 2.8702

44 13181 113815 11.9199

48 3979 44151 4.6239

51 1460 9222 0.9658

52 5665 46651 4.8858

53 6760 85715 8.9770

54 11573 117115 12.2656

55 1159 1581 0.1656

56 5933 74071 7.7575

61 1325 16665 1.7453

62 11269 127184 13.3201

71 1917 25298 2.6495

72 4017 35816 3.7511

81 7699 104522 10.9467

99 22 135.88500 0.0142

Total 92033 954827 100.000

The top give highest concentrations of Only English and Spanish businesses are insectors 62 (13.32% and receipts of 175.22), 54 (12.27% and receipts of 139.53), 44 (11.92% andreceipts of 267.63), 81 (10.95% and receipts of 69.52), and 23 (10.60% and receipts of 218.50).Although within on a sector level, Only English and Spanish businesses in sector 55 made up alarge share of businesses in sector 55 overall, it actually had one of the smallest actualfrequencies with just 1159 businesses total. Revising my initial conclusion, I argue that thehigher estimate is most likely derived from consistently above average performance in sectorswhere Only English and Spanish businesses are prevalent.


27/56

27

Visualizing the Percentage of Hispanic/Spanish-Speaking Businesses in the United

StatesAs noted earlier, there exists a dichotomy between businesses run by Hispanic owners

and businesses that conduct transactions in English and Spanish. I approached this issue in adifferent fashion, by plotting the percentage of both (for each state) on a map of the United

States. I pre-defined set intervals after looking at the spread of percentages for each state. Thefollowing graphs generally depict the same patterns of higher concentrations of allHispanic/Spanish-speaking subjects located in the Southwest and Florida. I obtained 2007 dataon the number of Hispanics living in each state from the Pew Research Hispanic Trend Project.


28/56

28


29/56

29

Analysis of Capital Sources in California versus the United StatesCalifornias $2 trillion economy would be the ninth biggest in the world if it were a

country. The state represents 13% of the U.S. economy. However, California has been ranked asone of the worst states to do business in recent years according to business executives andpublications.2The state has been under duress from the dramatic fall in home prices and the

reduced tax revenues for the state. Moreover, California consistently boasts one of the highestcosts for living and operation. Interestingly, California also ranks among the best for technologyand innovation. Another plus is the $36 billion in venture capital money invested in Californiacompanies the past three years, which is four times the total of any other state.3California is alsonoted to be the home of Silicon Valley. According to a 2006 study done by the AmericanElectronics Association, Silicon Valley and the Bay Area as a whole ranked first in terms of thenumber of high-tech jobs in the United States.While the PUMS does not have location datamore granular than the state level, the existence of Silicon Valley itself could point to interestingstatistical characteristics of California that no other state may share.

As the state of extremes, I found it interesting to investigate sources of startup andexpansion capital and their relationship to revenues in California, especially compared to this

relationship between capital and revenues in the United States.Dataset Overview

The observed dataset is simply a subset of the cleaned PUMS done by only settingobservations that have indicated 06 (for California) as its FIPS code. This dataset has 182932observations, and most categorical variables (except location) seem to have a missing percentageof 30-50%, which is slightly higher than the typical missing pattern observed in the U.S. dataset.In this section, both startup and expansion capital will be analyzed.

Startup capitalrefers to the initial cost of investment to fully bring a product or service tomarket. It can be used for everything from business operation expenses to research anddevelopment to payroll. It is typically used to fund businesses still in their infancy, and can be

repaid once the business reaches a level of maturity to earn revenues on its own.Companies that seek expansion capital, on the other hand, will often do so in order to

finance a transformational event in their business. These companies are likely to be more mature(in terms of operating time) than venture capitalfunded companies, able to generate revenue andoperating profits but unable to generate sufficient cash to fund major opportunities, acquisitionsor other investments. Because of this lack of scale these companies generally can find fewalternative conduits to secure capital for growth, so access to growth equity can be critical topursue necessary facility expansion, sales and marketing initiatives, equipment purchases, andnew product development.

2http://www.cnbc.com/id/1008432873http://www.forbes.com/pictures/mli45kikd/41-california/


30/56

30

Glossary of Capital Sources

Startup Capital (SC)

SCSAVINGS: Personal Savings

SCASSETS: Other Personal Assets

SCEQUITY: Home Equity

SCCREDIT: Credit Cards

SCGOVTLOAN: Government Loan

SCGOVTGUAR: Government Guaranteed Loanthe United States government and the SmallBusiness Administration provides loans to certain businesses depending on size and capitalpurposes

SCVENTURE: Venture Capitalist

SCGRANT: Grant

SCOTHER: Other

Research Question 2a: What sources of capital have the most positive relationship

with receipts?Regressing both startup capital and expansion capital variables against receipts in both

the California and United States datasets reveals interesting statistics on both the percentage ofusage of each type of capital, as well as the practical significance of each capital sourcerepresented via the estimate. The following data tables derive content from the full list of SASoutputs contained in Appendix C.

Summarized Table for Startup Capital

California U.S. EstimatesTrue Estimates (added to

Intercept)

Yes No Yes No CA USA CA USA Difference

SCSAVINGS 60.97% 39.03% 58.01% 41.99% 11.70 -15.78 113.71 154.40 -40.69

SCASSETS 6.23% 93.77% 7.05% 92.95% 66.98 48.85 168.98 219.03 -50.04

SCEQUITY 6.12% 93.88% 5.07% 94.93% 71.58 49.37 173.59 219.55 -45.96

SCCREDIT 11.22% 88.78% 10.31% 89.69% -53.88 -90.21 48.13 79.96 -31.83

SCGOVTLOAN 0.40% 99.60% 0.56% 99.44% 155.74 91.93 257.74 262.10 -4.36

SCGOVTGUAR 0.45% 99.55% 0.59% 99.41% 128.86 195.60 230.87 365.78 -134.92

SCBANKLOAN 4.47% 95.53% 9.18% 90.82% 247.36 305.75 349.37 475.93 -126.56

SCFAMLOAN 2.23% 97.77% 2.28% 97.72% 36.52 180.69 138.53 350.86 -212.34

SCVENTURE 0.30% 99.70% 0.25% 99.75% 68.87 364.61 170.88 534.78 -363.91

SCGRANT 0.18% 99.82% 0.20% 99.80% -78.63 -103.56 23.37 66.62 -43.24

SCOTHER 1.74% 98.26% 1.70% 98.30% 203.10 170.52 305.10 340.69 -35.59

SCDONTKNOW 4.47% 95.53% 4.58% 95.42% 150.63 183.79 252.64 353.96 -101.32SCNONENEEDED 23.59% 76.41% 24.35% 75.65% -53.49 111.12 48.52 281.30 -232.78

SCNOTREPORTED 5.33% 94.67% 5.32% 94.68% 0.00 0.00 102.01 170.18 -68.17

INTERCEPT 102.01 170.18

In terms of usage frequency, differences greater than 1% between the United States andCalifornia are bolded. Regardless, its important to note that most of these differences (even fordifferences less than 1%) are statistically significant due to the large sample size. Businesses inthe United States as a whole are more than twice as likely to use bank loans as a source of startupcapital when compared to businesses in California, while Californian business owners are more

Startup Capital (SC)

SCSAVINGS: Personal Savings

SCASSETS: Other Personal Assets

SCEQUITY: Home Equity SCCREDIT: Credit Cards

SCGOVTLOAN: Government Loan

SCGOVTGUAR: Government Guaranteed Loan

the United States government and the Small

Business Administration provides loans to

certain businesses depending on size and capital

purposes

SCBANKLOAN: Loan from Bank

SCFAMLOAN: Loan from family and friends

SCVENTURE: Venture Capitalist

SCGRANT: Grant

SCOTHER: Other

Expansion Capital (EC)

ECSAVINGS: Personal Savings

ECASSETS: Other Personal Assets

ECEQUITY: Home Equity

ECCREDIT: Credit Cards

ECGOVTLOAN: Government Loan ECGOVTGUAR: Government Guaranteed

Loanthe United States government and the

Small Business Administration provides loans

to certain businesses depending on size and

capital purposes

ECBANKLOAN: Loan from Bank

ECFAMLOAN: Loan from family and friends

ECVENTURE: Venture Capitalist

ECPROFITS: Business Profits

ECGRANT: Grant

ECOTHER: Other

ECNOEXPAND: Did not expand ECNOACCESS: No Access to Expansion

Capital


31/56

31

likely to use home equity and their own savings to start ventures. Overall, the top threecategories with the highest estimate magnitudes for the United States are venture capital, bankloans, and government guaranteed loans. For California, these categories are bank loans, othersources of capital (unspecified), and government loans.

Consistent with its low ranking, California as a whole seems to entirely perform worse

than businesses in the United States according to the difference in estimates. Most interestingly,California comparatively the worst in the venture capital category, despite being the state withthe greatest amount of venture capital invested in business venture formation. Othercomparatively poor categories are loans from family members and businesses that do not needstartup capital. These results warrant further analysis is conducted on the spread, rather than theaverage, of certain categories, as it is possible for Californian businesses to have greaterextremes than American businesses in general.

Summarized Table for Expansion Capital

California U.S. EstimatesTrue Estimates (added to

Intercept)

Yes No Yes No CA USA CA USA Difference

ECSAVINGS 32.24% 67.76% 29.00% 71.00% -33.74 -112.84 76.38 118.78 -42.39

ECASSETS 3.90% 96.10% 4.00% 96.00% 21.89 5.40 132.03 237.03 -105.00

ECEQUITY 5.84% 94.16% 4.33% 95.67% 135.96 90.04 246.09 321.67 -75.57

ECCREDIT 13.68% 86.32% 12.09% 87.91% -9.45 -88.74 100.67 142.87 -42.19

ECGOVTLOAN 0.33% 99.67% 0.39% 99.61% 419.09 99.43 529.23 331.06 198.17

ECGOVTGUAR 0.26% 99.74% 0.29% 99.71% 63.47 167.61 173.61 399.24 -225.63

ECBANKLOAN 4.60% 95.40% 7.54% 92.46% 461.09 446.34 571.22 677.97 -106.74

ECFAMLOAN 1.06% 98.94% 0.93% 99.07% 69.17 113.62 179.31 345.25 -165.93

ECVENTURE 0.18% 99.82% 0.13% 99.87% 514.95 521.87 625.09 753.49 -128.40

ECPROFITS 8.95% 91.05% 9.23% 90.77% 124.95 162.90 235.09 394.53 -159.43

ECGRANT 0.20% 99.80% 0.19% 99.81% -98.08 -131.52 12.05 100.10 -88.04

ECOTHER 0.83% 99.17% 0.76% 99.24% 127.44 132.62 237.57 364.24 -126.67

ECDONTKNOW 6.76% 93.24% 6.57% 93.43% -9.43 -43.75 100.70 187.86 -87.16

ECNOACCESS 1.93% 98.07% 1.80% 98.20% -68.32 -161.82 41.81 69.80 -27.98

ECNOEXPAND 45.80% 54.20% 48.29% 51.71% -26.99 -108.06 83.14 123.56 -40.41

ECNOTREPORTED 7.38% 92.62% 7.61% 92.39% 0 0 110.13 231.63 -121.49

INTERCEPT N/A N/A N/A N/A 110.137 231.627

As done previously, differences greater than 1% between the United States and Californiaare bolded. Businesses in the United States as a whole are more likely to use bank loans forexpansion capital, or not require it at all. Businesses in California are more likely to use theirown savings, home equity, or credit card debt to fund expansion.

Overall, the top three categories with the highest estimate magnitudes for the UnitedStates are venture capital, bank loans, and government guaranteed loans (which is also identicalto the top three for startup capital). For California, these categories are venture capital, bankloans, and government loans. Once again, businesses in the United States tend to benefit morefrom virtually all sources of expansion capital than businesses in California, with the exceptionof having a government loan. These interesting characteristics further warrant an analysis ofspread, rather than just the average, for certain categories.


32/56

32

Research Question 2b: How does the spread of receipts for certain capital sources

compare between businesses in California and the general United States?

SCVENTURE and ECVENTURE AnalysisEven though California performed worse on average according to its estimates, I initially

hypothesized that California had an overall larger spread with a maximum and upper quartile

point likely exceeding the maximum and upper quartile of receipts in the United States.According to the following give number summary, Californias quantiles for startup venturecapital exceed those of the United States except for the maximum value, which indicates thatCalifornian businesses holistically perform better than businesses in the United States in general.The previous estimates from the regression are influenced due to the heavier right skewedness ofthe United States.

Venture Capital - Five Number Summaries

Quantile

California -

SC

US -

SC

California -

EC US - EC

Min 0 0 0 0

Q1 10.57 7.71 9.34 12.32Median 97.28 52.54 91.73 92.12

Q3 774.90 454.80 608.45 725.24

Max 6600.00 6900.00 6900.00 6800.00

ECGOVTGUAR AnalysisSince ECGOVTGUAR was the only source of expansion capital in which California had

a greater estimate for than the United States, I also wanted to conduct a spread analysis.According to the PROC SURVEYFREQ procedure, there are 440 businesses in California thatused a government guaranteed loan for expansion capital.

Expansion Capital from Gov't Guar. Loan

Quantile California - SC US - SC

Min 0.00 0.00

Q1 83.57 22.29

Median 301.42 174.09

Q3 993.67 621.90

Max 6900.00 6900.00

The spread confirms the general interpretation from the estimate such that Californian

businesses as a whole tend to benefit more from government guaranteed loans.


33/56

33

ConclusionMy analysis of the Public Microdata Sample of the 2007 Survey of Business Owners

involved cleaning and manipulating raw data, selecting a generalized linear model of allvariables against receipts through the Akaikes Information Criterion, an analysis ofbusinesstransaction language, industry of business, and ethnicity of the owner in the context of state

location, and an investigation of the differences in capital sources between businesses inCalifornia and the United States in general.

From the data cleaning, I was able to successfully incorporate my knowledge of thebackground survey methodology to effectively consolidate and remove variables and prepare itfor proper analysis in this context. From the model fitting, I was able to learn and apply differentmodel selection techniques and selection criteria (AIC, BIC, SBC, Mallows Cp, MSE, andadjusted r-square) to ultimately choose the best fitting model that attained a moderately strongadjusted coefficient of determination, after several model selection manipulations that involvedlogarithm variable transformation to reduce skewness and improve overall fit.

The language analysis revealed that businesses that conducted transactions in only

English and Spanish had higher estimates that businesses that only used English. Investigatingthese businesses within the context of sector ultimately showed that businesses that only usedEnglish and Spanish were statistically represented at a higher percentage for certain sectors (suchas sector 23 / Construction ) that earned more on average than other sectors where Only Englishbusinesses had a higher percentage. Finally, the analysis of sources of capital revealed thatCalifornia overwhelmingly performed worse than businesses in the United States based onaverages and estimates in regression, but closer analysis on the spread of receipts given startupventure capital in California shows that estimates can be misleading, and heavy right-skewednessinvalidates conclusions based on the average alone.

Given more time and access to the actual dataset (without noise and other confidentiality-

preserving measures), I would be able to develop a more powerful and accurate model, alongwith other analyses. Other important research questions to investigate would be creating acorrelation matrix to observe collinearity between variables, observing the relationship betweenreceipts with more demographic information such as age, gender, and education level, as well asconducting statistical analysis with different response variables, such as employment and payroll.

Lessons LearnedIve come to believe that doing independent research in the context of statistics is

incredibly important. Through this study, Ive been able to apply everything that Ive learned inall of my statistics courses, from learning how to handle a very large data set in SAS to makingthe proper assumptions and conclusions from my analyses. I argue that this is the highest form of

learningas it is completely experiential, based off of existing data, and set entirely in real-world scenarios. Ive also been fortunate enough to study the fusion of my two academic fields business and statistics. The flexibility of independent research has allowed me to learn aboutexisting literature in the vast field of business statistics and entrepreneurship, as well as fieldother possible research ideas such as social network analysis and survival analysis.


34/56

34

Appendix A: Full Output of Regression with log(Receipts) and ModifiedPayroll and Employment

The GLMSELECT Procedure

Selected Model

The selected model is the model at the last step (Step 81).

Effects: Intercept PAYROLL_NOISY EMPLOYMENT_NOISY PCT1 FIPST SECTORN07_EMPLOYER SEX1 VET1 FOUNDED1 PURCHASED1 INHERITED1RECEIVED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDITSCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN SCVENTURE

SCGRANT SCOTHER SCDONTKNOW SCAMOUNT HOMEBASEDFRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECEQUITY ECCREDITECGOVTLOAN ECBANKLOAN ECVENTURE ECPROFITS ECGRANTECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERALOTHERBUS INDIVIDUALS EXPORTS FULLTIME PARTTIME LEASEDCONTRACTORS HEALTHINS RETIREMENT PROFITSHARE HOLIDAYSBENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURSLT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV OPERATINGCEASENR HUSBWIFE NUMOWNERS race1noblanks LANGUAGE

Analysis of Variance

Source DF Sum of

Squares

Mean

Square

F Value

Model 207 14974589 72341 7650.22

Error 784001 7413568 9.45607

Corrected Total 784208 22388157

Root MSE 3.07507

Dependent Mean 4.23116

R-Square 0.6689

Adj R-Sq 0.6688

AIC 2546267

AICC 2546268

SBC 1764464


35/56

35

Parameter Estimates

Parameter DF Estimate Standard Error t Value

Intercept 1 1.889667 0.094597 19.98

PAYROLL_NOISY 1 0.001098 0.000006900 159.06

EMPLOYMENT_NOISY 1 0.013737 0.000194 70.72

PCT1 1 0.000336 0.000107 3.13

FIPST 01 1 0.069510 0.012035 5.78

FIPST 04 1 0.098799 0.010865 9.09

FIPST 05 1 0.007242 0.013682 0.53

FIPST 06 1 0.190992 0.008333 22.92

FIPST 08 1 0.042436 0.010223 4.15

FIPST 09 1 0.176561 0.011957 14.77

FIPST 12 1 0.042020 0.008756 4.80

FIPST 13 1 0.066957 0.009863 6.79

FIPST 15 1 0.128502 0.017731 7.25

FIPST 16 1 -0.020266 0.015023 -1.35

FIPST 17 1 0.053238 0.009259 5.75

FIPST 18 1 0.012449 0.010725 1.16

FIPST 19 1 -0.066745 0.012767 -5.23

FIPST 20 1 -0.015773 0.013107 -1.20

FIPST 21 1 -0.007839 0.012161 -0.64

FIPST 22 1 0.102305 0.012320 8.30

FIPST 23 1 0.000535 0.015441 0.03

FIPST 24 1 0.120146 0.010882 11.04

FIPST 25 1 0.144996 0.010390 13.96FIPST 26 1 0.006431 0.009752 0.66

FIPST 27 1 0.006867 0.010502 0.65

FIPST 28 1 0.052754 0.014819 3.56

FIPST 29 1 -0.014865 0.010791 -1.38


36/56

36

Parameter Estimates


FIPST 30 1 -0.065514 0.016294 -4.02

FIPST 31 1 -0.083451 0.015061 -5.54FIPST 32 1 0.171132 0.014256 12.00

FIPST 33 1 0.122197 0.015510 7.88

FIPST 34 1 0.184363 0.009862 18.69

FIPST 35 1 0.046100 0.015776 2.92

FIPST 36 1 0.132543 0.008847 14.98

FIPST 37 1 0.057997 0.009811 5.91

FIPST 39 1 0.021252 0.009527 2.23

FIPST 40 1 -0.000955 0.012247 -0.08

FIPST 41 1 0.055542 0.011366 4.89

FIPST 42 1 0.069120 0.009322 7.41

FIPST 45 1 0.045966 0.011929 3.85

FIPST 47 1 0.070336 0.010845 6.49

FIPST 48 1 0.107862 0.008739 12.34

FIPST 49 1 0.083133 0.012981 6.40

FIPST 51 1 0.079476 0.010169 7.82

FIPST 53 1 0.103092 0.010198 10.11

FIPST 54 1 -0.057228 0.017523 -3.27

FIPST 55 0 0 . .

SECTOR 11 1 1.034824 0.086277 11.99

SECTOR 21 1 1.021291 0.086922 11.75

SECTOR 22 1 0.990292 0.096556 10.26

SECTOR 23 1 1.275192 0.085487 14.92

SECTOR 31 1 1.101583 0.085674 12.86

SECTOR 42 1 1.555542 0.085640 18.16

SECTOR 44 1 1.279568 0.085497 14.97

SECTOR 48 1 1.200337 0.085620 14.02


37/56

37

Parameter Estimates


SECTOR 51 1 0.863523 0.085909 10.05

SECTOR 52 1 0.940771 0.085589 10.99SECTOR 53 1 0.995807 0.085505 11.65

SECTOR 54 1 0.908610 0.085467 10.63

SECTOR 55 1 -0.452943 0.100108 -4.52

SECTOR 56 1 0.809018 0.085534 9.46

SECTOR 61 1 0.734224 0.085848 8.55

SECTOR 62 1 0.966935 0.085523 11.31

SECTOR 71 1 0.742947 0.085621 8.68

SECTOR 72 1 1.069889 0.085657 12.49

SECTOR 81 1 0.856258 0.085507 10.01

SECTOR 99 0 0 . .

N07_EMPLOYER E 1 1.137851 0.003439 330.84

N07_EMPLOYER N 0 0 . .

SEX1 F 1 -0.152693 0.002767 -55.18

SEX1 M 0 0 . .

VET1 1 1 0.994431 0.478293 2.08

VET1 2 0 0 . .

FOUNDED1 1 1 -0.079178 0.028948 -2.74

FOUNDED1 2 0 0 . .

PURCHASED1 1 1 -0.087677 0.028811 -3.04

PURCHASED1 2 0 0 . .

INHERITED1 1 1 -0.091757 0.028593 -3.21

INHERITED1 2 0 0 . .

RECEIVED1 1 1 -0.060479 0.028701 -2.11

RECEIVED1 2 0 0 . .

ACQYR1 1 1 0.123718 0.010722 11.54

ACQYR1 2 1 0.114924 0.010468 10.98


38/56

38

Parameter Estimates


ACQYR1 3 1 0.122968 0.009717 12.66

ACQYR1 4 1 0.112593 0.009256 12.16ACQYR1 5 1 0.084345 0.011292 7.47

ACQYR1 6 1 0.055247 0.011273 4.90

ACQYR1 7 1 -0.069739 0.011162 -6.25

ACQYR1 8 0 0 . .

PROVIDE1 1 1 -0.225547 0.003055 -73.82

PROVIDE1 2 0 0 . .

MANAGE1 1 1 -0.065064 0.002819 -23.08

MANAGE1 2 0 0 . .

FINANCIAL1 1 1 0.104690 0.002771 37.78

FINANCIAL1 2 0 0 . .

FNCTNABV1 1 1 -0.117711 0.006090 -19.33

FNCTNABV1 2 0 0 . .

HOURS1 1 1 -0.135239 0.008345 -16.21

HOURS1 2 1 -0.336065 0.004656 -72.19

HOURS1 3 1 -0.181169 0.004282 -42.31

HOURS1 4 1 -0.150164 0.003995 -37.59

HOURS1 5 1 -0.082393 0.003423 -24.07

HOURS1 6 0 0 . .

PRMINC1 1 1 0.242443 0.002948 82.23

PRMINC1 2 0 0 . .

SELFEMP1 1 1 0.047307 0.002353 20.10

SELFEMP1 2 0 0 . .

EDUC1 1 1 -0.166470 0.005955 -27.95

EDUC1 2 1 -0.108420 0.003966 -27.34

EDUC1 3 1 -0.181064 0.005205 -34.78

EDUC1 4 1 -0.124506 0.003872 -32.15


39/56

39

Parameter Estimates


EDUC1 5 1 -0.146356 0.005298 -27.62

EDUC1 6 1 -0.073524 0.003406 -21.58EDUC1 7 0 0 . .

AGE1 1 1 -0.184340 0.009761 -18.88

AGE1 2 1 0.007893 0.005418 1.46

AGE1 3 1 0.071539 0.004581 15.62

AGE1 4 1 0.071199 0.004198 16.96

AGE1 5 1 0.036784 0.003984 9.23

AGE1 6 0 0 . .

BORNUS1 1 1 -0.037608 0.004052 -9.28

BORNUS1 2 0 0 . .

DISVET1 1 1 -1.093928 0.478399 -2.29

DISVET1 2 1 -1.034112 0.478302 -2.16

DISVET1 3 0 0 . .

ESTABLISHED 1 1 0.133473 0.009513 14.03

ESTABLISHED 2 1 0.135956 0.009681 14.04

ESTABLISHED 3 1 0.132354 0.008972 14.75

ESTABLISHED 4 1 0.111280 0.008763 12.70

ESTABLISHED 5 1 0.095458 0.009609 9.93

ESTABLISHED 6 1 0.094962 0.009231 10.29

ESTABLISHED 7 1 0.071583 0.010832 6.61

ESTABLISHED 8 1 0.051642 0.010923 4.73

ESTABLISHED 9 1 -0.016931 0.010863 -1.56

ESTABLISHED A 0 0 . .

SCSAVINGS 1 1 -0.017371 0.003576 -4.86

SCSAVINGS 2 0 0 . .

SCASSETS 1 1 -0.056578 0.004313 -13.12

SCASSETS 2 0 0 . .


40/56

40

Parameter Estimates


SCEQUITY 1 1 -0.039681 0.005002 -7.93

SCEQUITY 2 0 0 . .SCCREDIT 1 1 -0.051551 0.003860 -13.36

SCCREDIT 2 0 0 . .

SCGOVTLOAN 1 1 -0.028818 0.013311 -2.16

SCGOVTLOAN 2 0 0 . .

SCGOVTGUAR 1 1 0.022142 0.012371 1.79

SCGOVTGUAR 2 0 0 . .

SCBANKLOAN 1 1 0.055183 0.004020 13.73

SCBANKLOAN 2 0 0 . .

SCFAMLOAN 1 1 -0.018966 0.006370 -2.98

SCFAMLOAN 2 0 0 . .

SCVENTURE 1 1 -0.096289 0.022065 -4.36

SCVENTURE 2 0 0 . .

SCGRANT 1 1 -0.145460 0.028804 -5.05

SCGRANT 2 0 0 . .

SCOTHER 1 1 -0.035163 0.008224 -4.28

SCOTHER 2 0 0 . .

SCDONTKNOW 1 1 -0.052654 0.008604 -6.12

SCDONTKNOW 2 0 0 . .

SCAMOUNT 1 1 -0.002310 0.004671 -0.49

SCAMOUNT 2 1 0.062655 0.005589 11.21

SCAMOUNT 3 1 0.087967 0.005551 15.85

SCAMOUNT 4 1 0.131828 0.006121 21.54

SCAMOUNT 5 1 0.182133 0.006361 28.63

SCAMOUNT 6 1 0.247689 0.006671 37.13

SCAMOUNT 7 1 0.385266 0.007672 50.22

SCAMOUNT 8 1 0.587252 0.011676 50.29


41/56

41

Parameter Estimates


SCAMOUNT 9 1 0.186741 0.006260 29.83

SCAMOUNT A 0 0 . .HOMEBASED 1 1 -0.193209 0.002631 -73.42

HOMEBASED 2 0 0 . .

FRANCHISE 1 1 0.095547 0.008042 11.88

FRANCHISE 2 0 0 . .

FRANCHISER50 1 1 -0.022814 0.012799 -1.78

FRANCHISER50 2 0 0 . .

ECSAVINGS 1 1 -0.071333 0.003629 -19.66

ECSAVINGS 2 0 0 . .

ECASSETS 1 1 -0.050133 0.005692 -8.81

ECASSETS 2 0 0 . .

ECEQUITY 1 1 0.032494 0.005389 6.03

ECEQUITY 2 0 0 . .

ECCREDIT 1 1 -0.024643 0.003773 -6.53

ECCREDIT 2 0 0 . .

ECGOVTLOAN 1 1 0.038285 0.015650 2.45

ECGOVTLOAN 2 0 0 . .

ECBANKLOAN 1 1 0.103370 0.004219 24.50

ECBANKLOAN 2 0 0 . .

ECVENTURE 1 1 -0.246374 0.031677 -7.78

ECVENTURE 2 0 0 . .

ECPROFITS 1 1 0.031553 0.003818 8.27

ECPROFITS 2 0 0 . .

ECGRANT 1 1 -0.158893 0.028705 -5.54

ECGRANT 2 0 0 . .

ECOTHER 1 1 -0.032423 0.012595 -2.57

ECOTHER 2 0 0 . .


42/56

42

Parameter Estimates


ECDONTKNOW 1 1 -0.088126 0.007204 -12.23

ECDONTKNOW 2 0 0 . .ECNOACCESS 1 1 -0.095309 0.010197 -9.35

ECNOACCESS 2 0 0 . .

ECNOEXPAND 1 1 -0.043675 0.003966 -11.01

ECNOEXPAND 2 0 0 . .

FEDERAL 1 1 0.041492 0.007992 5.19

FEDERAL 2 0 0 . .

OTHERBUS 1 1 0.045852 0.003214 14.27

OTHERBUS 2 0 0 . .

INDIVIDUALS 1 1 -0.149726 0.003468 -43.17

INDIVIDUALS 2 0 0 . .

EXPORTS 1 1 0.073348 0.007377 9.94

EXPORTS 2 1 0.138000 0.010933 12.62

EXPORTS 3 1 0.177615 0.013409 13.25

EXPORTS 4 1 0.214486 0.016534 12.97

EXPORTS 5 1 0.166936 0.016124 10.35

EXPORTS 6 1 0.178288 0.016120 11.06

EXPORTS 7 1 0.293766 0.017115 17.16

EXPORTS 8 1 0.320932 0.021770 14.74

EXPORTS 9 0 0 . .

FULLTIME 1 1 0.179631 0.003769 47.67

FULLTIME 2 0 0 . .

PARTTIME 1 1 -0.020924 0.003068 -6.82

PARTTIME 2 0 0 . .

LEASED 1 1 0.334245 0.011579 28.87

LEASED 2 0 0 . .

CONTRACTORS 1 1 0.246384 0.002523 97.67


43/56

43

Parameter Estimates


CONTRACTORS 2 0 0 . .

HEALTHINS 1 1 0.078051 0.004257 18.33HEALTHINS 2 0 0 . .

RETIREMENT 1 1 0.170154 0.004144 41.06

RETIREMENT 2 0 0 . .

PROFITSHARE 1 1 -0.020966 0.006971 -3.01

PROFITSHARE 2 0 0 . .

HOLIDAYS 1 1 0.120158 0.004618 26.02

HOLIDAYS 2 0 0 . .

BENENABV 1 1 -0.074976 0.005111 -14.67

BENENABV 2 0 0 . .

WEBSITE 1 1 0.019704 0.002904 6.78

WEBSITE 2 0 0 . .

ECOMMPCT 1 1 -0.033776 0.009894 -3.41

ECOMMPCT 2 1 -0.076199 0.010547 -7.22

ECOMMPCT 3 1 -0.065350 0.013496 -4.84

ECOMMPCT 4 1 -0.107320 0.012424 -8.64

ECOMMPCT 5 1 -0.053331 0.012200 -4.37

ECOMMPCT 6 1 -0.060335 0.010540 -5.72

ECOMMPCT 7 1 -0.093930 0.013852 -6.78

ECOMMPCT 8 1 -0.158731 0.015935 -9.96

ECOMMPCT 9 0 0 . .

ONLINEPURCH 1 1 0.003940 0.002547 1.55

ONLINEPURCH 2 0 0 . .

LT40HOURS 1 1 -0.060922 0.004991 -12.21

LT40HOURS 2 0 0 . .

LT12MONTHS 1 1 -0.050281 0.004365 -11.52

LT12MONTHS 2 0 0 . .


44/56

44

Parameter Estimates


SEASONAL 1 1 -0.038413 0.005919 -6.49

SEASONAL 2 0 0 . .OCCASIONALLY 1 1 -0.102215 0.005775 -17.70

OCCASIONALLY 2 0 0 . .

ACTIVITYNABV 1 1 0.189169 0.005265 35.93

ACTIVITYNABV 2 0 0 . .

OPERATING 1 1 0.247984 0.003370 73.58

OPERATING 2 0 0 . .

CEASENR 1 1 0.062078 0.022437 2.77

CEASENR 2 0 0 . .

HUSBWIFE 1 1 -0.064987 0.004525 -14.36

HUSBWIFE 2 1 -0.068549 0.004097 -16.73

HUSBWIFE 3 1 -0.126576 0.006033 -20.98

HUSBWIFE 4 0 0 . .

NUMOWNERS 1 1 0.300708 0.014590 20.61

NUMOWNERS 2 1 0.419767 0.015134 27.74

NUMOWNERS 3 1 0.486225 0.016239 29.94

NUMOWNERS 4 1 0.489487 0.017344 28.22

NUMOWNERS 5 1 0.505033 0.017709 28.52

NUMOWNERS 6 1 0.444110 0.023362 19.01

NUMOWNERS 7 1 0.051559 0.046352 1.11

NUMOWNERS 8 0 0 . .

race1noblanks A 1 -0.020036 0.005832 -3.44

race1noblanks B 1 -0.185359 0.006773 -27.37

race1noblanks H 1 -0.067546 0.005681 -11.89

race1noblanks I 1 -0.072423 0.014317 -5.06

race1noblanks Mixed 1 0.004210 0.029596 0.14

race1noblanks P 1 -0.131819 0.039328 -3.35


45/56

45

Parameter Estimates


race1noblanks S 1 0.002974 0.026112 0.11

race1noblanks W 0 0 . .LANGUAGE Only English 1 0.216239 0.017187 12.58

LANGUAGE Only English and Other 1 0.143442 0.017956 7.99

LANGUAGE Only English and Spanish 1 0.219765 0.017202 12.78

LANGUAGE Only Other Language 1 0.007221 0.023920 0.30

LANGUAGE Only Spanish 0 0 . .


46/56

46

Appendix B: Regression of All Available Variables

The SAS System

The SURVEYREG Procedure

Regression Analysis for Dependent Variable RECEIPTS_NOISY

Data Summary

Number of Observations 874182

Sum of Weights 10068531

Weighted Mean of RECEIPTS_NOISY 251.05405

Weighted Sum of RECEIPTS_NOISY 2527745467

Fit Statistics

R-square 0.5771

Root MSE 446.73

Denominator DF 874181

Class Level Information

Class Variable Levels Values

FIPST 43 01 04 05 06 08 09 12 13 15 16 17 18 19 20 21 22 23 24 25 26

27 28 29 30 31 32 33 34 35 36 37 39 40 41 42 45 47 48 49 5153 54 55

SECTOR 20 11 21 22 23 31 42 44 48 51 52 53 54 55 56 61 62 71 72 81 99

N07_EMPLOYER 2 E N

SEX1 2 F M

VET1 2 1 2

FOUNDED1 2 1 2

PURCHASED1 2 1 2

INHERITED1 2 1 2

RECEIVED1 2 1 2

ACQUIRENR1 1 2

ACQYR1 8 1 2 3 4 5 6 7 8

PROVIDE1 2 1 2


47/56

47



MANAGE1 2 1 2

FINANCIAL1 2 1 2FNCTNABV1 2 1 2

FNCTNR1 1 2

HOURS1 6 1 2 3 4 5 6

PRMINC1 2 1 2

SELFEMP1 2 1 2

EDUC1 7 1 2 3 4 5 6 7

AGE1 6 1 2 3 4 5 6

BORNUS1 2 1 2

DISVET1 3 1 2 3

ESTABLISHED 10 1 2 3 4 5 6 7 8 9 A

SCSAVINGS 2 1 2

SCASSETS 2 1 2

SCEQUITY 2 1 2

SCCREDIT 2 1 2

SCGOVTLOAN 2 1 2

SCGOVTGUAR 2 1 2

SCBANKLOAN 2 1 2

SCFAMLOAN 2 1 2

SCVENTURE 2 1 2

SCGRANT 2 1 2

SCOTHER 2 1 2

SCDONTKNOW 2 1 2

SCNONENEEDED 2 1 2

SCNOTREPORTED 1 2

SCAMOUNT 10 1 2 3 4 5 6 7 8 9 A

HOMEBASED 2 1 2


48/56

48



FRANCHISE 2 1 2

FRANCHISER50 2 1 2ECSAVINGS 2 1 2

ECASSETS 2 1 2

ECEQUITY 2 1 2

ECCREDIT 2 1 2

ECGOVTLOAN 2 1 2

ECGOVTGUAR 2 1 2

ECBANKLOAN 2 1 2

ECFAMLOAN 2 1 2

ECVENTURE 2 1 2

ECPROFITS 2 1 2

ECGRANT 2 1 2

ECOTHER 2 1 2

ECDONTKNOW 2 1 2

ECNOACCESS 2 1 2

ECNOEXPAND 2 1 2

ECNOTREPORTED 1 2

FEDERAL 2 1 2

STATELOCAL 2 1 2

OTHERBUS 2 1 2

INDIVIDUALS 2 1 2

CUSTNR 1 2

EXPORTS 9 1 2 3 4 5 6 7 8 9

OPSOUTSIDE 2 1 2

OUTSOURCE 2 1 2

FULLTIME 2 1 2

PARTTIME 2 1 2


49/56

49



DAYLABOR 2 1 2

TEMPSTAFF 2 1 2LEASED 2 1 2

CONTRACTORS 2 1 2

EMPNR 1 2

HEALTHINS 2 1 2

RETIREMENT 2 1 2

PROFITSHARE 2 1 2

HOLIDAYS 2 1 2

BENENABV 2 1 2

BENENR 1 2

WEBSITE 2 1 2

ECOMMERCE 2 1 2

ECOMMPCT 9 1 2 3 4 5 6 7 8 9

ONLINEPURCH 2 1 2

LT40HOURS 2 1 2

LT12MONTHS 2 1 2

SEASONAL 2 1 2

OCCASIONALLY 2 1 2

ACTIVITYNABV 2 1 2

ACTIVITYNR 1 2

OPERATING 2 1 2

CEASENR 2 1 2

CEASENA 2 1 2

HUSBWIFE 4 1 2 3 4

FAMILYBUS 2 1 2

NUMOWNERS 8 1 2 3 4 5 6 7 8

race1noblanks 8 A B H I Mixed P S W


50/56

50



LANGUAGE 5 Only English Only English and Other Only English andSpanish Only Other Language Only Spanish

region 9 East Sout Mid-Atlan Midwest Mountain Northeast PacificSouth Atl West Nort West Sout

Tests of Model Effects

Effect Num DF F Value Pr > F

Model 215 1538.56


51/56

51



PRMINC1 1 592.68


52/56

52



ECCREDIT 1 477.03


53/56

53



HEALTHINS 1 1027.62


54/56

54

Appendix C: Full Tables of Capital RegressionRegressing Startup Capital Variables Against Receipts in California

Standard

Parameter Estimate Error t Value Pr > |t|

Intercept 102.007317 21.726311 4.70


55/56

55

SCFAMLOAN 1 180.68722 20.0165018 9.03


56/56

Intercept 231 62717 10.454859 22.15

2014 independent study on the census bureau's survey of business owners public use microdata sample

Documents