2014 independent study on the census bureau's survey of business owners public use microdata sample
TRANSCRIPT
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
1/56
Analysis of the 2007 Survey of Business Owners
Public Use Microdata Sample
Arthur Wu
Supervisor: Professor Amber Tomas
STAT 4993 Independent Study5/9/2013
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
2/56
2
Table of Contents
Introduction ..................................................................................................................................... 4Overview of the Survey of Business Owners Public Microdata Sample ........................................ 4
Primary Methodology ................................................................................................................. 4Missing Data ............................................................................................................................... 6Nonresponse ................................................................................................................................ 7Inherent Differences between SBO and PUMS Information ...................................................... 7
Data Manipulation and Cleaning .................................................................................................... 8Subsetting to Small Businesses and One Owner Data ................................................................ 8Tabulation Weights ..................................................................................................................... 9
Fitting a Regression ........................................................................................................................ 9Regression One: All Variables Against Receipts ...................................................................... 10Model Selection Techniques ..................................................................................................... 12Regression Two: PROC GLSELECT with the Schwarz Bayesian Criterion ........................... 14Regression Three: PROC GLMSELECT with Akaikes Information Criterion....................... 15Regression Four: PROC GLMSELECT with AIC: Receipts with Logarithm Transform ........ 18Regression Five: PROC GLMSELECT with AIC: Modified Binary Payroll and Employment
with Other Character Variables against the Logarithm of Receipts .......................................... 20Analysis of Language ................................................................................................................... 22
Research Question 1a: Does the language spoken in transactions produce a difference incorrelated receipts? .................................................................................................................... 22Research Question 1b: Out of only existing Only English and Spanish businesses, which are
the most popular industries for business? ................................................................................. 26Visualizing the Percentage of Hispanic/Spanish-Speaking Businesses in the United States ... 27
Analysis of Capital Sources in California versus the United States ............................................. 29Dataset Overview ...................................................................................................................... 29Research Question 2a: What sources of capital have the most positive relationship with
receipts? ..................................................................................................................................... 30Research Question 2b: How does the spread of receipts for certain capital sources compare
between businesses in California and the general United States? ............................................. 32Conclusion .................................................................................................................................... 33Lessons Learned............................................................................................................................ 33Appendix A: Full Output of Regression with log(Receipts) and Modified Payroll and
Employment .................................................................................................................................. 34
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
3/56
3
Appendix B: Regression of All Available Variables .................................................................... 46Appendix C: Full Tables of Capital Regression ........................................................................... 54
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
4/56
4
Introduction
In selecting my topic for independent study, I wanted to combine the skills I had learnedfor statistical computing software with my main passion and primary fields of studybusinessand entrepreneurship. As a result, I originally intended to build a predictive model for new or
small business venture success. According to the U.S. Small Business Administration (SBA),small businesses represent 99.7 percent of all employer firms. Since 1995, small businesses havegenerated 64 percent of new jobs, and paid 44 percent of the total United States private payroll,according to the SBA. However, I quickly realized that there was a dearth of accurate, thorough,and easily accessible data that had a large enough sample size to satisfy the normalityassumption across all sectors and states. Moreover, attempting to answer this question wouldfurther require significant longitudinal data on the individual business level. Ultimately, I reliedon the U.S. Census Bureaus Survey of Business Owners (SBO) Public Use Microdata Sample(PUMS), in which I examine entrepreneurial activity and the relationships between businesscharacteristics such as access to capital, firm size, employer-paid benefits, minority ownership,and firm age. In this report, I detail how I conducted data cleaning on the 2007 SBO PUMS in
addition to the development of a regression model as well as more in-depth analyses of therelationships between specific variables.
Overview of the Survey of Business Owners Public Microdata SamplePrimary Methodology
The 2007 Survey of Business Owners (SBO) questionnaire, Form SBO-1, was mailed toa random sample of 2.3 million businesses selected from a list of 27 million firms operatingduring 2007 with receipts of $1,000 or more. The list of all firms (the sampling universe) wasderived from both official business tax returns and data collected on other economic censusreports. The Census Bureau obtained electronic files from the Internal Revenue Service (IRS) for
all companies reporting any business activity on 2007 IRS Tax Forms such as Form 1040 and1065.
With regards to the background of the SBO, this survey is part of the Economic Censusprogram, which the Census Bureau is required by law to conduct every 5 years for years endingin "2" and "7." The Census Bureau combines and crosschecks data from the SBO with data fromother economic surveys, economic censuses, and administrative records. The published datainclude number of firms (both firms with paid employees and firms with no paid employees),sales and receipts, number of paid employees, and annual payroll; they are presented by kind ofbusiness, geographic area, and size of firm (employment and receipts). These results will alsocontain summary statistics on the composition of businesses in the United States by gender,
ethnicity, race, and veteran status. Additional demographic and economic characteristics ofbusiness owners and their businesses are included, such as: owner's age, education level, hoursworked, and primary function in the business; family- and home-based businesses; types ofcustomers and workers; sources of financing for start-up, expansion, or capital improvements;outsourcing; use of Internet and e-commerce; and employer-paid benefits.
The IRS provided certain identification, classification, and measurement data forbusinesses filing those forms. For most firms with paid employees, the Census Bureau also
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
5/56
5
collected employment, payroll, receipts, and kind of business for each plant, store, or physicallocation during the 2007 Economic Census.
For the 2007 SBO, firms could either report electronically by using Census Taker, theCensus Bureau's secure online interactive application, or return their completed form by mail.Three report form re-mails to employer firms and two report form re-mails to nonemployer firms
were conducted at one-month intervals to all delinquent respondents. The returned formsunderwent extensive review and computer processing. All reports were geographically coded,data-keyed, and edited.
This wealth of data provides a resource to main parties from government officials toindustry organization leaders. For example, this data allows agencies such as the Small BusinessAdministration to identify and address the needs of small businesses in the United States. In theprivate sector, consultants and researchers to analyze long-term economic and demographicshifts, and differences in ownership and performance among geographic areas.
Survey Overview:
Form SBO-1, given to every sampled business, primarily asked basic information aboutthe business in general while focusing on the demographics and level of ownership for eachlisted owner. There are only 9 numeric variables (tabulation weight, total revenues with injectednoise, payroll with injected noise, employment injected with noise, and general ownershippercentages for up to four owners) while the other hundreds are character variables (age,education, startup capital type, race, ethnicity, etc.). These character variables usually adopt theformat of a binary yes/no answer to most questions, except for variables with multiple levels(education, age, and race).
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
6/56
6
Missing Data
For the numeric variables of the cleaned data set (receipts, payroll, employment, andpercent of ownership), there were no missing values. However, for most character variables, thepercentage of missing data ranged from 20-40%. There were also several variables in which over
General Owner andBusiness Characteristics
Additional OwnerCharacteristics
Additional Business Characteristics
Sector
Employer status
Random group(for varianceestimation)
Tabulationweight
Measures of size(noise-infusedfor disclosureavoidance):
Employment
Payroll
Receipts Individual owner
information (forup to fourowners):
Percentageownership
Gender
Ethnicity
Race
Veteran status
How the ownerinitially acquired the
business When the owner
acquired the business
Owners primaryfunction in thebusiness
Owners averagenumber of hours perweek spent workingin the business
Whether the business
provided the ownersprimary source ofpersonal income
Whether the ownerpreviously owned abusiness or had beenself-employed
Owners educationalbackground
Owners age
Whether the owner
was born in theUnited States
If the owner was aveteran, whether theowner was disabledas the result of injuryincurred duringactive militaryservice
Year business was established
Source(s) of start-up or acquisition
capital Amount of start-up or acquisition
capital
Home-based business
Operated as a franchise
Owned by a franchise
Source(s) of capital used to expandbusiness
Types of customers
Percent of total sales exported
Operations established outside the
United States Outsourced any business function
outside the United States
Language(s) used in transactions
Types of workers employed
Employer-paid benefits offered
Whether the company had a website
Whether the company had e-commerce sales
E-commerce as a percentage of total
sales Whether the company made online
purchases
Business activity (e.g., seasonal orpart-time)
Whether the business currentlyoperates
Reasons for ceasing operations
Joint ownership by husband and wife
Family-owned business
Number of owners
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
7/56
7
90% of the observations had missing values, such as whether or not the owner is retired, theowner is deceased, or the business had low or inadequate sales.
Nonresponse
Approximately 62 percent of the 2.3 million businesses in the SBO sample responded to
the survey, compared to 75 percent for the 2002 survey. For the 2007 survey, 72 percent of thecompanies in the SBO sample returned a questionnaire, but 10 percent of the returns did notcontain enough information to be considered a response for the estimates by race, gender,ethnicity or veteran status. Many of these respondents were sole proprietors that answered "No"to Item 8, "In 2007, did any individual own 10% or more of the rights, claims, interests, or stockin this business?" Another identified issue was duality between race (Hispanic vs non-Hispanic)and ethnicity (White, Black, Asian, American Indian). Every Hispanic business owner also hadto identify at least one additional ethnicity, which may lead to indication of mixed race when anowner is solely Hispanic is heritage. This led to consequent variable manipulation for correction.
According to the U.S. Census, about 4 percent of the 2007 nonrespondents were selectedfor and responded to the 2002 SBO. For these firms, data from the 2002 survey were used in
place of the missing 2007 responses. For the remaining nonrespondents, gender, ethnicity, raceand veteran status were imputed from donor respondents in the same sampling frame withsimilar characteristics (state, industry, employment status, size). Because the assignment ofbusinesses to sampling frames relies heavily on administrative data, and there is a high level ofagreement between sampling frame assignment and tabulated race or ethnicity for respondingfirms, the donor imputations are considered to be reliable. Estimates of sampling variability areadjusted to account for nonresponse. Estimates with high error (relative standard error for salesor receipts of 50 percent or more) are suppressed. Overall, imputed data accounted forapproximately 47 percent of the firm count estimates by gender, ethnicity, race, and veteranstatus and approximately 20 percent of the estimates of sales.
Inherent Differences between SBO and PUMS Information
The Public Use Microdata Sample (PUMS) is a large dataset available to the public derivedfrom the original SBO dataset of responses. According to the U.S. Census Bureau, measureswere taken in constructing the PUMS file to protect the confidentiality of the SBO data in orderfor it to be used freely among the public. In the PUMS file, each record corresponds to abusiness, but deliberate measures were taken to ensure the anonymity of each business. Forbusinesses operating in multiple states and/or industry sectors, one record exists for each statecombination in which the firm conducts business. Identifiers to link the component records of abusiness are not included. Additionally, businesses classified in the SBO as publicly owned ornot classifiable by gender, ethnicity, race, or veteran status are not included in the PUMS file
because many publicly owned firms are easily identifiable. Since the primary focus of myresearch is on small businesses, exclusion of possibly larger public corporations does notsignificantly impact the integrity of the data.
Finally, the U.S. Census Bureau infused noise into the PUMS data for disclosure avoidanceand confidentiality protection. Values are perturbed prior to tabulation by applying a randomnoise multiplier to the magnitude data, such as the sales and receipts for all firms. Thisintroduced variation perturbed data points by no more than a few percentage points.
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
8/56
8
Data Manipulation and CleaningSubsetting to Small Businesses and One Owner Data
According to the Small Business Administration, small businesses are generallyconsidered to have less than 500 employees and under 7 million in receipts depending on the
industry. By subsetting the original dataset according to these new parameters, the totalobservation count decreased from 2,165,680 to 2,025,530.
Furthermore, according to the 2007 SBO PUMS Guide, a response of 0 for most of thequalitative questions indicated that the data for these variables were missing. As a result, each 0was converted to to more accurately reflect the nature of this data and to correctly set upregression procedures later on.
Another major issue for data analysis were the inclusion of data points of up to threeadditional business owners. As a result, there exist three additional sets of demographicvariables. However, if a business only has one owner, these three additional sets of variables
would all have missing variables. Given the fact that PROC REG and other regressioncommands exclude observations that have even one missing value, allowing the sixty extravariables for additional owners would lead to over-exclusion of a significant amount ofobservations. Moreover, I noticed that virtually every business had responses to all variablesdescribing the first owner, and that the first owner almost always owned as much (if not a greateramount) of the business than his or her other 1-3 business partners. This justified my decision toremove all variables affiliated with the second, third, and fourth owners.
One future recommendation would be to keep these deleted variables and find otheroptions to analyze the overall dataset despite the necessary inclusion of extra missing variables.Observing businesses with the intent to analyze the relationship between multiple business
owners (when also factoring in age, experience, and education) could be incredibly valuable forthe studies of organization behavior and business.
Additional deleted variables included those that had over a 90% missing rate, since itseverely diminished the number of total observations used in regression. For the purposes of myresearch questions, having more observations to interpret is preferable, but including thesevariables in another analysis of businesses that have generally ceased activity would also beworthwhile. These variables include:
CEASEOTHERceased operations for another reason
SOLDBUSsold this business
STARTANOTHERStarted another business NOPERSCREDLack of personal loans/credit
NOBUSCREDLack of business loans/credit
LOWSALESInadequate cash flow or low sales
ONETIMEOperated for one-time event
DECEASEDOwner died
RETIREOwner retired
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
9/56
9
Lastly, I consolidated all language variables and the race and ethnicity variables in orderto improve overall adjusted fit and reduce noise. After analyzing the spread of the manylanguage variables in the dataset (English, Arabic, Chinese, French, German, Greek, Hindi,Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Tagalog, and Vietnamese), the
most frequent languages spoken during business transactions are naturally English and Spanish.Other minor languages collectively do not constitute more than 3% of the total number ofobservations, so we created a new language variable with five levels: Only English, OnlySpanish, Only English and Spanish, Only English and Other, and Only Other Language, whicheffectively consolidated sixteen binary variables to one variable with five levels. Additionally,the race variable had at least twenty levels due to the combinations of difference races, whichencouraged consolidation. Both race and ethnicity were combined to create a new variable,which followed this algorithm:
1. If ethnicity is Hispanic then Race/Ethnicity is Hispanic2. If ethnicity is not Hispanic and the owner is White AND another minority, then the owner
is considered part of the other minority3. If ethnicity is not Hispanic and the owner is only White or another minority, then the racestays as listed
4. If ethnicity is not Hispanic and the owner is at least two types of race (both of which arenot White), then the owner is considered Mixed.
As a result, the final levels for the new race/ethnicity variable are: W for White, B for Black, Afor Asian, H for Hispanic, I for American Indian, P for Nhopi (Native Hawaiian), Mixed (for anycombination of two non-white and non-hispanic races such as Black/Asian.
Tabulation Weights
In most surveys, it will be the case that some groups are over-represented in the raw dataand others under-represented. In order to address this, weights are assigned to each observationto compensate for the over/under-representation of data. While the exact method of determiningthese weights for the PUMS is unknown, the values of the tabulation weight range from 1.0 to35.0. In this sense, a single observation with a weight of 35.0 would functionally be the same asthirty-five individual observations with the same parameters except for a weight of 1.0. Analysisof the data therefore requires the weights to be properly factored into averages, percentages, andregressions through the proper SAS procedures.
Fitting a Regression
Intuitively, I first wanted to fit a regression for all variables against receipts to observethe overall coefficient of determination and the comparative significance of each individualvariable in determining receipts. This would immediately confirm or deny several of my possibleresearch questions about the variables and would then be a starting point for other furtherresearch ideas. I gave further consideration to other possible response variables besides receipts,such as employment, payroll, and certain categorical variables such as whether or not thebusiness was still operating. Ultimately, I decided to primarily focus on receipts as the response
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
10/56
10
variable, seeing as in general business theory that an increase in either payroll or employmentusually comes after an increase in overall revenues.
In order to begin, I had to identify the proper statistical procedure to regress severalquantitative and multi-level categorical variables against receipts. I also wanted to use a
procedure that would automatically create dummy variables and evaluate correlations whilecontrolling for other variables. Ultimately, I selected the generalized linear model procedurebecause it satisfied the aforementioned criteria. With this in mind, limitations include the factthat responses must be independent of another (which is extremely difficult to prove given gametheory and general economics of competition), and the fact that predictors are still assumed to belinear.
Regression One: All Variables Against Receipts
Data Summary:Number of Observations874,182
Included Variables: RECEIPTS_NOISY EMPLOYMENT_NOISY PAYROLL_NOISY PCT1EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 FIPST SECTOR N07_EMPLOYER SEX1FOUNDED1 PURCHASED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1 ESTABLISHEDSCSAVINGS SCASSETS SCEQUITY SCCREDIT SCGOVTLOAN SCGOVTGUARSCBANKLOAN SCFAMLOAN SCGRANT SCDONTKNOW SCAMOUNT HOMEBASEDFRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECCREDIT ECGOVTLOANECGOVTGUAR ECBANKLOAN ECFAMLOAN ECVENTURE ECPROFITS ECGRANTECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL STATELOCALINDIVIDUALS EXPORTS OPSOUTSIDE FULLTIME PARTTIME DAYLABORTEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHARE
HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURSLT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV CEASENR HUSBWIFEFAMILYBUS NUMOWNERS race1noblanks OPERATING LANGUAGE
Further analysis shows that these are the top five variables based on both the magnitudeof the t-value and the estimate, as these parameters indicate both statistical and practicalsignificance. Since the vast majority of variables are significant in the model on a 0.05significance level, the magnitude of the estimate is the most determinant factor in establishingimportance.
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
11/56
11
Estimated Regression Coefficients
Parameter Estimate Standard Error t Value Pr > |t
SECTOR 11 Agriculture, Fishing 273.5380 19.767836 13.84
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
12/56
12
HOLIDAYS 1 168.2109 5.420379 31.03
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
13/56
13
sets the significance level at which variables can be removed from the model, which is onceagain usually 0.05.
Stepwise Regression
As a fusion of both forward and backward stepwise regression, in stepwise regressionfour options are considered at each stage: add a term, delete a term, swap a term in the model for
one not in the model, or stop. This algorithm is most often used in practice. Despite itswidespread use, it has little theoretical basis. A more theoretically robust tool like AkaikesInformation Criterion (AIC) can also be used as a good metric to assess models. Limitations O
Fit Statistics
Adjusted Coefficient of Determination / Adjusted R-Square
The adjusted R-squared is a modified version of R-squared that has been adjusted for thenumber of predictors in the model. The adjusted R-squared increases only if the new termimproves the model more than would be expected by chance. It decreases when a predictorimproves the model by less than expected by chance. The adjusted R-squared can be negative,but its usually not. It is always lower than the R-squared.
Akaike Information Criterion
The Akaike Information Criterion (AIC) is a way of selecting a model from a set ofmodels. The chosen model is the one that minimizes the Kullback-Leibler divergence between
the model and the truth. It's based on information theory, but a heuristic way to think about it isas a criterion that seeks a model that has a good fit to the truth but few parameters. It is definedas:
AIC = -2 ( ln ( SSE / n )) + 2 K
where likelihood is the probability of the data given a model and K is the number of freeparameters in the model. AIC scores are often shown as AIC scores, or difference between thebest model (smallest AIC) and each model (so thebest model has a AIC of zero). Used instepwise regression, AIC can be used instead of the p-value as the main criterion for modelselection. Each iterative models AIC should be calculated and be compared to the previous, and
should only be preferred if the current AIC is smaller than the AIC of the prior model. Thisprocess continues until the best model is selected.
Bayesian Information Criterion
BIC = n log (SSEp)n log (n) + p log (n)
The BIC acts essentially the same as AIC but incorporates a more severe decrease if n > 8
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
14/56
14
Schwarz Bayesian Criterion
SBC = n ln (SSE / n) + k ln (n)
This is essentially like the AIC equation but uses a multiplicative penalty term based onsample size rather than a constant of 2. By default, PROC GLMSELECT uses the stepwise
selection based on the Schwarz Bayesian Criterion.1
Mallows Cp
Cp= ((1-Rp2)(n-T) / (1-RT2))[n2(p+1)]
The AIC has been shown as equivalent to Mallows Cp, which is used to assess the fit ofa regression model that has been estimated using least ordinary squares. This measures the biasin the reduced regression model relative to the full model having all T candidate predictors. If Cpis roughly equivalent to p, then the reduced model predicts as well as the full model. If Cp < pthen the reduced model is estimated to predict better than the full model. In practice, the selected
model should have the smallest Cp.Mean Squared Error (MSE)
The Mean Squared Error in regression refers to the residual sum of squares divided bythe number of degrees of freedom.Minimizing MSE is important to ensure that the maximumamount of variation of the regression can be explained by the independent variables, thusestablishing the robustness of a model. It is one of the most important and fundamental criteriathat can be used to evaluate models.
General Criteria:
General diagnostics should be calculated for each model to help determine which modelis best. Thesemodel diagnostics include the mean square error (MSE), the adjusted coefficientof determination (R2), and Mallows Cp. A good linear model will have small MSE and Cp anda high adjusted R2 close to 1. With these criteria in mind in addition to stepwise regression withtools such as AIC, BIC, and SBC, I can develop a more robust model than the original. However,it is also important to note that use of these criteria and selection procedures will not definitivelyyield the best model due to the sheer number of potential models and inherent limitations ofthese tools.
Regression Two: PROC GLSELECT with the Schwarz Bayesian CriterionUsing PROC GLMSELECT, I used the aforementioned steps to select a model just using
stepwise regression.
Data Summary:Number of Observations874,182
Included Variables: EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 SECTORN07_EMPLOYER SEX1 VET1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1
1http://www2.sas.com/proceedings/sugi31/207-31.pdf
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
15/56
15
FNCTNABV1 HOURS1 PRMINC1 EDUC1 BORNUS1 ESTABLISHED SCSAVINGSSCEQUITY SCCREDIT SCGOVTGUAR SCDONTKNOW SCAMOUNT HOMEBASEDFRANCHISE ECSAVINGS ECASSETS ECCREDIT ECBANKLOAN ECVENTUREECPROFITS ECGRANT ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERALSTATELOCAL INDIVIDUALS EXPORTS OPSOUTSIDE FULLTIME PARTTIME
TEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHAREHOLIDAYS BENENABV WEBSITE ECOMMPCT LT40HOURS SEASONALOCCASIONALLY ACTIVITYNABV OPERATING HUSBWIFE NUMOWNERS REGION
Fit Statistics:R-square0.5768Adjusted R-square0.5768Root MSE1516.49340
Regression Three: PROC GLMSELECT with Akaikes Information CriterionUsing PROC GLMSELECT, I used the aforementioned steps to select a model using the AIC.
Data Summary:Number of Observations874,182
Included Variables: EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 FIPST SECTORN07_EMPLOYER SEX1 FOUNDED1 PURCHASED1 ACQYR1 PROVIDE1 MANAGE1FINANCIAL1 FNCTNABV1 HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1DISVET1 ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDITSCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN SCGRANTSCDONTKNOW SCAMOUNT HOMEBASED FRANCHISE FRANCHISER50 ECSAVINGSECASSETS ECCREDIT ECGOVTLOAN ECGOVTGUAR ECBANKLOAN ECFAMLOAN
ECVENTURE ECPROFITS ECGRANT ECOTHER ECDONTKNOW ECNOACCESSECNOEXPAND FEDERAL STATELOCAL INDIVIDUALS EXPORTS OPSOUTSIDEFULLTIME PARTTIME DAYLABOR TEMPSTAFF LEASED CONTRACTORSHEALTHINS RETIREMENT PROFITSHARE HOLIDAYS BENENABV WEBSITEECOMMPCT ONLINEPURCH LT40HOURS LT12MONTHS SEASONAL OCCASIONALLYACTIVITYNABV OPERATING CEASENR HUSBWIFE FAMILYBUS NUMOWNERSrace1noblanks LANGUAGE
Fit Statistics:R-square0.5771Adjusted R-square0.5770Root MSE1516.10130
Since the AIC model is on par with highest overall adjusted R-squared, has the leastmodel variables, and relatively low MSE compared to the second stepwise model, it should beconsidered the most robust. It is noted that the first linear regression of all variables has aconsiderably lower MSE than the stepwise-selected models, despite the inclusion of far morevariables. However, given the prevalence of missing data and nonresponse in these data sets,future data collection may benefit from relying on fewer variables in the generalized linearmodel. Therefore, the AIC model ranks the most effective.
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
16/56
16
The most significant variables are also those aforementioned in Regression One: sector,health insurance, holidays, and employer status for tabulation. Sector is arguably the mostintuitively significant, while both health insurance and holidays are usually indicative of abusiness performing well enough to provide extended luxuries for employees. Surprisingly, bothpayroll and employment had far smaller magnitudes in their estimates, which indicates
insignificance on a practical level.
There existed several limitations to the regression. As stated earlier, one of the maindrawbacks of using the generalized linear model is that it can produce over-fitting to data as wellas the assumption that the relationships between the explanatory and response variable is linear,which may not be the case. Ideally, tests for nonlinearity could be conducted on each variable byplotting the residuals versus predicted values. Furthermore, multicollinearity or the varianceinflation factor should be calculated for each variable, which remains to be done consideringthere is no built-in functionality for this purpose for survey data in SAS. Due to the large amountof variables, testing for multicollinearity is especially important. From a cursory analysis, mostof the t-ratios for the individual coefficients is statistically significant, which could indicate that
multicollinearity is not severe. Another possible option was to create a correlation matrix, whichproved to be too cumbersome considering the matrix have dimensions greater than 150 x 150.Finally, further work could be done to investigate the addition of extra terms, such as interactivecombinations of other original terms that better reflect the lack of complete independencebetween explanatory variables.
Proper model selection requires conscientious consideration of the tradeoff between thecompeting objectives of conformity and adherence to data and model simplicity. Good modelsconform to the data with a strong goodness of fit, but can also be easily generalizable in itsinterpretation. Finally, good models should not under-fit (leaving out key variables in favor ofattempt to be generalizable) or over-fit to the data (including extraneous or unrealistic variableeffects in its attempt to have the best goodness of fit) because in each scenario, the conclusionloses value.
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
17/56
17
QQ Plot of Residuals versus a Normal Distribution
This quantile-quantile plot of the residuals versus a normal distribution show that the data seemsto be normally distributed through the inner quartiles, but heavily skewed with long-taileddistributions on both sidesparticularly the left side. Since this QQ plot indicates significantskew in this model, its conclusions cannot be used to draw the strongest conclusions.
Plot of Fitted Values versus Residuals
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
18/56
18
This plot of fitted values essentially entails a biased and heteroscedastic spread with aninteresting phenomenon of residuals steadily decreasing in variation at higher response variablevalues. Due to the extreme density of points attributable to the large amount of overall variationand sheer number of datapoints, more specific phenomena cannot be analyzed at this point intime.
The plot also points to the issue of the modeldrastically over-estimating predicted valuesgiven the sheer magnitude of the negativeresiduals, which calls for further analysisamong observed businesses that deviate the
most from the model. Given that over half of the observed businesses have receipts less than$16,000 according to the five-number summary, running a regression on the subset of PUMS toonly include businesses that earn more than the median point may produce a more well-fitting,unbiased, and homoscedastic model.
The five number summary forresidual values clearly shows greaternegative skew on the whole, but a
greater density of positive values over smaller intervals, thus confirming the residuals versusfitted plot. Since numerical variables typically have more leverage over the fit and spread of theresiduals and receipts is extremely skewed right, an analysis of both payroll and employment iswarranted to see if similar effects are in place. The following two tables describe the five numbersummaries for both payroll and employment.
These results confirm that both employment and payroll are heavily skewed right, whichwould explain the potential for the model to drastically overestimate certain values. Since theactual magnitude (from a dollar or labor force standpoint) causes this extreme skew, creating anew binary variable for each of these variables such that 0 indicates employment or payroll of 0and 1 indicates employment or payroll greater than 0 could result in better fit. Additionally,taking the logarithm of RECEIPTS_NOISY may induce better fit as well due to the skew.
In order to establish control for the following regressions, I first conduct the PROCGLMSELECT procedure with AIC with the only change of taking the logarithm of receipt
values.
Regression Four: PROC GLMSELECT with AIC: Receipts with Logarithm
Transform
Data Summary:Number of Observations - 784,208
Five Number Summary of
RECEIPTS_NOISY
Min Q1 Median Q3 Max
0 1.72393 15.23411 81.77834 6900
Five Number Summary of Residuals
Min Q1 Median Q3 Max-93534.00 -222.63 -36.71 101.67 8271.42
Five Number Summary of EMPLOYMENT_NOISY
Min Q1 Median Q3 Max
0 0 0 0 4890.00
Five Number Summary of PAYROLL_NOISY
Min Q1 Median Q3 Max
0 0 0 0 280000.00
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
19/56
19
Included Variables: PAYROLL_NOISY EMPLOYMENT_NOISY PCT1 FIPST SECTORN07_EMPLOYER SEX1 VET1 FOUNDED1 PURCHASED1 INHERITED1 RECEIVED1ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1 HOURS1 PRMINC1SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1 ESTABLISHED SCSAVINGS SCASSETSSCEQUITY SCCREDIT SCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN
SCVENTURE SCGRANT SCOTHER SCDONTKNOW SCAMOUNT HOMEBASEDFRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECEQUITY ECCREDITECGOVTLOAN ECBANKLOAN ECVENTURE ECPROFITS ECGRANT ECOTHERECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL OTHERBUS INDIVIDUALSEXPORTS FULLTIME PARTTIME LEASED CONTRACTORS HEALTHINS RETIREMENTPROFITSHARE HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCHLT40HOURS LT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABVOPERATING CEASENR HUSBWIFE NUMOWNERS race1noblanks LANGUAGE
Fit Statistics:R-Square0.6689
Adjusted R-Square0.6688Root MSE3.07507
While the fit has improved by moststandards, the overall Q-Q plot and fittedversus residual values plot still indicatemajor issues in skew and bias. As statedearlier, trying to correct the issue bycreating binary variables out of theexisting numeric payroll andemployment variables could controlsome of the drastic skewedness observedin both plots, which leads to the nextregression.
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
20/56
20
Regression Five: PROC GLMSELECT with AIC: Modified Binary Payroll and
Employment with Other Character Variables against the Logarithm of Receipts
Using PROC GLMSELECT, I used the aforementioned steps to select a model using the AICwith the modified variables of binary payroll and employment.
Variables Used: PCT1 FIPST SECTOR N07_EMPLOYER SEX1 VET1 FOUNDED1
PURCHASED1 INHERITED1 RECEIVED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1
FNCTNABV1 HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1
ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDIT SCGOVTLOAN
SCBANKLOAN SCFAMLOAN SCVENTURE SCGRANT SCOTHER SCAMOUNT
HOMEBASED FRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECEQUITY
ECCREDIT ECGOVTLOAN ECGOVTGUAR ECBANKLOAN ECVENTURE ECPROFITS
ECGRANT ECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL
STATELOCAL OTHERBUS INDIVIDUALS EXPORTS FULLTIME DAYLABORTEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHARE
HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURS
LT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV OPERATING CEASENR
HUSBWIFE FAMILYBUS NUMOWNERS race1noblanks LANGUAGE ifpayroll
ifemployment
Fit Statistics:R-square0.6550Adjusted R-square0.6549
Root MSE3.13900
Although the fit statistics seem to
indicate worse fit than the previous
model given lower coefficients of
determination and a higher root MSE,
the Q-Q plot absolutely indicates better
overall fit and far less skewespecially
left skewedness.
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
21/56
21
The correction of skew also allows me to observe the actual bulk of the fitted versus
residuals plot more closely, which is starting to show very peculiar patterns that require furtherinvestigation. One possible reason for this is the fact that virtually all of the regressed variables
are categorical with the exception of ownership percentage of the first business owner.
Although the previous model had better fit values and the current model may be
susceptible to data overfitting to the sample, this model iteration still ultimately has better results
according to the Q-Q plot and the fitted versus residuals plot due to reduced skewedness.
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
22/56
22
Analysis of Language
Research Question 1a: Does the language spoken in transactions produce a
difference in correlated receipts?
The United States is often seen as the nation most conducive to immigrant
entrepreneurship, especially in an increasingly globalized society as well. To start, I investigatedthe relationship between different languages and overall business receipts. Running a generallinear model procedure for only the language variable against receipts results in the following:
Estimates of Language Variable Levels Against Receipts
One interesting phenomenon is that a business that conducts transactions through onlyEnglish and Spanish is associated with higher gross receipts than a business that conductsbusiness transactions through any other language or language combination. Most notably, theestimate for a business that only speaks English and Spanish is nearly 40% greater than theestimate for a business that only speaks English for transactions.
Its possible that Hispanics are largely employed by the agriculture, manufacturing, andconstruction industries more so than others. According to 2008 research by the Center forConstruction Research and Training, the following depicts a graph of Hispanic employees as apercentage of each industry.
This data point to the possible underlying non-uniform distribution of Hispanics acrossbusiness industries.Its important to understand the difference between businesses that conduct
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
23/56
23
transactions in a certain language and businesses that operate internally using the language. Forexample, a business that could have a predominantly large portion of Hispanics may notnecessarily conduct business transactions in Spanish. Therefore, an additional research questioncould be the relationship between the language and new race/ethnicity variable. The percentageof Hispanics by industry in this instance simply points to industries to investigate more closely,
such as construction and agriculture.Using this information, I analyzed the breakdown of language by NAICS sector code,
paying special attention to the industries that had the greatest percentage of Only English andSpanish businesses.
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
24/56
24
Industries with the Greatest Proportion of Only Spanish and English Businesses:Sector 55: 13.63%The Management of Companies and Enterprises sector comprises (1) establishments that hold
the securities of (or other equity interests in) companies and enterprises for the purpose ofowning a controlling interest or influencing management decisions or (2) establishments (except
government establishments) that administer, oversee, and manage establishments of the companyor enterprise and that normally undertake the strategic or organizational planning and decision
making role of the company or enterprise.
Sector 62: 10.74%The Health Care and Social Assistance sector comprises establishments providing health careand social assistance for individuals.
Industries with the Greatest Proportion of Only English Businesses:Sector 21: 95.71%
The Mining sector comprises establishments that extract naturally occurring mineral solids, suchas coal and ores; liquid minerals, such as crude petroleum; and gases, such as natural gas.
Sector 11: 92.90%
The Agriculture, Forestry, Fishing and Hunting sector comprises establishments primarily
engaged in growing crops, raising animals, harvesting timber, and harvesting fish and other
animals from a farm, ranch, or their natural habitats.
Given these interpretations, looking at the general mean receipts for each sector could then
explain why businesses that conducted business transactions in English and Spanish havestatistically higher receipts on average. The following table depicts sector and it means receipts:
Statistics
Variable Mean Std Error of Mean
RECEIPTS_NOISY 174.631446 0.353243
Domain Analysis: SECTOR
SECTOR Mean Std Error of Mean
11 Agriculture, Fishing 96.067138 2.182337
21 Mining, Quarrying, Oil Extraction 233.433129 6.192045
22 Trade, Transportation, Utilities 113.609696 6.397033
23 Construction 218.497973 1.149555
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
25/56
25
Domain Analysis: SECTOR
SECTOR Mean Std Error of Mean
31 Manufacturing 527.070573 4.920525
42 Wholesale Trade 628.044453 5.29576244 Retail Trade 267.633023 1.575551
48 Transportation and Warehousing 144.352095 1.225643
51 Information 156.241621 2.441339
52 Finance and Insurance 170.082058 1.558410
53 Real Estate and Rental 120.396544 0.699686
54 Professional, Scientific, TechnicalServices
139.530715 0.707921
55 Mgmt. of Companies and Enterprises 490.728375 19.475524
56 Admin. and Support and WasteManagement
100.129900 0.835629
61 Education 45.306452 0.868810
62 Healthcare 175.218046 1.177138
71 Arts and Etnmt. 58.393615 0.740265
72 Accommodation and Food Services 356.426518 2.657923
81 Other Services 69.516196 0.417748
99 Unclassifiable 102.268452 8.707951
According to this PROC SURVEYMEANS procedure of the mean gross receipts of the averagebusiness in each industry and the average gross receipts of all businesses (across industries),which is 174.63, both sectors 55 and 62 earn above average. While this preliminarily explainswhy the overall estimate for Only English and Spanish businesses is higher than the estimateof other language levels, the following question should be investigated:
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
26/56
26
Research Question 1b: Out of only existing Only English and Spanish businesses,
which are the most popular industries for business?
Table of LANGUAGE by SECTOR
LANGUAGE SECTOR Frequency WeightedFrequency
Percent
Only English and Spanish 11 390 4076 0.4269
21 263 1907 0.1998
22 123 533.41200 0.0559
23 7472 101173 10.5960
31 3480 17790 1.8632
42 4346 27406 2.8702
44 13181 113815 11.9199
48 3979 44151 4.6239
51 1460 9222 0.9658
52 5665 46651 4.8858
53 6760 85715 8.9770
54 11573 117115 12.2656
55 1159 1581 0.1656
56 5933 74071 7.7575
61 1325 16665 1.7453
62 11269 127184 13.3201
71 1917 25298 2.6495
72 4017 35816 3.7511
81 7699 104522 10.9467
99 22 135.88500 0.0142
Total 92033 954827 100.000
The top give highest concentrations of Only English and Spanish businesses are insectors 62 (13.32% and receipts of 175.22), 54 (12.27% and receipts of 139.53), 44 (11.92% andreceipts of 267.63), 81 (10.95% and receipts of 69.52), and 23 (10.60% and receipts of 218.50).Although within on a sector level, Only English and Spanish businesses in sector 55 made up alarge share of businesses in sector 55 overall, it actually had one of the smallest actualfrequencies with just 1159 businesses total. Revising my initial conclusion, I argue that thehigher estimate is most likely derived from consistently above average performance in sectorswhere Only English and Spanish businesses are prevalent.
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
27/56
27
Visualizing the Percentage of Hispanic/Spanish-Speaking Businesses in the United
StatesAs noted earlier, there exists a dichotomy between businesses run by Hispanic owners
and businesses that conduct transactions in English and Spanish. I approached this issue in adifferent fashion, by plotting the percentage of both (for each state) on a map of the United
States. I pre-defined set intervals after looking at the spread of percentages for each state. Thefollowing graphs generally depict the same patterns of higher concentrations of allHispanic/Spanish-speaking subjects located in the Southwest and Florida. I obtained 2007 dataon the number of Hispanics living in each state from the Pew Research Hispanic Trend Project.
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
28/56
28
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
29/56
29
Analysis of Capital Sources in California versus the United StatesCalifornias $2 trillion economy would be the ninth biggest in the world if it were a
country. The state represents 13% of the U.S. economy. However, California has been ranked asone of the worst states to do business in recent years according to business executives andpublications.2The state has been under duress from the dramatic fall in home prices and the
reduced tax revenues for the state. Moreover, California consistently boasts one of the highestcosts for living and operation. Interestingly, California also ranks among the best for technologyand innovation. Another plus is the $36 billion in venture capital money invested in Californiacompanies the past three years, which is four times the total of any other state.3California is alsonoted to be the home of Silicon Valley. According to a 2006 study done by the AmericanElectronics Association, Silicon Valley and the Bay Area as a whole ranked first in terms of thenumber of high-tech jobs in the United States.While the PUMS does not have location datamore granular than the state level, the existence of Silicon Valley itself could point to interestingstatistical characteristics of California that no other state may share.
As the state of extremes, I found it interesting to investigate sources of startup andexpansion capital and their relationship to revenues in California, especially compared to this
relationship between capital and revenues in the United States.Dataset Overview
The observed dataset is simply a subset of the cleaned PUMS done by only settingobservations that have indicated 06 (for California) as its FIPS code. This dataset has 182932observations, and most categorical variables (except location) seem to have a missing percentageof 30-50%, which is slightly higher than the typical missing pattern observed in the U.S. dataset.In this section, both startup and expansion capital will be analyzed.
Startup capitalrefers to the initial cost of investment to fully bring a product or service tomarket. It can be used for everything from business operation expenses to research anddevelopment to payroll. It is typically used to fund businesses still in their infancy, and can be
repaid once the business reaches a level of maturity to earn revenues on its own.Companies that seek expansion capital, on the other hand, will often do so in order to
finance a transformational event in their business. These companies are likely to be more mature(in terms of operating time) than venture capitalfunded companies, able to generate revenue andoperating profits but unable to generate sufficient cash to fund major opportunities, acquisitionsor other investments. Because of this lack of scale these companies generally can find fewalternative conduits to secure capital for growth, so access to growth equity can be critical topursue necessary facility expansion, sales and marketing initiatives, equipment purchases, andnew product development.
2http://www.cnbc.com/id/1008432873http://www.forbes.com/pictures/mli45kikd/41-california/
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
30/56
30
Glossary of Capital Sources
Startup Capital (SC)
SCSAVINGS: Personal Savings
SCASSETS: Other Personal Assets
SCEQUITY: Home Equity
SCCREDIT: Credit Cards
SCGOVTLOAN: Government Loan
SCGOVTGUAR: Government Guaranteed Loanthe United States government and the SmallBusiness Administration provides loans to certain businesses depending on size and capitalpurposes
SCVENTURE: Venture Capitalist
SCGRANT: Grant
SCOTHER: Other
Research Question 2a: What sources of capital have the most positive relationship
with receipts?Regressing both startup capital and expansion capital variables against receipts in both
the California and United States datasets reveals interesting statistics on both the percentage ofusage of each type of capital, as well as the practical significance of each capital sourcerepresented via the estimate. The following data tables derive content from the full list of SASoutputs contained in Appendix C.
Summarized Table for Startup Capital
California U.S. EstimatesTrue Estimates (added to
Intercept)
Yes No Yes No CA USA CA USA Difference
SCSAVINGS 60.97% 39.03% 58.01% 41.99% 11.70 -15.78 113.71 154.40 -40.69
SCASSETS 6.23% 93.77% 7.05% 92.95% 66.98 48.85 168.98 219.03 -50.04
SCEQUITY 6.12% 93.88% 5.07% 94.93% 71.58 49.37 173.59 219.55 -45.96
SCCREDIT 11.22% 88.78% 10.31% 89.69% -53.88 -90.21 48.13 79.96 -31.83
SCGOVTLOAN 0.40% 99.60% 0.56% 99.44% 155.74 91.93 257.74 262.10 -4.36
SCGOVTGUAR 0.45% 99.55% 0.59% 99.41% 128.86 195.60 230.87 365.78 -134.92
SCBANKLOAN 4.47% 95.53% 9.18% 90.82% 247.36 305.75 349.37 475.93 -126.56
SCFAMLOAN 2.23% 97.77% 2.28% 97.72% 36.52 180.69 138.53 350.86 -212.34
SCVENTURE 0.30% 99.70% 0.25% 99.75% 68.87 364.61 170.88 534.78 -363.91
SCGRANT 0.18% 99.82% 0.20% 99.80% -78.63 -103.56 23.37 66.62 -43.24
SCOTHER 1.74% 98.26% 1.70% 98.30% 203.10 170.52 305.10 340.69 -35.59
SCDONTKNOW 4.47% 95.53% 4.58% 95.42% 150.63 183.79 252.64 353.96 -101.32SCNONENEEDED 23.59% 76.41% 24.35% 75.65% -53.49 111.12 48.52 281.30 -232.78
SCNOTREPORTED 5.33% 94.67% 5.32% 94.68% 0.00 0.00 102.01 170.18 -68.17
INTERCEPT 102.01 170.18
In terms of usage frequency, differences greater than 1% between the United States andCalifornia are bolded. Regardless, its important to note that most of these differences (even fordifferences less than 1%) are statistically significant due to the large sample size. Businesses inthe United States as a whole are more than twice as likely to use bank loans as a source of startupcapital when compared to businesses in California, while Californian business owners are more
Startup Capital (SC)
SCSAVINGS: Personal Savings
SCASSETS: Other Personal Assets
SCEQUITY: Home Equity SCCREDIT: Credit Cards
SCGOVTLOAN: Government Loan
SCGOVTGUAR: Government Guaranteed Loan
the United States government and the Small
Business Administration provides loans to
certain businesses depending on size and capital
purposes
SCBANKLOAN: Loan from Bank
SCFAMLOAN: Loan from family and friends
SCVENTURE: Venture Capitalist
SCGRANT: Grant
SCOTHER: Other
Expansion Capital (EC)
ECSAVINGS: Personal Savings
ECASSETS: Other Personal Assets
ECEQUITY: Home Equity
ECCREDIT: Credit Cards
ECGOVTLOAN: Government Loan ECGOVTGUAR: Government Guaranteed
Loanthe United States government and the
Small Business Administration provides loans
to certain businesses depending on size and
capital purposes
ECBANKLOAN: Loan from Bank
ECFAMLOAN: Loan from family and friends
ECVENTURE: Venture Capitalist
ECPROFITS: Business Profits
ECGRANT: Grant
ECOTHER: Other
ECNOEXPAND: Did not expand ECNOACCESS: No Access to Expansion
Capital
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
31/56
31
likely to use home equity and their own savings to start ventures. Overall, the top threecategories with the highest estimate magnitudes for the United States are venture capital, bankloans, and government guaranteed loans. For California, these categories are bank loans, othersources of capital (unspecified), and government loans.
Consistent with its low ranking, California as a whole seems to entirely perform worse
than businesses in the United States according to the difference in estimates. Most interestingly,California comparatively the worst in the venture capital category, despite being the state withthe greatest amount of venture capital invested in business venture formation. Othercomparatively poor categories are loans from family members and businesses that do not needstartup capital. These results warrant further analysis is conducted on the spread, rather than theaverage, of certain categories, as it is possible for Californian businesses to have greaterextremes than American businesses in general.
Summarized Table for Expansion Capital
California U.S. EstimatesTrue Estimates (added to
Intercept)
Yes No Yes No CA USA CA USA Difference
ECSAVINGS 32.24% 67.76% 29.00% 71.00% -33.74 -112.84 76.38 118.78 -42.39
ECASSETS 3.90% 96.10% 4.00% 96.00% 21.89 5.40 132.03 237.03 -105.00
ECEQUITY 5.84% 94.16% 4.33% 95.67% 135.96 90.04 246.09 321.67 -75.57
ECCREDIT 13.68% 86.32% 12.09% 87.91% -9.45 -88.74 100.67 142.87 -42.19
ECGOVTLOAN 0.33% 99.67% 0.39% 99.61% 419.09 99.43 529.23 331.06 198.17
ECGOVTGUAR 0.26% 99.74% 0.29% 99.71% 63.47 167.61 173.61 399.24 -225.63
ECBANKLOAN 4.60% 95.40% 7.54% 92.46% 461.09 446.34 571.22 677.97 -106.74
ECFAMLOAN 1.06% 98.94% 0.93% 99.07% 69.17 113.62 179.31 345.25 -165.93
ECVENTURE 0.18% 99.82% 0.13% 99.87% 514.95 521.87 625.09 753.49 -128.40
ECPROFITS 8.95% 91.05% 9.23% 90.77% 124.95 162.90 235.09 394.53 -159.43
ECGRANT 0.20% 99.80% 0.19% 99.81% -98.08 -131.52 12.05 100.10 -88.04
ECOTHER 0.83% 99.17% 0.76% 99.24% 127.44 132.62 237.57 364.24 -126.67
ECDONTKNOW 6.76% 93.24% 6.57% 93.43% -9.43 -43.75 100.70 187.86 -87.16
ECNOACCESS 1.93% 98.07% 1.80% 98.20% -68.32 -161.82 41.81 69.80 -27.98
ECNOEXPAND 45.80% 54.20% 48.29% 51.71% -26.99 -108.06 83.14 123.56 -40.41
ECNOTREPORTED 7.38% 92.62% 7.61% 92.39% 0 0 110.13 231.63 -121.49
INTERCEPT N/A N/A N/A N/A 110.137 231.627
As done previously, differences greater than 1% between the United States and Californiaare bolded. Businesses in the United States as a whole are more likely to use bank loans forexpansion capital, or not require it at all. Businesses in California are more likely to use theirown savings, home equity, or credit card debt to fund expansion.
Overall, the top three categories with the highest estimate magnitudes for the UnitedStates are venture capital, bank loans, and government guaranteed loans (which is also identicalto the top three for startup capital). For California, these categories are venture capital, bankloans, and government loans. Once again, businesses in the United States tend to benefit morefrom virtually all sources of expansion capital than businesses in California, with the exceptionof having a government loan. These interesting characteristics further warrant an analysis ofspread, rather than just the average, for certain categories.
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
32/56
32
Research Question 2b: How does the spread of receipts for certain capital sources
compare between businesses in California and the general United States?
SCVENTURE and ECVENTURE AnalysisEven though California performed worse on average according to its estimates, I initially
hypothesized that California had an overall larger spread with a maximum and upper quartile
point likely exceeding the maximum and upper quartile of receipts in the United States.According to the following give number summary, Californias quantiles for startup venturecapital exceed those of the United States except for the maximum value, which indicates thatCalifornian businesses holistically perform better than businesses in the United States in general.The previous estimates from the regression are influenced due to the heavier right skewedness ofthe United States.
Venture Capital - Five Number Summaries
Quantile
California -
SC
US -
SC
California -
EC US - EC
Min 0 0 0 0
Q1 10.57 7.71 9.34 12.32Median 97.28 52.54 91.73 92.12
Q3 774.90 454.80 608.45 725.24
Max 6600.00 6900.00 6900.00 6800.00
ECGOVTGUAR AnalysisSince ECGOVTGUAR was the only source of expansion capital in which California had
a greater estimate for than the United States, I also wanted to conduct a spread analysis.According to the PROC SURVEYFREQ procedure, there are 440 businesses in California thatused a government guaranteed loan for expansion capital.
Expansion Capital from Gov't Guar. Loan
Quantile California - SC US - SC
Min 0.00 0.00
Q1 83.57 22.29
Median 301.42 174.09
Q3 993.67 621.90
Max 6900.00 6900.00
The spread confirms the general interpretation from the estimate such that Californian
businesses as a whole tend to benefit more from government guaranteed loans.
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
33/56
33
ConclusionMy analysis of the Public Microdata Sample of the 2007 Survey of Business Owners
involved cleaning and manipulating raw data, selecting a generalized linear model of allvariables against receipts through the Akaikes Information Criterion, an analysis ofbusinesstransaction language, industry of business, and ethnicity of the owner in the context of state
location, and an investigation of the differences in capital sources between businesses inCalifornia and the United States in general.
From the data cleaning, I was able to successfully incorporate my knowledge of thebackground survey methodology to effectively consolidate and remove variables and prepare itfor proper analysis in this context. From the model fitting, I was able to learn and apply differentmodel selection techniques and selection criteria (AIC, BIC, SBC, Mallows Cp, MSE, andadjusted r-square) to ultimately choose the best fitting model that attained a moderately strongadjusted coefficient of determination, after several model selection manipulations that involvedlogarithm variable transformation to reduce skewness and improve overall fit.
The language analysis revealed that businesses that conducted transactions in only
English and Spanish had higher estimates that businesses that only used English. Investigatingthese businesses within the context of sector ultimately showed that businesses that only usedEnglish and Spanish were statistically represented at a higher percentage for certain sectors (suchas sector 23 / Construction ) that earned more on average than other sectors where Only Englishbusinesses had a higher percentage. Finally, the analysis of sources of capital revealed thatCalifornia overwhelmingly performed worse than businesses in the United States based onaverages and estimates in regression, but closer analysis on the spread of receipts given startupventure capital in California shows that estimates can be misleading, and heavy right-skewednessinvalidates conclusions based on the average alone.
Given more time and access to the actual dataset (without noise and other confidentiality-
preserving measures), I would be able to develop a more powerful and accurate model, alongwith other analyses. Other important research questions to investigate would be creating acorrelation matrix to observe collinearity between variables, observing the relationship betweenreceipts with more demographic information such as age, gender, and education level, as well asconducting statistical analysis with different response variables, such as employment and payroll.
Lessons LearnedIve come to believe that doing independent research in the context of statistics is
incredibly important. Through this study, Ive been able to apply everything that Ive learned inall of my statistics courses, from learning how to handle a very large data set in SAS to makingthe proper assumptions and conclusions from my analyses. I argue that this is the highest form of
learningas it is completely experiential, based off of existing data, and set entirely in real-world scenarios. Ive also been fortunate enough to study the fusion of my two academic fields business and statistics. The flexibility of independent research has allowed me to learn aboutexisting literature in the vast field of business statistics and entrepreneurship, as well as fieldother possible research ideas such as social network analysis and survival analysis.
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
34/56
34
Appendix A: Full Output of Regression with log(Receipts) and ModifiedPayroll and Employment
The GLMSELECT Procedure
Selected Model
The selected model is the model at the last step (Step 81).
Effects: Intercept PAYROLL_NOISY EMPLOYMENT_NOISY PCT1 FIPST SECTORN07_EMPLOYER SEX1 VET1 FOUNDED1 PURCHASED1 INHERITED1RECEIVED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDITSCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN SCVENTURE
SCGRANT SCOTHER SCDONTKNOW SCAMOUNT HOMEBASEDFRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECEQUITY ECCREDITECGOVTLOAN ECBANKLOAN ECVENTURE ECPROFITS ECGRANTECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERALOTHERBUS INDIVIDUALS EXPORTS FULLTIME PARTTIME LEASEDCONTRACTORS HEALTHINS RETIREMENT PROFITSHARE HOLIDAYSBENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURSLT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV OPERATINGCEASENR HUSBWIFE NUMOWNERS race1noblanks LANGUAGE
Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value
Model 207 14974589 72341 7650.22
Error 784001 7413568 9.45607
Corrected Total 784208 22388157
Root MSE 3.07507
Dependent Mean 4.23116
R-Square 0.6689
Adj R-Sq 0.6688
AIC 2546267
AICC 2546268
SBC 1764464
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
35/56
35
Parameter Estimates
Parameter DF Estimate Standard Error t Value
Intercept 1 1.889667 0.094597 19.98
PAYROLL_NOISY 1 0.001098 0.000006900 159.06
EMPLOYMENT_NOISY 1 0.013737 0.000194 70.72
PCT1 1 0.000336 0.000107 3.13
FIPST 01 1 0.069510 0.012035 5.78
FIPST 04 1 0.098799 0.010865 9.09
FIPST 05 1 0.007242 0.013682 0.53
FIPST 06 1 0.190992 0.008333 22.92
FIPST 08 1 0.042436 0.010223 4.15
FIPST 09 1 0.176561 0.011957 14.77
FIPST 12 1 0.042020 0.008756 4.80
FIPST 13 1 0.066957 0.009863 6.79
FIPST 15 1 0.128502 0.017731 7.25
FIPST 16 1 -0.020266 0.015023 -1.35
FIPST 17 1 0.053238 0.009259 5.75
FIPST 18 1 0.012449 0.010725 1.16
FIPST 19 1 -0.066745 0.012767 -5.23
FIPST 20 1 -0.015773 0.013107 -1.20
FIPST 21 1 -0.007839 0.012161 -0.64
FIPST 22 1 0.102305 0.012320 8.30
FIPST 23 1 0.000535 0.015441 0.03
FIPST 24 1 0.120146 0.010882 11.04
FIPST 25 1 0.144996 0.010390 13.96FIPST 26 1 0.006431 0.009752 0.66
FIPST 27 1 0.006867 0.010502 0.65
FIPST 28 1 0.052754 0.014819 3.56
FIPST 29 1 -0.014865 0.010791 -1.38
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
36/56
36
Parameter Estimates
Parameter DF Estimate Standard Error t Value
FIPST 30 1 -0.065514 0.016294 -4.02
FIPST 31 1 -0.083451 0.015061 -5.54FIPST 32 1 0.171132 0.014256 12.00
FIPST 33 1 0.122197 0.015510 7.88
FIPST 34 1 0.184363 0.009862 18.69
FIPST 35 1 0.046100 0.015776 2.92
FIPST 36 1 0.132543 0.008847 14.98
FIPST 37 1 0.057997 0.009811 5.91
FIPST 39 1 0.021252 0.009527 2.23
FIPST 40 1 -0.000955 0.012247 -0.08
FIPST 41 1 0.055542 0.011366 4.89
FIPST 42 1 0.069120 0.009322 7.41
FIPST 45 1 0.045966 0.011929 3.85
FIPST 47 1 0.070336 0.010845 6.49
FIPST 48 1 0.107862 0.008739 12.34
FIPST 49 1 0.083133 0.012981 6.40
FIPST 51 1 0.079476 0.010169 7.82
FIPST 53 1 0.103092 0.010198 10.11
FIPST 54 1 -0.057228 0.017523 -3.27
FIPST 55 0 0 . .
SECTOR 11 1 1.034824 0.086277 11.99
SECTOR 21 1 1.021291 0.086922 11.75
SECTOR 22 1 0.990292 0.096556 10.26
SECTOR 23 1 1.275192 0.085487 14.92
SECTOR 31 1 1.101583 0.085674 12.86
SECTOR 42 1 1.555542 0.085640 18.16
SECTOR 44 1 1.279568 0.085497 14.97
SECTOR 48 1 1.200337 0.085620 14.02
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
37/56
37
Parameter Estimates
Parameter DF Estimate Standard Error t Value
SECTOR 51 1 0.863523 0.085909 10.05
SECTOR 52 1 0.940771 0.085589 10.99SECTOR 53 1 0.995807 0.085505 11.65
SECTOR 54 1 0.908610 0.085467 10.63
SECTOR 55 1 -0.452943 0.100108 -4.52
SECTOR 56 1 0.809018 0.085534 9.46
SECTOR 61 1 0.734224 0.085848 8.55
SECTOR 62 1 0.966935 0.085523 11.31
SECTOR 71 1 0.742947 0.085621 8.68
SECTOR 72 1 1.069889 0.085657 12.49
SECTOR 81 1 0.856258 0.085507 10.01
SECTOR 99 0 0 . .
N07_EMPLOYER E 1 1.137851 0.003439 330.84
N07_EMPLOYER N 0 0 . .
SEX1 F 1 -0.152693 0.002767 -55.18
SEX1 M 0 0 . .
VET1 1 1 0.994431 0.478293 2.08
VET1 2 0 0 . .
FOUNDED1 1 1 -0.079178 0.028948 -2.74
FOUNDED1 2 0 0 . .
PURCHASED1 1 1 -0.087677 0.028811 -3.04
PURCHASED1 2 0 0 . .
INHERITED1 1 1 -0.091757 0.028593 -3.21
INHERITED1 2 0 0 . .
RECEIVED1 1 1 -0.060479 0.028701 -2.11
RECEIVED1 2 0 0 . .
ACQYR1 1 1 0.123718 0.010722 11.54
ACQYR1 2 1 0.114924 0.010468 10.98
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
38/56
38
Parameter Estimates
Parameter DF Estimate Standard Error t Value
ACQYR1 3 1 0.122968 0.009717 12.66
ACQYR1 4 1 0.112593 0.009256 12.16ACQYR1 5 1 0.084345 0.011292 7.47
ACQYR1 6 1 0.055247 0.011273 4.90
ACQYR1 7 1 -0.069739 0.011162 -6.25
ACQYR1 8 0 0 . .
PROVIDE1 1 1 -0.225547 0.003055 -73.82
PROVIDE1 2 0 0 . .
MANAGE1 1 1 -0.065064 0.002819 -23.08
MANAGE1 2 0 0 . .
FINANCIAL1 1 1 0.104690 0.002771 37.78
FINANCIAL1 2 0 0 . .
FNCTNABV1 1 1 -0.117711 0.006090 -19.33
FNCTNABV1 2 0 0 . .
HOURS1 1 1 -0.135239 0.008345 -16.21
HOURS1 2 1 -0.336065 0.004656 -72.19
HOURS1 3 1 -0.181169 0.004282 -42.31
HOURS1 4 1 -0.150164 0.003995 -37.59
HOURS1 5 1 -0.082393 0.003423 -24.07
HOURS1 6 0 0 . .
PRMINC1 1 1 0.242443 0.002948 82.23
PRMINC1 2 0 0 . .
SELFEMP1 1 1 0.047307 0.002353 20.10
SELFEMP1 2 0 0 . .
EDUC1 1 1 -0.166470 0.005955 -27.95
EDUC1 2 1 -0.108420 0.003966 -27.34
EDUC1 3 1 -0.181064 0.005205 -34.78
EDUC1 4 1 -0.124506 0.003872 -32.15
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
39/56
39
Parameter Estimates
Parameter DF Estimate Standard Error t Value
EDUC1 5 1 -0.146356 0.005298 -27.62
EDUC1 6 1 -0.073524 0.003406 -21.58EDUC1 7 0 0 . .
AGE1 1 1 -0.184340 0.009761 -18.88
AGE1 2 1 0.007893 0.005418 1.46
AGE1 3 1 0.071539 0.004581 15.62
AGE1 4 1 0.071199 0.004198 16.96
AGE1 5 1 0.036784 0.003984 9.23
AGE1 6 0 0 . .
BORNUS1 1 1 -0.037608 0.004052 -9.28
BORNUS1 2 0 0 . .
DISVET1 1 1 -1.093928 0.478399 -2.29
DISVET1 2 1 -1.034112 0.478302 -2.16
DISVET1 3 0 0 . .
ESTABLISHED 1 1 0.133473 0.009513 14.03
ESTABLISHED 2 1 0.135956 0.009681 14.04
ESTABLISHED 3 1 0.132354 0.008972 14.75
ESTABLISHED 4 1 0.111280 0.008763 12.70
ESTABLISHED 5 1 0.095458 0.009609 9.93
ESTABLISHED 6 1 0.094962 0.009231 10.29
ESTABLISHED 7 1 0.071583 0.010832 6.61
ESTABLISHED 8 1 0.051642 0.010923 4.73
ESTABLISHED 9 1 -0.016931 0.010863 -1.56
ESTABLISHED A 0 0 . .
SCSAVINGS 1 1 -0.017371 0.003576 -4.86
SCSAVINGS 2 0 0 . .
SCASSETS 1 1 -0.056578 0.004313 -13.12
SCASSETS 2 0 0 . .
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
40/56
40
Parameter Estimates
Parameter DF Estimate Standard Error t Value
SCEQUITY 1 1 -0.039681 0.005002 -7.93
SCEQUITY 2 0 0 . .SCCREDIT 1 1 -0.051551 0.003860 -13.36
SCCREDIT 2 0 0 . .
SCGOVTLOAN 1 1 -0.028818 0.013311 -2.16
SCGOVTLOAN 2 0 0 . .
SCGOVTGUAR 1 1 0.022142 0.012371 1.79
SCGOVTGUAR 2 0 0 . .
SCBANKLOAN 1 1 0.055183 0.004020 13.73
SCBANKLOAN 2 0 0 . .
SCFAMLOAN 1 1 -0.018966 0.006370 -2.98
SCFAMLOAN 2 0 0 . .
SCVENTURE 1 1 -0.096289 0.022065 -4.36
SCVENTURE 2 0 0 . .
SCGRANT 1 1 -0.145460 0.028804 -5.05
SCGRANT 2 0 0 . .
SCOTHER 1 1 -0.035163 0.008224 -4.28
SCOTHER 2 0 0 . .
SCDONTKNOW 1 1 -0.052654 0.008604 -6.12
SCDONTKNOW 2 0 0 . .
SCAMOUNT 1 1 -0.002310 0.004671 -0.49
SCAMOUNT 2 1 0.062655 0.005589 11.21
SCAMOUNT 3 1 0.087967 0.005551 15.85
SCAMOUNT 4 1 0.131828 0.006121 21.54
SCAMOUNT 5 1 0.182133 0.006361 28.63
SCAMOUNT 6 1 0.247689 0.006671 37.13
SCAMOUNT 7 1 0.385266 0.007672 50.22
SCAMOUNT 8 1 0.587252 0.011676 50.29
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
41/56
41
Parameter Estimates
Parameter DF Estimate Standard Error t Value
SCAMOUNT 9 1 0.186741 0.006260 29.83
SCAMOUNT A 0 0 . .HOMEBASED 1 1 -0.193209 0.002631 -73.42
HOMEBASED 2 0 0 . .
FRANCHISE 1 1 0.095547 0.008042 11.88
FRANCHISE 2 0 0 . .
FRANCHISER50 1 1 -0.022814 0.012799 -1.78
FRANCHISER50 2 0 0 . .
ECSAVINGS 1 1 -0.071333 0.003629 -19.66
ECSAVINGS 2 0 0 . .
ECASSETS 1 1 -0.050133 0.005692 -8.81
ECASSETS 2 0 0 . .
ECEQUITY 1 1 0.032494 0.005389 6.03
ECEQUITY 2 0 0 . .
ECCREDIT 1 1 -0.024643 0.003773 -6.53
ECCREDIT 2 0 0 . .
ECGOVTLOAN 1 1 0.038285 0.015650 2.45
ECGOVTLOAN 2 0 0 . .
ECBANKLOAN 1 1 0.103370 0.004219 24.50
ECBANKLOAN 2 0 0 . .
ECVENTURE 1 1 -0.246374 0.031677 -7.78
ECVENTURE 2 0 0 . .
ECPROFITS 1 1 0.031553 0.003818 8.27
ECPROFITS 2 0 0 . .
ECGRANT 1 1 -0.158893 0.028705 -5.54
ECGRANT 2 0 0 . .
ECOTHER 1 1 -0.032423 0.012595 -2.57
ECOTHER 2 0 0 . .
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
42/56
42
Parameter Estimates
Parameter DF Estimate Standard Error t Value
ECDONTKNOW 1 1 -0.088126 0.007204 -12.23
ECDONTKNOW 2 0 0 . .ECNOACCESS 1 1 -0.095309 0.010197 -9.35
ECNOACCESS 2 0 0 . .
ECNOEXPAND 1 1 -0.043675 0.003966 -11.01
ECNOEXPAND 2 0 0 . .
FEDERAL 1 1 0.041492 0.007992 5.19
FEDERAL 2 0 0 . .
OTHERBUS 1 1 0.045852 0.003214 14.27
OTHERBUS 2 0 0 . .
INDIVIDUALS 1 1 -0.149726 0.003468 -43.17
INDIVIDUALS 2 0 0 . .
EXPORTS 1 1 0.073348 0.007377 9.94
EXPORTS 2 1 0.138000 0.010933 12.62
EXPORTS 3 1 0.177615 0.013409 13.25
EXPORTS 4 1 0.214486 0.016534 12.97
EXPORTS 5 1 0.166936 0.016124 10.35
EXPORTS 6 1 0.178288 0.016120 11.06
EXPORTS 7 1 0.293766 0.017115 17.16
EXPORTS 8 1 0.320932 0.021770 14.74
EXPORTS 9 0 0 . .
FULLTIME 1 1 0.179631 0.003769 47.67
FULLTIME 2 0 0 . .
PARTTIME 1 1 -0.020924 0.003068 -6.82
PARTTIME 2 0 0 . .
LEASED 1 1 0.334245 0.011579 28.87
LEASED 2 0 0 . .
CONTRACTORS 1 1 0.246384 0.002523 97.67
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
43/56
43
Parameter Estimates
Parameter DF Estimate Standard Error t Value
CONTRACTORS 2 0 0 . .
HEALTHINS 1 1 0.078051 0.004257 18.33HEALTHINS 2 0 0 . .
RETIREMENT 1 1 0.170154 0.004144 41.06
RETIREMENT 2 0 0 . .
PROFITSHARE 1 1 -0.020966 0.006971 -3.01
PROFITSHARE 2 0 0 . .
HOLIDAYS 1 1 0.120158 0.004618 26.02
HOLIDAYS 2 0 0 . .
BENENABV 1 1 -0.074976 0.005111 -14.67
BENENABV 2 0 0 . .
WEBSITE 1 1 0.019704 0.002904 6.78
WEBSITE 2 0 0 . .
ECOMMPCT 1 1 -0.033776 0.009894 -3.41
ECOMMPCT 2 1 -0.076199 0.010547 -7.22
ECOMMPCT 3 1 -0.065350 0.013496 -4.84
ECOMMPCT 4 1 -0.107320 0.012424 -8.64
ECOMMPCT 5 1 -0.053331 0.012200 -4.37
ECOMMPCT 6 1 -0.060335 0.010540 -5.72
ECOMMPCT 7 1 -0.093930 0.013852 -6.78
ECOMMPCT 8 1 -0.158731 0.015935 -9.96
ECOMMPCT 9 0 0 . .
ONLINEPURCH 1 1 0.003940 0.002547 1.55
ONLINEPURCH 2 0 0 . .
LT40HOURS 1 1 -0.060922 0.004991 -12.21
LT40HOURS 2 0 0 . .
LT12MONTHS 1 1 -0.050281 0.004365 -11.52
LT12MONTHS 2 0 0 . .
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
44/56
44
Parameter Estimates
Parameter DF Estimate Standard Error t Value
SEASONAL 1 1 -0.038413 0.005919 -6.49
SEASONAL 2 0 0 . .OCCASIONALLY 1 1 -0.102215 0.005775 -17.70
OCCASIONALLY 2 0 0 . .
ACTIVITYNABV 1 1 0.189169 0.005265 35.93
ACTIVITYNABV 2 0 0 . .
OPERATING 1 1 0.247984 0.003370 73.58
OPERATING 2 0 0 . .
CEASENR 1 1 0.062078 0.022437 2.77
CEASENR 2 0 0 . .
HUSBWIFE 1 1 -0.064987 0.004525 -14.36
HUSBWIFE 2 1 -0.068549 0.004097 -16.73
HUSBWIFE 3 1 -0.126576 0.006033 -20.98
HUSBWIFE 4 0 0 . .
NUMOWNERS 1 1 0.300708 0.014590 20.61
NUMOWNERS 2 1 0.419767 0.015134 27.74
NUMOWNERS 3 1 0.486225 0.016239 29.94
NUMOWNERS 4 1 0.489487 0.017344 28.22
NUMOWNERS 5 1 0.505033 0.017709 28.52
NUMOWNERS 6 1 0.444110 0.023362 19.01
NUMOWNERS 7 1 0.051559 0.046352 1.11
NUMOWNERS 8 0 0 . .
race1noblanks A 1 -0.020036 0.005832 -3.44
race1noblanks B 1 -0.185359 0.006773 -27.37
race1noblanks H 1 -0.067546 0.005681 -11.89
race1noblanks I 1 -0.072423 0.014317 -5.06
race1noblanks Mixed 1 0.004210 0.029596 0.14
race1noblanks P 1 -0.131819 0.039328 -3.35
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
45/56
45
Parameter Estimates
Parameter DF Estimate Standard Error t Value
race1noblanks S 1 0.002974 0.026112 0.11
race1noblanks W 0 0 . .LANGUAGE Only English 1 0.216239 0.017187 12.58
LANGUAGE Only English and Other 1 0.143442 0.017956 7.99
LANGUAGE Only English and Spanish 1 0.219765 0.017202 12.78
LANGUAGE Only Other Language 1 0.007221 0.023920 0.30
LANGUAGE Only Spanish 0 0 . .
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
46/56
46
Appendix B: Regression of All Available Variables
The SAS System
The SURVEYREG Procedure
Regression Analysis for Dependent Variable RECEIPTS_NOISY
Data Summary
Number of Observations 874182
Sum of Weights 10068531
Weighted Mean of RECEIPTS_NOISY 251.05405
Weighted Sum of RECEIPTS_NOISY 2527745467
Fit Statistics
R-square 0.5771
Root MSE 446.73
Denominator DF 874181
Class Level Information
Class Variable Levels Values
FIPST 43 01 04 05 06 08 09 12 13 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 34 35 36 37 39 40 41 42 45 47 48 49 5153 54 55
SECTOR 20 11 21 22 23 31 42 44 48 51 52 53 54 55 56 61 62 71 72 81 99
N07_EMPLOYER 2 E N
SEX1 2 F M
VET1 2 1 2
FOUNDED1 2 1 2
PURCHASED1 2 1 2
INHERITED1 2 1 2
RECEIVED1 2 1 2
ACQUIRENR1 1 2
ACQYR1 8 1 2 3 4 5 6 7 8
PROVIDE1 2 1 2
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
47/56
47
Class Level Information
Class Variable Levels Values
MANAGE1 2 1 2
FINANCIAL1 2 1 2FNCTNABV1 2 1 2
FNCTNR1 1 2
HOURS1 6 1 2 3 4 5 6
PRMINC1 2 1 2
SELFEMP1 2 1 2
EDUC1 7 1 2 3 4 5 6 7
AGE1 6 1 2 3 4 5 6
BORNUS1 2 1 2
DISVET1 3 1 2 3
ESTABLISHED 10 1 2 3 4 5 6 7 8 9 A
SCSAVINGS 2 1 2
SCASSETS 2 1 2
SCEQUITY 2 1 2
SCCREDIT 2 1 2
SCGOVTLOAN 2 1 2
SCGOVTGUAR 2 1 2
SCBANKLOAN 2 1 2
SCFAMLOAN 2 1 2
SCVENTURE 2 1 2
SCGRANT 2 1 2
SCOTHER 2 1 2
SCDONTKNOW 2 1 2
SCNONENEEDED 2 1 2
SCNOTREPORTED 1 2
SCAMOUNT 10 1 2 3 4 5 6 7 8 9 A
HOMEBASED 2 1 2
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
48/56
48
Class Level Information
Class Variable Levels Values
FRANCHISE 2 1 2
FRANCHISER50 2 1 2ECSAVINGS 2 1 2
ECASSETS 2 1 2
ECEQUITY 2 1 2
ECCREDIT 2 1 2
ECGOVTLOAN 2 1 2
ECGOVTGUAR 2 1 2
ECBANKLOAN 2 1 2
ECFAMLOAN 2 1 2
ECVENTURE 2 1 2
ECPROFITS 2 1 2
ECGRANT 2 1 2
ECOTHER 2 1 2
ECDONTKNOW 2 1 2
ECNOACCESS 2 1 2
ECNOEXPAND 2 1 2
ECNOTREPORTED 1 2
FEDERAL 2 1 2
STATELOCAL 2 1 2
OTHERBUS 2 1 2
INDIVIDUALS 2 1 2
CUSTNR 1 2
EXPORTS 9 1 2 3 4 5 6 7 8 9
OPSOUTSIDE 2 1 2
OUTSOURCE 2 1 2
FULLTIME 2 1 2
PARTTIME 2 1 2
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
49/56
49
Class Level Information
Class Variable Levels Values
DAYLABOR 2 1 2
TEMPSTAFF 2 1 2LEASED 2 1 2
CONTRACTORS 2 1 2
EMPNR 1 2
HEALTHINS 2 1 2
RETIREMENT 2 1 2
PROFITSHARE 2 1 2
HOLIDAYS 2 1 2
BENENABV 2 1 2
BENENR 1 2
WEBSITE 2 1 2
ECOMMERCE 2 1 2
ECOMMPCT 9 1 2 3 4 5 6 7 8 9
ONLINEPURCH 2 1 2
LT40HOURS 2 1 2
LT12MONTHS 2 1 2
SEASONAL 2 1 2
OCCASIONALLY 2 1 2
ACTIVITYNABV 2 1 2
ACTIVITYNR 1 2
OPERATING 2 1 2
CEASENR 2 1 2
CEASENA 2 1 2
HUSBWIFE 4 1 2 3 4
FAMILYBUS 2 1 2
NUMOWNERS 8 1 2 3 4 5 6 7 8
race1noblanks 8 A B H I Mixed P S W
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
50/56
50
Class Level Information
Class Variable Levels Values
LANGUAGE 5 Only English Only English and Other Only English andSpanish Only Other Language Only Spanish
region 9 East Sout Mid-Atlan Midwest Mountain Northeast PacificSouth Atl West Nort West Sout
Tests of Model Effects
Effect Num DF F Value Pr > F
Model 215 1538.56
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
51/56
51
Tests of Model Effects
Effect Num DF F Value Pr > F
PRMINC1 1 592.68
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
52/56
52
Tests of Model Effects
Effect Num DF F Value Pr > F
ECCREDIT 1 477.03
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
53/56
53
Tests of Model Effects
Effect Num DF F Value Pr > F
HEALTHINS 1 1027.62
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
54/56
54
Appendix C: Full Tables of Capital RegressionRegressing Startup Capital Variables Against Receipts in California
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept 102.007317 21.726311 4.70
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
55/56
55
SCFAMLOAN 1 180.68722 20.0165018 9.03
-
8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample
56/56
Intercept 231 62717 10.454859 22.15