2014 independent study on the census bureau's survey of business owners public use microdata sample

Upload: arthur-wu

Post on 03-Jun-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    1/56

    Analysis of the 2007 Survey of Business Owners

    Public Use Microdata Sample

    Arthur Wu

    Supervisor: Professor Amber Tomas

    STAT 4993 Independent Study5/9/2013

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    2/56

    2

    Table of Contents

    Introduction ..................................................................................................................................... 4Overview of the Survey of Business Owners Public Microdata Sample ........................................ 4

    Primary Methodology ................................................................................................................. 4Missing Data ............................................................................................................................... 6Nonresponse ................................................................................................................................ 7Inherent Differences between SBO and PUMS Information ...................................................... 7

    Data Manipulation and Cleaning .................................................................................................... 8Subsetting to Small Businesses and One Owner Data ................................................................ 8Tabulation Weights ..................................................................................................................... 9

    Fitting a Regression ........................................................................................................................ 9Regression One: All Variables Against Receipts ...................................................................... 10Model Selection Techniques ..................................................................................................... 12Regression Two: PROC GLSELECT with the Schwarz Bayesian Criterion ........................... 14Regression Three: PROC GLMSELECT with Akaikes Information Criterion....................... 15Regression Four: PROC GLMSELECT with AIC: Receipts with Logarithm Transform ........ 18Regression Five: PROC GLMSELECT with AIC: Modified Binary Payroll and Employment

    with Other Character Variables against the Logarithm of Receipts .......................................... 20Analysis of Language ................................................................................................................... 22

    Research Question 1a: Does the language spoken in transactions produce a difference incorrelated receipts? .................................................................................................................... 22Research Question 1b: Out of only existing Only English and Spanish businesses, which are

    the most popular industries for business? ................................................................................. 26Visualizing the Percentage of Hispanic/Spanish-Speaking Businesses in the United States ... 27

    Analysis of Capital Sources in California versus the United States ............................................. 29Dataset Overview ...................................................................................................................... 29Research Question 2a: What sources of capital have the most positive relationship with

    receipts? ..................................................................................................................................... 30Research Question 2b: How does the spread of receipts for certain capital sources compare

    between businesses in California and the general United States? ............................................. 32Conclusion .................................................................................................................................... 33Lessons Learned............................................................................................................................ 33Appendix A: Full Output of Regression with log(Receipts) and Modified Payroll and

    Employment .................................................................................................................................. 34

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    3/56

    3

    Appendix B: Regression of All Available Variables .................................................................... 46Appendix C: Full Tables of Capital Regression ........................................................................... 54

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    4/56

    4

    Introduction

    In selecting my topic for independent study, I wanted to combine the skills I had learnedfor statistical computing software with my main passion and primary fields of studybusinessand entrepreneurship. As a result, I originally intended to build a predictive model for new or

    small business venture success. According to the U.S. Small Business Administration (SBA),small businesses represent 99.7 percent of all employer firms. Since 1995, small businesses havegenerated 64 percent of new jobs, and paid 44 percent of the total United States private payroll,according to the SBA. However, I quickly realized that there was a dearth of accurate, thorough,and easily accessible data that had a large enough sample size to satisfy the normalityassumption across all sectors and states. Moreover, attempting to answer this question wouldfurther require significant longitudinal data on the individual business level. Ultimately, I reliedon the U.S. Census Bureaus Survey of Business Owners (SBO) Public Use Microdata Sample(PUMS), in which I examine entrepreneurial activity and the relationships between businesscharacteristics such as access to capital, firm size, employer-paid benefits, minority ownership,and firm age. In this report, I detail how I conducted data cleaning on the 2007 SBO PUMS in

    addition to the development of a regression model as well as more in-depth analyses of therelationships between specific variables.

    Overview of the Survey of Business Owners Public Microdata SamplePrimary Methodology

    The 2007 Survey of Business Owners (SBO) questionnaire, Form SBO-1, was mailed toa random sample of 2.3 million businesses selected from a list of 27 million firms operatingduring 2007 with receipts of $1,000 or more. The list of all firms (the sampling universe) wasderived from both official business tax returns and data collected on other economic censusreports. The Census Bureau obtained electronic files from the Internal Revenue Service (IRS) for

    all companies reporting any business activity on 2007 IRS Tax Forms such as Form 1040 and1065.

    With regards to the background of the SBO, this survey is part of the Economic Censusprogram, which the Census Bureau is required by law to conduct every 5 years for years endingin "2" and "7." The Census Bureau combines and crosschecks data from the SBO with data fromother economic surveys, economic censuses, and administrative records. The published datainclude number of firms (both firms with paid employees and firms with no paid employees),sales and receipts, number of paid employees, and annual payroll; they are presented by kind ofbusiness, geographic area, and size of firm (employment and receipts). These results will alsocontain summary statistics on the composition of businesses in the United States by gender,

    ethnicity, race, and veteran status. Additional demographic and economic characteristics ofbusiness owners and their businesses are included, such as: owner's age, education level, hoursworked, and primary function in the business; family- and home-based businesses; types ofcustomers and workers; sources of financing for start-up, expansion, or capital improvements;outsourcing; use of Internet and e-commerce; and employer-paid benefits.

    The IRS provided certain identification, classification, and measurement data forbusinesses filing those forms. For most firms with paid employees, the Census Bureau also

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    5/56

    5

    collected employment, payroll, receipts, and kind of business for each plant, store, or physicallocation during the 2007 Economic Census.

    For the 2007 SBO, firms could either report electronically by using Census Taker, theCensus Bureau's secure online interactive application, or return their completed form by mail.Three report form re-mails to employer firms and two report form re-mails to nonemployer firms

    were conducted at one-month intervals to all delinquent respondents. The returned formsunderwent extensive review and computer processing. All reports were geographically coded,data-keyed, and edited.

    This wealth of data provides a resource to main parties from government officials toindustry organization leaders. For example, this data allows agencies such as the Small BusinessAdministration to identify and address the needs of small businesses in the United States. In theprivate sector, consultants and researchers to analyze long-term economic and demographicshifts, and differences in ownership and performance among geographic areas.

    Survey Overview:

    Form SBO-1, given to every sampled business, primarily asked basic information aboutthe business in general while focusing on the demographics and level of ownership for eachlisted owner. There are only 9 numeric variables (tabulation weight, total revenues with injectednoise, payroll with injected noise, employment injected with noise, and general ownershippercentages for up to four owners) while the other hundreds are character variables (age,education, startup capital type, race, ethnicity, etc.). These character variables usually adopt theformat of a binary yes/no answer to most questions, except for variables with multiple levels(education, age, and race).

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    6/56

    6

    Missing Data

    For the numeric variables of the cleaned data set (receipts, payroll, employment, andpercent of ownership), there were no missing values. However, for most character variables, thepercentage of missing data ranged from 20-40%. There were also several variables in which over

    General Owner andBusiness Characteristics

    Additional OwnerCharacteristics

    Additional Business Characteristics

    Sector

    Employer status

    Random group(for varianceestimation)

    Tabulationweight

    Measures of size(noise-infusedfor disclosureavoidance):

    Employment

    Payroll

    Receipts Individual owner

    information (forup to fourowners):

    Percentageownership

    Gender

    Ethnicity

    Race

    Veteran status

    How the ownerinitially acquired the

    business When the owner

    acquired the business

    Owners primaryfunction in thebusiness

    Owners averagenumber of hours perweek spent workingin the business

    Whether the business

    provided the ownersprimary source ofpersonal income

    Whether the ownerpreviously owned abusiness or had beenself-employed

    Owners educationalbackground

    Owners age

    Whether the owner

    was born in theUnited States

    If the owner was aveteran, whether theowner was disabledas the result of injuryincurred duringactive militaryservice

    Year business was established

    Source(s) of start-up or acquisition

    capital Amount of start-up or acquisition

    capital

    Home-based business

    Operated as a franchise

    Owned by a franchise

    Source(s) of capital used to expandbusiness

    Types of customers

    Percent of total sales exported

    Operations established outside the

    United States Outsourced any business function

    outside the United States

    Language(s) used in transactions

    Types of workers employed

    Employer-paid benefits offered

    Whether the company had a website

    Whether the company had e-commerce sales

    E-commerce as a percentage of total

    sales Whether the company made online

    purchases

    Business activity (e.g., seasonal orpart-time)

    Whether the business currentlyoperates

    Reasons for ceasing operations

    Joint ownership by husband and wife

    Family-owned business

    Number of owners

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    7/56

    7

    90% of the observations had missing values, such as whether or not the owner is retired, theowner is deceased, or the business had low or inadequate sales.

    Nonresponse

    Approximately 62 percent of the 2.3 million businesses in the SBO sample responded to

    the survey, compared to 75 percent for the 2002 survey. For the 2007 survey, 72 percent of thecompanies in the SBO sample returned a questionnaire, but 10 percent of the returns did notcontain enough information to be considered a response for the estimates by race, gender,ethnicity or veteran status. Many of these respondents were sole proprietors that answered "No"to Item 8, "In 2007, did any individual own 10% or more of the rights, claims, interests, or stockin this business?" Another identified issue was duality between race (Hispanic vs non-Hispanic)and ethnicity (White, Black, Asian, American Indian). Every Hispanic business owner also hadto identify at least one additional ethnicity, which may lead to indication of mixed race when anowner is solely Hispanic is heritage. This led to consequent variable manipulation for correction.

    According to the U.S. Census, about 4 percent of the 2007 nonrespondents were selectedfor and responded to the 2002 SBO. For these firms, data from the 2002 survey were used in

    place of the missing 2007 responses. For the remaining nonrespondents, gender, ethnicity, raceand veteran status were imputed from donor respondents in the same sampling frame withsimilar characteristics (state, industry, employment status, size). Because the assignment ofbusinesses to sampling frames relies heavily on administrative data, and there is a high level ofagreement between sampling frame assignment and tabulated race or ethnicity for respondingfirms, the donor imputations are considered to be reliable. Estimates of sampling variability areadjusted to account for nonresponse. Estimates with high error (relative standard error for salesor receipts of 50 percent or more) are suppressed. Overall, imputed data accounted forapproximately 47 percent of the firm count estimates by gender, ethnicity, race, and veteranstatus and approximately 20 percent of the estimates of sales.

    Inherent Differences between SBO and PUMS Information

    The Public Use Microdata Sample (PUMS) is a large dataset available to the public derivedfrom the original SBO dataset of responses. According to the U.S. Census Bureau, measureswere taken in constructing the PUMS file to protect the confidentiality of the SBO data in orderfor it to be used freely among the public. In the PUMS file, each record corresponds to abusiness, but deliberate measures were taken to ensure the anonymity of each business. Forbusinesses operating in multiple states and/or industry sectors, one record exists for each statecombination in which the firm conducts business. Identifiers to link the component records of abusiness are not included. Additionally, businesses classified in the SBO as publicly owned ornot classifiable by gender, ethnicity, race, or veteran status are not included in the PUMS file

    because many publicly owned firms are easily identifiable. Since the primary focus of myresearch is on small businesses, exclusion of possibly larger public corporations does notsignificantly impact the integrity of the data.

    Finally, the U.S. Census Bureau infused noise into the PUMS data for disclosure avoidanceand confidentiality protection. Values are perturbed prior to tabulation by applying a randomnoise multiplier to the magnitude data, such as the sales and receipts for all firms. Thisintroduced variation perturbed data points by no more than a few percentage points.

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    8/56

    8

    Data Manipulation and CleaningSubsetting to Small Businesses and One Owner Data

    According to the Small Business Administration, small businesses are generallyconsidered to have less than 500 employees and under 7 million in receipts depending on the

    industry. By subsetting the original dataset according to these new parameters, the totalobservation count decreased from 2,165,680 to 2,025,530.

    Furthermore, according to the 2007 SBO PUMS Guide, a response of 0 for most of thequalitative questions indicated that the data for these variables were missing. As a result, each 0was converted to to more accurately reflect the nature of this data and to correctly set upregression procedures later on.

    Another major issue for data analysis were the inclusion of data points of up to threeadditional business owners. As a result, there exist three additional sets of demographicvariables. However, if a business only has one owner, these three additional sets of variables

    would all have missing variables. Given the fact that PROC REG and other regressioncommands exclude observations that have even one missing value, allowing the sixty extravariables for additional owners would lead to over-exclusion of a significant amount ofobservations. Moreover, I noticed that virtually every business had responses to all variablesdescribing the first owner, and that the first owner almost always owned as much (if not a greateramount) of the business than his or her other 1-3 business partners. This justified my decision toremove all variables affiliated with the second, third, and fourth owners.

    One future recommendation would be to keep these deleted variables and find otheroptions to analyze the overall dataset despite the necessary inclusion of extra missing variables.Observing businesses with the intent to analyze the relationship between multiple business

    owners (when also factoring in age, experience, and education) could be incredibly valuable forthe studies of organization behavior and business.

    Additional deleted variables included those that had over a 90% missing rate, since itseverely diminished the number of total observations used in regression. For the purposes of myresearch questions, having more observations to interpret is preferable, but including thesevariables in another analysis of businesses that have generally ceased activity would also beworthwhile. These variables include:

    CEASEOTHERceased operations for another reason

    SOLDBUSsold this business

    STARTANOTHERStarted another business NOPERSCREDLack of personal loans/credit

    NOBUSCREDLack of business loans/credit

    LOWSALESInadequate cash flow or low sales

    ONETIMEOperated for one-time event

    DECEASEDOwner died

    RETIREOwner retired

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    9/56

    9

    Lastly, I consolidated all language variables and the race and ethnicity variables in orderto improve overall adjusted fit and reduce noise. After analyzing the spread of the manylanguage variables in the dataset (English, Arabic, Chinese, French, German, Greek, Hindi,Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Tagalog, and Vietnamese), the

    most frequent languages spoken during business transactions are naturally English and Spanish.Other minor languages collectively do not constitute more than 3% of the total number ofobservations, so we created a new language variable with five levels: Only English, OnlySpanish, Only English and Spanish, Only English and Other, and Only Other Language, whicheffectively consolidated sixteen binary variables to one variable with five levels. Additionally,the race variable had at least twenty levels due to the combinations of difference races, whichencouraged consolidation. Both race and ethnicity were combined to create a new variable,which followed this algorithm:

    1. If ethnicity is Hispanic then Race/Ethnicity is Hispanic2. If ethnicity is not Hispanic and the owner is White AND another minority, then the owner

    is considered part of the other minority3. If ethnicity is not Hispanic and the owner is only White or another minority, then the racestays as listed

    4. If ethnicity is not Hispanic and the owner is at least two types of race (both of which arenot White), then the owner is considered Mixed.

    As a result, the final levels for the new race/ethnicity variable are: W for White, B for Black, Afor Asian, H for Hispanic, I for American Indian, P for Nhopi (Native Hawaiian), Mixed (for anycombination of two non-white and non-hispanic races such as Black/Asian.

    Tabulation Weights

    In most surveys, it will be the case that some groups are over-represented in the raw dataand others under-represented. In order to address this, weights are assigned to each observationto compensate for the over/under-representation of data. While the exact method of determiningthese weights for the PUMS is unknown, the values of the tabulation weight range from 1.0 to35.0. In this sense, a single observation with a weight of 35.0 would functionally be the same asthirty-five individual observations with the same parameters except for a weight of 1.0. Analysisof the data therefore requires the weights to be properly factored into averages, percentages, andregressions through the proper SAS procedures.

    Fitting a Regression

    Intuitively, I first wanted to fit a regression for all variables against receipts to observethe overall coefficient of determination and the comparative significance of each individualvariable in determining receipts. This would immediately confirm or deny several of my possibleresearch questions about the variables and would then be a starting point for other furtherresearch ideas. I gave further consideration to other possible response variables besides receipts,such as employment, payroll, and certain categorical variables such as whether or not thebusiness was still operating. Ultimately, I decided to primarily focus on receipts as the response

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    10/56

    10

    variable, seeing as in general business theory that an increase in either payroll or employmentusually comes after an increase in overall revenues.

    In order to begin, I had to identify the proper statistical procedure to regress severalquantitative and multi-level categorical variables against receipts. I also wanted to use a

    procedure that would automatically create dummy variables and evaluate correlations whilecontrolling for other variables. Ultimately, I selected the generalized linear model procedurebecause it satisfied the aforementioned criteria. With this in mind, limitations include the factthat responses must be independent of another (which is extremely difficult to prove given gametheory and general economics of competition), and the fact that predictors are still assumed to belinear.

    Regression One: All Variables Against Receipts

    Data Summary:Number of Observations874,182

    Included Variables: RECEIPTS_NOISY EMPLOYMENT_NOISY PAYROLL_NOISY PCT1EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 FIPST SECTOR N07_EMPLOYER SEX1FOUNDED1 PURCHASED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1 ESTABLISHEDSCSAVINGS SCASSETS SCEQUITY SCCREDIT SCGOVTLOAN SCGOVTGUARSCBANKLOAN SCFAMLOAN SCGRANT SCDONTKNOW SCAMOUNT HOMEBASEDFRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECCREDIT ECGOVTLOANECGOVTGUAR ECBANKLOAN ECFAMLOAN ECVENTURE ECPROFITS ECGRANTECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL STATELOCALINDIVIDUALS EXPORTS OPSOUTSIDE FULLTIME PARTTIME DAYLABORTEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHARE

    HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURSLT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV CEASENR HUSBWIFEFAMILYBUS NUMOWNERS race1noblanks OPERATING LANGUAGE

    Further analysis shows that these are the top five variables based on both the magnitudeof the t-value and the estimate, as these parameters indicate both statistical and practicalsignificance. Since the vast majority of variables are significant in the model on a 0.05significance level, the magnitude of the estimate is the most determinant factor in establishingimportance.

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    11/56

    11

    Estimated Regression Coefficients

    Parameter Estimate Standard Error t Value Pr > |t

    SECTOR 11 Agriculture, Fishing 273.5380 19.767836 13.84

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    12/56

    12

    HOLIDAYS 1 168.2109 5.420379 31.03

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    13/56

    13

    sets the significance level at which variables can be removed from the model, which is onceagain usually 0.05.

    Stepwise Regression

    As a fusion of both forward and backward stepwise regression, in stepwise regressionfour options are considered at each stage: add a term, delete a term, swap a term in the model for

    one not in the model, or stop. This algorithm is most often used in practice. Despite itswidespread use, it has little theoretical basis. A more theoretically robust tool like AkaikesInformation Criterion (AIC) can also be used as a good metric to assess models. Limitations O

    Fit Statistics

    Adjusted Coefficient of Determination / Adjusted R-Square

    The adjusted R-squared is a modified version of R-squared that has been adjusted for thenumber of predictors in the model. The adjusted R-squared increases only if the new termimproves the model more than would be expected by chance. It decreases when a predictorimproves the model by less than expected by chance. The adjusted R-squared can be negative,but its usually not. It is always lower than the R-squared.

    Akaike Information Criterion

    The Akaike Information Criterion (AIC) is a way of selecting a model from a set ofmodels. The chosen model is the one that minimizes the Kullback-Leibler divergence between

    the model and the truth. It's based on information theory, but a heuristic way to think about it isas a criterion that seeks a model that has a good fit to the truth but few parameters. It is definedas:

    AIC = -2 ( ln ( SSE / n )) + 2 K

    where likelihood is the probability of the data given a model and K is the number of freeparameters in the model. AIC scores are often shown as AIC scores, or difference between thebest model (smallest AIC) and each model (so thebest model has a AIC of zero). Used instepwise regression, AIC can be used instead of the p-value as the main criterion for modelselection. Each iterative models AIC should be calculated and be compared to the previous, and

    should only be preferred if the current AIC is smaller than the AIC of the prior model. Thisprocess continues until the best model is selected.

    Bayesian Information Criterion

    BIC = n log (SSEp)n log (n) + p log (n)

    The BIC acts essentially the same as AIC but incorporates a more severe decrease if n > 8

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    14/56

    14

    Schwarz Bayesian Criterion

    SBC = n ln (SSE / n) + k ln (n)

    This is essentially like the AIC equation but uses a multiplicative penalty term based onsample size rather than a constant of 2. By default, PROC GLMSELECT uses the stepwise

    selection based on the Schwarz Bayesian Criterion.1

    Mallows Cp

    Cp= ((1-Rp2)(n-T) / (1-RT2))[n2(p+1)]

    The AIC has been shown as equivalent to Mallows Cp, which is used to assess the fit ofa regression model that has been estimated using least ordinary squares. This measures the biasin the reduced regression model relative to the full model having all T candidate predictors. If Cpis roughly equivalent to p, then the reduced model predicts as well as the full model. If Cp < pthen the reduced model is estimated to predict better than the full model. In practice, the selected

    model should have the smallest Cp.Mean Squared Error (MSE)

    The Mean Squared Error in regression refers to the residual sum of squares divided bythe number of degrees of freedom.Minimizing MSE is important to ensure that the maximumamount of variation of the regression can be explained by the independent variables, thusestablishing the robustness of a model. It is one of the most important and fundamental criteriathat can be used to evaluate models.

    General Criteria:

    General diagnostics should be calculated for each model to help determine which modelis best. Thesemodel diagnostics include the mean square error (MSE), the adjusted coefficientof determination (R2), and Mallows Cp. A good linear model will have small MSE and Cp anda high adjusted R2 close to 1. With these criteria in mind in addition to stepwise regression withtools such as AIC, BIC, and SBC, I can develop a more robust model than the original. However,it is also important to note that use of these criteria and selection procedures will not definitivelyyield the best model due to the sheer number of potential models and inherent limitations ofthese tools.

    Regression Two: PROC GLSELECT with the Schwarz Bayesian CriterionUsing PROC GLMSELECT, I used the aforementioned steps to select a model just using

    stepwise regression.

    Data Summary:Number of Observations874,182

    Included Variables: EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 SECTORN07_EMPLOYER SEX1 VET1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1

    1http://www2.sas.com/proceedings/sugi31/207-31.pdf

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    15/56

    15

    FNCTNABV1 HOURS1 PRMINC1 EDUC1 BORNUS1 ESTABLISHED SCSAVINGSSCEQUITY SCCREDIT SCGOVTGUAR SCDONTKNOW SCAMOUNT HOMEBASEDFRANCHISE ECSAVINGS ECASSETS ECCREDIT ECBANKLOAN ECVENTUREECPROFITS ECGRANT ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERALSTATELOCAL INDIVIDUALS EXPORTS OPSOUTSIDE FULLTIME PARTTIME

    TEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHAREHOLIDAYS BENENABV WEBSITE ECOMMPCT LT40HOURS SEASONALOCCASIONALLY ACTIVITYNABV OPERATING HUSBWIFE NUMOWNERS REGION

    Fit Statistics:R-square0.5768Adjusted R-square0.5768Root MSE1516.49340

    Regression Three: PROC GLMSELECT with Akaikes Information CriterionUsing PROC GLMSELECT, I used the aforementioned steps to select a model using the AIC.

    Data Summary:Number of Observations874,182

    Included Variables: EMPLOYMENT_NOISY PAYROLL_NOISY PCT1 FIPST SECTORN07_EMPLOYER SEX1 FOUNDED1 PURCHASED1 ACQYR1 PROVIDE1 MANAGE1FINANCIAL1 FNCTNABV1 HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1DISVET1 ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDITSCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN SCGRANTSCDONTKNOW SCAMOUNT HOMEBASED FRANCHISE FRANCHISER50 ECSAVINGSECASSETS ECCREDIT ECGOVTLOAN ECGOVTGUAR ECBANKLOAN ECFAMLOAN

    ECVENTURE ECPROFITS ECGRANT ECOTHER ECDONTKNOW ECNOACCESSECNOEXPAND FEDERAL STATELOCAL INDIVIDUALS EXPORTS OPSOUTSIDEFULLTIME PARTTIME DAYLABOR TEMPSTAFF LEASED CONTRACTORSHEALTHINS RETIREMENT PROFITSHARE HOLIDAYS BENENABV WEBSITEECOMMPCT ONLINEPURCH LT40HOURS LT12MONTHS SEASONAL OCCASIONALLYACTIVITYNABV OPERATING CEASENR HUSBWIFE FAMILYBUS NUMOWNERSrace1noblanks LANGUAGE

    Fit Statistics:R-square0.5771Adjusted R-square0.5770Root MSE1516.10130

    Since the AIC model is on par with highest overall adjusted R-squared, has the leastmodel variables, and relatively low MSE compared to the second stepwise model, it should beconsidered the most robust. It is noted that the first linear regression of all variables has aconsiderably lower MSE than the stepwise-selected models, despite the inclusion of far morevariables. However, given the prevalence of missing data and nonresponse in these data sets,future data collection may benefit from relying on fewer variables in the generalized linearmodel. Therefore, the AIC model ranks the most effective.

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    16/56

    16

    The most significant variables are also those aforementioned in Regression One: sector,health insurance, holidays, and employer status for tabulation. Sector is arguably the mostintuitively significant, while both health insurance and holidays are usually indicative of abusiness performing well enough to provide extended luxuries for employees. Surprisingly, bothpayroll and employment had far smaller magnitudes in their estimates, which indicates

    insignificance on a practical level.

    There existed several limitations to the regression. As stated earlier, one of the maindrawbacks of using the generalized linear model is that it can produce over-fitting to data as wellas the assumption that the relationships between the explanatory and response variable is linear,which may not be the case. Ideally, tests for nonlinearity could be conducted on each variable byplotting the residuals versus predicted values. Furthermore, multicollinearity or the varianceinflation factor should be calculated for each variable, which remains to be done consideringthere is no built-in functionality for this purpose for survey data in SAS. Due to the large amountof variables, testing for multicollinearity is especially important. From a cursory analysis, mostof the t-ratios for the individual coefficients is statistically significant, which could indicate that

    multicollinearity is not severe. Another possible option was to create a correlation matrix, whichproved to be too cumbersome considering the matrix have dimensions greater than 150 x 150.Finally, further work could be done to investigate the addition of extra terms, such as interactivecombinations of other original terms that better reflect the lack of complete independencebetween explanatory variables.

    Proper model selection requires conscientious consideration of the tradeoff between thecompeting objectives of conformity and adherence to data and model simplicity. Good modelsconform to the data with a strong goodness of fit, but can also be easily generalizable in itsinterpretation. Finally, good models should not under-fit (leaving out key variables in favor ofattempt to be generalizable) or over-fit to the data (including extraneous or unrealistic variableeffects in its attempt to have the best goodness of fit) because in each scenario, the conclusionloses value.

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    17/56

    17

    QQ Plot of Residuals versus a Normal Distribution

    This quantile-quantile plot of the residuals versus a normal distribution show that the data seemsto be normally distributed through the inner quartiles, but heavily skewed with long-taileddistributions on both sidesparticularly the left side. Since this QQ plot indicates significantskew in this model, its conclusions cannot be used to draw the strongest conclusions.

    Plot of Fitted Values versus Residuals

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    18/56

    18

    This plot of fitted values essentially entails a biased and heteroscedastic spread with aninteresting phenomenon of residuals steadily decreasing in variation at higher response variablevalues. Due to the extreme density of points attributable to the large amount of overall variationand sheer number of datapoints, more specific phenomena cannot be analyzed at this point intime.

    The plot also points to the issue of the modeldrastically over-estimating predicted valuesgiven the sheer magnitude of the negativeresiduals, which calls for further analysisamong observed businesses that deviate the

    most from the model. Given that over half of the observed businesses have receipts less than$16,000 according to the five-number summary, running a regression on the subset of PUMS toonly include businesses that earn more than the median point may produce a more well-fitting,unbiased, and homoscedastic model.

    The five number summary forresidual values clearly shows greaternegative skew on the whole, but a

    greater density of positive values over smaller intervals, thus confirming the residuals versusfitted plot. Since numerical variables typically have more leverage over the fit and spread of theresiduals and receipts is extremely skewed right, an analysis of both payroll and employment iswarranted to see if similar effects are in place. The following two tables describe the five numbersummaries for both payroll and employment.

    These results confirm that both employment and payroll are heavily skewed right, whichwould explain the potential for the model to drastically overestimate certain values. Since theactual magnitude (from a dollar or labor force standpoint) causes this extreme skew, creating anew binary variable for each of these variables such that 0 indicates employment or payroll of 0and 1 indicates employment or payroll greater than 0 could result in better fit. Additionally,taking the logarithm of RECEIPTS_NOISY may induce better fit as well due to the skew.

    In order to establish control for the following regressions, I first conduct the PROCGLMSELECT procedure with AIC with the only change of taking the logarithm of receipt

    values.

    Regression Four: PROC GLMSELECT with AIC: Receipts with Logarithm

    Transform

    Data Summary:Number of Observations - 784,208

    Five Number Summary of

    RECEIPTS_NOISY

    Min Q1 Median Q3 Max

    0 1.72393 15.23411 81.77834 6900

    Five Number Summary of Residuals

    Min Q1 Median Q3 Max-93534.00 -222.63 -36.71 101.67 8271.42

    Five Number Summary of EMPLOYMENT_NOISY

    Min Q1 Median Q3 Max

    0 0 0 0 4890.00

    Five Number Summary of PAYROLL_NOISY

    Min Q1 Median Q3 Max

    0 0 0 0 280000.00

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    19/56

    19

    Included Variables: PAYROLL_NOISY EMPLOYMENT_NOISY PCT1 FIPST SECTORN07_EMPLOYER SEX1 VET1 FOUNDED1 PURCHASED1 INHERITED1 RECEIVED1ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1 HOURS1 PRMINC1SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1 ESTABLISHED SCSAVINGS SCASSETSSCEQUITY SCCREDIT SCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN

    SCVENTURE SCGRANT SCOTHER SCDONTKNOW SCAMOUNT HOMEBASEDFRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECEQUITY ECCREDITECGOVTLOAN ECBANKLOAN ECVENTURE ECPROFITS ECGRANT ECOTHERECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL OTHERBUS INDIVIDUALSEXPORTS FULLTIME PARTTIME LEASED CONTRACTORS HEALTHINS RETIREMENTPROFITSHARE HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCHLT40HOURS LT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABVOPERATING CEASENR HUSBWIFE NUMOWNERS race1noblanks LANGUAGE

    Fit Statistics:R-Square0.6689

    Adjusted R-Square0.6688Root MSE3.07507

    While the fit has improved by moststandards, the overall Q-Q plot and fittedversus residual values plot still indicatemajor issues in skew and bias. As statedearlier, trying to correct the issue bycreating binary variables out of theexisting numeric payroll andemployment variables could controlsome of the drastic skewedness observedin both plots, which leads to the nextregression.

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    20/56

    20

    Regression Five: PROC GLMSELECT with AIC: Modified Binary Payroll and

    Employment with Other Character Variables against the Logarithm of Receipts

    Using PROC GLMSELECT, I used the aforementioned steps to select a model using the AICwith the modified variables of binary payroll and employment.

    Variables Used: PCT1 FIPST SECTOR N07_EMPLOYER SEX1 VET1 FOUNDED1

    PURCHASED1 INHERITED1 RECEIVED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1

    FNCTNABV1 HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1

    ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDIT SCGOVTLOAN

    SCBANKLOAN SCFAMLOAN SCVENTURE SCGRANT SCOTHER SCAMOUNT

    HOMEBASED FRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECEQUITY

    ECCREDIT ECGOVTLOAN ECGOVTGUAR ECBANKLOAN ECVENTURE ECPROFITS

    ECGRANT ECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERAL

    STATELOCAL OTHERBUS INDIVIDUALS EXPORTS FULLTIME DAYLABORTEMPSTAFF LEASED CONTRACTORS HEALTHINS RETIREMENT PROFITSHARE

    HOLIDAYS BENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURS

    LT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV OPERATING CEASENR

    HUSBWIFE FAMILYBUS NUMOWNERS race1noblanks LANGUAGE ifpayroll

    ifemployment

    Fit Statistics:R-square0.6550Adjusted R-square0.6549

    Root MSE3.13900

    Although the fit statistics seem to

    indicate worse fit than the previous

    model given lower coefficients of

    determination and a higher root MSE,

    the Q-Q plot absolutely indicates better

    overall fit and far less skewespecially

    left skewedness.

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    21/56

    21

    The correction of skew also allows me to observe the actual bulk of the fitted versus

    residuals plot more closely, which is starting to show very peculiar patterns that require furtherinvestigation. One possible reason for this is the fact that virtually all of the regressed variables

    are categorical with the exception of ownership percentage of the first business owner.

    Although the previous model had better fit values and the current model may be

    susceptible to data overfitting to the sample, this model iteration still ultimately has better results

    according to the Q-Q plot and the fitted versus residuals plot due to reduced skewedness.

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    22/56

    22

    Analysis of Language

    Research Question 1a: Does the language spoken in transactions produce a

    difference in correlated receipts?

    The United States is often seen as the nation most conducive to immigrant

    entrepreneurship, especially in an increasingly globalized society as well. To start, I investigatedthe relationship between different languages and overall business receipts. Running a generallinear model procedure for only the language variable against receipts results in the following:

    Estimates of Language Variable Levels Against Receipts

    One interesting phenomenon is that a business that conducts transactions through onlyEnglish and Spanish is associated with higher gross receipts than a business that conductsbusiness transactions through any other language or language combination. Most notably, theestimate for a business that only speaks English and Spanish is nearly 40% greater than theestimate for a business that only speaks English for transactions.

    Its possible that Hispanics are largely employed by the agriculture, manufacturing, andconstruction industries more so than others. According to 2008 research by the Center forConstruction Research and Training, the following depicts a graph of Hispanic employees as apercentage of each industry.

    This data point to the possible underlying non-uniform distribution of Hispanics acrossbusiness industries.Its important to understand the difference between businesses that conduct

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    23/56

    23

    transactions in a certain language and businesses that operate internally using the language. Forexample, a business that could have a predominantly large portion of Hispanics may notnecessarily conduct business transactions in Spanish. Therefore, an additional research questioncould be the relationship between the language and new race/ethnicity variable. The percentageof Hispanics by industry in this instance simply points to industries to investigate more closely,

    such as construction and agriculture.Using this information, I analyzed the breakdown of language by NAICS sector code,

    paying special attention to the industries that had the greatest percentage of Only English andSpanish businesses.

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    24/56

    24

    Industries with the Greatest Proportion of Only Spanish and English Businesses:Sector 55: 13.63%The Management of Companies and Enterprises sector comprises (1) establishments that hold

    the securities of (or other equity interests in) companies and enterprises for the purpose ofowning a controlling interest or influencing management decisions or (2) establishments (except

    government establishments) that administer, oversee, and manage establishments of the companyor enterprise and that normally undertake the strategic or organizational planning and decision

    making role of the company or enterprise.

    Sector 62: 10.74%The Health Care and Social Assistance sector comprises establishments providing health careand social assistance for individuals.

    Industries with the Greatest Proportion of Only English Businesses:Sector 21: 95.71%

    The Mining sector comprises establishments that extract naturally occurring mineral solids, suchas coal and ores; liquid minerals, such as crude petroleum; and gases, such as natural gas.

    Sector 11: 92.90%

    The Agriculture, Forestry, Fishing and Hunting sector comprises establishments primarily

    engaged in growing crops, raising animals, harvesting timber, and harvesting fish and other

    animals from a farm, ranch, or their natural habitats.

    Given these interpretations, looking at the general mean receipts for each sector could then

    explain why businesses that conducted business transactions in English and Spanish havestatistically higher receipts on average. The following table depicts sector and it means receipts:

    Statistics

    Variable Mean Std Error of Mean

    RECEIPTS_NOISY 174.631446 0.353243

    Domain Analysis: SECTOR

    SECTOR Mean Std Error of Mean

    11 Agriculture, Fishing 96.067138 2.182337

    21 Mining, Quarrying, Oil Extraction 233.433129 6.192045

    22 Trade, Transportation, Utilities 113.609696 6.397033

    23 Construction 218.497973 1.149555

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    25/56

    25

    Domain Analysis: SECTOR

    SECTOR Mean Std Error of Mean

    31 Manufacturing 527.070573 4.920525

    42 Wholesale Trade 628.044453 5.29576244 Retail Trade 267.633023 1.575551

    48 Transportation and Warehousing 144.352095 1.225643

    51 Information 156.241621 2.441339

    52 Finance and Insurance 170.082058 1.558410

    53 Real Estate and Rental 120.396544 0.699686

    54 Professional, Scientific, TechnicalServices

    139.530715 0.707921

    55 Mgmt. of Companies and Enterprises 490.728375 19.475524

    56 Admin. and Support and WasteManagement

    100.129900 0.835629

    61 Education 45.306452 0.868810

    62 Healthcare 175.218046 1.177138

    71 Arts and Etnmt. 58.393615 0.740265

    72 Accommodation and Food Services 356.426518 2.657923

    81 Other Services 69.516196 0.417748

    99 Unclassifiable 102.268452 8.707951

    According to this PROC SURVEYMEANS procedure of the mean gross receipts of the averagebusiness in each industry and the average gross receipts of all businesses (across industries),which is 174.63, both sectors 55 and 62 earn above average. While this preliminarily explainswhy the overall estimate for Only English and Spanish businesses is higher than the estimateof other language levels, the following question should be investigated:

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    26/56

    26

    Research Question 1b: Out of only existing Only English and Spanish businesses,

    which are the most popular industries for business?

    Table of LANGUAGE by SECTOR

    LANGUAGE SECTOR Frequency WeightedFrequency

    Percent

    Only English and Spanish 11 390 4076 0.4269

    21 263 1907 0.1998

    22 123 533.41200 0.0559

    23 7472 101173 10.5960

    31 3480 17790 1.8632

    42 4346 27406 2.8702

    44 13181 113815 11.9199

    48 3979 44151 4.6239

    51 1460 9222 0.9658

    52 5665 46651 4.8858

    53 6760 85715 8.9770

    54 11573 117115 12.2656

    55 1159 1581 0.1656

    56 5933 74071 7.7575

    61 1325 16665 1.7453

    62 11269 127184 13.3201

    71 1917 25298 2.6495

    72 4017 35816 3.7511

    81 7699 104522 10.9467

    99 22 135.88500 0.0142

    Total 92033 954827 100.000

    The top give highest concentrations of Only English and Spanish businesses are insectors 62 (13.32% and receipts of 175.22), 54 (12.27% and receipts of 139.53), 44 (11.92% andreceipts of 267.63), 81 (10.95% and receipts of 69.52), and 23 (10.60% and receipts of 218.50).Although within on a sector level, Only English and Spanish businesses in sector 55 made up alarge share of businesses in sector 55 overall, it actually had one of the smallest actualfrequencies with just 1159 businesses total. Revising my initial conclusion, I argue that thehigher estimate is most likely derived from consistently above average performance in sectorswhere Only English and Spanish businesses are prevalent.

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    27/56

    27

    Visualizing the Percentage of Hispanic/Spanish-Speaking Businesses in the United

    StatesAs noted earlier, there exists a dichotomy between businesses run by Hispanic owners

    and businesses that conduct transactions in English and Spanish. I approached this issue in adifferent fashion, by plotting the percentage of both (for each state) on a map of the United

    States. I pre-defined set intervals after looking at the spread of percentages for each state. Thefollowing graphs generally depict the same patterns of higher concentrations of allHispanic/Spanish-speaking subjects located in the Southwest and Florida. I obtained 2007 dataon the number of Hispanics living in each state from the Pew Research Hispanic Trend Project.

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    28/56

    28

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    29/56

    29

    Analysis of Capital Sources in California versus the United StatesCalifornias $2 trillion economy would be the ninth biggest in the world if it were a

    country. The state represents 13% of the U.S. economy. However, California has been ranked asone of the worst states to do business in recent years according to business executives andpublications.2The state has been under duress from the dramatic fall in home prices and the

    reduced tax revenues for the state. Moreover, California consistently boasts one of the highestcosts for living and operation. Interestingly, California also ranks among the best for technologyand innovation. Another plus is the $36 billion in venture capital money invested in Californiacompanies the past three years, which is four times the total of any other state.3California is alsonoted to be the home of Silicon Valley. According to a 2006 study done by the AmericanElectronics Association, Silicon Valley and the Bay Area as a whole ranked first in terms of thenumber of high-tech jobs in the United States.While the PUMS does not have location datamore granular than the state level, the existence of Silicon Valley itself could point to interestingstatistical characteristics of California that no other state may share.

    As the state of extremes, I found it interesting to investigate sources of startup andexpansion capital and their relationship to revenues in California, especially compared to this

    relationship between capital and revenues in the United States.Dataset Overview

    The observed dataset is simply a subset of the cleaned PUMS done by only settingobservations that have indicated 06 (for California) as its FIPS code. This dataset has 182932observations, and most categorical variables (except location) seem to have a missing percentageof 30-50%, which is slightly higher than the typical missing pattern observed in the U.S. dataset.In this section, both startup and expansion capital will be analyzed.

    Startup capitalrefers to the initial cost of investment to fully bring a product or service tomarket. It can be used for everything from business operation expenses to research anddevelopment to payroll. It is typically used to fund businesses still in their infancy, and can be

    repaid once the business reaches a level of maturity to earn revenues on its own.Companies that seek expansion capital, on the other hand, will often do so in order to

    finance a transformational event in their business. These companies are likely to be more mature(in terms of operating time) than venture capitalfunded companies, able to generate revenue andoperating profits but unable to generate sufficient cash to fund major opportunities, acquisitionsor other investments. Because of this lack of scale these companies generally can find fewalternative conduits to secure capital for growth, so access to growth equity can be critical topursue necessary facility expansion, sales and marketing initiatives, equipment purchases, andnew product development.

    2http://www.cnbc.com/id/1008432873http://www.forbes.com/pictures/mli45kikd/41-california/

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    30/56

    30

    Glossary of Capital Sources

    Startup Capital (SC)

    SCSAVINGS: Personal Savings

    SCASSETS: Other Personal Assets

    SCEQUITY: Home Equity

    SCCREDIT: Credit Cards

    SCGOVTLOAN: Government Loan

    SCGOVTGUAR: Government Guaranteed Loanthe United States government and the SmallBusiness Administration provides loans to certain businesses depending on size and capitalpurposes

    SCVENTURE: Venture Capitalist

    SCGRANT: Grant

    SCOTHER: Other

    Research Question 2a: What sources of capital have the most positive relationship

    with receipts?Regressing both startup capital and expansion capital variables against receipts in both

    the California and United States datasets reveals interesting statistics on both the percentage ofusage of each type of capital, as well as the practical significance of each capital sourcerepresented via the estimate. The following data tables derive content from the full list of SASoutputs contained in Appendix C.

    Summarized Table for Startup Capital

    California U.S. EstimatesTrue Estimates (added to

    Intercept)

    Yes No Yes No CA USA CA USA Difference

    SCSAVINGS 60.97% 39.03% 58.01% 41.99% 11.70 -15.78 113.71 154.40 -40.69

    SCASSETS 6.23% 93.77% 7.05% 92.95% 66.98 48.85 168.98 219.03 -50.04

    SCEQUITY 6.12% 93.88% 5.07% 94.93% 71.58 49.37 173.59 219.55 -45.96

    SCCREDIT 11.22% 88.78% 10.31% 89.69% -53.88 -90.21 48.13 79.96 -31.83

    SCGOVTLOAN 0.40% 99.60% 0.56% 99.44% 155.74 91.93 257.74 262.10 -4.36

    SCGOVTGUAR 0.45% 99.55% 0.59% 99.41% 128.86 195.60 230.87 365.78 -134.92

    SCBANKLOAN 4.47% 95.53% 9.18% 90.82% 247.36 305.75 349.37 475.93 -126.56

    SCFAMLOAN 2.23% 97.77% 2.28% 97.72% 36.52 180.69 138.53 350.86 -212.34

    SCVENTURE 0.30% 99.70% 0.25% 99.75% 68.87 364.61 170.88 534.78 -363.91

    SCGRANT 0.18% 99.82% 0.20% 99.80% -78.63 -103.56 23.37 66.62 -43.24

    SCOTHER 1.74% 98.26% 1.70% 98.30% 203.10 170.52 305.10 340.69 -35.59

    SCDONTKNOW 4.47% 95.53% 4.58% 95.42% 150.63 183.79 252.64 353.96 -101.32SCNONENEEDED 23.59% 76.41% 24.35% 75.65% -53.49 111.12 48.52 281.30 -232.78

    SCNOTREPORTED 5.33% 94.67% 5.32% 94.68% 0.00 0.00 102.01 170.18 -68.17

    INTERCEPT 102.01 170.18

    In terms of usage frequency, differences greater than 1% between the United States andCalifornia are bolded. Regardless, its important to note that most of these differences (even fordifferences less than 1%) are statistically significant due to the large sample size. Businesses inthe United States as a whole are more than twice as likely to use bank loans as a source of startupcapital when compared to businesses in California, while Californian business owners are more

    Startup Capital (SC)

    SCSAVINGS: Personal Savings

    SCASSETS: Other Personal Assets

    SCEQUITY: Home Equity SCCREDIT: Credit Cards

    SCGOVTLOAN: Government Loan

    SCGOVTGUAR: Government Guaranteed Loan

    the United States government and the Small

    Business Administration provides loans to

    certain businesses depending on size and capital

    purposes

    SCBANKLOAN: Loan from Bank

    SCFAMLOAN: Loan from family and friends

    SCVENTURE: Venture Capitalist

    SCGRANT: Grant

    SCOTHER: Other

    Expansion Capital (EC)

    ECSAVINGS: Personal Savings

    ECASSETS: Other Personal Assets

    ECEQUITY: Home Equity

    ECCREDIT: Credit Cards

    ECGOVTLOAN: Government Loan ECGOVTGUAR: Government Guaranteed

    Loanthe United States government and the

    Small Business Administration provides loans

    to certain businesses depending on size and

    capital purposes

    ECBANKLOAN: Loan from Bank

    ECFAMLOAN: Loan from family and friends

    ECVENTURE: Venture Capitalist

    ECPROFITS: Business Profits

    ECGRANT: Grant

    ECOTHER: Other

    ECNOEXPAND: Did not expand ECNOACCESS: No Access to Expansion

    Capital

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    31/56

    31

    likely to use home equity and their own savings to start ventures. Overall, the top threecategories with the highest estimate magnitudes for the United States are venture capital, bankloans, and government guaranteed loans. For California, these categories are bank loans, othersources of capital (unspecified), and government loans.

    Consistent with its low ranking, California as a whole seems to entirely perform worse

    than businesses in the United States according to the difference in estimates. Most interestingly,California comparatively the worst in the venture capital category, despite being the state withthe greatest amount of venture capital invested in business venture formation. Othercomparatively poor categories are loans from family members and businesses that do not needstartup capital. These results warrant further analysis is conducted on the spread, rather than theaverage, of certain categories, as it is possible for Californian businesses to have greaterextremes than American businesses in general.

    Summarized Table for Expansion Capital

    California U.S. EstimatesTrue Estimates (added to

    Intercept)

    Yes No Yes No CA USA CA USA Difference

    ECSAVINGS 32.24% 67.76% 29.00% 71.00% -33.74 -112.84 76.38 118.78 -42.39

    ECASSETS 3.90% 96.10% 4.00% 96.00% 21.89 5.40 132.03 237.03 -105.00

    ECEQUITY 5.84% 94.16% 4.33% 95.67% 135.96 90.04 246.09 321.67 -75.57

    ECCREDIT 13.68% 86.32% 12.09% 87.91% -9.45 -88.74 100.67 142.87 -42.19

    ECGOVTLOAN 0.33% 99.67% 0.39% 99.61% 419.09 99.43 529.23 331.06 198.17

    ECGOVTGUAR 0.26% 99.74% 0.29% 99.71% 63.47 167.61 173.61 399.24 -225.63

    ECBANKLOAN 4.60% 95.40% 7.54% 92.46% 461.09 446.34 571.22 677.97 -106.74

    ECFAMLOAN 1.06% 98.94% 0.93% 99.07% 69.17 113.62 179.31 345.25 -165.93

    ECVENTURE 0.18% 99.82% 0.13% 99.87% 514.95 521.87 625.09 753.49 -128.40

    ECPROFITS 8.95% 91.05% 9.23% 90.77% 124.95 162.90 235.09 394.53 -159.43

    ECGRANT 0.20% 99.80% 0.19% 99.81% -98.08 -131.52 12.05 100.10 -88.04

    ECOTHER 0.83% 99.17% 0.76% 99.24% 127.44 132.62 237.57 364.24 -126.67

    ECDONTKNOW 6.76% 93.24% 6.57% 93.43% -9.43 -43.75 100.70 187.86 -87.16

    ECNOACCESS 1.93% 98.07% 1.80% 98.20% -68.32 -161.82 41.81 69.80 -27.98

    ECNOEXPAND 45.80% 54.20% 48.29% 51.71% -26.99 -108.06 83.14 123.56 -40.41

    ECNOTREPORTED 7.38% 92.62% 7.61% 92.39% 0 0 110.13 231.63 -121.49

    INTERCEPT N/A N/A N/A N/A 110.137 231.627

    As done previously, differences greater than 1% between the United States and Californiaare bolded. Businesses in the United States as a whole are more likely to use bank loans forexpansion capital, or not require it at all. Businesses in California are more likely to use theirown savings, home equity, or credit card debt to fund expansion.

    Overall, the top three categories with the highest estimate magnitudes for the UnitedStates are venture capital, bank loans, and government guaranteed loans (which is also identicalto the top three for startup capital). For California, these categories are venture capital, bankloans, and government loans. Once again, businesses in the United States tend to benefit morefrom virtually all sources of expansion capital than businesses in California, with the exceptionof having a government loan. These interesting characteristics further warrant an analysis ofspread, rather than just the average, for certain categories.

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    32/56

    32

    Research Question 2b: How does the spread of receipts for certain capital sources

    compare between businesses in California and the general United States?

    SCVENTURE and ECVENTURE AnalysisEven though California performed worse on average according to its estimates, I initially

    hypothesized that California had an overall larger spread with a maximum and upper quartile

    point likely exceeding the maximum and upper quartile of receipts in the United States.According to the following give number summary, Californias quantiles for startup venturecapital exceed those of the United States except for the maximum value, which indicates thatCalifornian businesses holistically perform better than businesses in the United States in general.The previous estimates from the regression are influenced due to the heavier right skewedness ofthe United States.

    Venture Capital - Five Number Summaries

    Quantile

    California -

    SC

    US -

    SC

    California -

    EC US - EC

    Min 0 0 0 0

    Q1 10.57 7.71 9.34 12.32Median 97.28 52.54 91.73 92.12

    Q3 774.90 454.80 608.45 725.24

    Max 6600.00 6900.00 6900.00 6800.00

    ECGOVTGUAR AnalysisSince ECGOVTGUAR was the only source of expansion capital in which California had

    a greater estimate for than the United States, I also wanted to conduct a spread analysis.According to the PROC SURVEYFREQ procedure, there are 440 businesses in California thatused a government guaranteed loan for expansion capital.

    Expansion Capital from Gov't Guar. Loan

    Quantile California - SC US - SC

    Min 0.00 0.00

    Q1 83.57 22.29

    Median 301.42 174.09

    Q3 993.67 621.90

    Max 6900.00 6900.00

    The spread confirms the general interpretation from the estimate such that Californian

    businesses as a whole tend to benefit more from government guaranteed loans.

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    33/56

    33

    ConclusionMy analysis of the Public Microdata Sample of the 2007 Survey of Business Owners

    involved cleaning and manipulating raw data, selecting a generalized linear model of allvariables against receipts through the Akaikes Information Criterion, an analysis ofbusinesstransaction language, industry of business, and ethnicity of the owner in the context of state

    location, and an investigation of the differences in capital sources between businesses inCalifornia and the United States in general.

    From the data cleaning, I was able to successfully incorporate my knowledge of thebackground survey methodology to effectively consolidate and remove variables and prepare itfor proper analysis in this context. From the model fitting, I was able to learn and apply differentmodel selection techniques and selection criteria (AIC, BIC, SBC, Mallows Cp, MSE, andadjusted r-square) to ultimately choose the best fitting model that attained a moderately strongadjusted coefficient of determination, after several model selection manipulations that involvedlogarithm variable transformation to reduce skewness and improve overall fit.

    The language analysis revealed that businesses that conducted transactions in only

    English and Spanish had higher estimates that businesses that only used English. Investigatingthese businesses within the context of sector ultimately showed that businesses that only usedEnglish and Spanish were statistically represented at a higher percentage for certain sectors (suchas sector 23 / Construction ) that earned more on average than other sectors where Only Englishbusinesses had a higher percentage. Finally, the analysis of sources of capital revealed thatCalifornia overwhelmingly performed worse than businesses in the United States based onaverages and estimates in regression, but closer analysis on the spread of receipts given startupventure capital in California shows that estimates can be misleading, and heavy right-skewednessinvalidates conclusions based on the average alone.

    Given more time and access to the actual dataset (without noise and other confidentiality-

    preserving measures), I would be able to develop a more powerful and accurate model, alongwith other analyses. Other important research questions to investigate would be creating acorrelation matrix to observe collinearity between variables, observing the relationship betweenreceipts with more demographic information such as age, gender, and education level, as well asconducting statistical analysis with different response variables, such as employment and payroll.

    Lessons LearnedIve come to believe that doing independent research in the context of statistics is

    incredibly important. Through this study, Ive been able to apply everything that Ive learned inall of my statistics courses, from learning how to handle a very large data set in SAS to makingthe proper assumptions and conclusions from my analyses. I argue that this is the highest form of

    learningas it is completely experiential, based off of existing data, and set entirely in real-world scenarios. Ive also been fortunate enough to study the fusion of my two academic fields business and statistics. The flexibility of independent research has allowed me to learn aboutexisting literature in the vast field of business statistics and entrepreneurship, as well as fieldother possible research ideas such as social network analysis and survival analysis.

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    34/56

    34

    Appendix A: Full Output of Regression with log(Receipts) and ModifiedPayroll and Employment

    The GLMSELECT Procedure

    Selected Model

    The selected model is the model at the last step (Step 81).

    Effects: Intercept PAYROLL_NOISY EMPLOYMENT_NOISY PCT1 FIPST SECTORN07_EMPLOYER SEX1 VET1 FOUNDED1 PURCHASED1 INHERITED1RECEIVED1 ACQYR1 PROVIDE1 MANAGE1 FINANCIAL1 FNCTNABV1HOURS1 PRMINC1 SELFEMP1 EDUC1 AGE1 BORNUS1 DISVET1ESTABLISHED SCSAVINGS SCASSETS SCEQUITY SCCREDITSCGOVTLOAN SCGOVTGUAR SCBANKLOAN SCFAMLOAN SCVENTURE

    SCGRANT SCOTHER SCDONTKNOW SCAMOUNT HOMEBASEDFRANCHISE FRANCHISER50 ECSAVINGS ECASSETS ECEQUITY ECCREDITECGOVTLOAN ECBANKLOAN ECVENTURE ECPROFITS ECGRANTECOTHER ECDONTKNOW ECNOACCESS ECNOEXPAND FEDERALOTHERBUS INDIVIDUALS EXPORTS FULLTIME PARTTIME LEASEDCONTRACTORS HEALTHINS RETIREMENT PROFITSHARE HOLIDAYSBENENABV WEBSITE ECOMMPCT ONLINEPURCH LT40HOURSLT12MONTHS SEASONAL OCCASIONALLY ACTIVITYNABV OPERATINGCEASENR HUSBWIFE NUMOWNERS race1noblanks LANGUAGE

    Analysis of Variance

    Source DF Sum of

    Squares

    Mean

    Square

    F Value

    Model 207 14974589 72341 7650.22

    Error 784001 7413568 9.45607

    Corrected Total 784208 22388157

    Root MSE 3.07507

    Dependent Mean 4.23116

    R-Square 0.6689

    Adj R-Sq 0.6688

    AIC 2546267

    AICC 2546268

    SBC 1764464

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    35/56

    35

    Parameter Estimates

    Parameter DF Estimate Standard Error t Value

    Intercept 1 1.889667 0.094597 19.98

    PAYROLL_NOISY 1 0.001098 0.000006900 159.06

    EMPLOYMENT_NOISY 1 0.013737 0.000194 70.72

    PCT1 1 0.000336 0.000107 3.13

    FIPST 01 1 0.069510 0.012035 5.78

    FIPST 04 1 0.098799 0.010865 9.09

    FIPST 05 1 0.007242 0.013682 0.53

    FIPST 06 1 0.190992 0.008333 22.92

    FIPST 08 1 0.042436 0.010223 4.15

    FIPST 09 1 0.176561 0.011957 14.77

    FIPST 12 1 0.042020 0.008756 4.80

    FIPST 13 1 0.066957 0.009863 6.79

    FIPST 15 1 0.128502 0.017731 7.25

    FIPST 16 1 -0.020266 0.015023 -1.35

    FIPST 17 1 0.053238 0.009259 5.75

    FIPST 18 1 0.012449 0.010725 1.16

    FIPST 19 1 -0.066745 0.012767 -5.23

    FIPST 20 1 -0.015773 0.013107 -1.20

    FIPST 21 1 -0.007839 0.012161 -0.64

    FIPST 22 1 0.102305 0.012320 8.30

    FIPST 23 1 0.000535 0.015441 0.03

    FIPST 24 1 0.120146 0.010882 11.04

    FIPST 25 1 0.144996 0.010390 13.96FIPST 26 1 0.006431 0.009752 0.66

    FIPST 27 1 0.006867 0.010502 0.65

    FIPST 28 1 0.052754 0.014819 3.56

    FIPST 29 1 -0.014865 0.010791 -1.38

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    36/56

    36

    Parameter Estimates

    Parameter DF Estimate Standard Error t Value

    FIPST 30 1 -0.065514 0.016294 -4.02

    FIPST 31 1 -0.083451 0.015061 -5.54FIPST 32 1 0.171132 0.014256 12.00

    FIPST 33 1 0.122197 0.015510 7.88

    FIPST 34 1 0.184363 0.009862 18.69

    FIPST 35 1 0.046100 0.015776 2.92

    FIPST 36 1 0.132543 0.008847 14.98

    FIPST 37 1 0.057997 0.009811 5.91

    FIPST 39 1 0.021252 0.009527 2.23

    FIPST 40 1 -0.000955 0.012247 -0.08

    FIPST 41 1 0.055542 0.011366 4.89

    FIPST 42 1 0.069120 0.009322 7.41

    FIPST 45 1 0.045966 0.011929 3.85

    FIPST 47 1 0.070336 0.010845 6.49

    FIPST 48 1 0.107862 0.008739 12.34

    FIPST 49 1 0.083133 0.012981 6.40

    FIPST 51 1 0.079476 0.010169 7.82

    FIPST 53 1 0.103092 0.010198 10.11

    FIPST 54 1 -0.057228 0.017523 -3.27

    FIPST 55 0 0 . .

    SECTOR 11 1 1.034824 0.086277 11.99

    SECTOR 21 1 1.021291 0.086922 11.75

    SECTOR 22 1 0.990292 0.096556 10.26

    SECTOR 23 1 1.275192 0.085487 14.92

    SECTOR 31 1 1.101583 0.085674 12.86

    SECTOR 42 1 1.555542 0.085640 18.16

    SECTOR 44 1 1.279568 0.085497 14.97

    SECTOR 48 1 1.200337 0.085620 14.02

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    37/56

    37

    Parameter Estimates

    Parameter DF Estimate Standard Error t Value

    SECTOR 51 1 0.863523 0.085909 10.05

    SECTOR 52 1 0.940771 0.085589 10.99SECTOR 53 1 0.995807 0.085505 11.65

    SECTOR 54 1 0.908610 0.085467 10.63

    SECTOR 55 1 -0.452943 0.100108 -4.52

    SECTOR 56 1 0.809018 0.085534 9.46

    SECTOR 61 1 0.734224 0.085848 8.55

    SECTOR 62 1 0.966935 0.085523 11.31

    SECTOR 71 1 0.742947 0.085621 8.68

    SECTOR 72 1 1.069889 0.085657 12.49

    SECTOR 81 1 0.856258 0.085507 10.01

    SECTOR 99 0 0 . .

    N07_EMPLOYER E 1 1.137851 0.003439 330.84

    N07_EMPLOYER N 0 0 . .

    SEX1 F 1 -0.152693 0.002767 -55.18

    SEX1 M 0 0 . .

    VET1 1 1 0.994431 0.478293 2.08

    VET1 2 0 0 . .

    FOUNDED1 1 1 -0.079178 0.028948 -2.74

    FOUNDED1 2 0 0 . .

    PURCHASED1 1 1 -0.087677 0.028811 -3.04

    PURCHASED1 2 0 0 . .

    INHERITED1 1 1 -0.091757 0.028593 -3.21

    INHERITED1 2 0 0 . .

    RECEIVED1 1 1 -0.060479 0.028701 -2.11

    RECEIVED1 2 0 0 . .

    ACQYR1 1 1 0.123718 0.010722 11.54

    ACQYR1 2 1 0.114924 0.010468 10.98

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    38/56

    38

    Parameter Estimates

    Parameter DF Estimate Standard Error t Value

    ACQYR1 3 1 0.122968 0.009717 12.66

    ACQYR1 4 1 0.112593 0.009256 12.16ACQYR1 5 1 0.084345 0.011292 7.47

    ACQYR1 6 1 0.055247 0.011273 4.90

    ACQYR1 7 1 -0.069739 0.011162 -6.25

    ACQYR1 8 0 0 . .

    PROVIDE1 1 1 -0.225547 0.003055 -73.82

    PROVIDE1 2 0 0 . .

    MANAGE1 1 1 -0.065064 0.002819 -23.08

    MANAGE1 2 0 0 . .

    FINANCIAL1 1 1 0.104690 0.002771 37.78

    FINANCIAL1 2 0 0 . .

    FNCTNABV1 1 1 -0.117711 0.006090 -19.33

    FNCTNABV1 2 0 0 . .

    HOURS1 1 1 -0.135239 0.008345 -16.21

    HOURS1 2 1 -0.336065 0.004656 -72.19

    HOURS1 3 1 -0.181169 0.004282 -42.31

    HOURS1 4 1 -0.150164 0.003995 -37.59

    HOURS1 5 1 -0.082393 0.003423 -24.07

    HOURS1 6 0 0 . .

    PRMINC1 1 1 0.242443 0.002948 82.23

    PRMINC1 2 0 0 . .

    SELFEMP1 1 1 0.047307 0.002353 20.10

    SELFEMP1 2 0 0 . .

    EDUC1 1 1 -0.166470 0.005955 -27.95

    EDUC1 2 1 -0.108420 0.003966 -27.34

    EDUC1 3 1 -0.181064 0.005205 -34.78

    EDUC1 4 1 -0.124506 0.003872 -32.15

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    39/56

    39

    Parameter Estimates

    Parameter DF Estimate Standard Error t Value

    EDUC1 5 1 -0.146356 0.005298 -27.62

    EDUC1 6 1 -0.073524 0.003406 -21.58EDUC1 7 0 0 . .

    AGE1 1 1 -0.184340 0.009761 -18.88

    AGE1 2 1 0.007893 0.005418 1.46

    AGE1 3 1 0.071539 0.004581 15.62

    AGE1 4 1 0.071199 0.004198 16.96

    AGE1 5 1 0.036784 0.003984 9.23

    AGE1 6 0 0 . .

    BORNUS1 1 1 -0.037608 0.004052 -9.28

    BORNUS1 2 0 0 . .

    DISVET1 1 1 -1.093928 0.478399 -2.29

    DISVET1 2 1 -1.034112 0.478302 -2.16

    DISVET1 3 0 0 . .

    ESTABLISHED 1 1 0.133473 0.009513 14.03

    ESTABLISHED 2 1 0.135956 0.009681 14.04

    ESTABLISHED 3 1 0.132354 0.008972 14.75

    ESTABLISHED 4 1 0.111280 0.008763 12.70

    ESTABLISHED 5 1 0.095458 0.009609 9.93

    ESTABLISHED 6 1 0.094962 0.009231 10.29

    ESTABLISHED 7 1 0.071583 0.010832 6.61

    ESTABLISHED 8 1 0.051642 0.010923 4.73

    ESTABLISHED 9 1 -0.016931 0.010863 -1.56

    ESTABLISHED A 0 0 . .

    SCSAVINGS 1 1 -0.017371 0.003576 -4.86

    SCSAVINGS 2 0 0 . .

    SCASSETS 1 1 -0.056578 0.004313 -13.12

    SCASSETS 2 0 0 . .

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    40/56

    40

    Parameter Estimates

    Parameter DF Estimate Standard Error t Value

    SCEQUITY 1 1 -0.039681 0.005002 -7.93

    SCEQUITY 2 0 0 . .SCCREDIT 1 1 -0.051551 0.003860 -13.36

    SCCREDIT 2 0 0 . .

    SCGOVTLOAN 1 1 -0.028818 0.013311 -2.16

    SCGOVTLOAN 2 0 0 . .

    SCGOVTGUAR 1 1 0.022142 0.012371 1.79

    SCGOVTGUAR 2 0 0 . .

    SCBANKLOAN 1 1 0.055183 0.004020 13.73

    SCBANKLOAN 2 0 0 . .

    SCFAMLOAN 1 1 -0.018966 0.006370 -2.98

    SCFAMLOAN 2 0 0 . .

    SCVENTURE 1 1 -0.096289 0.022065 -4.36

    SCVENTURE 2 0 0 . .

    SCGRANT 1 1 -0.145460 0.028804 -5.05

    SCGRANT 2 0 0 . .

    SCOTHER 1 1 -0.035163 0.008224 -4.28

    SCOTHER 2 0 0 . .

    SCDONTKNOW 1 1 -0.052654 0.008604 -6.12

    SCDONTKNOW 2 0 0 . .

    SCAMOUNT 1 1 -0.002310 0.004671 -0.49

    SCAMOUNT 2 1 0.062655 0.005589 11.21

    SCAMOUNT 3 1 0.087967 0.005551 15.85

    SCAMOUNT 4 1 0.131828 0.006121 21.54

    SCAMOUNT 5 1 0.182133 0.006361 28.63

    SCAMOUNT 6 1 0.247689 0.006671 37.13

    SCAMOUNT 7 1 0.385266 0.007672 50.22

    SCAMOUNT 8 1 0.587252 0.011676 50.29

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    41/56

    41

    Parameter Estimates

    Parameter DF Estimate Standard Error t Value

    SCAMOUNT 9 1 0.186741 0.006260 29.83

    SCAMOUNT A 0 0 . .HOMEBASED 1 1 -0.193209 0.002631 -73.42

    HOMEBASED 2 0 0 . .

    FRANCHISE 1 1 0.095547 0.008042 11.88

    FRANCHISE 2 0 0 . .

    FRANCHISER50 1 1 -0.022814 0.012799 -1.78

    FRANCHISER50 2 0 0 . .

    ECSAVINGS 1 1 -0.071333 0.003629 -19.66

    ECSAVINGS 2 0 0 . .

    ECASSETS 1 1 -0.050133 0.005692 -8.81

    ECASSETS 2 0 0 . .

    ECEQUITY 1 1 0.032494 0.005389 6.03

    ECEQUITY 2 0 0 . .

    ECCREDIT 1 1 -0.024643 0.003773 -6.53

    ECCREDIT 2 0 0 . .

    ECGOVTLOAN 1 1 0.038285 0.015650 2.45

    ECGOVTLOAN 2 0 0 . .

    ECBANKLOAN 1 1 0.103370 0.004219 24.50

    ECBANKLOAN 2 0 0 . .

    ECVENTURE 1 1 -0.246374 0.031677 -7.78

    ECVENTURE 2 0 0 . .

    ECPROFITS 1 1 0.031553 0.003818 8.27

    ECPROFITS 2 0 0 . .

    ECGRANT 1 1 -0.158893 0.028705 -5.54

    ECGRANT 2 0 0 . .

    ECOTHER 1 1 -0.032423 0.012595 -2.57

    ECOTHER 2 0 0 . .

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    42/56

    42

    Parameter Estimates

    Parameter DF Estimate Standard Error t Value

    ECDONTKNOW 1 1 -0.088126 0.007204 -12.23

    ECDONTKNOW 2 0 0 . .ECNOACCESS 1 1 -0.095309 0.010197 -9.35

    ECNOACCESS 2 0 0 . .

    ECNOEXPAND 1 1 -0.043675 0.003966 -11.01

    ECNOEXPAND 2 0 0 . .

    FEDERAL 1 1 0.041492 0.007992 5.19

    FEDERAL 2 0 0 . .

    OTHERBUS 1 1 0.045852 0.003214 14.27

    OTHERBUS 2 0 0 . .

    INDIVIDUALS 1 1 -0.149726 0.003468 -43.17

    INDIVIDUALS 2 0 0 . .

    EXPORTS 1 1 0.073348 0.007377 9.94

    EXPORTS 2 1 0.138000 0.010933 12.62

    EXPORTS 3 1 0.177615 0.013409 13.25

    EXPORTS 4 1 0.214486 0.016534 12.97

    EXPORTS 5 1 0.166936 0.016124 10.35

    EXPORTS 6 1 0.178288 0.016120 11.06

    EXPORTS 7 1 0.293766 0.017115 17.16

    EXPORTS 8 1 0.320932 0.021770 14.74

    EXPORTS 9 0 0 . .

    FULLTIME 1 1 0.179631 0.003769 47.67

    FULLTIME 2 0 0 . .

    PARTTIME 1 1 -0.020924 0.003068 -6.82

    PARTTIME 2 0 0 . .

    LEASED 1 1 0.334245 0.011579 28.87

    LEASED 2 0 0 . .

    CONTRACTORS 1 1 0.246384 0.002523 97.67

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    43/56

    43

    Parameter Estimates

    Parameter DF Estimate Standard Error t Value

    CONTRACTORS 2 0 0 . .

    HEALTHINS 1 1 0.078051 0.004257 18.33HEALTHINS 2 0 0 . .

    RETIREMENT 1 1 0.170154 0.004144 41.06

    RETIREMENT 2 0 0 . .

    PROFITSHARE 1 1 -0.020966 0.006971 -3.01

    PROFITSHARE 2 0 0 . .

    HOLIDAYS 1 1 0.120158 0.004618 26.02

    HOLIDAYS 2 0 0 . .

    BENENABV 1 1 -0.074976 0.005111 -14.67

    BENENABV 2 0 0 . .

    WEBSITE 1 1 0.019704 0.002904 6.78

    WEBSITE 2 0 0 . .

    ECOMMPCT 1 1 -0.033776 0.009894 -3.41

    ECOMMPCT 2 1 -0.076199 0.010547 -7.22

    ECOMMPCT 3 1 -0.065350 0.013496 -4.84

    ECOMMPCT 4 1 -0.107320 0.012424 -8.64

    ECOMMPCT 5 1 -0.053331 0.012200 -4.37

    ECOMMPCT 6 1 -0.060335 0.010540 -5.72

    ECOMMPCT 7 1 -0.093930 0.013852 -6.78

    ECOMMPCT 8 1 -0.158731 0.015935 -9.96

    ECOMMPCT 9 0 0 . .

    ONLINEPURCH 1 1 0.003940 0.002547 1.55

    ONLINEPURCH 2 0 0 . .

    LT40HOURS 1 1 -0.060922 0.004991 -12.21

    LT40HOURS 2 0 0 . .

    LT12MONTHS 1 1 -0.050281 0.004365 -11.52

    LT12MONTHS 2 0 0 . .

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    44/56

    44

    Parameter Estimates

    Parameter DF Estimate Standard Error t Value

    SEASONAL 1 1 -0.038413 0.005919 -6.49

    SEASONAL 2 0 0 . .OCCASIONALLY 1 1 -0.102215 0.005775 -17.70

    OCCASIONALLY 2 0 0 . .

    ACTIVITYNABV 1 1 0.189169 0.005265 35.93

    ACTIVITYNABV 2 0 0 . .

    OPERATING 1 1 0.247984 0.003370 73.58

    OPERATING 2 0 0 . .

    CEASENR 1 1 0.062078 0.022437 2.77

    CEASENR 2 0 0 . .

    HUSBWIFE 1 1 -0.064987 0.004525 -14.36

    HUSBWIFE 2 1 -0.068549 0.004097 -16.73

    HUSBWIFE 3 1 -0.126576 0.006033 -20.98

    HUSBWIFE 4 0 0 . .

    NUMOWNERS 1 1 0.300708 0.014590 20.61

    NUMOWNERS 2 1 0.419767 0.015134 27.74

    NUMOWNERS 3 1 0.486225 0.016239 29.94

    NUMOWNERS 4 1 0.489487 0.017344 28.22

    NUMOWNERS 5 1 0.505033 0.017709 28.52

    NUMOWNERS 6 1 0.444110 0.023362 19.01

    NUMOWNERS 7 1 0.051559 0.046352 1.11

    NUMOWNERS 8 0 0 . .

    race1noblanks A 1 -0.020036 0.005832 -3.44

    race1noblanks B 1 -0.185359 0.006773 -27.37

    race1noblanks H 1 -0.067546 0.005681 -11.89

    race1noblanks I 1 -0.072423 0.014317 -5.06

    race1noblanks Mixed 1 0.004210 0.029596 0.14

    race1noblanks P 1 -0.131819 0.039328 -3.35

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    45/56

    45

    Parameter Estimates

    Parameter DF Estimate Standard Error t Value

    race1noblanks S 1 0.002974 0.026112 0.11

    race1noblanks W 0 0 . .LANGUAGE Only English 1 0.216239 0.017187 12.58

    LANGUAGE Only English and Other 1 0.143442 0.017956 7.99

    LANGUAGE Only English and Spanish 1 0.219765 0.017202 12.78

    LANGUAGE Only Other Language 1 0.007221 0.023920 0.30

    LANGUAGE Only Spanish 0 0 . .

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    46/56

    46

    Appendix B: Regression of All Available Variables

    The SAS System

    The SURVEYREG Procedure

    Regression Analysis for Dependent Variable RECEIPTS_NOISY

    Data Summary

    Number of Observations 874182

    Sum of Weights 10068531

    Weighted Mean of RECEIPTS_NOISY 251.05405

    Weighted Sum of RECEIPTS_NOISY 2527745467

    Fit Statistics

    R-square 0.5771

    Root MSE 446.73

    Denominator DF 874181

    Class Level Information

    Class Variable Levels Values

    FIPST 43 01 04 05 06 08 09 12 13 15 16 17 18 19 20 21 22 23 24 25 26

    27 28 29 30 31 32 33 34 35 36 37 39 40 41 42 45 47 48 49 5153 54 55

    SECTOR 20 11 21 22 23 31 42 44 48 51 52 53 54 55 56 61 62 71 72 81 99

    N07_EMPLOYER 2 E N

    SEX1 2 F M

    VET1 2 1 2

    FOUNDED1 2 1 2

    PURCHASED1 2 1 2

    INHERITED1 2 1 2

    RECEIVED1 2 1 2

    ACQUIRENR1 1 2

    ACQYR1 8 1 2 3 4 5 6 7 8

    PROVIDE1 2 1 2

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    47/56

    47

    Class Level Information

    Class Variable Levels Values

    MANAGE1 2 1 2

    FINANCIAL1 2 1 2FNCTNABV1 2 1 2

    FNCTNR1 1 2

    HOURS1 6 1 2 3 4 5 6

    PRMINC1 2 1 2

    SELFEMP1 2 1 2

    EDUC1 7 1 2 3 4 5 6 7

    AGE1 6 1 2 3 4 5 6

    BORNUS1 2 1 2

    DISVET1 3 1 2 3

    ESTABLISHED 10 1 2 3 4 5 6 7 8 9 A

    SCSAVINGS 2 1 2

    SCASSETS 2 1 2

    SCEQUITY 2 1 2

    SCCREDIT 2 1 2

    SCGOVTLOAN 2 1 2

    SCGOVTGUAR 2 1 2

    SCBANKLOAN 2 1 2

    SCFAMLOAN 2 1 2

    SCVENTURE 2 1 2

    SCGRANT 2 1 2

    SCOTHER 2 1 2

    SCDONTKNOW 2 1 2

    SCNONENEEDED 2 1 2

    SCNOTREPORTED 1 2

    SCAMOUNT 10 1 2 3 4 5 6 7 8 9 A

    HOMEBASED 2 1 2

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    48/56

    48

    Class Level Information

    Class Variable Levels Values

    FRANCHISE 2 1 2

    FRANCHISER50 2 1 2ECSAVINGS 2 1 2

    ECASSETS 2 1 2

    ECEQUITY 2 1 2

    ECCREDIT 2 1 2

    ECGOVTLOAN 2 1 2

    ECGOVTGUAR 2 1 2

    ECBANKLOAN 2 1 2

    ECFAMLOAN 2 1 2

    ECVENTURE 2 1 2

    ECPROFITS 2 1 2

    ECGRANT 2 1 2

    ECOTHER 2 1 2

    ECDONTKNOW 2 1 2

    ECNOACCESS 2 1 2

    ECNOEXPAND 2 1 2

    ECNOTREPORTED 1 2

    FEDERAL 2 1 2

    STATELOCAL 2 1 2

    OTHERBUS 2 1 2

    INDIVIDUALS 2 1 2

    CUSTNR 1 2

    EXPORTS 9 1 2 3 4 5 6 7 8 9

    OPSOUTSIDE 2 1 2

    OUTSOURCE 2 1 2

    FULLTIME 2 1 2

    PARTTIME 2 1 2

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    49/56

    49

    Class Level Information

    Class Variable Levels Values

    DAYLABOR 2 1 2

    TEMPSTAFF 2 1 2LEASED 2 1 2

    CONTRACTORS 2 1 2

    EMPNR 1 2

    HEALTHINS 2 1 2

    RETIREMENT 2 1 2

    PROFITSHARE 2 1 2

    HOLIDAYS 2 1 2

    BENENABV 2 1 2

    BENENR 1 2

    WEBSITE 2 1 2

    ECOMMERCE 2 1 2

    ECOMMPCT 9 1 2 3 4 5 6 7 8 9

    ONLINEPURCH 2 1 2

    LT40HOURS 2 1 2

    LT12MONTHS 2 1 2

    SEASONAL 2 1 2

    OCCASIONALLY 2 1 2

    ACTIVITYNABV 2 1 2

    ACTIVITYNR 1 2

    OPERATING 2 1 2

    CEASENR 2 1 2

    CEASENA 2 1 2

    HUSBWIFE 4 1 2 3 4

    FAMILYBUS 2 1 2

    NUMOWNERS 8 1 2 3 4 5 6 7 8

    race1noblanks 8 A B H I Mixed P S W

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    50/56

    50

    Class Level Information

    Class Variable Levels Values

    LANGUAGE 5 Only English Only English and Other Only English andSpanish Only Other Language Only Spanish

    region 9 East Sout Mid-Atlan Midwest Mountain Northeast PacificSouth Atl West Nort West Sout

    Tests of Model Effects

    Effect Num DF F Value Pr > F

    Model 215 1538.56

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    51/56

    51

    Tests of Model Effects

    Effect Num DF F Value Pr > F

    PRMINC1 1 592.68

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    52/56

    52

    Tests of Model Effects

    Effect Num DF F Value Pr > F

    ECCREDIT 1 477.03

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    53/56

    53

    Tests of Model Effects

    Effect Num DF F Value Pr > F

    HEALTHINS 1 1027.62

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    54/56

    54

    Appendix C: Full Tables of Capital RegressionRegressing Startup Capital Variables Against Receipts in California

    Standard

    Parameter Estimate Error t Value Pr > |t|

    Intercept 102.007317 21.726311 4.70

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    55/56

    55

    SCFAMLOAN 1 180.68722 20.0165018 9.03

  • 8/12/2019 2014 Independent Study on the Census Bureau's Survey of Business Owners Public Use Microdata Sample

    56/56

    Intercept 231 62717 10.454859 22.15