biomedical statistics

92
Identifying Green Automotive Designs In this demo, we will see how the MATLAB Statistics Toolbox™ can help us explore and analyze historical automotive performance data, and create a model that represent the data well. Contents Setup Load Data Exploration Matrix Plot Visualization Modeling Fuel Economy Create model for other combinations Simulation Conclusion

Upload: umair-ahmed

Post on 26-Oct-2014

30 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Biomedical Statistics

Identifying Green Automotive Designs

In this demo, we will see how the MATLAB Statistics Toolbox™ can help us explore and analyze historical automotive performance data, and create a model that represent the data well.

Contents

Setup Load Data Exploration Matrix Plot Visualization Modeling Fuel Economy Create model for other combinations Simulation Conclusion

Page 2: Biomedical Statistics

Setup

Turn off warning (after saving current state)

warnState = warning('query', 'stats:dataset:setvarnames:ModifiedVarnames');warning('off', 'stats:dataset:setvarnames:ModifiedVarnames');

Load Data

Load fuel economy data from year 2000 to 2007 into a dataset array

data = loadData();

Reading in 1 of 7

Warning: Variable names were modified to make them valid MATLAB identifiers. Reading in 2 of 7Warning: Variable names were modified to make them valid MATLAB identifiers. Reading in 3 of 7Warning: Variable names were modified to make them valid MATLAB identifiers. Reading in 4 of 7Warning: Variable names were modified to make them valid MATLAB identifiers. Reading in 5 of 7Warning: Variable names were modified to make them valid MATLAB identifiers. Reading in 6 of 7Warning: Variable names were modified to make them valid MATLAB identifiers. Reading in 7 of 7Warning: Variable names were modified to make them valid MATLAB identifiers. Done

Page 3: Biomedical Statistics

Exploring the Data

First, we would like to examine the distribution of the MPG data to get a better understanding of the variability. For this, we will use the Distribution Fitting Tool from the Statistics Toolbox.

Let us see how "dfittool" simplifies the task of visualizing and fitting distributions.

dfittool(data.mpg, [], [], 'MPG');

The histogram seems to have multiple peaks. Perhaps we need to break the data up into groups, "highway" and "city".

Page 4: Biomedical Statistics

% close DFITTOOL (custom function)closeDTool;

dfittool(data.mpg(data.C_H == 'H'), [], [], 'Highway');dfittool(data.mpg(data.C_H == 'C'), [], [], 'City');

If we look at the "highway" data, the distribution still has multiple peaks. We can group it even further into "cars" and "trucks"

Page 5: Biomedical Statistics

% close DFITTOOL (custom function)closeDTool;

dfittool(data.mpg(data.C_H == 'H' & data.car_truck == 'C'), [], [], 'Highway Car');dfittool(data.mpg(data.C_H == 'H' & data.car_truck == 'T'), [], [], 'Highway Truck');

We will try different types of distributions. We can do this by clicking on "New Fit..." from dfittool.

"Normal" distribution doesn't seem to fit well.

"Logistic" seems to be a better fit. It's not perfect but good enough for our purpose.

Page 6: Biomedical Statistics

myDistFit(data.mpg(data.C_H == 'H' & data.car_truck == 'C'), ... data.mpg(data.C_H == 'H' & data.car_truck == 'T'))xlabel('MPG');

Page 7: Biomedical Statistics

Matrix Plot Visualization

We now know the overall variability in MPG, but what may be causing this? Which factors affect MPG and how? To answer these questions, let's look at the relationship between MPG and few of the other variables.

We will first look at a few of these variables: the engine displacement, the horsepower, and weight. The command gplotmatrix creates a matrix plot grouped by categories. We will categorize by City/Highway and Car/Truck.

carType = {'C', 'MPG_{car}';'T', 'MPG_{truck}'};figPos = [.2 .55 .6 .45;.2 .1 .6 .45];predNames = {'Displacement', 'Horsepower', 'Weight'};for id = 1:2 dataID = data.car_truck == carType{id, 1}; response = data.mpg(dataID); respNames = carType(id, 2); predictors = [data.cid(dataID), data.rhp(dataID), data.etw(dataID)];

figure('units', 'normalized', 'outerposition', figPos(id, :)); [h, ax] = gplotmatrix(predictors, response, data.C_H(dataID), ... [], [], 5, [], [], predNames, respNames);

title('Emissions Test Results');end

Page 8: Biomedical Statistics

All 3 variables seem to have negative correlations to MPG. As expected, highway driving seems to have a better fuel economy than city driving.

Page 9: Biomedical Statistics

Modeling Fuel Economy

We will now focus on one of the four groups, Highway-Cars, and create a model of MPG based on a few predictors.

We will look at the Highway Cars and test out 6 potential predictors: engine displacement, rated horsepower, estimated test weight, compression ratio, axle ratio, and engine/vehicle speed ratio. I will use ANOVA to analyze.

The model I will check first is a linear model:

dat = data(data.C_H == 'H' & data.car_truck == 'C', :);

% List of potential predictor variablespredictors = {dat.cid, dat.rhp, dat.etw, dat.cmp, dat.axle, dat.n_v};varNames = {'cid', 'rhp', 'etw', 'cmp', 'axle', 'n_v'};p = anovan(dat.mpg, predictors, 'continuous', 1:length(predictors), ... 'varnames', varNames);

Page 10: Biomedical Statistics

The anova table informs us of the sources of variability in the model. Looking at the p-value (last column) can tell us whether or not a certain term has a significant effect in the model. We will remove any terms that is not significant to the model. In this case, the axle ratio seems to be insignificant. I will rerun ANOVA with up to 3-way interaction terms included in the model. The new model is

predictors(p > 0.05) = '';varNames(p > 0.05) = '';

[p2, t, stats, terms] = anovan(dat.mpg, predictors, ... 'continuous', 1:length(predictors), 'varnames', varNames, 'model', 3);

Finally, we will perform a regression with only the significant terms

terms(p2 > 0.05, :) = [];

s = regstats(dat.mpg, cell2mat(predictors), ... [zeros(1, length(predictors)); terms], {'beta', 'yhat', 'r', 'rsquare'});

home;r_square = s.rsquare

Page 11: Biomedical Statistics

r_square = 0.8478

The R squared value denotes how much of the variation seen in MPG is explained by the model. Let's visually inspect the goodness of the regression.

myPlot(dat.mpg, s.yhat, s.rsquare, 'Highway-Car');

Create model for other combinations

We will use the above method to create a model for the other three combinations: Highway-Truck, City-Car, City-Truck. The following function modelMPG is simply a compilation of the few cells above for creating a model for MPG.

Highway Truck

modelMPG(data, 'Highway', 'Truck');r_square = 0.7390

Page 12: Biomedical Statistics

City Car

modelMPG(data, 'City', 'Car');r_square = 0.8735

Page 13: Biomedical Statistics

City Truck

modelMPG(data, 'City', 'Truck');r_square = 0.7936

Page 14: Biomedical Statistics

Simulation

Now that we have a model, the final step is to simulate other scenarios. Let's look at the 2007 HONDA ACCORD.

home;idx = (dat.yr == '2007' & dat.mfrName == 'HONDA' & dat.carline == 'ACCORD');hondaAccord = dat(idx, {'yr', 'mfrName', 'carline', 'mpg'})hondaAccord = yr mfrName carline mpg 2007 HONDA ACCORD 43.6 2007 HONDA ACCORD 43.1 2007 HONDA ACCORD 43.3 2007 HONDA ACCORD 36.6 2007 HONDA ACCORD 38.5 2007 HONDA ACCORD 35.7 2007 HONDA ACCORD 38.1 2007 HONDA ACCORD 38.2 2007 HONDA ACCORD 37.4

Let's compare this to what the model gives us. We will call simMPG to simulate with the appropriate input arguments.

vars = dat(idx, {'cid', 'rhp', 'etw', 'cmp', 'axle', 'n_v'});hondaAccord_model = simMPG(vars(:, p < 0.05), terms, s);

Page 15: Biomedical Statistics

hondaMPG = [hondaAccord.mpg'; hondaAccord_model'];fprintf('\n\nModel validation (mpg):\n\n');fprintf('Actual Model Diff\n');fprintf('%6.2f %6.2f %6.2f\n', [hondaMPG; diff(hondaMPG)]);

Model validation (mpg):

Actual Model Diff 43.60 38.90 -4.70 43.10 38.90 -4.20 43.30 37.18 -6.12 36.60 35.06 -1.54 38.50 35.62 -2.88 35.70 35.06 -0.64 38.10 35.62 -2.48 38.20 34.63 -3.57 37.40 34.63 -2.77

The model gives similar values. Now, we will simulate the fuel economy for a design where the engine displacement is decreased by 20%.

% Decrease displacement by 20%vars.cid = vars.cid * 0.8;

hondaAccord_model2 = simMPG(vars(:, p < 0.05), terms, s);

hondaMPG2 = [hondaAccord_model'; hondaAccord_model2'];fprintf('\n\nModel data (mpg):\n\n');fprintf('Current Smaller Disp Diff %%Increase\n');fprintf('%6.2f %6.2f %6.2f %6.2f\n', ... [hondaMPG2; diff(hondaMPG2); diff(hondaMPG2)./hondaMPG2(1,:)*100]);

Model data (mpg):

Current Smaller Disp Diff %Increase 38.90 40.39 1.49 3.83 38.90 40.39 1.49 3.83 37.18 38.69 1.50 4.05 35.06 36.27 1.22 3.47 35.62 36.97 1.35 3.79 35.06 36.27 1.22 3.47 35.62 36.97 1.35 3.79 34.63 36.00 1.37 3.95 34.63 36.00 1.37 3.95

Compared to the current configuration, the design with smaller displacement would result in a slightly better fuel economy.

Conclusion

We can now use this model for simulating different scenarios to come up with recommendations for a new automobile design.

Page 16: Biomedical Statistics

warning(warnState);

Copyright 2007 The MathWorks, Inc.Published with MATLAB® 7.9

Page 17: Biomedical Statistics

Efficacy of coadministration of Drug X with Statin on cholesterol reduction

Contents

Abstract Data Preliminary analysis Pooled comparison: Is the combination therapy better than statin monotherapy ? Effect of Treatment, Statin Dose and Dose by Treatment interaction Effect of Statin Dose on incremental increase in percentage LDL reduction Regression analysis: Effect of statin dose on percent LDL C reduction Secondary Analysis: Consistency of effect across subgroups, age and gender

Abstract

Statins are the most common class of drugs used for treating hyperlipdemia. However, studies have shown that even at their maximum dosage of 80 mg, many patients do not reach LDL cholesterol goals recommended by the National Cholesterol Education Program Adult Treatment anel. Combination therapy, in which a second cholesterol-reducing agent that acts via a complementary pathway is coadmininstered with statin, is one alternative of achieving higher efficacy at lower statin dosage.

In this example, we test the primary hypothesis that coadminstering drug X with statin is more effective at reducing cholesterol levels than statin monotherapy.

NOTE The dataset used in this example is purely fictitious.

The analysis presented in this example is adapted from the following publication.

Reference Ballantyne CM, Houri J, Notarbartolo A, Melani L, Lipka LJ, Suresh R, Sun S, LeBeaut AP, Sager PT, Veltri EP; Ezetimibe Study Group. Effect of ezetimibe coadministered with atorvastatin in 628 patients with primary hypercholesterolemia: a prospective, randomized, double-blind trial. Circulation. 2003 May 20;107(19):2409-15.

Page 18: Biomedical Statistics

Data

650 patients were randomly assigned to one of the following 10 treatment groups (65 subjects per group)

Placebo Drug X (10 mg) Statin (10, 20, 40 or 80 mg) Drug X (10 mg) + Statin (10, 20, 40 or 80 mg)

Lipid profile (LDL cholesterol, HDL CHolesterol and Triglycerides) was measured at baseline (BL) and at 12 weeks (after the start of treatment). In addition to the lipid profile, patients age, gender and Cardiac Heart Disease (CHD) risk category was also logged at baseline.

The data from the study is stored in a Microsoft Excel (R) file. Note that the data could also be imported from other sources such as text files, any JDBC/ODBC compliant database, SAS transport files, etc.

The columns in the data are as follows:

ID - Patient ID Group - Treatment group Dose_A - Dosage of Statin (mg) Dose_X - Dosage of Drug X (mg) Age - Patient Age Gender - Patient Gender Risk - Patient CHD risk category (1 is high risk, and 3 is low risk) LDL_BL - HDL_BL & TC_BL - Lipid levels at baseline LDL_12wks , HDL_12wks & TC_12wks - Lipid levels after treatment

We will import the data into a dataset array that affords better data managemment and organization.

% Import data from an Excel file ds = dataset('xlsfile', 'Data.xls') ;

Page 19: Biomedical Statistics

Preliminary analysis

Our primary efficacy endpoint is the level of LDL cholesterol. Let us compare the LDL C levels at baseline to LDL C levels after treatment

% Use custom scatter plot LDLplot(ds.LDL_BL, ds.LDL_12wk, 50, 'g')

The mean LDL C level at baseline is around 4.2 and mean level after treatment is 2.5. So, at least for the data pooled across all the treatment groups, it seems that the treatment causes lowering of the LDL cholesterol levels

% Use a grouped scatter plot figure gscatter(ds.LDL_BL, ds.LDL_12wk, ds.Group)

Page 20: Biomedical Statistics

The grouped plot shows that LDL C levels before the start of treatment have similar means. However, the LDL C levels after treatment show difference across treatment groups. The Placebo group show no improvement. Statin monotherapy seems to outperform the Drug X monotherapy. There is overlap between the Statin and Statin + X groups; however, it the combination treatment does seem to perform better that the statin monotherapy. Remember that the "Statin" and "Statin + X" groups are further split based on Statin dose.

In this example, we will use percentage change of LDL C from the baseline level as the primary metric of efficacy.

% Calculate the percentage improvement over baseline level

ds.Change_LDL = ( ds.LDL_BL - ds.LDL_12wk ) ./ ds.LDL_BL * 100 ;

In the following graph, we can see that

1. In the "Statin" and "Statin + X" group, there appears to be a positive linear correlation between percentage improvement and statin dose

2. Even at the smallest dose of 10 mg, monotherapy with statin seems to be better than the Drug X monotherapy group

% Visualize effect of treatment and statin dose on perecentage LDL reduction figure gscatter(ds.ID, ds.Change_LDL, {ds.Group, ds.Dose_S})

Page 21: Biomedical Statistics

legend('Location', 'Best')

Page 22: Biomedical Statistics

Pooled comparison: Is the combination therapy better than statin monotherapy ?

First, we will extract percent change in LDL C level for the Statin and the Statin + X groups only. We will test the null hypothesis that the percent change in LDL C level for the "Statin + X" groups is greater than that in the "Statin + X" using pooled data. We use a 2 sample t-test to test this hypothesis.

% Convert Group into a categorical variableds.Group = nominal(ds.Group) ;

grp1 = ds.Change_LDL(ds.Group == 'Statin') ;grp2 = ds.Change_LDL(ds.Group == 'Statin + X') ;

[h, p] = ttest2(grp1, grp2, .01, 'left')h =

1

p =

7.6969e-050

We performed a tailed hypothesis to see if Statin + X group (grp2) is better than the Statin group (grp1). We test against the alternative that that mean LDL change of grp1 (Statin only) is less than mean LDL change of grp2 (Statin + X)

The null hypothesis is rejected (p < 0.01), implying that grp1 mean is less that grp2 mean, i.e. the Statin group is less effective at lowering LDL C levels than the Statin + X group.

The pooled analysis shows that coadministering drug X with statin is more effective than statin monotherapy.

Page 23: Biomedical Statistics

Effect of Treatment, Statin Dose and Dose by Treatment interaction

Our analysis so far was done on pooled data. We analysed the effect of treatment (statin alone (X = 0) vs. statin + 10 mg X) on the LDL C levels. We ignored levels of statin dose within each treatment group

Next, we will perform a 2-way ANOVA (analysis of variance) to simultaneously understand the effect of both factors - statin dose (4 levels - 10 20, 40, 80 mg) and Treatment (2 level - statin only or Statin + 10 mg X ) - on the percentage change of LDL C levels.

% First, we filter the data to include only the Statin and Statin + X groupsds1 = ds(ds.Group == 'Statin' | ds.Group == 'Statin + X', :) ;

anovan(ds1.Change_LDL , {ds1.Dose_S, ds1.Group } , ... 'varnames' , {'Statin Dose', 'Treatment'} ) ;

Page 24: Biomedical Statistics

Effect of Statin Dose on incremental increase in percentage LDL reduction

The ANOVA results indicate that statin dose is a significant factor, but it doesn't compare means across individual dose-treatment level combination. Let's look at the individual cell means.

ds2 = grpstats(ds1 , {'Dose_X', 'Dose_S'}, '', 'DataVars', 'Change_LDL')ds2 =

Dose_X Dose_S GroupCount mean_Change_LDL 0_10 0 10 65 34.467 0_20 0 20 65 40.085 0_40 0 40 65 47.453 0_80 0 80 65 52.329 10_10 10 10 65 50.656 10_20 10 20 65 54.444 10_40 10 40 65 58.075 10_80 10 80 65 61.485

Convert to wide format

ds2 = unstack(ds2, 'mean_Change_LDL' , 'Dose_X', ... 'NewDataVarNames' , {'Change_LDL_St', 'Change_LDL_St_X'} )ds2 =

Dose_S GroupCount Change_LDL_St Change_LDL_St_X 0_10 10 65 34.467 50.656 0_20 20 65 40.085 54.444 0_40 40 65 47.453 58.075 0_80 80 65 52.329 61.485

From the above table, we can clearly see that the average efficacy of the combination therapy is better than statin monotherapy at all statin dosages.

In the plot of the individual means, notice that the percentage reduction in LDL C levels achieved in the low dose combination therapy group (~50.5 %) is comparable to that achieved in the higher dose Statin monotherapy group (~ 49.4 %). Thus combination therapy with Drug X could help patients that cannot tolerate high statin doses.

figurebar([ds2.Change_LDL_St, ds2.Change_LDL_St_X])set(gca, 'XTickLabel', [10, 20 40, 80])colormap summerxlabel('Statin Dose Groups(mg)')ylabel('Percentage reduction of LDL C from Baseline (mmol/L)')legend('Statin', 'Statin + X')

Page 25: Biomedical Statistics
Page 26: Biomedical Statistics

Regression analysis: Effect of tatin dose on percent LDL C reduction

In the above graph, there appears to be linear improvement in the effectiveness metric for both treatment groups. In general it seems that for every doubling of the statin dose, there is a 5-6 point improvement in the percentage LDL C reduction. Let's fit a linear regression line to the entire dataset, instead of to the mean level.

x = ds1.Dose_S ( ds1.Group == 'Statin' ) ;y = ds1.Change_LDL( ds1.Group == 'Statin' ) ;

x1 = ds1.Dose_S (ds1.Group == 'Statin + X') ;y1 = ds1.Change_LDL(ds1.Group == 'Statin + X') ;cftool

The regression line for the Statin and the Statin + X group run almost parallel. This probably indicates mechanism of actions of drug X and statins are independent.

% Fit[m1, m2] = createFit(x,y,x1, y1)m1 =

Linear model Poly1: m1(x) = p1*x + p2 Coefficients (with 95% confidence bounds): p1 = 0.2412 (0.2064, 0.2759) p2 = 34.54 (32.94, 36.14)

m2 =

Linear model Poly1: m2(x) = p1*x + p2 Coefficients (with 95% confidence bounds): p1 = 0.1435 (0.116, 0.1709) p2 = 50.79 (49.52, 52.05)

Page 27: Biomedical Statistics
Page 28: Biomedical Statistics

Secondary Analysis: Consistency of effect across subgroups, age and gender

Finally, we will make a visual check to ensure that the efficacy of the Statin + X treatment at various statin doses is consistent across gender and age subgroups. We will perform this check for only the Statin + X treatment group.

idx = ds.Group == 'Statin + X' ;boxplot(ds.Change_LDL(idx), { ds.Dose_S(idx), ds.Gender(idx)} )

We will convert the continuous age variable into a catergorical variable, with 2 categories: Age < 65 and Age >= 65

% Convert age into a ordinal arrayds.AgeGroup = ordinal(ds.Age ,{'< 65', '>= 65'} , [] ,[0 65 100] ) ;

% Plotboxplot(ds.Change_LDL(idx), { ds.Dose_S(idx), ds.AgeGroup(idx)} )

Page 29: Biomedical Statistics
Page 30: Biomedical Statistics

Solving Data Management and Analysis Challenges using MATLAB® and Statistics Toolbox™Demo file for the Data Management and Statistics Webinar. This demo requires the Statistics Toolbox and was created using MATLAB 7.7 (R2008b).

In this demo, we will see how we can take a set of data describing performance and characteristics of various cars, and organize, extract, and visualize useful information for further analysis.

Contents

Automobile Data Dataset Object Categorical Arrays Filtering Concatenate and Join Dealing with Missing Data Clean Up Our Model Create a New Regression Model Multivariate Analysis of Variance Robust Regression Perform Regression Substitution

Automobile Data

Now let's begin. We'll work with this MAT file which contains some automobile data.

clear;load carbigwhos Name Size Bytes Class Attributes

Acceleration 406x1 3248 double Cylinders 406x1 3248 double Displacement 406x1 3248 double Horsepower 406x1 3248 double MPG 406x1 3248 double Model 406x36 29232 char Model_Year 406x1 3248 double Origin 406x7 5684 char Weight 406x1 3248 double cyl4 406x5 4060 char

Page 31: Biomedical Statistics

org 406x7 5684 char when 406x5 4060 char

This data set contains information regarding 406 different cars. There are different variables for each piece of information, and each row corresponds to the same car.

Dataset Object

Dataset objects allow you to organize information in a tabular format, and have structures very much like that of matrices. Each row represents the observations, or "cars" in this case, and the columns represent the variables, with the appropriate header names.

clc;cars = dataset(Acceleration, Cylinders, Displacement, Horsepower, ... MPG, Model, Model_Year, Origin, Weight)cars = Acceleration Cylinders Displacement Horsepower 12 8 307 130 11.5 8 350 165 11 8 318 150 12 8 304 150 10.5 8 302 140 10 8 429 198 9 8 454 220 8.5 8 440 215 10 8 455 225 8.5 8 390 190 17.5 4 133 115 11.5 8 350 165 11 8 351 153 10.5 8 383 175 11 8 360 175 10 8 383 170 8 8 340 160 8 8 302 140 9.5 8 400 150 10 8 455 225 15 4 113 95 15.5 6 198 95 15.5 6 199 97 16 6 200 85 14.5 4 97 88 20.5 4 97 46 17.5 4 110 87 14.5 4 107 90 17.5 4 104 95 12.5 4 121 113 15 6 199 90 14 8 360 215 15 8 307 200 13.5 8 318 210 18.5 8 304 193 14.5 4 97 88

Page 32: Biomedical Statistics

15.5 4 140 90 14 4 113 95 19 4 98 NaN 20 4 97 48 13 6 232 100 15.5 6 225 105 15.5 6 250 100 15.5 6 250 88 15.5 6 232 100 12 8 350 165 11.5 8 400 175 13.5 8 351 153 13 8 318 150 11.5 8 383 180 12 8 400 170 12 8 400 175 13.5 6 258 110 19 4 140 72 15 6 250 100 14.5 6 250 88 14 4 122 86 14 4 116 90 19.5 4 79 70 14.5 4 88 76 19 4 71 65 18 4 72 69 19 4 97 60 20.5 4 91 70 15.5 4 113 95 17 4 97.5 80 23.5 4 97 54 19.5 4 140 90 16.5 4 122 86 12 8 350 165 12 8 400 175 13.5 8 318 150 13 8 351 153 11.5 8 304 150 11 8 429 208 13.5 8 350 155 13.5 8 350 160 12.5 8 400 190 13.5 3 70 97 12.5 8 304 150 14 8 307 130 16 8 302 140 14 8 318 150 14.5 4 121 112 18 4 121 76 19.5 4 120 87 18 4 96 69 16 4 122 86 17 4 97 92 14.5 4 120 97 15 4 98 80 16.5 4 97 88 13 8 350 175

Page 33: Biomedical Statistics

11.5 8 304 150 13 8 350 145 14.5 8 302 137 12.5 8 318 150 11.5 8 429 198 12 8 400 150 13 8 351 158 14.5 8 318 150 11 8 440 215 11 8 455 225 11 8 360 175 16.5 6 225 105 18 6 250 100 16 6 232 100 16.5 6 250 88 16 6 198 95 21 4 97 46 14 8 400 150 12.5 8 400 167 13 8 360 170 12.5 8 350 180 15 6 232 100 19 4 97 88 19.5 4 140 72 16.5 4 108 94 13.5 3 70 90 18.5 4 122 85 14 6 155 107 15.5 4 98 90 13 8 350 145 9.5 8 400 230 19.5 4 68 49 15.5 4 116 75 14 4 114 91 15.5 4 121 112 11 8 318 150 14 4 121 110 13.5 6 156 122 11 8 350 180 16.5 6 198 95 17 6 200 NaN 16 6 232 100 17 6 250 100 19 4 79 67 16.5 4 122 80 21 4 71 65 17 4 140 75 17 6 250 100 18 6 258 110 16.5 6 225 105 14 8 302 140 14.5 8 350 150 13.5 8 318 150 16 8 302 140 15.5 8 304 150 16.5 4 98 83 15.5 4 79 67

Page 34: Biomedical Statistics

14.5 4 97 78 16.5 4 76 52 19 4 83 61 14.5 4 90 75 15.5 4 90 75 14 4 116 75 15 4 120 97 15.5 4 108 93 16 4 79 67 16 6 225 95 16 6 250 105 21 6 250 72 19.5 6 250 72 11.5 8 400 170 14 8 350 145 14.5 8 318 150 13.5 8 351 148 21 6 231 110 18.5 6 250 105 19 6 258 110 19 6 225 95 15 6 231 110 13.5 8 262 110 12 8 302 129 16 4 97 75 17 4 140 83 16 6 232 100 18.5 4 140 78 13.5 4 134 96 16.5 4 90 71 17 4 119 97 14.5 6 171 97 14 4 90 70 17 6 232 90 15 4 115 95 17 4 120 88 14.5 4 121 98 13.5 4 121 115 17.5 4 91 53 15.5 4 107 86 16.9 4 116 81 14.9 4 140 92 17.7 4 98 79 15.3 4 101 83 13 8 305 140 13 8 318 150 13.9 8 304 120 12.8 8 351 152 15.4 6 225 100 14.5 6 250 105 17.6 6 200 81 17.6 6 232 90 22.2 4 85 52 22.1 4 98 60 14.2 4 90 70 17.4 4 91 53 17.7 6 225 100

Page 35: Biomedical Statistics

21 6 250 78 16.2 6 250 110 17.8 6 258 95 12.2 4 97 71 17 4 85 70 16.4 4 97 75 13.6 4 140 72 15.7 4 130 102 13.2 8 318 150 21.9 4 120 88 15.5 6 156 108 16.7 6 168 120 12.1 8 350 180 12 8 350 145 15 8 302 130 14 8 318 150 18.5 4 98 68 14.8 4 111 80 18.6 4 79 58 15.5 4 122 96 16.8 4 85 70 12.5 8 305 145 19 8 260 110 13.7 8 318 145 14.9 8 302 130 16.4 6 250 110 16.9 6 231 105 17.7 6 225 100 19 6 250 98 11.1 8 400 180 11.4 8 350 170 12.2 8 400 190 14.5 8 351 149 14.5 4 97 78 16 4 151 88 18.2 4 97 75 15.8 4 140 89 17 4 98 63 15.9 4 98 83 16.4 4 97 67 14.1 4 97 78 14.5 6 146 97 12.8 4 121 110 13.5 3 80 110 21.5 4 90 48 14.4 4 98 66 19.4 4 78 52 18.6 4 85 70 16.4 4 91 60 15.5 8 260 110 13.2 8 318 140 12.8 8 302 139 19.2 6 231 105 18.2 6 200 95 15.8 6 200 85 15.4 4 140 88 17.2 6 225 100

Page 36: Biomedical Statistics

17.2 6 232 90 15.8 6 231 105 16.7 6 200 85 18.7 6 225 110 15.1 6 258 120 13.2 8 305 145 13.4 6 231 165 11.2 8 302 139 13.7 8 318 140 16.5 4 98 68 14.2 4 134 95 14.7 4 119 97 14.5 4 105 75 14.8 4 134 95 16.7 4 156 105 17.6 4 151 85 14.9 4 119 97 15.9 5 131 103 13.6 6 163 125 15.7 4 121 115 15.8 6 163 133 14.9 4 89 71 16.6 4 98 68 15.4 6 231 115 18.2 6 200 85 17.3 4 140 88 18.2 6 232 90 16.6 6 225 110 15.4 8 305 130 13.4 8 302 129 13.2 8 351 138 15.2 8 318 135 14.9 8 350 155 14.3 8 351 142 15 8 267 125 13 8 360 150 14 4 89 71 15.2 4 86 65 14.4 4 98 80 15 4 121 80 20.1 5 183 77 17.4 8 350 125 24.8 4 141 71 22.2 8 260 90 13.2 4 105 70 14.9 4 105 70 19.2 4 85 65 14.7 4 91 69 16 4 151 90 11.3 6 173 115 12.9 6 173 115 13.2 4 151 90 14.7 4 98 76 18.8 4 89 60 15.5 4 98 70 16.4 4 86 65 16.5 4 151 90

Page 37: Biomedical Statistics

18.1 4 140 88 20.1 4 151 90 18.7 6 225 90 15.8 4 97 78 15.5 4 134 90 17.5 4 120 75 15 4 119 92 15.2 4 108 75 17.9 4 86 65 14.4 4 156 105 19.2 4 85 65 21.7 4 90 48 23.7 4 90 48 19.9 5 121 67 21.8 4 146 67 13.8 4 91 67 17.3 4 85 NaN 18 4 97 67 15.3 4 89 62 11.4 6 168 132 12.5 3 70 100 15.1 4 122 88 14.3 4 140 NaN 17 4 107 72 15.7 4 135 84 16.4 4 151 84 14.4 4 156 92 12.6 6 173 110 12.9 4 135 84 16.9 4 79 58 16.4 4 86 64 16.1 4 81 60 17.8 4 97 67 19.4 4 85 65 17.3 4 89 62 16 4 91 68 14.9 4 105 63 16.2 4 98 65 20.7 4 98 65 14.2 4 105 74 15.8 4 100 NaN 14.4 4 107 75 16.8 4 108 75 14.8 4 119 100 18.3 4 120 74 20.4 4 141 80 15.4 4 121 110 19.6 6 145 76 12.6 6 168 116 13.8 6 146 120 15.8 6 231 110 19 8 350 105 17.1 6 200 88 16.6 6 225 85 19.6 4 112 88 18.6 4 112 88 18 4 112 88

Page 38: Biomedical Statistics

16.2 4 112 85 16 4 135 84 18 4 151 90 16.4 4 140 92 20.5 4 151 NaN 15.3 4 105 74 18.2 4 91 68 17.6 4 91 68 14.7 4 105 63 17.3 4 98 70 14.5 4 120 88 14.5 4 107 75 16.9 4 108 70 15 4 91 67 15.7 4 91 67 16.2 4 91 67 16.4 6 181 110 17 6 262 85 14.5 4 156 92 14.7 6 232 112 13.9 4 144 96 13 4 135 84 17.3 4 151 90 15.6 4 140 86 24.6 4 97 52 11.6 4 135 84 18.6 4 120 79 19.4 4 119 82

Page 39: Biomedical Statistics

MPG Model Model_Year Origin Weight 18 [1x36 char] 70 USA 3504 15 [1x36 char] 70 USA 3693 18 [1x36 char] 70 USA 3436 16 [1x36 char] 70 USA 3433 17 [1x36 char] 70 USA 3449 15 [1x36 char] 70 USA 4341 14 [1x36 char] 70 USA 4354 14 [1x36 char] 70 USA 4312 14 [1x36 char] 70 USA 4425 15 [1x36 char] 70 USA 3850 NaN [1x36 char] 70 France 3090 NaN [1x36 char] 70 USA 4142 NaN [1x36 char] 70 USA 4034 NaN [1x36 char] 70 USA 4166 NaN [1x36 char] 70 USA 3850 15 [1x36 char] 70 USA 3563 14 [1x36 char] 70 USA 3609 NaN [1x36 char] 70 USA 3353 15 [1x36 char] 70 USA 3761 14 [1x36 char] 70 USA 3086 24 [1x36 char] 70 Japan 2372 22 [1x36 char] 70 USA 2833 18 [1x36 char] 70 USA 2774 21 [1x36 char] 70 USA 2587 27 [1x36 char] 70 Japan 2130 26 [1x36 char] 70 Germany 1835 25 [1x36 char] 70 France 2672 24 [1x36 char] 70 Germany 2430 25 [1x36 char] 70 Sweden 2375 26 [1x36 char] 70 Germany 2234 21 [1x36 char] 70 USA 2648 10 [1x36 char] 70 USA 4615 10 [1x36 char] 70 USA 4376 11 [1x36 char] 70 USA 4382 9 [1x36 char] 70 USA 4732 27 [1x36 char] 71 Japan 2130 28 [1x36 char] 71 USA 2264 25 [1x36 char] 71 Japan 2228 25 [1x36 char] 71 USA 2046 NaN [1x36 char] 71 Germany 1978 19 [1x36 char] 71 USA 2634 16 [1x36 char] 71 USA 3439 17 [1x36 char] 71 USA 3329 19 [1x36 char] 71 USA 3302 18 [1x36 char] 71 USA 3288 14 [1x36 char] 71 USA 4209 14 [1x36 char] 71 USA 4464 14 [1x36 char] 71 USA 4154 14 [1x36 char] 71 USA 4096 12 [1x36 char] 71 USA 4955 13 [1x36 char] 71 USA 4746 13 [1x36 char] 71 USA 5140 18 [1x36 char] 71 USA 2962 22 [1x36 char] 71 USA 2408 19 [1x36 char] 71 USA 3282 18 [1x36 char] 71 USA 3139

Page 40: Biomedical Statistics

23 [1x36 char] 71 USA 2220 28 [1x36 char] 71 Germany 2123 30 [1x36 char] 71 France 2074 30 [1x36 char] 71 Italy 2065 31 [1x36 char] 71 Japan 1773 35 [1x36 char] 71 Japan 1613 27 [1x36 char] 71 Germany 1834 26 [1x36 char] 71 USA 1955 24 [1x36 char] 72 Japan 2278 25 [1x36 char] 72 USA 2126 23 [1x36 char] 72 Germany 2254 20 [1x36 char] 72 USA 2408 21 [1x36 char] 72 USA 2226 13 [1x36 char] 72 USA 4274 14 [1x36 char] 72 USA 4385 15 [1x36 char] 72 USA 4135 14 [1x36 char] 72 USA 4129 17 [1x36 char] 72 USA 3672 11 [1x36 char] 72 USA 4633 13 [1x36 char] 72 USA 4502 12 [1x36 char] 72 USA 4456 13 [1x36 char] 72 USA 4422 19 [1x36 char] 72 Japan 2330 15 [1x36 char] 72 USA 3892 13 [1x36 char] 72 USA 4098 13 [1x36 char] 72 USA 4294 14 [1x36 char] 72 USA 4077 18 [1x36 char] 72 Sweden 2933 22 [1x36 char] 72 Germany 2511 21 [1x36 char] 72 France 2979 26 [1x36 char] 72 France 2189 22 [1x36 char] 72 USA 2395 28 [1x36 char] 72 Japan 2288 23 [1x36 char] 72 Japan 2506 28 [1x36 char] 72 USA 2164 27 [1x36 char] 72 Japan 2100 13 [1x36 char] 73 USA 4100 14 [1x36 char] 73 USA 3672 13 [1x36 char] 73 USA 3988 14 [1x36 char] 73 USA 4042 15 [1x36 char] 73 USA 3777 12 [1x36 char] 73 USA 4952 13 [1x36 char] 73 USA 4464 13 [1x36 char] 73 USA 4363 14 [1x36 char] 73 USA 4237 13 [1x36 char] 73 USA 4735 12 [1x36 char] 73 USA 4951 13 [1x36 char] 73 USA 3821 18 [1x36 char] 73 USA 3121 16 [1x36 char] 73 USA 3278 18 [1x36 char] 73 USA 2945 18 [1x36 char] 73 USA 3021 23 [1x36 char] 73 USA 2904 26 [1x36 char] 73 Germany 1950 11 [1x36 char] 73 USA 4997 12 [1x36 char] 73 USA 4906 13 [1x36 char] 73 USA 4654

Page 41: Biomedical Statistics

12 [1x36 char] 73 USA 4499 18 [1x36 char] 73 USA 2789 20 [1x36 char] 73 Japan 2279 21 [1x36 char] 73 USA 2401 22 [1x36 char] 73 Japan 2379 18 [1x36 char] 73 Japan 2124 19 [1x36 char] 73 USA 2310 21 [1x36 char] 73 USA 2472 26 [1x36 char] 73 Italy 2265 15 [1x36 char] 73 USA 4082 16 [1x36 char] 73 USA 4278 29 [1x36 char] 73 Italy 1867 24 [1x36 char] 73 Germany 2158 20 [1x36 char] 73 Germany 2582 19 [1x36 char] 73 Sweden 2868 15 [1x36 char] 73 USA 3399 24 [1x36 char] 73 Sweden 2660 20 [1x36 char] 73 Japan 2807 11 [1x36 char] 73 USA 3664 20 [1x36 char] 74 USA 3102 21 [1x36 char] 74 USA 2875 19 [1x36 char] 74 USA 2901 15 [1x36 char] 74 USA 3336 31 [1x36 char] 74 Japan 1950 26 [1x36 char] 74 USA 2451 32 [1x36 char] 74 Japan 1836 25 [1x36 char] 74 USA 2542 16 [1x36 char] 74 USA 3781 16 [1x36 char] 74 USA 3632 18 [1x36 char] 74 USA 3613 16 [1x36 char] 74 USA 4141 13 [1x36 char] 74 USA 4699 14 [1x36 char] 74 USA 4457 14 [1x36 char] 74 USA 4638 14 [1x36 char] 74 USA 4257 29 [1x36 char] 74 Germany 2219 26 [1x36 char] 74 Germany 1963 26 [1x36 char] 74 Germany 2300 31 [1x36 char] 74 Japan 1649 32 [1x36 char] 74 Japan 2003 28 [1x36 char] 74 USA 2125 24 [1x36 char] 74 Italy 2108 26 [1x36 char] 74 Italy 2246 24 [1x36 char] 74 Japan 2489 26 [1x36 char] 74 Japan 2391 31 [1x36 char] 74 Italy 2000 19 [1x36 char] 75 USA 3264 18 [1x36 char] 75 USA 3459 15 [1x36 char] 75 USA 3432 15 [1x36 char] 75 USA 3158 16 [1x36 char] 75 USA 4668 15 [1x36 char] 75 USA 4440 16 [1x36 char] 75 USA 4498 14 [1x36 char] 75 USA 4657 17 [1x36 char] 75 USA 3907 16 [1x36 char] 75 USA 3897 15 [1x36 char] 75 USA 3730

Page 42: Biomedical Statistics

18 [1x36 char] 75 USA 3785 21 [1x36 char] 75 USA 3039 20 [1x36 char] 75 USA 3221 13 [1x36 char] 75 USA 3169 29 [1x36 char] 75 Japan 2171 23 [1x36 char] 75 USA 2639 20 [1x36 char] 75 USA 2914 23 [1x36 char] 75 USA 2592 24 [1x36 char] 75 Japan 2702 25 [1x36 char] 75 Germany 2223 24 [1x36 char] 75 Japan 2545 18 [1x36 char] 75 USA 2984 29 [1x36 char] 75 Germany 1937 19 [1x36 char] 75 USA 3211 23 [1x36 char] 75 Germany 2694 23 [1x36 char] 75 France 2957 22 [1x36 char] 75 Sweden 2945 25 [1x36 char] 75 Sweden 2671 33 [1x36 char] 75 Japan 1795 28 [1x36 char] 76 Italy 2464 25 [1x36 char] 76 Germany 2220 25 [1x36 char] 76 USA 2572 26 [1x36 char] 76 USA 2255 27 [1x36 char] 76 France 2202 17.5 [1x36 char] 76 USA 4215 16 [1x36 char] 76 USA 4190 15.5 [1x36 char] 76 USA 3962 14.5 [1x36 char] 76 USA 4215 22 [1x36 char] 76 USA 3233 22 [1x36 char] 76 USA 3353 24 [1x36 char] 76 USA 3012 22.5 [1x36 char] 76 USA 3085 29 [1x36 char] 76 USA 2035 24.5 [1x36 char] 76 USA 2164 29 [1x36 char] 76 Germany 1937 33 [1x36 char] 76 Japan 1795 20 [1x36 char] 76 USA 3651 18 [1x36 char] 76 USA 3574 18.5 [1x36 char] 76 USA 3645 17.5 [1x36 char] 76 USA 3193 29.5 [1x36 char] 76 Germany 1825 32 [1x36 char] 76 Japan 1990 28 [1x36 char] 76 Japan 2155 26.5 [1x36 char] 76 USA 2565 20 [1x36 char] 76 Sweden 3150 13 [1x36 char] 76 USA 3940 19 [1x36 char] 76 France 3270 19 [1x36 char] 76 Japan 2930 16.5 [1x36 char] 76 Germany 3820 16.5 [1x36 char] 76 USA 4380 13 [1x36 char] 76 USA 4055 13 [1x36 char] 76 USA 3870 13 [1x36 char] 76 USA 3755 31.5 [1x36 char] 77 Japan 2045 30 [1x36 char] 77 USA 2155 36 [1x36 char] 77 France 1825 25.5 [1x36 char] 77 USA 2300

Page 43: Biomedical Statistics

33.5 [1x36 char] 77 Japan 1945 17.5 [1x36 char] 77 USA 3880 17 [1x36 char] 77 USA 4060 15.5 [1x36 char] 77 USA 4140 15 [1x36 char] 77 USA 4295 17.5 [1x36 char] 77 USA 3520 20.5 [1x36 char] 77 USA 3425 19 [1x36 char] 77 USA 3630 18.5 [1x36 char] 77 USA 3525 16 [1x36 char] 77 USA 4220 15.5 [1x36 char] 77 USA 4165 15.5 [1x36 char] 77 USA 4325 16 [1x36 char] 77 USA 4335 29 [1x36 char] 77 Germany 1940 24.5 [1x36 char] 77 USA 2740 26 [1x36 char] 77 Japan 2265 25.5 [1x36 char] 77 USA 2755 30.5 [1x36 char] 77 USA 2051 33.5 [1x36 char] 77 USA 2075 30 [1x36 char] 77 Japan 1985 30.5 [1x36 char] 77 Germany 2190 22 [1x36 char] 77 Japan 2815 21.5 [1x36 char] 77 Germany 2600 21.5 [1x36 char] 77 Japan 2720 43.1 [1x36 char] 78 Germany 1985 36.1 [1x36 char] 78 USA 1800 32.8 [1x36 char] 78 Japan 1985 39.4 [1x36 char] 78 Japan 2070 36.1 [1x36 char] 78 Japan 1800 19.9 [1x36 char] 78 USA 3365 19.4 [1x36 char] 78 USA 3735 20.2 [1x36 char] 78 USA 3570 19.2 [1x36 char] 78 USA 3535 20.5 [1x36 char] 78 USA 3155 20.2 [1x36 char] 78 USA 2965 25.1 [1x36 char] 78 USA 2720 20.5 [1x36 char] 78 USA 3430 19.4 [1x36 char] 78 USA 3210 20.6 [1x36 char] 78 USA 3380 20.8 [1x36 char] 78 USA 3070 18.6 [1x36 char] 78 USA 3620 18.1 [1x36 char] 78 USA 3410 19.2 [1x36 char] 78 USA 3425 17.7 [1x36 char] 78 USA 3445 18.1 [1x36 char] 78 USA 3205 17.5 [1x36 char] 78 USA 4080 30 [1x36 char] 78 USA 2155 27.5 [1x36 char] 78 Japan 2560 27.2 [1x36 char] 78 Japan 2300 30.9 [1x36 char] 78 USA 2230 21.1 [1x36 char] 78 Japan 2515 23.2 [1x36 char] 78 USA 2745 23.8 [1x36 char] 78 USA 2855 23.9 [1x36 char] 78 Japan 2405 20.3 [1x36 char] 78 Germany 2830 17 [1x36 char] 78 Sweden 3140 21.6 [1x36 char] 78 Sweden 2795

Page 44: Biomedical Statistics

16.2 [1x36 char] 78 France 3410 31.5 [1x36 char] 78 Germany 1990 29.5 [1x36 char] 78 Japan 2135 21.5 [1x36 char] 79 USA 3245 19.8 [1x36 char] 79 USA 2990 22.3 [1x36 char] 79 USA 2890 20.2 [1x36 char] 79 USA 3265 20.6 [1x36 char] 79 USA 3360 17 [1x36 char] 79 USA 3840 17.6 [1x36 char] 79 USA 3725 16.5 [1x36 char] 79 USA 3955 18.2 [1x36 char] 79 USA 3830 16.9 [1x36 char] 79 USA 4360 15.5 [1x36 char] 79 USA 4054 19.2 [1x36 char] 79 USA 3605 18.5 [1x36 char] 79 USA 3940 31.9 [1x36 char] 79 Germany 1925 34.1 [1x36 char] 79 Japan 1975 35.7 [1x36 char] 79 USA 1915 27.4 [1x36 char] 79 USA 2670 25.4 [1x36 char] 79 Germany 3530 23 [1x36 char] 79 USA 3900 27.2 [1x36 char] 79 France 3190 23.9 [1x36 char] 79 USA 3420 34.2 [1x36 char] 79 USA 2200 34.5 [1x36 char] 79 USA 2150 31.8 [1x36 char] 79 Japan 2020 37.3 [1x36 char] 79 Italy 2130 28.4 [1x36 char] 79 USA 2670 28.8 [1x36 char] 79 USA 2595 26.8 [1x36 char] 79 USA 2700 33.5 [1x36 char] 79 USA 2556 41.5 [1x36 char] 80 Germany 2144 38.1 [1x36 char] 80 Japan 1968 32.1 [1x36 char] 80 USA 2120 37.2 [1x36 char] 80 Japan 2019 28 [1x36 char] 80 USA 2678 26.4 [1x36 char] 80 USA 2870 24.3 [1x36 char] 80 USA 3003 19.1 [1x36 char] 80 USA 3381 34.3 [1x36 char] 80 Germany 2188 29.8 [1x36 char] 80 Japan 2711 31.3 [1x36 char] 80 Japan 2542 37 [1x36 char] 80 Japan 2434 32.2 [1x36 char] 80 Japan 2265 46.6 [1x36 char] 80 Japan 2110 27.9 [1x36 char] 80 USA 2800 40.8 [1x36 char] 80 Japan 2110 44.3 [1x36 char] 80 Germany 2085 43.4 [1x36 char] 80 Germany 2335 36.4 [1x36 char] 80 Germany 2950 30 [1x36 char] 80 Germany 3250 44.6 [1x36 char] 80 Japan 1850 40.9 [1x36 char] 80 France 1835 33.8 [1x36 char] 80 Japan 2145 29.8 [1x36 char] 80 Germany 1845 32.7 [1x36 char] 80 Japan 2910

Page 45: Biomedical Statistics

23.7 [1x36 char] 80 Japan 2420 35 [1x36 char] 80 England 2500 23.6 [1x36 char] 80 USA 2905 32.4 [1x36 char] 80 Japan 2290 27.2 [1x36 char] 81 USA 2490 26.6 [1x36 char] 81 USA 2635 25.8 [1x36 char] 81 USA 2620 23.5 [1x36 char] 81 USA 2725 30 [1x36 char] 81 USA 2385 39.1 [1x36 char] 81 Japan 1755 39 [1x36 char] 81 USA 1875 35.1 [1x36 char] 81 Japan 1760 32.3 [1x36 char] 81 Japan 2065 37 [1x36 char] 81 Japan 1975 37.7 [1x36 char] 81 Japan 2050 34.1 [1x36 char] 81 Japan 1985 34.7 [1x36 char] 81 USA 2215 34.4 [1x36 char] 81 USA 2045 29.9 [1x36 char] 81 USA 2380 33 [1x36 char] 81 Germany 2190 34.5 [1x36 char] 81 France 2320 33.7 [1x36 char] 81 Japan 2210 32.4 [1x36 char] 81 Japan 2350 32.9 [1x36 char] 81 Japan 2615 31.6 [1x36 char] 81 Japan 2635 28.1 [1x36 char] 81 France 3230 NaN [1x36 char] 81 Sweden 2800 30.7 [1x36 char] 81 Sweden 3160 25.4 [1x36 char] 81 Japan 2900 24.2 [1x36 char] 81 Japan 2930 22.4 [1x36 char] 81 USA 3415 26.6 [1x36 char] 81 USA 3725 20.2 [1x36 char] 81 USA 3060 17.6 [1x36 char] 81 USA 3465 28 [1x36 char] 82 USA 2605 27 [1x36 char] 82 USA 2640 34 [1x36 char] 82 USA 2395 31 [1x36 char] 82 USA 2575 29 [1x36 char] 82 USA 2525 27 [1x36 char] 82 USA 2735 24 [1x36 char] 82 USA 2865 23 [1x36 char] 82 USA 3035 36 [1x36 char] 82 Germany 1980 37 [1x36 char] 82 Japan 2025 31 [1x36 char] 82 Japan 1970 38 [1x36 char] 82 USA 2125 36 [1x36 char] 82 USA 2125 36 [1x36 char] 82 Japan 2160 36 [1x36 char] 82 Japan 2205 34 [1x36 char] 82 Japan 2245 38 [1x36 char] 82 Japan 1965 32 [1x36 char] 82 Japan 1965 38 [1x36 char] 82 Japan 1995 25 [1x36 char] 82 USA 2945 38 [1x36 char] 82 USA 3015 26 [1x36 char] 82 USA 2585 22 [1x36 char] 82 USA 2835

Page 46: Biomedical Statistics

32 [1x36 char] 82 Japan 2665 36 [1x36 char] 82 USA 2370 27 [1x36 char] 82 USA 2950 27 [1x36 char] 82 USA 2790 44 [1x36 char] 82 Germany 2130 32 [1x36 char] 82 USA 2295 28 [1x36 char] 82 USA 2625 31 [1x36 char] 82 USA 2720

The summary function provides basic statistical information for each of the variables included in the dataset object. Notice that there are some missing values for Horsepower and MPG, denoted by NaNs.

clcsummary(cars);Acceleration: [406x1 double] min 1st Q median 3rd Q max 8 13.7000 15.5000 17.2000 24.8000 Cylinders: [406x1 double] min 1st Q median 3rd Q max 3 4 4 8 8 Displacement: [406x1 double] min 1st Q median 3rd Q max 68 105 151 302 455 Horsepower: [406x1 double] Columns 1 through 5 min 1st Q median 3rd Q max 46 75.5000 95 130 230 Column 6 NaNs 6 MPG: [406x1 double] Columns 1 through 5 min 1st Q median 3rd Q max 9 17.5000 23 29 46.6000 Column 6 NaNs 8 Model: [406x36 char]Model_Year: [406x1 double] min 1st Q median 3rd Q max 70 73 76 79 82 Origin: [406x7 char]Weight: [406x1 double] min 1st Q median 3rd Q max 1613 2226 2.8225e+003 3620 5140

We can index into a dataset object like a regular matrix.

clccars(1:10, :)ans = Acceleration Cylinders Displacement Horsepower 12 8 307 130 11.5 8 350 165

Page 47: Biomedical Statistics

11 8 318 150 12 8 304 150 10.5 8 302 140 10 8 429 198 9 8 454 220 8.5 8 440 215 10 8 455 225 8.5 8 390 190

MPG Model Model_Year Origin Weight 18 [1x36 char] 70 USA 3504 15 [1x36 char] 70 USA 3693 18 [1x36 char] 70 USA 3436 16 [1x36 char] 70 USA 3433 17 [1x36 char] 70 USA 3449 15 [1x36 char] 70 USA 4341 14 [1x36 char] 70 USA 4354 14 [1x36 char] 70 USA 4312 14 [1x36 char] 70 USA 4425 15 [1x36 char] 70 USA 3850

We can access individual columns by referencing them by their names...

clccars(1:10, {'Origin', 'MPG', 'Weight'})ans = Origin MPG Weight USA 18 3504 USA 15 3693 USA 18 3436 USA 16 3433 USA 17 3449 USA 15 4341 USA 14 4354 USA 14 4312 USA 14 4425 USA 15 3850

The dot-notation allows you to extract the whole content of a variable.

clccars.Horsepower(1:10)ans = 130 165 150 150 140 198 220 215 225 190

Dataset objects store meta information.

Page 48: Biomedical Statistics

get(cars) Description: '' VarDescription: {} Units: {} DimNames: {'Observations' 'Variables'} UserData: [] ObsNames: {} VarNames: {1x9 cell}

We can add dataset descriptions as well as units for the variables.

clccars = set(cars, 'Description', 'Performance and structural information of automobiles');cars = set(cars, 'Units', {'m/s^2', '', 'mm', 'hp', 'mpg', '', '', '', 'kg'});

summary(cars)Performance and structural information of automobiles

Acceleration: [406x1 double, Units = m/s^2] min 1st Q median 3rd Q max 8 13.7000 15.5000 17.2000 24.8000 Cylinders: [406x1 double] min 1st Q median 3rd Q max 3 4 4 8 8 Displacement: [406x1 double, Units = mm] min 1st Q median 3rd Q max 68 105 151 302 455 Horsepower: [406x1 double, Units = hp] Columns 1 through 5 min 1st Q median 3rd Q max 46 75.5000 95 130 230 Column 6 NaNs 6 MPG: [406x1 double, Units = mpg] Columns 1 through 5 min 1st Q median 3rd Q max 9 17.5000 23 29 46.6000 Column 6 NaNs 8 Model: [406x36 char]Model_Year: [406x1 double] min 1st Q median 3rd Q max 70 73 76 79 82 Origin: [406x7 char]Weight: [406x1 double, Units = kg] min 1st Q median 3rd Q max 1613 2226 2.8225e+003 3620 5140

Categorical Arrays

Page 49: Biomedical Statistics

Notice that some of the variables take on discrete values. For instance, the Cylinders, and Origin take on a unique set of values:

clcdisp('Cylinders:');unique(cars(:, 'Cylinders'))

disp('Origin:');unique(cars(:, 'Origin'))Cylinders:ans = Cylinders 3 4 5 6 8 Origin:ans = Origin England France Germany Italy Japan Sweden USA

Categorical arrays provide significant memory savings. We will convert Cylinders to an ordinal array, which contains ordering information. The variable Origin will be converted to a nominal array, which does not store ordering.

clcCylinders_cat = ordinal(Cylinders);Origin_cat = nominal(Origin);

whos Cylinders* Origin* Name Size Bytes Class Attributes

Cylinders 406x1 3248 double Cylinders_cat 406x1 1178 ordinal Origin 406x7 5684 char Origin_cat 406x1 1366 nominal

Now, let's convert the variables of the dataset object.

cars.Cylinders = ordinal(cars.Cylinders);cars.Origin = nominal(cars.Origin);

Filtering

Page 50: Biomedical Statistics

Dataset objects can be easily filtered by criteria.

For example, we can create a logical array that has ONEs where the origin is Germany and ZEROs where it's not Germany.

germanyMask = cars.Origin == 'Germany'germanyMask = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

Page 51: Biomedical Statistics

0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 52: Biomedical Statistics

0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0

Page 53: Biomedical Statistics

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1

Page 54: Biomedical Statistics

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 55: Biomedical Statistics

0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1

Page 56: Biomedical Statistics

1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

Page 57: Biomedical Statistics

0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

Use mask to extract out all German cars.

clccars(germanyMask, :)ans = Acceleration Cylinders Displacement Horsepower 20.5 4 97 46 14.5 4 107 90 12.5 4 121 113 20 4 97 48 14 4 116 90 19 4 97 60 23.5 4 97 54 18 4 121 76 21 4 97 46 15.5 4 116 75 14 4 114 91 16.5 4 98 83 15.5 4 79 67 14.5 4 97 78 16.5 4 90 71 14 4 90 70 15 4 115 95 16.9 4 116 81 14.2 4 90 70 12.2 4 97 71 16.7 6 168 120 14.5 4 97 78 14.1 4 97 78 12.8 4 121 110 21.5 4 90 48 15.9 5 131 103 14.9 4 89 71 14 4 89 71 20.1 5 183 77 14.7 4 98 76 15.8 4 97 78 21.7 4 90 48 23.7 4 90 48

Page 58: Biomedical Statistics

19.9 5 121 67 21.8 4 146 67 15.3 4 89 62 14.2 4 105 74 15.3 4 105 74 24.6 4 97 52

MPG Model Model_Year Origin Weight 26 [1x36 char] 70 Germany 1835 24 [1x36 char] 70 Germany 2430 26 [1x36 char] 70 Germany 2234 NaN [1x36 char] 71 Germany 1978 28 [1x36 char] 71 Germany 2123 27 [1x36 char] 71 Germany 1834 23 [1x36 char] 72 Germany 2254 22 [1x36 char] 72 Germany 2511 26 [1x36 char] 73 Germany 1950 24 [1x36 char] 73 Germany 2158 20 [1x36 char] 73 Germany 2582 29 [1x36 char] 74 Germany 2219 26 [1x36 char] 74 Germany 1963 26 [1x36 char] 74 Germany 2300 25 [1x36 char] 75 Germany 2223 29 [1x36 char] 75 Germany 1937 23 [1x36 char] 75 Germany 2694 25 [1x36 char] 76 Germany 2220 29 [1x36 char] 76 Germany 1937 29.5 [1x36 char] 76 Germany 1825 16.5 [1x36 char] 76 Germany 3820 29 [1x36 char] 77 Germany 1940 30.5 [1x36 char] 77 Germany 2190 21.5 [1x36 char] 77 Germany 2600 43.1 [1x36 char] 78 Germany 1985 20.3 [1x36 char] 78 Germany 2830 31.5 [1x36 char] 78 Germany 1990 31.9 [1x36 char] 79 Germany 1925 25.4 [1x36 char] 79 Germany 3530 41.5 [1x36 char] 80 Germany 2144 34.3 [1x36 char] 80 Germany 2188 44.3 [1x36 char] 80 Germany 2085 43.4 [1x36 char] 80 Germany 2335 36.4 [1x36 char] 80 Germany 2950 30 [1x36 char] 80 Germany 3250 29.8 [1x36 char] 80 Germany 1845 33 [1x36 char] 81 Germany 2190 36 [1x36 char] 82 Germany 1980 44 [1x36 char] 82 Germany 2130

Scatter plot grouped by the year of the make.

gscatter(cars.MPG, cars.Weight, cars.Model_Year, '', 'xos');xlabel('Miles per Gallon')ylabel('Weight')

Page 59: Biomedical Statistics

We notice a general trend, but the amount of data prevents us from getting useful information.

We can use filtering to refine the visualization. Let's extract out only the cars made in 1970, 1976, or 1982.

index = cars.Model_Year == 70 | cars.Model_Year == 76 | cars.Model_Year == 82;filtered = cars(index,:);

We have a more meaningful scatter plot for this smaller subset.

gscatter(filtered.MPG, filtered.Weight, filtered.Model_Year, '', 'xos');xlabel('Miles per Gallon')ylabel('Weight')

Page 60: Biomedical Statistics

Add interactive case names to the plot

gname(filtered.Model);

Page 61: Biomedical Statistics

Concatenate and Join

We can combine datasets by either concatenating or joining.

Concatenate

We have a different set of data that corresponds to small cars. Let's combine this with the original dataset. First, we'll create a dataset object from this data.

% load carsmallcs = load('carsmall.mat');

% create dataset and convert variables to categorical arrayscars_s = dataset(cs);cars_s.Origin = nominal(cars_s.Origin);cars_s.Cylinders = ordinal(cars_s.Cylinders);

% add additional levels match the levels from the carbig datasetcars_s.Cylinders = addlevels(cars_s.Cylinders, getlabels(cars.Cylinders));cars_s.Cylinders = reorderlevels(cars_s.Cylinders, getlabels(cars.Cylinders));Warning: Ignoring duplicate levels in NEWLEVELS.

Concatenate using the matrix concatenation notation.

Page 62: Biomedical Statistics

cars_all = [cars; cars_s];% alternatively,% cars_all = vertcat(cars, cars_s);

Join

Joining allows you to take the data in one dataset array and assign it to the rows of another dataset array, based on matching values in a common key variable.

clctabulate(cars_all.Origin); Value Count Percent England 1 0.20% France 18 3.56% Germany 48 9.49% Italy 9 1.78% Japan 94 18.58% Sweden 13 2.57% USA 323 63.83%

Create a new dataset that maps countries to continents.

clcNewdata = dataset(... {nominal({'England';'France';'Germany';'Italy' ;'Japan';'Sweden';'USA' }),'Origin' }, ... {nominal({'Europe' ;'Europe';'Europe' ;'Europe';'Asia' ;'Europe';'North America'}),'Continent'})Newdata = Origin Continent England Europe France Europe Germany Europe Italy Europe Japan Asia Sweden Europe USA North America

Join the two datasets to include Continent as a new variable.

cars_all = join(cars_all, Newdata);

clccars_all(1:10:100, :)ans = Acceleration Cylinders Displacement Horsepower 12 8 307 130 17.5 4 133 115 15 4 113 95 15 6 199 90 13 6 232 100 12 8 400 170 19 4 71 65 12 8 400 175

Page 63: Biomedical Statistics

14 8 307 130 15 4 98 80

MPG Model Model_Year Origin Weight 18 [1x36 char] 70 USA 3504 NaN [1x36 char] 70 France 3090 24 [1x36 char] 70 Japan 2372 21 [1x36 char] 70 USA 2648 19 [1x36 char] 71 USA 2634 13 [1x36 char] 71 USA 4746 31 [1x36 char] 71 Japan 1773 14 [1x36 char] 72 USA 4385 13 [1x36 char] 72 USA 4098 28 [1x36 char] 72 USA 2164

Continent North America Europe Asia North America North America North America Asia North America North America North America

Page 64: Biomedical Statistics

Dealing with Missing Data

Notice that we have some missing data in our MPG data

clccars(5:20, 'MPG')ans = MPG 17 15 14 14 14 15 NaN NaN NaN NaN NaN 15 14 NaN 15 14

One way to deal with missing data is to substitute for the missing value. In this case, we will create a regression model to represent the performance measures (MPG) as functions of possible predictor variables (acceleration, cylinders, horsepower, model year, and weight)

X = [ones(length(cars.MPG),1) cars.Acceleration, double(cars.Cylinders), ... cars.Displacement, cars.Horsepower, cars.Model_Year, cars.Weight];Y = [cars.MPG];[b,bint,r,rint,stats] = regress(Y, X);

Note that cars.Horsepower contains NaNs. The regress function performs listwise deletion on the independent variables

cars.regress = X * b;fprintf('R-squared: %f\n', stats(1));R-squared: 0.814178

Examine the residual.

residuals = cars.MPG - cars.regress;stem(cars.regress, residuals);xlabel('model'); ylabel('actual - model');

Page 65: Biomedical Statistics

For cars with low or high MPG, the model seems to underestimate the MPG, while for cars in the middle, the model overestimates the true value.

gname(cars.Model)

Page 66: Biomedical Statistics

Clean Up Our Model

We can potentially improve the model by adding dummy variables to handle diesels, automatic transmissions, and station wagons. In addition, we can filter out 3 and 5 cylinder engines, which are rotary engines.

Dummy variable is a binary variable that has a "1" where it satisfies the criteria and "0" everywhere else.

% Load in as a dataset objectds = dataset('file','dummy.txt');

% Concatenatecarsall = [cars, ds];carsall = set(carsall, 'Units', [get(cars, 'Units'), {'', '', '', ''}]);

% Filter out 3- and 5- cylinder enginesindex = carsall.Cylinders == '4' | carsall.Cylinders == '6' | carsall.Cylinders == '8';carsall = carsall(index,:);

Page 67: Biomedical Statistics

Create a New Regression Model

Create a new regression model just by looking at 4, 6, and 8 cylinders and taking into account the car type (station wagon, diesel, automatic).

X = [ones(length(carsall.MPG),1), carsall.Acceleration, ... double(carsall.Cylinders), carsall.Displacement, carsall.Horsepower, ... carsall.Model_Year, carsall.Weight, carsall.SW, carsall.Diesel, ... carsall.Automatic];Y = [carsall.MPG];[b, bint, r, rint, stats] = regress(Y, X);

carsall.regress = X * b;

residuals2 = carsall.MPG - carsall.regress;stem(carsall.regress, residuals2)xlabel('model'); ylabel('actual - model');gname(carsall.Model)

Page 68: Biomedical Statistics

Multivariate Analysis of Variance

Multivariate analysis of variance to see how similar the cars from various countries are, in terms of MPG, Acceleration, Weight, and Displacement.

X = [carsall.MPG, carsall.Acceleration, carsall.Weight, carsall.Displacement];[d, p, stats] = manova1(X, carsall.Origin);manovacluster(stats)

We see that Japanese and German cars are quite similar, and they are very different from English and American cars

Let's add another dummy variable that distinguished Japanese and German cars. Then redo the regression.

carsall.dummy = (carsall.Origin == 'Germany' | carsall.Origin == 'Japan');

X = [ones(length(carsall.MPG),1), carsall.Acceleration, ... double(carsall.Cylinders), carsall.Displacement, carsall.Horsepower, ... carsall.Model_Year, carsall.Weight, carsall.SW, carsall.Diesel, ... carsall.Automatic carsall.dummy];Y = [carsall.MPG];[b, bint, r, rint, stats] = regress(Y, X);

carsall.regress = X * b;

Page 69: Biomedical Statistics

% Inspect once againresiduals2 = carsall.MPG - carsall.regress;stem(carsall.regress, residuals2)xlabel('model'); ylabel('actual - model');gname(carsall.Model)

Robust Regression

We can also perform robust regression to deal with the outliers that may exist in the dataset.

X2 = [carsall.Acceleration, double(carsall.Cylinders), ... carsall.Displacement, carsall.Horsepower, carsall.Model_Year, ... carsall.Weight, carsall.SW, carsall.Diesel, carsall.Automatic, carsall.dummy];[robustbeta, stats] = robustfit(X2, Y)X3 = [ones(length(carsall.MPG),1), X2];

carsall.regress2 = X3 * robustbeta;robustbeta = -4.1872 -0.2086 -1.6006 0.0126 -0.0166 0.6444 -0.0048 0.1167

Page 70: Biomedical Statistics

12.4719 -2.6949 1.7508stats = ols_s: 2.9910 robust_s: 2.8012 mad_s: 2.6344 s: 2.8477 resid: [399x1 double] rstud: [399x1 double] se: [11x1 double] covb: [11x11 double] coeffcorr: [11x11 double] t: [11x1 double] p: [11x1 double] w: [399x1 double] R: [11x11 double] dfe: 374 h: [399x1 double]

Perform Regression Substitution

We have been looking at linear regressions so far, but we might be able to apply some nonlinear regressions to get a better predictive model.

Once we have a regression model, we can go ahead substitute the missing values with the model data.

carsall.mask = isnan(carsall.MPG);carsall.MPG(carsall.mask) = carsall.regress2(carsall.mask);

carsall(5:20, 'MPG')ans = MPG 17 15 14 14 14 15 18.881 12.256 13.095 12.597 13.732 15 14 16.498 15 14

Page 71: Biomedical Statistics

* Biomedical Statistics & Curve Fitting with Matlab.

Data Analysis tool boxes. Signal Processing. Image Processing. Wave lets. Statistics. Curve Fitting. Neural Network.

Statistics

Probability Distributions & Fittings Hypothesis testing. Multivariate Analysis. Clustering (Hierarchical K-mean).

Curve Fitting

Simple Linear Model (residuals). Robust Fitting (Outlier insensitivity) Automatic Code generation.

Distool (A Gui her DDF & CDF Demo.

If starts by showing CDF of the normal distribution but we will switch to the more familiar PDF. There is a variety of distribution to look at, we will choose gamma distribution. As we change the parameter values we can see how the PDF changes. Another gui to look random number generation. Normal view we will choose gamma distribution again and generate one hundred random values. Notice how the histogram mimies the shape of the PDF we saw but it varies as we generate new samples.

Rand tool Gui for random Number generation.

If we increase the sample size the shape is more consistent and smoother. (100 1000)

Now let us try fitting in Samuel distribution system of real date. We have data collected on life spans of 100 fruit files. We would like to model this data-first curve fitting univariate probability distribution. First load the data into the matter work space.

Clear AK

Page 72: Biomedical Statistics

Load medflies.mat Whose

A data vector called medflies containing the measured life span in days. Lets look at the data. Number specialized graphics of functions are available such as plot and empirical CDF plot but we will stick to sample histogram.

List (medflies, 25:5:6:25); we can control the bin location with 2nd argument of the list function. Change bar color as well.

Set (get(gea, ‘children’0, Face Color, [09, 0.9, 0.9]),

From the histogram the date are skewed to the risa so the normal dist. Is probably not a good model. Instead let us use a gamma dist which does have a right skew.

The TBox has fitting functions for a variety of univariate distribution. For example the function ‘gamfit’ fits a gama dist using maximum likelihood estimation.

[paramEsts, confints], gamfit (medflies] paramEsts = 4.3672 0.40962 ConInsts = 5.4652 0.50245

The gamfit function returns point estimates for the parameter as well as confidence intervals. For example the maximum likihood estimated with shape parameter is boxt 4.4 and 95% confidence interval goes from 3.3 to 5.5.

[Fitmean - 1.7889]= gamstat (param Ests (1), paramEsts (2))

fitMean = 1.7889 fitVar = 0.73278

Try to judge how the gamma model fits these data with over lay its density function on the histogram. Use gampdf to compute the density along the range of the lifespan data.

Yfit = gamapdf (0:0.1:7, paramEsts (1), paramEsts (2))

Hold on, plot (0:.01:7, Ysmooth*100#0.5, ‘r’), hold off.

The toolbox also has function to calculate CDFs and inkers (CDF, the plot indicates that the gamma distribution may not be entirely suitable for these data. But the analysis is the artifact of the small sample size.

One way to judge would be to use M. Carlo Sim to investigate How good of a fit we might expect.

Page 73: Biomedical Statistics

Don’t have time to do that here. General idea would be to simulate data from the the model that we just fit. Its we saw earlier the tool box has random number generation to simulate a variety of distributions. From the command line it is simple matter to immaculate a random sample of a hundred values from our demo model using the gamrnd function and the estimated parameter.

r = gamrnd (paramEsts (1), paramEsts(2), 1, 100)

Long List of Number.

Another approach we can take is to fit a semi parametric model to the dta.

Ysmooths = ksdensity (medflies, 0:.01:7),

The ks density function computes Kernel smooth estimated of the pd for our data overlay that on the histogram too.

Hold on, plot (0:0.01:7, ysmooth*100*0.5, ‘r’) hold off

Legend (‘Parametric fit’, ‘Kernal smooth’),

The Kernal smooth capture the mode and the right tail of the data better than the parametric model.

This difference with probably be worth investigating. But we won’t persue it here.

We have estimated the parameters of out gamma fit using maximum liklihood and computed confidence intervals that make the standard large sample assumptions of asymptotic normality (abnormality).

The boot strap is a useful look when large sample approximation may or may not apply. For example we can boot …… to estimate standard errors for our parameter estimate instead of relying on the confidence intervals we saw earlier.

First we create 500 bootstrap replicate.

b = bootstrap (500, ‘gamfit’, medflies) b (1:10:)

ans = 10 x 2 arrange of numbers.

The first few sets of the 500 replicates two columns contain this if we compute the standard deviations along these columns we get a boot strap estimate of the standard error of the actual parameter estimate.

The tool box has a variety of functions of calculate descriptive statistics and we need the function std here.

Page 74: Biomedical Statistics

Std (b)Ans = 0.76446 0.077319

We can also look at the replicates after the QQ plot to help determine if asymptotic abnormality applies.

qq plot (b(:, 22)),

Let us plot for the 2nd parameter the boot strap replicates are a little skewed relative to normal indicating that the assumption of normality for the actual parameter estimates is probably not valid. So boot straping was a good idea in this case.

Comparing Multiple Samples

Use data collected in an experiment to determine now well subjects could distinguish a flickering light from a continuous one. The measured variable was the maximum flicker rate that could be distinguished by each subject.

We will investigate how subjects eye color effects there ability making use of the tool boxes command line functions for graphing, descriptive statistics & hypothesis testing load the data

Clear all Load my flicker Whose Color Flicker Group

Use box plot to show the flicker rate data broken down by eye color.

Box plot (flicker, color);

Each box summarizes the data for one eye color with brown eyed subjects for example the redline across the box show the medium rate is about 25 hertz. The top and bottom edges of the box indicates the upper and lower quantities for brown eyed subjects. The tails above and below the box show for the other values extended. One possible out eyed value is plotted as a separate point. The other boxes give the same info for the other eye colors. Apparently the rate is larger for green and blue eye although there is enough variability / layerability within each eye color so that the distribution overlap.

Page 75: Biomedical Statistics

We would like to know the difference are significant or simply due to random chance. We could compare any two eye color with ………… But instead we will use one way ANOVA to experiment over all significance test of difference b/w the 3 eye color groups all at once.

[P, table, stats] = anoval (flicker, color)

Anoval function returns values on the command line but also produce display box plot like before, Anoval table including F-Statistics (13.6) Corresponding to a very small P value of 2.289 x 10.5.

This the anova confirms what we suspected from the box plot that there are significant differences b/w eye colors on average.

Now we can go on the determine which groups are different from each other. This process known as multiple comparison.

We can use multicompare function and will supply with out put from the anova 1 function.

[comparison, means] = multicompare (stats)

The result is an interactive display that we can use to see which group differs from the which others by comparing interval estimates of the group effects. By clicking on a group we can determine which other group are significant in different the intervals are district and which are not intervals overlapped.

Brown eyed subjects differs significantly on average from blue eyed. The green eyed group has a wide interval reflecting a small sample size for that eye color. Because we can’t make a precise estimate for green we are not able to claim that the effect of for these subjects is significantly different than the effects of the other groups. If there have been more green eyed subjects we may have been able to detect the difference.

Multivariable / Multivariate Analysis

Consist of methods that operates on multivariable at one time use the famous fishers iris data set.

Clear all

Load fisheriris whose Meas 150 x 4 Species 150 x 1

The data consist of 4 measurements taken on a 150 specimens of iris flower an 3 different species each flower has measurement take of its sepal length & width and petal length & width. The sepal is the lower droopy part of the flower we also know the species of each specimen. Let us look at the scattered part of the sepal measurement with point color coded based on species. \

Page 76: Biomedical Statistics

g scatter (meas (:,1), meas (:, 1), species) xlabel (‘sepal length’), ylabel (sepal width)

We see a well separated at the top left consisting of the points from the stcso species. Based on the looking only at the sepal measurements the points from the other two species overlapped. However we haven’t used petal measurements for that we may still be able to distinguish species.

Let us ……………….. four measurements scatter plot.

gplotmatrix (meas,[ ], species, [ ], [ ], [ ], ‘off’)

This plot consists of an arrange of scatter plots each showing a pain of measurements. Histograms along with diagonals show the distribution of each measurement. We can see some pains of variables gives separation b/w species. But it would be best to use all 4 measurements together. The problem is that we can’t visualize 4th dimension without using tricks.

The data has a natural grouping by species let us try cluster analysis…………..?

In hierarchical cluster analysis we compute the dissimilarities b/w our specimens and then create a linkage….. tree from those dissimilates. A variety of dissimalities measure and leakage types are available here we will use Euclidean distance and average linkage.

dist = pdist (meas, Euclidean), tree = linkage (dist, ‘average’), [h, node Num] = dendrogram (tree, 20),

The dendrogram function plots the linkage tree. This is a way to visualize the distances b/w four dimensional measurements and there to look for clusters. Starting from the bottom of the tree specimen that are closet are joined together than the next closest and finally all specimens are joined. The y-acix is a measure of the distance b/w specimens as they are joined. Notice that the final joining takes place at a distance of about 4 much higher that next ……… join in at about w. this is an indication that there are two well separated grouping in the data.

The clarity of this plot is that it condenses the 150 flower specimen into only 20 ……….. of dangrogram.

Sereom output of dendogram indicate which specimensfalls into which nodes. For example node 8 consists of 4 wersi color specimen.

Species (node Num = 8)Ans = “Versicder” “Vercicolor” “Versicolor”.

Page 77: Biomedical Statistics

Species (ismember(node Num,[1 2 15 4])) Colum 1 through 7 “Setosa”…………..

We can confirm that the setosa specimens are the owner that make up group of nodes on the right.

The is member function here picks out specimens whose node number is one of a set of values.

The K-means another clustering method .Instead of producing a clustering hierarchy. K means partitions the data into specific number of clusters based on the distance from each other. For example e can find the best partition of the data into 3 clusters.

[Clust Idx, Clust Ctrs]

= Kmeans (meas, 3, ‘dist’, ‘cos’, ‘rep’, 5),

Here we used the cosine distance.

Kmean function returned a list of cluster indices one for each specimen that define which cluster specimen has been assigned to.

Second output contains the centroids which in this care 4 dimensional values representing the average measurement of the specimens in three clusters.

You can visualize the clusters with a parallel……… each of the specimen in represented as a series of 4 measurement. The plots in this case in highly specialize but it can be put together high and low level Matlab graphics function. (See mage plot costs & run)

The plot show different clusters with different colors. You can see the Kmeans partitioned the data base on how much sepal and petal measurements over lapped. It turns out that this does a pretty good job of distinguishing the species. Only 5 specimens are misclassified. Those are the ones shown in black.

Not too bad since we client even used species info on the measurement.

To explicitly incorporate the species info into the analysis we use the classification that is known as decision trees.

Tree set function creates a classification tree and treedisp function displays it.

A tree is a …………….. of binary decision.

Page 78: Biomedical Statistics

t = tree fit (meas, species);

Treedisp (t, ‘names’, {‘SL’ ‘SW’ ‘PL’ ‘PW’}),

We classify on observations as one of trees one of the three species. Each node of the tree defines split of the data base from on e of the 4 measurements.

e.g. at the top node PL in are used to distinguish b/w ……….. species and other two.

If we click on the two nodes on the second level.

Branch on the left contains only ………… species and branch on the right contains other two species.

The few of the observations are actually misclassified but that is to be expected.

Controls on the tree allow you to desire size.

Tool optimal tree size based on cross validation. Don’t have time.