title of projects: - applying linear regression model … · title of projects: - applying linear...
TRANSCRIPT
Title of Projects: - Applying linear Regression model to a real world problem.
One of the most popular and frequently used techniques in statistics is linear regression where you predict a real-valued output based on an input value. Technically, linear regression is a statistical technique to analyze/predict the linear relationship between a dependent variable and one or more independent variables.
Let’s say you want to predict the price of a house, the price is the dependent variable and factors like size of the house, locality, and season of purchase might act as independent variables. This is because the price depends on other variables.
R comes with many default data sets and it can be seen using MASS library.
Steps:-
Open R Studio then Install.packages(“MASS”) using above command-
> install.packages(“MASS”) > Library (MASS) > data()
This will give you a list of available data sets using which you can get can a clear idea of linear regression problems.
Data sets in package ‘datasets’:
AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales)
Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide Uptake in Grass Plants
ChickWeight Weight versus age of chicks on different
diets
DNase Elisa assay of DNase
EuStockMarkets Daily Closing Prices of Major European Stock
Indices, 1991-1998
Formaldehyde Determination of Formaldehyde
HairEyeColor Hair and Eye Color of Statistics Students
Harman23.cor Harman Example 2.3
Harman74.cor Harman Example 7.4
Indometh Pharmacokinetics of Indomethacin
InsectSprays Effectiveness of Insect Sprays
JohnsonJohnson Quarterly Earnings per Johnson & Johnson
Share
LakeHuron Level of Lake Huron 1875-1972
LifeCycleSavings Intercountry Life-Cycle Savings Data
Loblolly Growth of Loblolly pine trees
Nile Flow of the River Nile
Orange Growth of Orange Trees
OrchardSprays Potency of Orchard Sprays
PlantGrowth Results from an Experiment on Plant Growth
Puromycin Reaction Velocity of an Enzymatic Reaction
Seatbelts Road Casualties in Great Britain 1969-84
Theoph Pharmacokinetics of Theophylline
Titanic Survival of passengers on the Titanic
ToothGrowth The Effect of Vitamin C on Tooth Growth in
Guinea Pigs
UCBAdmissions Student Admissions at UC Berkeley
UKDriverDeaths Road Casualties in Great Britain 1969-84
UKgas UK Quarterly Gas Consumption
USAccDeaths Accidental Deaths in the US 1973-1978
USArrests Violent Crime Rates by US State
USJudgeRatings Lawyers' Ratings of State Judges in the US
Superior Court
USPersonalExpenditure
Personal Expenditure Data
UScitiesD Distances Between European Cities and
Between US Cities
VADeaths Death Rates in Virginia (1940)
WWWusage Internet Usage per Minute
WorldPhones The World's Telephones
ability.cov Ability and Intelligence Tests
airmiles Passenger Miles on Commercial US Airlines,
1937-1960
airquality New York Air Quality Measurements
anscombe Anscombe's Quartet of 'Identical' Simple
Linear Regressions
attenu The Joyner-Boore Attenuation Data
attitude The Chatterjee-Price Attitude Data
austres Quarterly Time Series of the Number of
Australian Residents
beaver1 (beavers) Body Temperature Series of Two Beavers
beaver2 (beavers) Body Temperature Series of Two Beavers
cars Speed and Stopping Distances of Cars
chickwts Chicken Weights by Feed Type
co2 Mauna Loa Atmospheric CO2 Concentration
crimtab Student's 3000 Criminals Data
discoveries Yearly Numbers of Important Discoveries
esoph Smoking, Alcohol and (O)esophageal Cancer
euro Conversion Rates of Euro Currencies
euro.cross (euro) Conversion Rates of Euro Currencies
eurodist Distances Between European Cities and
Between US Cities
faithful Old Faithful Geyser Data
fdeaths (UKLungDeaths)
Monthly Deaths from Lung Diseases in the UK
freeny Freeny's Revenue Data
freeny.x (freeny) Freeny's Revenue Data
freeny.y (freeny) Freeny's Revenue Data
infert Infertility after Spontaneous and Induced
Abortion
iris Edgar Anderson's Iris Data
iris3 Edgar Anderson's Iris Data
islands Areas of the World's Major Landmasses
ldeaths (UKLungDeaths)
Monthly Deaths from Lung Diseases in the UK
lh Luteinizing Hormone in Blood Samples
longley Longley's Economic Regression Data
lynx Annual Canadian Lynx trappings 1821-1934
mdeaths (UKLungDeaths)
Monthly Deaths from Lung Diseases in the UK
morley Michelson Speed of Light Data
mtcars Motor Trend Car Road Tests
nhtemp Average Yearly Temperatures in New Haven
nottem Average Monthly Temperatures at Nottingham,
1920-1939
npk Classical N, P, K Factorial Experiment
occupationalStatus Occupational Status of Fathers and their
Sons
precip Annual Precipitation in US Cities
presidents Quarterly Approval Ratings of US Presidents
pressure Vapor Pressure of Mercury as a Function of
Temperature
quakes Locations of Earthquakes off Fiji
randu Random Numbers from Congruential Generator
RANDU
rivers Lengths of Major North American Rivers
rock Measurements on Petroleum Rock Samples
sleep Student's Sleep Data
stack.loss (stackloss)
Brownlee's Stack Loss Plant Data
stack.x (stackloss) Brownlee's Stack Loss Plant Data
stackloss Brownlee's Stack Loss Plant Data
state.abb (state) US State Facts and Figures
state.area (state) US State Facts and Figures
state.center (state)
US State Facts and Figures
state.division (state)
US State Facts and Figures
state.name (state) US State Facts and Figures
state.region (state)
US State Facts and Figures
state.x77 (state) US State Facts and Figures
sunspot.month Monthly Sunspot Data, from 1749 to "Present"
sunspot.year Yearly Sunspot Data, 1700-1988
sunspots Monthly Sunspot Numbers, 1749-1983
swiss Swiss Fertility and Socioeconomic Indicators
(1888) Data
treering Yearly Treering Data, -6000-1979
trees Girth, Height and Volume for Black Cherry
Trees
uspop Populations Recorded by the US Census
volcano Topographic Information on Auckland's Maunga
Whau Volcano
warpbreaks The Number of Breaks in Yarn during Weaving
women Average Heights and Weights for American
Women
Data sets in package ‘MASS’:
Aids2 Australian AIDS Survival Data
Animals Brain and Body Weights for 28 Species
Boston Housing Values in Suburbs of Boston
Cars93 Data from 93 Cars on Sale in the USA in 1993
Cushings Diagnostic Tests on Patients with Cushing's
Syndrome
DDT DDT in Kale
GAGurine Level of GAG in Urine of Children
Insurance Numbers of Car Insurance claims
Melanoma Survival from Malignant Melanoma
OME Tests of Auditory Perception in Children
with OME
Pima.te Diabetes in Pima Indian Women
Pima.tr Diabetes in Pima Indian Women
Pima.tr2 Diabetes in Pima Indian Women
Rabbit Blood Pressure in Rabbits
Rubber Accelerated Testing of Tyre Rubber
SP500 Returns of the Standard and Poors 500
Sitka Growth Curves for Sitka Spruce Trees in 1988
Sitka89 Growth Curves for Sitka Spruce Trees in 1989
Skye AFM Compositions of Aphyric Skye Lavas
Traffic Effect of Swedish Speed Limits on Accidents
UScereal Nutritional and Marketing Information on US
Cereals
UScrime The Effect of Punishment Regimes on Crime
Rates
VA Veteran's Administration Lung Cancer Trial
abbey Determinations of Nickel Content
accdeaths Accidental Deaths in the US 1973-1978
anorexia Anorexia Data on Weight Change
bacteria Presence of Bacteria after Drug Treatments
beav1 Body Temperature Series of Beaver 1
beav2 Body Temperature Series of Beaver 2
biopsy Biopsy Data on Breast Cancer Patients
birthwt Risk Factors Associated with Low Infant
Birth Weight
cabbages Data from a cabbage field trial
caith Colours of Eyes and Hair of People in
Caithness
cats Anatomical Data from Domestic Cats
cement Heat Evolved by Setting Cements
chem Copper in Wholemeal Flour
coop Co-operative Trial in Analytical Chemistry
cpus Performance of Computer CPUs
crabs Morphological Measurements on Leptograpsus
Crabs
deaths Monthly Deaths from Lung Diseases in the UK
drivers Deaths of Car Drivers in Great Britain
1969-84
eagles Foraging Ecology of Bald Eagles
epil Seizure Counts for Epileptics
farms Ecological Factors in Farm Management
fgl Measurements of Forensic Glass Fragments
forbes Forbes' Data on Boiling Points in the Alps
galaxies Velocities for 82 Galaxies
gehan Remission Times of Leukaemia Patients
genotype Rat Genotype Data
geyser Old Faithful Geyser Data
gilgais Line Transect of Soil in Gilgai Territory
hills Record Times in Scottish Hill Races
housing Frequency Table from a Copenhagen Housing
Conditions Survey
immer Yields from a Barley Field Trial
leuk Survival Times and White Blood Counts for
Leukaemia Patients
mammals Brain and Body Weights for 62 Species of
Land Mammals
mcycle Data from a Simulated Motorcycle Accident
menarche Age of Menarche in Warsaw
michelson Michelson's Speed of Light Data
minn38 Minnesota High School Graduates of 1938
motors Accelerated Life Testing of Motorettes
muscle Effect of Calcium Chloride on Muscle
Contraction in Rat Hearts
newcomb Newcomb's Measurements of the Passage Time
of Light
nlschools Eighth-Grade Pupils in the Netherlands
npk Classical N, P, K Factorial Experiment
npr1 US Naval Petroleum Reserve No. 1 data
oats Data from an Oats Field Trial
painters The Painter's Data of de Piles
petrol N. L. Prater's Petrol Refinery Data
phones Belgium Phone Calls 1950-1973
quine Absenteeism from School in Rural New South
Wales
road Road Accident Deaths in US States
rotifer Numbers of Rotifers by Fluid Density
ships Ships Damage Data
shoes Shoe wear data of Box, Hunter and Hunter
shrimp Percentage of Shrimp in Shrimp Cocktail
shuttle Space Shuttle Autolander Problem
snails Snail Mortality Data
steam The Saturated Steam Pressure Data
stormer The Stormer Viscometer Data
survey Student Survey Data
synth.te Synthetic Classification Problem
synth.tr Synthetic Classification Problem
topo Spatial Topographic Data
waders Counts of Waders at 15 Sites in South Africa
whiteside House Insulation: Whiteside's Data
wtloss Weight Loss Data from an Obese Patient
Analysing a default data set in R
I will use a default data set called “airquality” data. The data set has various air quality
parameters in New York city.
These are the parameters in the data set:
Daily temperature from May to August
Solar radiation data
Ozone data
Wind data
Our goal is to predict the temperature for a particular month in New York using solar radiation, ozone and wind data. I am going to use Linear Regression (LR) to make the prediction.
To start using LR or any other algorithm, first and foremost step is to generate a Hypothesis.
The hypothesis is: “Temperature of house depends on ozone, wind and solar radiations”. Now, the null hypothesis of linear regression says there is no relation between dependent and independent variables; and all coefficients are zero. i.e. if equation is Temp=a1.Solar.R +a2.Ozone + a3.Wind + error.
On the other hand, alternate hypothesis says there is at least one non-zero coefficient and hence relationship exists between dependent and independent variables.
In mathematical notations it can be written as:
H0: a1=a2=a3=0
Ha: a1≠a2≠a3≠0
Let’s test the hypothesis using a linear regression model and draw a conclusion.
To test the hypothesis, we would check the level of significance of variables in support of our hypothesis. If the significance is higher than accepted level (generally 95%), we would reject the null hypothesis and hence there is a relation between dependent and independent variables.
If the significance is less than the accepted level, we will reject the null hypothesis and hence there is no relationship between dependent and independent variables.
Before that, let’s understand the data by exploring it in R.
> data("airquality") > attach(airquality) > head(airquality,10)# to see first 10 rows Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5ta 6 28 NA 14.9 66 5 6 7 23 299 8.6 65 5 7 8 19 99 13.8 59 5 8 9 8 19 20.1 61 5 9 10 NA 194 8.6 69 5 10 > summary(airquality) Ozone Solar.R Wind Temp Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 NA's :37 NA's :7 Month Day Min. :5.000 Min. : 1.0 1st Qu.:6.000 1st Qu.: 8.0 Median :7.000 Median :16.0 Mean :6.993 Mean :15.8 3rd Qu.:8.000 3rd Qu.:23.0 Max. :9.000 Max. :31.0
Data() function is used to call airquality data.
Attach () function makes the data available to the R search path.
Summary function gives you the range, quartiles, median and mean for numerical variables and table with frequencies for categorical variables.
Data visualization
I use a boxplot to visualize the daily temperature for month 5, 6, 7, 8 and 9.
> month5=subset(airquality,Month=5) > par(mfrow = c(1,2)) # 3 rows and 2 columns > boxplot((month5$Temp~airquality$Day),main="month5",col=rainbow(3)) > month6=subset(airquality,Month=6) > par(mfrow = c(1,2)) # 3 rows and 2 columns > boxplot((month6$Temp~airquality$Day),main="month6",col=rainbow(3)) > month7=subset(airquality,Month=7) > par(mfrow = c(1,2)) # 3 rows and 2 columns > boxplot((month7$Temp~airquality$Day),main="month7",col=rainbow(3)) > month8=subset(airquality,Month=8) > par(mfrow = c(1,2)) # 3 rows and 2 columns > boxplot((month8$Temp~airquality$Day),main="month8",col=rainbow(3)) > month9=subset(airquality,Month=9) > par(mfrow = c(1,2)) # 3 rows and 2 columns > boxplot((month9$Temp~airquality$Day),main="month9",col=rainbow(3)) Output of BOX plot:-
now use a histogram to see the distribution of temperature data.
> hist(airquality$Temp,col=rainbow(2))
now use a scatter plot to see if there is a linear pattern between the ‘temperature rise’ and other variables.
>plot(airquality$Temp~airquality$Day+airquality$Solar.R+airquality$Wind+airquality$Wind+airquality$Ozone,col=blues9) Hit <Return> to see next plot:
It seems that solar.R , Ozone, and wind have a linear pattern with temperature. Solar and Ozone have a positive relationship and wind has a negative one.
Now use Co-plot to see the effect of wind and solar radiations combined on Temperature
> coplot(Ozone~Solar.R|Wind,panel=panel.smooth,airquality,col="green") Missing rows: 5, 6, 10, 11, 25, 26, 27, 32, 33, 34, 35, 36, 37, 39, 42, 43, 45, 46, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 65, 72, 75, 83, 84, 96, 97, 98, 102, 103, 107, 115, 119, 150
Now It’s time to execute to Linear Regression on our data set
Now use lm function to run a linear regression on our data set.
The function lm fits a linear model to the data where Temperature (dependent variable) is on the left hand side separated by a ~ from the independent variables.
Data preparation:
The input data needs be processed before we use them in our algorithm. This means, deleting rows that has no values, checking correlation and outliers. While building the model, R inherently takes care of the null values, and drops the rows where the data is missing. This eventually results in data loss.
There are different methods to deal with data loss like imputing mean for numerical variables and mode for categorical variables. Another method is to replace null values with any value way larger than other values in the range.
For e.g. we can replace a null value with -1 when the variable is age. Since age cannot be negative, R considers the negative value as an outlier while building the model.
We can use the following command to find column wise count of null values in the data.
> sapply(airquality,function(x){sum(is.na(x))}) Ozone Solar.R Wind Temp Month Day 37 7 0 0 0 0
You can see that there are missing values in Ozone and Solar.R. We can drop those rows but that would result in data loss since there are just 153 rows in our data, dropping 37 would be almost a 20% loss.
Hence, we will replace the null values with mean (since both of the variables are numerical).
> airquality$Ozone[is.na(airquality$Ozone)]=mean(airquality$Ozone,na.rm=T) > airquality$Solar.R[is.na(airquality$Solar.R)]=mean(airquality$Solar.R,na.rm=T) > sapply(airquality,function(x){sum(is.na(x))}) Ozone Solar.R Wind Temp Month Day 0 0 0 0 0 0
Now, let’s check the correlation between independent variables.
We use corrplot library to visualize the correlation between variables.
> o=corrplot(cor(airquality),method = 'number')
# this method can be changed try using method=’circle’
Now drop one of the two variables that has high correlation but if we have a good knowledge about data then we can form a new variable by taking the difference of two.
For example, if ‘expenditure and income’ as variables have high correlation then we can create a new variable called ‘savings’ by taking the difference of ‘expenditure’ and ‘income’. We can do this only if we have domain knowledge.
Let’s see the effect of multicollinearity (without dropping a parameter) on our model.
> Model_lm1=lm(Temp~.,data=airquality) > summary(Model_lm1) Call: lm(formula = Temp ~ ., data = airquality) Residuals: Min 1Q Median 3Q Max -19.6924 -4.3178 0.6003 3.8249 16.2020 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 59.349707 4.052346 14.646 < 2e-16 *** Ozone 0.142977 0.023515 6.080 9.87e-09 *** Solar.R 0.014273 0.006589 2.166 0.0319 * Wind -0.423795 0.182787 -2.319 0.0218 * Month 2.252433 0.389039 5.790 4.11e-08 *** Day -0.106120 0.061414 -1.728 0.0861 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.628 on 147 degrees of freedom Multiple R-squared: 0.5258, Adjusted R-squared: 0.5097 F-statistic: 32.6 on 5 and 147 DF, p-value: < 2.2e-16
Before we interpret the results, I am going to the tune the model for a low AIC value.
The Akaike Information Criterion (AIC) is a measure of the relative quality of statistical models for a given set of data.
Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Hence, AIC provides a means for model selection.
You can tune the model for a low AIC in two ways:
1) By eliminating some less significant variables and re-running the model
2) Using a ‘Step’ function in R. The step function runs all the possible parameters and checks the lowest value.
I am going to use the second method here.
> Model_lm_best=step(Model_lm1) Start: AIC=584.62 Temp ~ Ozone + Solar.R + Wind + Month + Day Df Sum of Sq RSS AIC <none> 6457.8 584.62 - Day 1 131.17 6588.9 585.69 - Solar.R 1 206.13 6663.9 587.43 - Wind 1 236.15 6693.9 588.11 - Month 1 1472.59 7930.3 614.05 - Ozone 1 1624.07 8081.8 616.94 > summary(Model_lm_best) Call: lm(formula = Temp ~ Ozone + Solar.R + Wind + Month + Day, data = airquality) Residuals: Min 1Q Median 3Q Max -19.6924 -4.3178 0.6003 3.8249 16.2020 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 59.349707 4.052346 14.646 < 2e-16 *** Ozone 0.142977 0.023515 6.080 9.87e-09 *** Solar.R 0.014273 0.006589 2.166 0.0319 * Wind -0.423795 0.182787 -2.319 0.0218 * Month 2.252433 0.389039 5.790 4.11e-08 *** Day -0.106120 0.061414 -1.728 0.0861 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.628 on 147 degrees of freedom Multiple R-squared: 0.5258, Adjusted R-squared: 0.5097 F-statistic: 32.6 on 5 and 147 DF, p-value: < 2.2e-16
This summary gives coefficients of dependent variables and error term with the significance level (confidence level).
The highlighted line in the result shows how to read the level of significance. A three asterisks means 99.99% significant (check the corresponding value. If it is less than 0.01 means variable is 99.99% significant).
The R square and adjusted R square values defines how much variance of the dependent variable is explained by the model and the rest is explained by the error term. Hence, higher the R square or adjusted R square better the model.
Adjusted R square is a better indicator of explained variance because it considers only important variables and extra variables are deliberately dropped by adjusted R square. In other words,
adjusted R square penalizes the inclusion of many variables in the model for the sake of high percentage of variance explained.
VIF and Multicollinearity
Variable Inflation factor is an important parameter regarding value of coefficient of determination (R2).
If two independent variables are highly correlated then it inflates the model’s variance (estimated error).
To deal with this, we can check VIF of the model before and after dropping one of the two highly correlated variables.
Formula for VIF:
VIF(k)= 1/1+Rk^2
Where R2 is the value obtained by regressing the kth predictor on the remaining predictors.
So to calculate VIF, we make model for each independent variable and consider all other variables as predictors. Then we calculate VIF for each variable. Whenever VIF is high, it means that set of variables have high correlation with the selected variable.
We will use an R library called ‘fmsb’ to calculate VIF.
So we can check VIF for our final linear model.
> library(fmsb) > Model_lm1=lm(Temp~ Ozone+Solar.R+Month,data=airquality) > VIF(lm(Month ~ Ozone+Solar.R,data=airquality)) [1] 1.039042 > VIF(lm(Ozone ~ Solar.R+Month, data=airquality)) [1] 1.137975 > VIF(lm(Solar.R ~ Ozone +Month, data=airquality)) [1] 1.118629
As a general rule, VIF < 5 is acceptable (VIF = 1 means there is no multicollinearity), and VIF > 5 is not acceptable and we need to check our model.
In our example, VIF < 5 and hence there is no need of any additional verification needed.
Interpretation of results
Basic assumptions of linear regression:
Linear relationship between variables
Normal distribution of residuals
No or little multi-collinearity: we have seen this using VIF
Homoscedasticity: Variance across the regression line should be uniform
R displays the summary of the model and gives intercept values of all independent variables along with error terms (or residuals).
The Linear relationship between variables has been verified by the significance (p value) of variables.
In ‘Residuals vs fitted values’ plot it can be seen that residuals are linearly distributed and hence variance is uniform.
In ‘Normal Q-Q’ plot it can be seen that residuals are normally distributed. It can be seen by plotting histogram of residuals also
> hist(Model_lm_best$residuals)
To measure the quality of the model there are many ways and residual sum of squares is the most common one.
There are many ways to measure the quality of a model, but residual sum of squares is the most common one.
Residual sum of squares attempts to make a ‘line of best fit’ in the scattered data points so that the line has least error with respect to the actual data points.
If Y is the actual data point and Y’ is the predicted value by the equation, then the error is Y-Y’. But this has a bias towards ‘sign’ because when you sum up the error positive and negative values would cancel each other so the resultant error would be less than the actual value. To overcome this, a general method is to take square which serves two purposes:
1) Cancel out the effect of signs
2) Penalize the error in prediction
Prediction
To make a prediction, let’s build a data frame for new values of Solar.R, Wind and Ozone.
> new_data=data.frame(Solar.R,Wind,Ozone,Month) > new_data Solar.R Wind Ozone Month 1 190 7.4 41 5 2 118 8.0 36 5 3 149 12.6 12 5 4 313 11.5 18 5 5 NA 14.3 NA 5 6 NA 14.9 28 5 7 299 8.6 23 5
8 99 13.8 19 5 9 19 20.1 8 5 10 194 8.6 NA 5 11 NA 6.9 7 5 12 256 9.7 16 5 13 290 9.2 11 5 14 274 10.9 14 5 15 65 13.2 18 5 16 334 11.5 14 5 17 307 12.0 34 5 18 78 18.4 6 5 19 322 11.5 30 5 20 44 9.7 11 5 21 8 9.7 1 5 22 320 16.6 11 5 23 25 9.7 4 5 24 92 12.0 32 5 25 66 16.6 NA 5 26 266 14.9 NA 5 27 NA 8.0 NA 5 28 13 12.0 23 5 29 252 14.9 45 5 30 223 5.7 115 5 31 279 7.4 37 5 32 286 8.6 NA 6 33 287 9.7 NA 6 34 242 16.1 NA 6 35 186 9.2 NA 6 36 220 8.6 NA 6 37 264 14.3 NA 6 38 127 9.7 29 6 39 273 6.9 NA 6 40 291 13.8 71 6 41 323 11.5 39 6 42 259 10.9 NA 6 43 250 9.2 NA 6 44 148 8.0 23 6 45 332 13.8 NA 6 46 322 11.5 NA 6 47 191 14.9 21 6 48 284 20.7 37 6 49 37 9.2 20 6 50 120 11.5 12 6 51 137 10.3 13 6 52 150 6.3 NA 6 53 59 1.7 NA 6 54 91 4.6 NA 6 55 250 6.3 NA 6 56 135 8.0 NA 6 57 127 8.0 NA 6 58 47 10.3 NA 6 59 98 11.5 NA 6 60 31 14.9 NA 6 61 138 8.0 NA 6 62 269 4.1 135 7 63 248 9.2 49 7 64 236 9.2 32 7 65 101 10.9 NA 7 66 175 4.6 64 7 67 314 10.9 40 7 68 276 5.1 77 7 69 267 6.3 97 7 70 272 5.7 97 7 71 175 7.4 85 7 72 139 8.6 NA 7 73 264 14.3 10 7 74 175 14.9 27 7 75 291 14.9 NA 7 76 48 14.3 7 7 77 260 6.9 48 7 78 274 10.3 35 7
79 285 6.3 61 7 80 187 5.1 79 7 81 220 11.5 63 7 82 7 6.9 16 7 83 258 9.7 NA 7 84 295 11.5 NA 7 85 294 8.6 80 7 86 223 8.0 108 7 87 81 8.6 20 7 88 82 12.0 52 7 89 213 7.4 82 7 90 275 7.4 50 7 91 253 7.4 64 7 92 254 9.2 59 7 93 83 6.9 39 8 94 24 13.8 9 8 95 77 7.4 16 8 96 NA 6.9 78 8 97 NA 7.4 35 8 98 NA 4.6 66 8 99 255 4.0 122 8 100 229 10.3 89 8 101 207 8.0 110 8 102 222 8.6 NA 8 103 137 11.5 NA 8 104 192 11.5 44 8 105 273 11.5 28 8 106 157 9.7 65 8 107 64 11.5 NA 8 108 71 10.3 22 8 109 51 6.3 59 8 110 115 7.4 23 8 111 244 10.9 31 8 112 190 10.3 44 8 113 259 15.5 21 8 114 36 14.3 9 8 115 255 12.6 NA 8 116 212 9.7 45 8 117 238 3.4 168 8 118 215 8.0 73 8 119 153 5.7 NA 8 120 203 9.7 76 8 121 225 2.3 118 8 122 237 6.3 84 8 123 188 6.3 85 8 124 167 6.9 96 9 125 197 5.1 78 9 126 183 2.8 73 9 127 189 4.6 91 9 128 95 7.4 47 9 129 92 15.5 32 9 130 252 10.9 20 9 131 220 10.3 23 9 132 230 10.9 21 9 133 259 9.7 24 9 134 236 14.9 44 9 135 259 15.5 21 9 136 238 6.3 28 9 137 24 10.9 9 9 138 112 11.5 13 9 139 237 6.9 46 9 140 224 13.8 18 9 141 27 10.3 13 9 142 238 10.3 24 9 143 201 8.0 16 9 144 238 12.6 13 9 145 14 9.2 23 9 146 139 10.3 36 9 147 49 10.3 7 9 148 20 16.6 14 9 149 193 6.9 30 9
150 145 13.2 NA 9 151 191 14.3 14 9 152 131 8.0 18 9 153 223 11.5 20 9
> pred_temp=predict(Model_lm_best,newdata=new_data) > pred_temp 1 2 3 4 5 6 7 75.94367 73.84070 68.79616 72.35493 NA NA 73.78063 8 9 10 11 12 13 14 68.04417 62.55352 NA NA 71.16926 71.04545 70.41943 15 16 17 18 19 20 21 66.92733 70.80933 72.96546 62.87507 72.60731 66.57943 64.52970 22 23 24 25 26 27 28 67.38249 64.98904 68.86786 NA NA NA 66.02899 29 30 31 32 33 34 35 71.25071 84.63794 73.45851 NA NA NA NA 36 37 38 39 40 41 42 NA NA 73.96970 NA 80.36578 77.11588 NA 43 44 45 46 47 48 49 NA 73.49532 NA NA 70.58058 71.63151 70.44288 50 51 52 53 54 55 56 69.40292 70.19098 NA NA NA NA NA 57 58 59 60 61 62 63 NA NA NA NA NA 96.41447 81.55126 64 65 66 67 68 69 70 78.84326 NA 84.28504 80.06159 87.16122 89.27762 89.49714 71 72 73 74 75 76 77 85.57032 NA 72.98099 73.78086 NA 69.15063 81.06862 78 79 80 81 82 83 84 77.86272 83.32619 84.90340 80.26839 72.35157 NA NA 85 86 87 88 89 90 91 84.55975 87.69784 72.72866 75.77116 83.77363 79.97721 81.55875 92 93 94 95 96 97 98 79.98919 81.09965 72.93791 77.30141 NA NA NA 99 100 101 102 103 104 105 96.01404 88.14867 91.70577 NA NA 80.25357 79.01597 106 107 108 109 110 111 112 83.30709 NA 75.46506 82.05879 77.25284 78.64853 79.88461 113 114 115 116 117 118 119 75.27117 70.77489 NA 80.17140 100.69243 84.72578 NA 120 121 122 123 124 125 126 84.05074 93.39974 86.90851 86.24597 92.70072 91.21206 91.16596 127 128 129 130 131 132 133 92.95623 84.03080 78.30447 80.71585 80.83618 80.33257 81.57786 134 135 136 137 138 139 140 81.79925 78.47868 82.97257 75.14591 76.61348 84.95924 77.74003 141 142 143 144 145 146 147 75.59043 80.06876 79.26544 77.30905 76.87634 79.94693 74.40987 148 149 150 151 152 153 72.22074 80.98238 NA 75.31788 77.59717 77.60688
Conclusion
The regression algorithm assumes that the data is normally distributed and there is a linear relation between dependent and independent variables.
It is a good mathematical model to analyze relationship and significance of various variables.
Reference :- https://www.edvancer.in/step-step-guide-to-execute-linear-regression-r/
**************************************THE END*********************************