predictive modelling

13
OPTIMIZING IRRIGATION SETUP IN INDIA TEAM-O SONALI GULERIA INCHARA B. DIWAKA

Upload: inchara-diwakar

Post on 13-Apr-2017

41 views

Category:

Technology


1 download

TRANSCRIPT

Optimizing Irrigation setup In InDia

Optimizing Irrigation setup In InDiaTeaM-OSonali GuleriaInchara B. Diwakar

Content

OBJECTIVEDATASETDATA PRE-PROCESSING BASELINE MODELPREDICTION APPROACH 1PREDICTION APPROACH 2PREDICTION APPROACH 3MODEL COMPARISON : CHOOSING THE MODELRECOMMENDATIONS

OBJECTIVE

Build a prediction engine which can be used by the government to forecast/budget the irrigation demands of a village as well as allow the farmers to plan their irrigation methods.The model is used to predict Percentage of Agricultural Land Irrigated.This will help to reduce the dependence on rainfall and strengthen farming decisions by following more empirical approach.

DataSetSource: India Open data portalPicked the states with a diverse socio-economic conditions to have a broad representation of the data. 350+ dimensions spread across various domains- population, education, irrigation sources, household etc at a village level.Merged rainfall data to add more information to the data.

Data Pre-Processing

SanitationAGROEducationLand UsageConnectivityWaterIndexed Data

Feature SelectionData Pre-ProcessingExhaustive Backward feature selection.Reduced the features from 51 (Indexed data) to 17 most reflective features.FEATUREDistrict Name Total Geographical Area in Hectares Total Population of Village EDUCATION_GOVT Water Sanitation AGRO_Rating Power Supply For Agriculture Use Status (Active: 1 NA 2) Power Supply For Commercial Use Summer April Sept per day in Hours Power Supply For All Users Status (Active: 1 NA 2) Agro_commodity Manufacture Area under Non Agricultural Uses in Hectares Cultivable Waste Land Area in Hectares Fallows Land other than Current Fallows Area in Hectares Current Fallows Area in Hectares Net Area Sown in Hectares

BASELINE MODELRandom Forest on all 350+ raw predictors. Root Mean Square Error: 12.2

ClusteringDefault Model- Mean of Percentage IrrigatedRoot Mean Square Error: 29.16Regression

Prediction Part 1State-wise clustering using Partitioning Around Mediods (PAM)2 clusters based on highest silhouette width.Clusteringthe Big Picture

10-cross validation- tuning on MTRY parameter.Achieved R-square of approx. .95 for 12 mtryPrediction Part 1(Contd)Bagging

BoostingGradient Boosting algorithm trained on parameters- n.trees and interaction depth with constant learning rate and minimum number of observations in trees.

Linear Regression10-cross validation performed to calculate OLS estimates

Performed 10- cross validation to perform regularized regression on both raw and indexed Data.Better performance for Indexed Data.Lasso (penalty, alpha=1) outperformed ridge (alpha =0) and average (alpha =0.5).Prediction Part 2Regularized Regression

Top Left: Lasso (MSE vs log(Lamda)) Top Right: Mid (MSE vs log(Lamda))Bottom: Ridge (MSE vs log(Lamda)) Bottom Right: RMSE comparison of all three

Performed PCA on both raw and indexed data.Higher performance by indexed data.Performed 10-cross validation regression using various top principal components. Prediction Part 3Principal component analysisPrincipal ComponentVarianceR-squaredCV RMSEPC5 (Knee)52.00%0.2824PC1581.22%0.43621.4PC1989.78%0.49820.1

Model ComparisonBagging outperforms with the lowest RMSE and Highest Lift. DEFAULT MODELRMSEMODELRMSELIFTRandom Forest12.2Bagging5.120.580328Random Forest12.2Boosting7.620.37541Mean29.16Linear Regression13.80.526749Mean29.16Lasso18.20.375857Mean29.16Ridge 18.30.372428Mean29.16Mid18.30.372428Mean29.16PC model23.90.180384

RECOMMENDATIONSThe most important features in our data set are- Power supply, Electricity and Education.There are various other features that have more of a correlated relationship and present an empirical representation of socio-economic condition of villages.These features can be used by government to drive the budget and other village related policies.Model can be further advanced by incorporating other rich features like temperature, soil composition and type.