ppt- group 5-final

27
October 11 th , 2016 Group: 5 Big Data Combine Engineered By BattleFin

Upload: ratnam-dubey

Post on 20-Mar-2017

24 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: PPT- Group 5-Final

October 11th, 2016

Group: 5Big Data Combine Engineered By BattleFin

Page 2: PPT- Group 5-Final

2

Partha S Satpathy Team Lead

The Team

Jorge Trevino

Mayuresh Indapurkar

Ratnam Dubey

Page 3: PPT- Group 5-Final

3

Problem Statement

Inputs:244 Inputs. These represent sentiment data from several

sources like newspaper, twitter etc.

Outputs:198 stocks.

We have data of change in their values for every 5 min from 9

am to 1.30 PM

9 AM

1.30 PM

Our Goal : Predict the value of Outputs (198 of them) at 4 PM

Page 4: PPT- Group 5-Final

4

Correlation among inputs

Linear Regression

Model

Predict the inputs at 4 PM using

Trend Line

Check if the model is correct

Predict future values

Project Overview

First find the Inputs which actually

drive the change in output

Find a relationship between the

Output and the independent

Inputs (Predictors)

We need the input first to calculate the

output at 4 PM

We are comparing predicted value

with the training data

Ta-da !!!!!We got the future change in Stocks values at 4 PM

Page 5: PPT- Group 5-Final

5

• It is a single number (between 0 and 1) that describes the degree of relationship between two variables

• As one variable rises or falls, the other variable rises or falls as well.

Correlation – what is it ?

Variable1 Variable2

INCR

EASE

INCR

EASE

Positive Correlation

Variable1 Variable2

INCR

EASE

DECREASE

Negative Correlation

Page 6: PPT- Group 5-Final

6

Correlation – Correlation among Inputs?

SOURCE

SOURCE

DRIVES

DRIVES

Stock Price Change

Page 7: PPT- Group 5-Final

7

• Large number of input variables (244)

• Discard redundant variables – ones which do not affect the output– highly correlated variables

Correlation - Removing Correlation among Inputs

Page 8: PPT- Group 5-Final

8

• cor(<data matrix>)

• Returns “Triangular Matrix”

• Shows correlation of every input variable with all other input variables

• Discard input variables having correlation > 0.3

• Approximately 3-5 input variables remain

Correlation - using R

Page 9: PPT- Group 5-Final

9

• A linear regression is a simple and useful tool for predicting a quantitative response.

• Here we consider that there is a linear relationship between the Response and the Predictor.

Linear Regression - Introduction

Linear Regression Type

Simple Linear Regression

Multiple Linear Regression

Page 10: PPT- Group 5-Final

10

• Simple linear regression is a very straightforward simple linear approach for predicting a quantitative response Y on the basis of a single regression predictor variable X.

• It assumes that there is approximately a linear relationship between X and Y .

• Mathematically, we can write this linear relationship as:

Simple Linear Regression

Y

X

Page 11: PPT- Group 5-Final

11

• We use Multiple Linear Regression.• It is same as Simple Linear Regression, but the equation is extended for all the Predictors.• Mathematically speaking:

Multiple Linear Regression

what if we have more than one predictor (X)?

Page 12: PPT- Group 5-Final

12

print(head(data.new,5))

lmO <- lm(data.new$O~.,data = data.new)print(coef(lmO))

Using Multiple Linear Regression in the Project

𝑌=β 0+ β1 𝑋 1+β 2 𝑋 2+β 3 𝑋 3O I16 I242 I244

Page 13: PPT- Group 5-Final

13

Predicting Future Values at 4 PM - I

So Sheldon, -you got your equation

- You got , .. from the model- Now to calculate Y at 4 PM, you need X at 4

PM too- What is your plan with that? Do not worry Leonard. I got that

covered. We have Trend Line that is going to give us X values (Inputs) at

4 PM.

Page 14: PPT- Group 5-Final

Trend line – Introduction

• A line indicating– direction of a process with

respect to time.– tendency of data with

respect to time

• Employed whenever time dependent data is available.

Page 15: PPT- Group 5-Final

Trend line – Types

Trend Line

Linear

Non-linear

Polynomial

Exponential

Logarithmic

Page 16: PPT- Group 5-Final

• No consideration for Probabilistic (Or Stochastic) nature of the process.

• Linear Trend line is Linear Regression with respect to “Time”.

• Function which can be used in R:– Lsfit() “Least Squares Fit”

• Return slope “m” and constant “c”

Linear trend line using R

Page 17: PPT- Group 5-Final

• 244 input variables• Input variable values – known for 310 days– 9 am to 1:30 pm (at 5 minute intervals)– 55 values per variable per day– 55th time interval -> 1:30 pm– 85th time interval -> 4 pm

• Estimate input variable value at 4 pm

Trend line – how we’ve used it - I

Page 18: PPT- Group 5-Final

• mc <- lsfit(start:end, day[start:end,col])

• print(mc$coefficients)

• x <- c(seq(9.0,9.55,.05),seq(10.0,10.55,.05),

seq(11.0,11.55,.05),seq(12.00,12.55,.05),seq(13.0,13.30,.05))

• plot(x,day[,col],xlab="Hour",ylab=names(day)[col])

• abline(mc)

Trend line – how we’ve used it - II

Input at 4 PM

X = 85 at 4 PM

Page 19: PPT- Group 5-Final

19

Predicting Future Values at 4 PM - II

Wow Sheldon!!!-you got your equation

- You got , .. from the model- You got X values at 4 PM too- Did you find the Y values at 4

PM?

Of course I did, Leonard. I put the values in the equation and I found the future stock value change for 198 stocks for 310

days. Check it out.

Page 20: PPT- Group 5-Final

20

Testing the model

I was wondering if we could test our Model is

estimating correct values or not?

Good thinking Leonard. We have 200 days of

actual outputs. I am going to compare our predicted value with actual value for

one stock. Here we go.

Page 21: PPT- Group 5-Final

21

Data Can Be Confusing ?

I Can Interpret the Data Using

Graphs…

We Use Graphs to Organize Data.

Page 22: PPT- Group 5-Final

22

Approach to use the type of Graph ?

Checking for the Purpose of the Graph and type of

Data to be used ??

Type of Data ??

Numeric and Stock Related Data

Selection of Graphs??

Line Graph

Line graphs are used to track changes over short and long periods of time.

When smaller changes exist, line graphs are better to use than bar

graphs.

Page 23: PPT- Group 5-Final

23

How to plot the Graph ?

What Does these functions Do?

Gathering and Arranging the Data According to the need.

Comparing two data Minimum variance Stock Maximum variance Stock

Plotting the Data

Page 24: PPT- Group 5-Final

24

Minimum and Maximum Variance Stock

Minimum Variance Stocks Maximum Variance Stocks

Page 25: PPT- Group 5-Final

25

Results ?

Successfully predicted the Values at 4pm for 210 days

Probably After analyzing the Stock variance I would like to choose the stock having minimum variance !!

Page 26: PPT- Group 5-Final

26

Minimum Variance Stock !!!! Maximum Variance Stock !!!! OR

Page 27: PPT- Group 5-Final

27

THANK YOU !!

Questions ??