decisionstats.com data science virtual internship

19
INTERNSHIP REPORT At Decision Stats Consultancy 18 th June 2014 - Present Chandan Kumar Routray Second Year Undergraduate Student IIT Kharagpur, West Bengal

Upload: ajay-ohri

Post on 27-Jan-2015

113 views

Category:

Engineering


6 download

DESCRIPTION

an internship report by Decisionstats.com intern and IIT Student - Chandan R.

TRANSCRIPT

Page 1: Decisionstats.com Data Science Virtual Internship

INTERNSHIP REPORT

At Decision Stats Consultancy 18th June 2014 - Present

Chandan Kumar Routray

Second Year Undergraduate Student

IIT Kharagpur, West Bengal

Page 2: Decisionstats.com Data Science Virtual Internship

Table of Contents

I. Summary ................................................................ 2

II. An Overview: Thing that I learned ....................... 3

III. Blog Posts: During the Internship ........................ 4

IV. Appendix (Day wise Work) ................................... 5

Day 1 ................................................................................................... 5

Day 2 ................................................................................................... 6

Day 3 ................................................................................................... 7

Day 4-5 ............................................................................................... 8

Day 6-7 ............................................................................................... 9

Day 8-9 ............................................................................................. 10

Day 9-11 ........................................................................................... 11

Day 12-13 ......................................................................................... 12

Day 14-18 ......................................................................................... 14

Day 19-20 ......................................................................................... 15

Day 20-26 ......................................................................................... 16

Day 40+ ............................................................................................ 17

Page 3: Decisionstats.com Data Science Virtual Internship

Summary

I have completed almost a month of this internship with Decision Stats

Consultancy. This internship has been a roller coaster ride for me from the

very beginning and the past four weeks were the most productive period

of my life in terms of learning. It has helped me changed into a more

disciplined and professional person. This internship helped me to acquire

a wide set of skills like analytics, web development, technical blog writing

etc. The daily update calls of Mr. Ajay Ohri, my guide for this internship

were scheduled at around 9 pm through Skype in which he reviews my

daily assignments, gives me some useful tips on how to manage my work

and to make it more presentable followed by the assignment for the next

day. Every day after these calls I found myself with a new target to

achieve in a given period of time, according to which I plan my next day

so as to achieve the same.

A typical assignment consist of learning a new thing like

coding or understanding a package in R or Python and performing an

exercise on the same, writing an informative blogpost on the things I

learned that day and a life experience blogpost. In this due course I

learned a lot of things: Coding in Python, R & JavaScript, Using packages

like Shiny, Rpy2, ggvis etc. in both R and Python, How to write a query in a

database using MySql, How to protect your website by SQL Injection,

Making your own website using Bootstrap, Automated extraction of data

from web and network, Working in the cloud and many other things.

Earlier I used to run away from writing stuffs but now I have become a

blogger and I am enjoying it too. I have also learned how to write an

informative blogpost, people have also started asking doubts on the

same and few of my blogposts are also re-blogged by some bloggers. Till

now I have written 14 tech blogpost about various thing that I have

learned, all of them were made very reader friendly by me.

I have been very excited about his internship from the very

beginning and now Mr. Ajay Ohri has offered me to continue this

internship for some more time for which I am very grateful to him.

Page 4: Decisionstats.com Data Science Virtual Internship

An Overview: Thing that I learned

Programming Web Development

• Python

(www.codecademy.com)

• Java Script

• R (Swirl Package and

www.datacamp.com)

• Bootstrap (www.jetstrap.com)

• SQL and SQL Injection

• JavaScript(D3.js)

• Hosting via Dropbox

Writing Software

• Technical Blog Writing

www.python4analytics.wor

dpress.com

• Report Making

• Virtualization Software: Oracle

Virtual Box

VM Ware Player

• Database Management: My

SQL Workbench

Analytics Working on Cloud

• Data Extraction: Wireshark,

iMacros

• Analysing Data: Rstudio,

Python (Pandas,Rpy2)

• Result Presentation:

Rstudio(Shiny, ggvis, slidify),

d3.js

• AWS EC2: Starting an Instance,

Accessing the instance,

Installing Rstudio Server &

Ipython on it etc.

Big Data Other

• Apache Hadoop: On

Hortonworks Sandbox

• Hue: Hive, Pig, HCatalog

• Git • Infographics: Infogr.am

Page 5: Decisionstats.com Data Science Virtual Internship

Blog Posts: During the Internship

Topics Links

Python https://python4analytics.wordpress.com/2014/06/18/python

-for-analytics-intro/

Installing Ipython https://python4analytics.wordpress.com/2014/06/19/installing-

ipython-on-anaconda/

Introduction to R https://python4analytics.wordpress.com/2014/06/20/introdu

ction-to-r-language-installing-swirl/

Pandas Library https://python4analytics.wordpress.com/2014/06/22/statistical-

python-pandas-library/

D3.js(JavaScr

ipt)

https://python4analytics.wordpress.com/2014/06/23/presen

ting-the-results-working-with-d3-js-a-javascript-library/

Datacamp vs. Swirl https://python4analytics.wordpress.com/2014/06/24/learning-r-

datacamp-com-vs-swirl-package/

Shiny https://python4analytics.wordpress.com/2014/06/26/shiny-

rstudio-web-application-framework-for-r/

Git https://python4analytics.wordpress.com/2014/06/26/using-git-for-

projects/

SAS https://python4analytics.wordpress.com/2014/06/30/intro-

to-sas-and-installation/

iMacros https://python4analytics.wordpress.com/2014/07/07/web-

scrapingdata-extraction-from-web-using-imacros/

SQL https://python4analytics.wordpress.com/2014/07/07/web-

scrapingdata-extraction-from-web-using-imacros/

EC2 https://python4analytics.wordpress.com/2014/07/18/setting-up-

rstudio-server-on-aws-ec2-instance/

Infogr.am https://python4analytics.wordpress.com/2014/07/25/infogra

m-infographics-made-easy/

Wireshark https://python4analytics.wordpress.com/2014/07/25/infogram-

infographics-made-easy/

Bootstrap https://python4analytics.wordpress.com/2014/07/01/make-

responsive-website-with-bootstrap/

Apache Hadoop https://python4analytics.wordpress.com/2014/08/06/installin

g-hortonworks-sandbox-hadoop

Page 6: Decisionstats.com Data Science Virtual Internship

Appendix (Day wise Work)

Day 1

Task Given 1) Create a blog on http://blogger.com and http://wordpress.com

2) Start an account on code academy and send screenshot of initial Page. You will

be learning Python

3) Download and Install R from www.r-project.org

4) Write a blog post on your experience on Day 1 of internship

Work Update 1) Codecademy account started. Completed 23% of the beginner’s course on very

first day. Check my progress by visiting this link :- www.codecademy.com/imeckr

2) R downloaded and installed on my system.

3) Created a blog on Blogger.com. I have also posted my first Blog on it

http://analyticsinternship.blogspot.com/2014/06/day-1.html

Reference used

None

Remarks 1) Proper editing of the day1 blog, write more information oriented blog

2) URL of a web should be answer to a question(SEO)

3) How to select a good theme for your blog

Page 7: Decisionstats.com Data Science Virtual Internship

Day 2

Task Given

1) Create a new blog on Wordpress.com with a catchier name, same content, better

editing, and its title URL should be the answer to a question on Google Search. Please

send me screenshots. What should be the reason for choosing an appropriate

theme?

2) Tags should be used and then you should share it on your Facebook, LinkedIn,

Twitter and Google Plus profiles- please send me screenshots of this.

3) Please earn at least 5 badges in Python in Code academy for tomorrow’s

submission

4) Read this page please - http://pandas.pydata.org/. Download and Install Pandas

5) Download and Install Ipython-http://ipython.org/

6) Blog on Day 2 (besides your existing edited and refined Day 1 blog)

Work Update

1) Created Wordpress blog https://python4analytics.wordpress.com/

Link to Day 1 blog :- https://python4analytics.wordpress.com/2014/06/18/python-for-

analytics-intro/

Link to Day 2 blog :- https://python4analytics.wordpress.com/2014/06/19/installing-

ipython-on-anaconda/

2) Earned 6 badges on Day 2 on codecademy

http://www.codecademy.com/imeckr

3) Ipython and Pandas downloaded and installed on system

4) Blog shared facebook, google+, linkedin accounts.

References used www.pandas.pydata.org, www.ipython.org. Also the respective documentation.

Remarks

1) Maintain two different blogs one on Blogger.com for work experience and one on

Wordpress.com for tech blogging on things I learn daily

Screenshots

Page 8: Decisionstats.com Data Science Virtual Internship

Day 3

Task Given

1) Create accounts on topcoder, kaggle, github. Write one paragraph summary of

what these websites are, what advantages can you have by an account on this

2) Install swirl package in R (use Google on how to). Do one exercise. Show

screenshot

3) Go to Datacamp.com and create account. Do one exercise and show

screenshot.

4) Get 4 badges in Python and 2 badges in Java Script on Code Academy

5) Blog on this. Show screenshots of analytics of each blog- answer this question-

which are the metrics I should track for my blog if I want to make it better

Work Update 1) Codecademy status: Python completed 50%, Java script 21% with 23 badges, 187

points and 4 day streak.

2) Blogged about R on Wordpress

https://python4analytics.wordpress.com/2014/06/20/introduction-to-r-language-

installing-swirl/

3) Swirl package installed on system. Done few exercises

4) Created account on Datacamp. Done few exercises there too.

5) Creating account on sites like Topcoder, Kaggel and Github helps a user in many

ways. As, these sites already have a lot registered user from across the world, it act as

an online community of coders, designers, analyst, innovators etc. where users can

discuss their problems and ideas among themselves. It also helps a user to see where

exactly he/she stands now and how can he/she develop his/her talent in their

respective field. A user can also take up various courses, projects and even also

compete with other user.

6) Keeping track on following will make one's blog better

(i) Referrers: - From where are my visitors are getting redirected, where should i share

my blog more often?

(ii)Region of visitors: - Which region does most of my visitors belong?

(iii)Tags and Categories: - Shows which topic is more trending on search engines.

References used

www.swirlstats.com . Also its documentation.

Screenshots

Page 9: Decisionstats.com Data Science Virtual Internship

Day 4-5

Task Given

1) CODING- Get to 60 % in Python and 40% in Java Script on Code Academy

2) STATISTICAL PYTHON -Go to http://pandas.pydata.org/pandas-

docs/stable/10min.html#min Blog on the experience

3) PRESENTATION OF RESULTS Go to http://d3js.org/ . Read it and Blog on it. (Part 2 is

shiny package in R from http://shiny.rstudio.com/tutorial/, Part 3 will http://slidify.org/

packages in R)

4) CODING- Do one modules in Swirl. Write a tech blog on what you have learnt

5) CODING- Go to Datacamp.com. Do one exercise and show screenshots.

Work Update

1) Completed 60% in Python and 40% in Java Script

2) Blog on Statistical Python :

https://python4analytics.wordpress.com/2014/06/22/statistical-python-pandas-library/

Blog on D3.js : https://python4analytics.wordpress.com/2014/06/23/presenting-the-

results-working-with-d3-js-a-javascript-library/

3) One module completed in Swirl

4) Completed one exercise on Datacamp.com

5) Blog on experience:

Day 3: http://analyticsinternship.blogspot.in/2014/06/day-3.html

Day 4-5:http://analyticsinternship.blogspot.in/2014/06/day-4-5.html

References used

www.d3js.org. Also its documentation.

Screenshots

Page 10: Decisionstats.com Data Science Virtual Internship

Day 6-7

Task Given

1) Do one more module in Swirl

2) Do one exercise in Data Camp

3) Write Technical Blog Post on how the two are different, including plus and minus of

both (Swirl vs. Data Camp)

4) Read about using JS within R here http://timelyportfolio.blogspot.in/2013/04/d3-r-

with-rcharts-and-slidify.html

5) Complete 4 badges each in Python and JS

Work Update 1) Codecademy status: Python - 70% and Java Script - 50%

2) One module completed in Swirl Package.

3) One exercise completed on Datacamp

4) Read about the link that you had given

5) Blogged on Swirl Vs. Datacamp

http://python4analytics.wordpress.com/2014/06/24/learning-r-datacamp-com-vs-

swirl-package/

References used http://timelyportfolio.blogspot.in/2013/04/d3-r-with-rcharts-and-slidify.html

Screenshots

Page 11: Decisionstats.com Data Science Virtual Internship

Day 8-9

Task Given

1) Make a demo app on Shiny. How is population of India and China changing over

time? How is the per capita GDP changing over time? Google for datasets. Send me

initial draft.

2) Install and Load SAS University Edition

http://www.sas.com/en_us/software/university-edition.html

3) Complete the exercises at https://try.github.io/

4) 3 tech blog posts on Shiny, GIT and SAS

Work Update 1) Made a demo app on shiny which can show one plot at time. Made a dataframe,

which I have used in the app.

2) Completed the GIT exercise.

3) Blog on Git https://python4analytics.wordpress.com/2014/06/26/using-git-for-

projects/

Blog on Shiny https://python4analytics.wordpress.com/2014/06/26/shiny-rstudio-web-

application-framework-for-r/

References used www.ggvis.rstudio.com, shiny.rstudio.com Also their documentations.

Screenshots

Shiny App

Page 12: Decisionstats.com Data Science Virtual Internship

Day 9-11

Task Given

1) Use http://shiny.rstudio.com/gallery/ for troubleshooting your Shiny App

2) Use ggvis package somehow in your app http://ggvis.rstudio.com/ and

also use d3.js (hint - read this http://www.xavierdupre.fr/blog/2013-11-30_nojs.html)

3) Use and create one small demo showing data flow and calls from python and R

using ryp2. For example load some JSON data using python and then call a R

package.

4) Complete all pending blog posts

5) Make a small demo website using https://jetstrap.com/

6) Create an infographic for the same dataset that you are using in shiny dataset

using http://infogr.am/

7)Try and download and install this- this will help check for the VMware and also start

off big data efforts : http://hortonworks.com/products/hortonworks-sandbox/

Work Update 1) Build the Shiny app. Used ggvis but not D3js till now

2) Used rpy2 in python to import a built-in dataset from R and plotting a graph of that.

3) Tech blogpost on SAS : http://python4analytics.wordpress.com/2014/06/30/intro-to-

sas-and-installation/

4) Made an Infographic

5) Demo Website : I tried to re-create my blog http://jetstrap.io/share/cfcd9bc36a

References used

www.shiny.rstudio.com, www.xavierdupre.fr/blog/2013-11-30_nojs.html

Screenshots

Shiny App

Page 13: Decisionstats.com Data Science Virtual Internship

Day 12-13

Task Given

1) Create Demo Website - Read this and try and create a website for Decision Stats

Consulting. Take content from the image in the post, and

http://decisionstats.com/about-decisionstats/ page.

2) For tomorrow Read about bootstrap http://getbootstrap.com/ and blog on it

3) Install MYSQL on your system (full installation). Learn SQL. Create a table with all

teams remaining round of 16 of all players. It should have player name, player

surname, football club, position he plays, one more additional column based on your

discretion. Then answer using SQL queries the following answers programmatically

Which World Cup team is now the tallest? Which is the oldest? Which is the shortest?

Which is the youngest? Which striker is the fattest/youngest?

4) Python - Make it to 90% by Wednesday

5) R- Finish swirl (all modules) by Wednesday

Work Update 1) Completed 90% Python course on Codecademy.

2)Learned about Bootstrap and revised HTML & CSS

3) Made "About Page" for Decision Stats by editing existing templates and adding

some new elements(Hosted the same using dropbox.com

http://imeckrdemo.kissr.com/)

4)Blogged on Bootstrap (https://python4analytics.wordpress.com/2014/07/01/make-

responsive-website-with-bootstrap/)

5)Completed all modules of R programming in Swirl

6)Installed MySQL on my system. Read about MySQL and currently learning it, will do

the assignment of the same after clearing doubts with you.

References used www.getbootstrap.com and its documentation

Screenshots

Page 14: Decisionstats.com Data Science Virtual Internship

Mad

e T

his

Dem

o W

ebsi

te

Page 15: Decisionstats.com Data Science Virtual Internship

Day 14-18

Task Given

1) Read on SQL Injection and SQL http://decisionstats.com/2013/03/26/how-to-learn-

sql-injection/ and try and do the demos at http://sqlzoo.net/hack/

2) What was the problem with SAS Installation? Blog on this AFTER you have successfully

installed it and shown screenshots

3) Compile everything you have learnt in 1 page essay. With appendix of day wise

submissions that you did.

4) Edit all the Blogs

Work Update 1) Learned Web Scrapping using iMacros, still facing some problem in extracting data

from some sites. Also wrote a blogpost on the same

https://python4analytics.wordpress.com/2014/07/07/web-scrapingdata-extraction-

from-web-using-imacros/

2) Created a basic table (in Database) using MySQL Workbench, practiced some basic

queries on it.

3) Learned about SQL injection by resources provided by you. Also blogged on the

same http://python4analytics.wordpress.com/2014/07/08/sql-and-sql-injection-a-web-

attack-technique/

References used http://decisionstats.com/2013/03/26/how-to-learn-sql-injection/ http://sqlzoo.net/hack/

Screenshots

Page 16: Decisionstats.com Data Science Virtual Internship

Day 19-20

Task Given

1) Read these papers: http://www.slideshare.net/ajayohri/using-r-for-cyber-security-part-

1 and http://www.sis.pitt.edu/jjoshi/courses/IS2621/Spring2014/Lab3.pdf

2 ) Use Wireshark and/or Silk to capture some dummy data from a network ( wifi or

wherever)

3) Use the paper 1 to import the data in R and visualize it

4) Additional download and install wireshark and use the instructions from

http://www.ict.kth.se/courses/II2202/II2202-quantitative-chip-R-20110918.pdf to help you

with the analysis

Work Update

1) Learned Web Scrapping using iMacros, still facing some problem in extracting data

from some sites. Also wrote a blogpost on the same

https://python4analytics.wordpress.com/2014/07/07/web-scrapingdata-extraction-from-

web-using-imacros/

2) Created a basic table (in Database) using MySQL Workbench, practiced some basic

queries on it.

3) Learned about SQL injection by resources provided by you. Also blogged on the same

http://python4analytics.wordpress.com/2014/07/08/sql-and-sql-injection-a-web-attack-

technique/

References used

http://decisionstats.com/2013/03/26/how-to-learn-sql-injection/ http://sqlzoo.net/hack/

Screenshots

Page 17: Decisionstats.com Data Science Virtual Internship

Day 20-26

Task Given

1) Complete python on codecademy.

2) Setup RStudio server on AWS and Blog on the same

3) Giving user rights to you, choosing the appropriate user rights.

4) Setup Ipython on AWS

Work Update

1) Completed Python on Codecademy

2) Created AWS account, set up RStudio server on it.

3) Blogged on the same https://python4analytics.wordpress.com/2014/07/18/setting-up-

rstudio-server-on-aws-ec2-instance/

References used

http://www.s-anand.net/blog/ssh-tunneling-through-web-filters/

http://www.r-bloggers.com/instructions-for-installing-using-r-on-amazon-ec2/

Screenshots

Page 18: Decisionstats.com Data Science Virtual Internship

Day 40+

Task Given

1) Read these

http://www.slideshare.net/ajayohri/decision-making-in-the-era-of-cloud-computing-and-

big-data

http://www.slideshare.net/ajayohri/big-data-big-analytics

2) Explore Hadoop

3) Complete the tutorials on Hortonworks Sandbox

Work Update 1) Read the two papers provided by you

http://www.slideshare.net/ajayohri/decision-making-in-the-era-of-cloud-computing-and-

big-data

http://www.slideshare.net/ajayohri/big-data-big-analytics

2) Explored Hadoop: What is HDFS, Map Reduce, Pig, Hive etc.

3) Resolved that problem that I was having with Pig and Sandbox

4) Completed first two tutorials on Hortonworks Sandbox and Learned following things:

Basics commands in Pig(Grunt Shell), Downloaded a sample data and performed basic

Hive Queries on it

Blog Written

https://python4analytics.wordpress.com/2014/08/06/installing-hortonworks-sandbox-

hadoop/

Screenshots

Page 19: Decisionstats.com Data Science Virtual Internship

Thank You