movie data analysis

21
Movie Data Analysis using Hive QL SUBMITTED TO: DR. JONGWOOK WOO C I S 5 2 0 : S o f t w a r e E n g i n e e r i n g 1 Kumari Parul Bisen Krutik Shah Manvi Chandra

Upload: manvi-chandra

Post on 10-Feb-2017

439 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Movie data analysis

CIS 520: Software Engineering

1

Movie Data Analysis using Hive QL SUBMITTED TO: DR. JONGWOOK WOO

Kumari Parul BisenKrutik ShahManvi Chandra

Page 2: Movie data analysis

CIS 520: Software Engineering

2Table of Contents Movies Project Description Hadoop,Hive,PowerView Cloudberry Explorer for Azure Blob Storage Flowchart Relation to SDLC Hive Queries for data analysis Output and Visualization on graphs Dashboard

Page 3: Movie data analysis

CIS 520: Software Engineering

3What is Movie Dataset ?

We have extracted data related to movies from http://www.the-numbers.com/ .

The-Numbers has tracked over 20,000 movies.

This data analysis is based on MPAA(Motion Picture Association of America) Ratings, Rankings, Genre, Gross Profit and Tickets sold.

Page 4: Movie data analysis

CIS 520: Software Engineering

4Project Description

We are basically analyzing the movie data using Hive QL

The results obtained are exported into excel sheets.

The visualization of the analyzed data is done using Power View query in MS-Excel.

Page 5: Movie data analysis

CIS 520: Software Engineering

5Hadoop

Hadoop- Hadoop is an open source framework utilized for processing humungous datasets and also used for distributed storage.

A particular special type of computational cluster is built in order to store and analyze large volumes of unstructured data is known as a Hadoop cluster.

Hadoop clusters are gaining popularity for enhancing the speed of data analysis applications. Hadoop clusters are extremely scalable.

Hadoop clusters are highly efficient as they are resistant to failures.

Page 6: Movie data analysis

CIS 520: Software Engineering

6Hive

Hive is a data warehouse system for Hadoop. It allows querying, data analysis utilizing HiveQL etc. Hive enables users to potray structure on huge unstructured data. Hive has the ability to understand organized and unorganized data

which may include text files where fields are circumscribed by specific characters.

Page 7: Movie data analysis

CIS 520: Software Engineering

7PowerView

PowerView is an add in which allows customers collect ,store,build and analyze huge volumes of data in excel.

PowerView is capable of providing intuitive data & visualization of power pivot models.

PowerView is similar to excel visualization layer.

Page 8: Movie data analysis

CIS 520: Software Engineering

8Cloudberry Explorer for Blob Storage

It is leveraged by Microsoft Azure Storage Analytics.

It is available in two versions freeware and Pro.

We have used this tool to upload data from local to Azure storage blob.

Page 9: Movie data analysis

CIS 520: Software Engineering

9Flowchart

Download data from

data source

Format the file in the form

of txt

Uploading the files

on Cloudberry Explorer

for Microsoft

Azure Blob

Storage

Use HiveQL to

create external tables.

Use Query results

and powervie

w to analyze

data

Dashboard

visualiztion

Page 10: Movie data analysis

CIS 520: Software Engineering

10Relation with SDLC

Determining the Scope,

Time Estimation

and Expected Output

Gathering Data through

The-Numbers.co

m and analysing.

Designing – acquire

necessary software for executing i.e., OBDC, HD Insight, Microsoft

azure.

Implement - Developed programs, prepared

documents

Testing and Maintaining

Page 11: Movie data analysis

CIS 520: Software Engineering

11Transfer data from Local to HD Insight

Page 12: Movie data analysis

CIS 520: Software Engineering

12Hive Queries

Page 13: Movie data analysis

CIS 520: Software Engineering

13Recommendation based on the Analysis

We are using recommendation technique named content based filtering on the basis of which we are trying to figure out the most popular movies.

 Content-based filtering approach utilizes a series of discrete characteristics of an item in order to recommend additional items with similar properties.

In our dataset in order to find the most popular movies we are considering Rank, Gross revenue earned and the Number of Tickets sold.

Page 14: Movie data analysis

CIS 520: Software Engineering

14Output and Visualizations

Page 15: Movie data analysis

CIS 520: Software Engineering

15Output and Visualizations

Page 16: Movie data analysis

CIS 520: Software Engineering

16Output and Visualizations

Page 17: Movie data analysis

CIS 520: Software Engineering

17Output and Visualizations

Page 18: Movie data analysis

CIS 520: Software Engineering

18Output and Visualizations

Page 19: Movie data analysis

CIS 520: Software Engineering

19Conclusion:

Data analysis using HiveQl.

Exporting of analyzed data to Excel and data representation using PowerView .

Visualization using Dashboard.

Page 20: Movie data analysis

CIS 520: Software Engineering

20References

Github.com https://azure.microsoft.com/en-us/documentation/samples/ http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-

apache-hive/

Page 21: Movie data analysis

CIS 520: Software Engineering

21THANK YOU 😊