analysis of historical movie data by bhadra

30
COMPUTER SCIENCE AND ENGINEERING ANALYSIS OF HISTORICAL MOVIE DATA BY USING HADOOP SYSTEM INTERNAL GUIDE:T.CHANDRA SHEKAR REDDY : G.VEERABHADRA(13R21A05C8)

Upload: bhadra-gowdra

Post on 07-Apr-2017

31 views

Category:

Internet


3 download

TRANSCRIPT

Page 1: Analysis of historical movie data by BHADRA

COMPUTER SCIENCE AND ENGINEERING

ANALYSIS OF HISTORICAL MOVIE DATA BY USING HADOOP SYSTEM

INTERNAL GUIDE:T.CHANDRA SHEKAR REDDY

:G.VEERABHADRA(13R21A05C8)

Page 2: Analysis of historical movie data by BHADRA

Abstract Requirements Dataflow Diagram Methodology Screenshots Future Extension Conclusion References

INDEX

Page 3: Analysis of historical movie data by BHADRA

Recommendation system provides the facility to understand a person's taste and find new, desirable content for them automatically based on the pattern between their likes and rating of different items. In this paper, we have proposed a recommendation system for the large amount of data available on the web in the form of ratings, reviews, opinions, complaints, remarks, feedback, and comments about any item (product, event, individual and services) using Hadoop Framework.

ABSTRACT

Page 4: Analysis of historical movie data by BHADRA

Hadoop 2.x My Sql HDFS Hive Pig Hue JDK 1.6

REQUIREMENTS

Page 5: Analysis of historical movie data by BHADRA

Dataflow Diagram

MS Excel (datasets in csv

format)

Import into cloudera home

Load the data into mysql

Create database in mysql

Load the data into hive using

sqoop

Load the data into Hue

Page 6: Analysis of historical movie data by BHADRA

Hadoop Distributed File System (HDFS): The Hadoop Distributed File System (HDFS) is designed to store very large data

sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks.

An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and the execution of application computations in parallel close to their data.

HDFS

Page 7: Analysis of historical movie data by BHADRA

HDFS Architecture:

Page 8: Analysis of historical movie data by BHADRA

• Hive is a data warehousing frame work in hadoop where we store data in the form of tables ( structured format).Hive runs on the top of hdfs and mapreduce.

• The back end storage for hive is hdfs and executing model is mapreduce. • Hive provides SQL like language called HiveQL(HQL). HQL is very similar to

SQL.• Hive is designed for scalability and easy of use.

HIVE

Page 9: Analysis of historical movie data by BHADRA

Tinyint(1 byte) SmallInt(2 bytes) int(4 bytes) Bigint(8 bytes) float(4 bytes) double(8 bytes) String(max size 2gb) varchar(hive-0.12.0 supports 1 to 65535 characters) Boolean --->true/false

HIVE complex data types:

Page 10: Analysis of historical movie data by BHADRA

sqoop is a tool designed to transfer data between hadoop and relational databases. You can use sqoop to import data from a relational database management system such as MYSQL,or ORACLE into the hadoop distributed file system and then export the data back into an RDBMS.

Sqoop automates most of the this process, relying on the database to describe the schema for the data to be imported . Sqoop uses mapreduce to import and export the data which provides parallel operations as well as fault tolerance.

SQOOP

Page 11: Analysis of historical movie data by BHADRA

Copy the file from windows to cloudera. For creating the database: Mysql>create database name; For using the database: Mysql>use name;

COMMAND

Page 12: Analysis of historical movie data by BHADRA

For creating table name: Mysql>create table tablename(….);

COMMAND

Page 13: Analysis of historical movie data by BHADRA

To import data sets in to MYSQL the following command is used:load the file Mysql>load data local infile ‘path of the file’ into table tablename fields

terminated by ‘,’ enclosed by ‘”’ lines terminated by ‘\r\n’; exit;

COMMAND

Page 14: Analysis of historical movie data by BHADRA

For importing the data from mysql to hive the following command is used: Sqoop import –connect jdbc:mysql//localhost/datbasename --username root –password cloudera --table tablename --fields-terminated-by ’,’ --hive -import -m 1

To log in to HUE:username: Clouderapassword: Clouderago to hive editor.

Where at the left side we have to select database and at the right side we can try some analytical queries on the tables created. Once the result is displayed select some charts and repeat the same process for all the respective years.

COMMAND

Page 15: Analysis of historical movie data by BHADRA
Page 16: Analysis of historical movie data by BHADRA
Page 17: Analysis of historical movie data by BHADRA

Representing bar graphs between title and rating

Page 18: Analysis of historical movie data by BHADRA

Representing bargraphs between Budget_in_crores and collection

Page 19: Analysis of historical movie data by BHADRA

Representing bargraphs between Year and collection

Page 20: Analysis of historical movie data by BHADRA

Clearly Big Data is in its beginnings, and is much more to be discovered. This technology itself brings business benefits by being leveraged across domains like Big Data, Business Intelligence and Analytics. These business benefits are:

Speed and Accelerated performanceGood query performance for improved decision making, boost of performance for data load processes for a low data latency, accelerated memory planning capabilities.

New Business InsightsSelf-service BI and more flexible modeling capabilities.Faster Business Processes.

FUTURE EXTENSION

Page 21: Analysis of historical movie data by BHADRA

The availability of Big Data, low-cost commodity hardware, and new information management and analytic software has produced a unique moment in the history of data analysis. The convergence of these trends means that we have the capabilities required to analyze astonishing data sets quickly and cost-effectively for the first time in history. These capabilities are neither theoretical nor trivial. They represent a genuine leap forward and a clear opportunity to realize enormous gains in terms of efficiency, productivity, revenue, and profitability. The Age of Big Data is here, and these are truly revolutionary times if both business and technology professionals continue to work together and deliver on the promise. Promises of Big Data include innovation, growth and long term sustainability.

From the results we can analyze the movies and project reports like the best rated, highest budget and highest collection with in a click.

CONCLUSION

Page 22: Analysis of historical movie data by BHADRA

https://www.tutorialspoint.com/ http://hadooptutorials.co.in/tutorials/hadoop/internals-of-hdfs-file-read-

operations.html http://www.hadooptpoint.com/hadoop-hive-architecture/ http://downloads.vmware.com/d/info/desktop_downloads/vmware_workstation/7_0 http://www.cloudera.com/ Hadoop: The Definitive Guide -- John White Big Data Analytics -- Wiley 

REFERENCES

Page 23: Analysis of historical movie data by BHADRA

Screenshot(Implementation):

Page 24: Analysis of historical movie data by BHADRA
Page 25: Analysis of historical movie data by BHADRA
Page 26: Analysis of historical movie data by BHADRA
Page 27: Analysis of historical movie data by BHADRA

Gantt Chart (definition):Gantt chart is a chart in which a series of horizontal lines shows the amount of work done or production completed in certain periods of time in relation to the amount planned for those periods.

Page 28: Analysis of historical movie data by BHADRA

Future Work: In the further process we will be analyzing the datasets which are loaded in the Hive using Hue or R tool.

Page 29: Analysis of historical movie data by BHADRA

Conclusion:In this project we have loaded large set of datasets in to HDFS using Sqoop and Hive Further the movie data can be easily analyzed using Hue.

Page 30: Analysis of historical movie data by BHADRA