hotel inspection data set analysis copy

HOTEL INSPECTION DATASET ANALYSIS

A mini project on BIG DATA-HADOOP

Project Title

HOTEL INSPECTION

DATASET ANALYSIS

Presented by,

SHARON MOSES

RAGINI AKULA

1


CONTENTS

Abstract

List of figures

List of screens

TOPIC NAME PAGE NO

1. INTRODUCTION 8-10 1.1Motivation 8

1.2Existing System 8

1.3Problem definition 9

1.3.1Storing 9

1.3.2Processing 9

1.4Proposed System 9

1.5Features of project 10

1.5.1Storing the data set 10

1.5.2Processing the data set 10

2. LITERATURE SURVEY 11-27 2.1Bigdata 11

2.2Apache Hadoop 13

2.2.1Vendors 16

2.2.2Cloudera 20

2.2.3Hadoop Ecosystems 22

2.3Linux Ubuntu 24

2.4My Sql 26

3. SYSTEM REQUIREMENTS 28-29

2


3.1Identification of needs 28

3.2Environmental Requirements 29

3.2.1Software Requirements 29

3.2.2Hardware Requirements 29

4. BUSINESS LOGIG 30-36 4.1System Analysis 30

4.1.1Functional Requirements 30

4.1.1.1Technical Feasibility 30

4.1.1.2Operational Feasibility 31

4.2System Design 32

4.2.1Business Flow 32

4.2.1.1Apache Hadoop working model-I 32

4.2.1.2Apache Hadoop working model-II 33

4.2.2Business Logic 34

5. PROJECT MODULES 37-61 5.1Modules Introduction 37

5.2Modules 37

5.2.1Analysing the data and filtering the data 37

5.2.2Identifying the headers (schema) 39

5.2.3Installing single node Hadoop cluster 43

5.2.4Moving the data to HDFS 51

5.2.5Creating the tables in Hive 54

5.2.6Importing data from HDFS to hive warehouse 56

5.2.7Analysing the data based on the queries from the client 58

5.2.8Generating the reports 61

6. EXECUTION OF JOBS 62-69

3


6.1Methods of Execution 62

6.1.1Executing the job from the Hive prompts 62

6.1.2Executing the job from terminal with Hadoop 63

6.1.3Executing the jog as Script 63

6.2Execution of HiveQL jobs 65

7. TESTING 70 7.1Introduction 70

7.2Sample unit testing 70

8. SCREENS 71-74

9. CONCLUSIONS 75

10. REFERENCES 76

4


ABSTRACT

Generally, hotels are complex and costly when it comes to maintenance with various

things like quality of food, usage of spaces that have different schedules and uses for guest

rooms‟ restaurants, health club, swimming pool, retail store and each has a functional

engineering system required for its maintenance. Maintenance therefore has to be done

throughout the year, requiring competent staff to undertake building services, operation and

maintenance, supplemented by outsourced contractors.

In the hospitality industry the maintenance of the engineering systems is important despite its

complex processes as its effectiveness will directly affect the quality of hotel service, food, and

beverage which have direct and significant effect on guests‟ impression of the hotel.

Here is the data of various inspections done on Hotels in various parts o USA .The data deals

with the violations made by the hotel managements and their violation codes. Data also explains

the action taken by the government according to the violation codes on the hotel.

We analyze the data based on which parts they are violating the codes, so that the new hotels will

exclude this problems and survive in the market.

5


LIST OF FIGURES

Figure no Name of the figure Page no

1 Big data 3 v’s 112 Data Measurements 123 Hadoop logo 134 Components of Hadoop 145 HDFS data distribution 156 Map reduce compute distribution 157 Performance and Scalability 198 Apache Hadoop ecosystem 249 Ubuntu logo 2410 MySql logo 2611 Apache Hadoop working model-I 3212 Apache Hadoop working model-II 3313 MapReduce Logic 3514 Job execution phase in Hadoop 3615 Violation table schema diagram 4216 Hotel table schema diagram 43

6


LIST OF SCREENS

Screen no Name of the screen Page no

1 Raw-dataset 382 Unnecessary data fields 383 Final dataset 394 JDK installation path 445 Java path 456 Hadoop location 467 Hadoop installation crosscheck 478 Hive installation crosscheck 489 Hadoop-env.sh file 4910 Core-site.xml 4911 Mapred-site.xml 5012 Hdfs-site.xml 5013 Creating Directory 5114 Listing the Directories 5215 Moving data to HDFS 5316 Checking the files in HDFS 5417 Table created successfully 5518 Checking created tables 5619 Data loaded to hive warehouse and table 5620 Table description 5721 Verifying the data 5722 Job execution 5823 Query executed and data loaded to HDFS 5924 Result moved to home directory 6025 Stored output 6026 Output generated by the query 6127 Report generated from the query 6128 Hive prompt 6229 Query using hive –e 63

7


30 Query from script 6431 Script home directory 6432 Query written in script 6533 Violation codes 7134 Violation made 7135 Inspections made area wise 7236 Violation counts from each restaurant 7237 Types of cosines inspected 7338 More inspections in cosines 7339 Critical and non-critical issues 74

1

8


1. INTRODUCTION

1.1 MOTIVATION:

Generally, hotels are complex and costly when it comes to maintenance with

various things like quality of food, usage of spaces that have different schedules and uses

for guest room’s restaurants, health club, swimming pool, retail store and each has a

functional engineering system required for its maintenance. Maintenance therefore has to

be done throughout the year, requiring competent staff to undertake building services,

operation and maintenance, supplemented by outsourced contractors.

In the hospitality industry the maintenance of the engineering systems is

important despite its complex processes as its effectiveness will directly affect the quality

of hotel service, food, and beverage which have direct and significant effect on guest’s

impression of the hotel. Here is the data of various inspection done on Hotels in various

parts o USA. The data deals with the violations made by the hotel managements and their

violation codes. Data also explains the action taken by the government according to the

violation codes on the hotel.

We analyze the data based on which parts they are violating the codes, so that the

new hotels will exclude these problems and survive in the market.

1.2 EXISTING SYSTEM:

These days for any organization, company or a business firm the most important

things for them is survive in the market and compete with the competitors. As to do so

the firm need to analyze their position in the market.

9


Analyzing the market needs the data which they have generated from long years. The

data from last year’s, which has been rapidly, multiply in numbers and creating a lot of problems

in terms of storing and analyzing the stored data.

These days we have a tedious improvement in storing technologies rather than in

analyzing techniques. We are having problems in analyzing the data stored in our Traditional

RDBMS ( Mysql, Db2..) and at same time the data size is also exceeding our storage

probabilities.

1.3 PROBLEM DEFINITION:

The following are the problems which we are facing with the existing systems.

1.3.1 Storing:

Since couple of years we can see how the data has rustically increased in its size and

creating lot of problems in storing them.

1.3.2 Processing:

Since the data is very huge we are not able to analyze the dataset in the fixed period of

time and so unable to get the results in an efficient way.

1.4 PROPOSED SYSTEM:

In our proposed system we are using new technologies for analyzing the datasets. The

framework we are using is Hadoop.

10


This is a framework which is capable of storing any tedious amount of data and can

processing the dataset in a less time and in a efficient way compared to other technologies.

1.5 FEATURES OF PROJECT:

1.5.1 Storing the Dataset:

We extract the dataset from an external source to our Hadoop cluster using Sqoop

ecosystem.

1.5.2 Processing the Dataset:

Once the dataset is extracted the data is analyzed using MapReduce and other ecosystem

which works well with the dataset.

11


2. LITERATURE SURVEY

2.1 BIGDATA:

Big data is an evolving term that describes any voluminous amount of structured, semi-

structured and unstructured data that has the potential to be mined for information. Although big

data doesn't refer to any specific quantity, the term is often used when speaking about petabytes

and Exabyte’s of data.

Big data is used to describe a massive volume of data that is so large that it's difficult to

process. The data is too big that exceeds current processing capacity.

Big data can be characterized by 3Vs: the extreme volume of data, the wide variety of

types of data and the velocity at which the data must be must processed.

Big Data 3V’s

12


An example of big data might be pentabytes (1,024 terabytes) or Exabyte’s (1,024

pentabytes) of data consisting of billions to trillions of records.

E.g. Web, sales, customer contact center, social media, and mobile data...

Data Measurements

13


2.2 APACHE HADOOP:

Hadoop-logo

Hadoop is an open-source software framework for storing and processing big data in a

distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two

tasks: massive data storage and faster processing.

Doug Cutting, Cloud era’s Chief Architect, helped create Apache Hadoop out of

necessity as data from the web exploded, and grew far beyond the ability of traditional systems

to handle it. Hadoop was initially inspired by papers published by Google outlining its approach

to handling an avalanche of data, and has since become the de facto standard for storing,

processing and analyzing hundreds of terabytes, and even pet bytes of data.

Why is Hadoop important?

Since its inception, Hadoop has become one of the most talked about technologies. Why?

One of the top reasons (and why it was invented) is its ability to handle huge amounts of data –

any kind of data – quickly. With volumes and varieties of data growing each day, especially from

social media and automated sensors, that’s a key consideration for most organizations. Other

reasons include:

Low cost: The open-source framework is free and uses commodity hardware to store large

quantities of data.

Computing power: Its distributed computing model can quickly process very large volumes

of data. The more computing nodes you use, the more processing power you have.

14


Scalability: You can easily grow your system simply by adding more nodes. Little

administration is required.

Storage flexibility: Unlike traditional relational databases, you don’t have to preprocess data

before storing it. And that includes unstructured data like text, images and videos. You can store

as much data as you want and decide how to use it later.

Inherent data protection and self-healing capabilities: Data and application processing are

protected against hardware failure. If a node goes down, jobs are automatically redirected to

other nodes to make sure the distributed computing does not fail. And it automatically stores

multiple copies of all data.

Components of Hadoop

HDFS: (Hadoop Distributed File System)

HDFS is a fault tolerant and self-healing distributed file system designed to turn a cluster

of industry standard servers into a massively scalable pool of storage. Developed specifically for

large-scale data processing workloads where scalability, flexibility and throughput are critical,

HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming,

and scales to proven deployments of 100PB and beyond.

15


HDFS Data Distribution

Data in HDFS is replicated across multiple nodes for compute performance and data protection.

MapReduce:

MapReduce is a massively scalable, parallel processing framework that works in tandem

with HDFS. With MapReduce and Hadoop, compute is executed at the location of the data,

rather than moving data to the compute location; data storage and computation coexist on the

same physical nodes in the cluster. MapReduce processes exceedingly large amounts of data

without being affected by traditional bottlenecks like network bandwidth by taking advantage of

this data proximity.

MapReduce Compute Distribution

MapReduce divides workloads up into multiple tasks that can be executed in parallel.

16


The MapReduce framework operates exclusively on <key, value> pairs, that is, the

framework views the input to the job as a set of <key, value> pairs and produces a set of <key,

value> pairs as the output of the job, conceivably of different types.

The key and value classes have to be Serializable by the framework and hence need to

implement the Writable interface. Additionally, the key classes have to implement the

WritableComparable interface to facilitate sorting by the framework.

Input and Output types of a MapReduce job:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

2.2.1 Vendors:

Hadoop vendors share the Hadoop architecture from Apache Hadoop.

EMC:

Pivotal HD, the Apache Hadoop distribution from EMC, natively integrates EMC’s

massively parallel processing (MPP) database technology (formerly known as Greenplum, and

now known as HAWQ) with Apache Hadoop. The result is a high-performance Hadoop

distribution with true SQL processing for Hadoop. SQL-based queries and other business

intelligence tools can be used to analyze data that is stored in HDFS.

Hortonworks: Another major player in the Hadoop market, Hortonworks has the largest

number of committers and code contributors for the Hadoop ecosystem components.

(Committers are the gatekeepers of Apache projects and have the power to approve code

changes.)

Hortonworks is a spin-off from Yahoo!, which was the original corporate driver of the

Hadoop project because it needed a large-scale platform to support its search engine business. Of

17


all the Hadoop distribution vendors, Hortonworks is the most committed to the open source

movement, based on the sheer volume of the development work it contributes to the community,

and because all its development efforts are (eventually) folded into the open source codebase.

The Hortonworks business model is based on its ability to leverage its popular HDP

distribution and provide paid services and support. However, it does not sell proprietary

software. Rather, the company enthusiastically supports the idea of working within the open

source community to develop solutions that address enterprise feature requirements (for

example, faster query processing with Hive).

Hortonworks has forged a number of relationships with established companies in the data

management industry: Teradata, Microsoft, Informatica, and SAS, for example. Though these

companies don’t have their own, in-house Hadoop offerings, they collaborate with Hortonworks

to provide integrated Hadoop solutions with their own product sets.

The Hortonworks Hadoop offering is the Hortonworks Data Platform (HDP), which

includes Hadoop as well as related tooling and projects. Also unlike Cloudera, Hortonworks

releases only HDP versions with production-level code from the open source community.

IBM:

Big Blue offers a range of Hadoop offerings, with the focus around value added on top of

the open source Hadoop stack.

18


Intel:

The Intel Distribution for Apache Hadoop (Intel Distribution) provides distributed

processing and data management for enterprise applications that analyze big data.

MapR:

For a complete distribution for Apache Hadoop and related projects that’s independent of

the Apache Software Foundation, look no further than MapR. Boasting no Java dependencies or

reliance on the Linux file system, MapR is being promoted as the only Hadoop distribution that

provides full data protection, no single points of failure, and significant ease-of-use advantages.

Three MapR editions are available: M3, M5, and M7. The M3 Edition is free and

available for unlimited production use; MapR M5 is an intermediate-level subscription software

offering; and MapR M7 is a complete distribution for Apache Hadoop and HBase that includes

Pig, Hive, Sqoop, and much more.

Cloudera:

Perhaps the best-known player in the field, Cloudera is able to claim Doug Cutting,

Hadoop’s co-founder, as its chief architect. Cloudera is seen by many people as the market

leader in the Hadoop space because it released the first commercial Hadoop distribution and it is

a highly active contributor of code to the Hadoop ecosystem.

19


Performance and scalability

20


2.2.2 Cloudera:

Perhaps the best-known player in the field, Cloudera is able to claim Doug Cutting,

Hadoop’s co-founder, as its chief architect. Cloudera is seen by many people as the market

leader in the Hadoop space because it released the first commercial Hadoop distribution and it is

a highly active contributor of code to the Hadoop ecosystem.

Cloudera Enterprise, a product positioned by Cloudera at the center of what it calls the

“Enterprise Data Hub,” includes the Cloudera Distribution for Hadoop (CDH), an open-source-

based distribution of Hadoop and its related projects as well as its proprietary Cloudera Manager.

Also included is a technical support subscription for the core components of CDH.

Cloudera’s primary business model has long been based on its ability to leverage its

popular CDH distribution and provide paid services and support. In the fall of 2013, Cloudera

formally announced that it is focusing on adding proprietary value-added components on top of

open source Hadoop to act as a differentiator.

Also, Cloudera has made it a common practice to accelerate the adoption of alpha- and

beta-level open source code for the newer Hadoop releases. Its approach is to take components it

deems to be mature and retrofit them into the existing production-ready open source libraries that

are included in its distribution.

2.2.3 Hadoop Ecosystems:

The Hadoop platform consists of two key services: a reliable, distributed file system

called Hadoop Distributed File System (HDFS) and the high-performance parallel data

processing engine called Hadoop MapReduce, described in MapReduce below.

21


The combination of HDFS and MapReduce provides a software framework for

processing vast amounts of data in parallel on large clusters of commodity hardware (potentially

scaling to thousands of nodes) in a reliable, fault-tolerant manner. Hadoop is a generic

processing framework designed to execute queries and other batch read operations against

massive datasets that can scale from tens of terabytes to pentabytes in size.

The popularity of Hadoop has grown in the last few years, because it meets the needs of

many organizations for flexible data analysis capabilities with an unmatched price-performance

curve. The flexible data analysis features apply to data in a variety of formats, from unstructured

data, such as raw text, to semi-structured data, such as logs, to structured data with a fixed

schema.

Hadoop has been particularly useful in environments where massive server farms are

used to collect data from a variety of sources. Hadoop is able to process parallel queries as big,

background batch jobs on the same server farm. This saves the user from having to acquire

additional hardware for a traditional database system to process the data (assume such a system

can scale to the required size). Hadoop also reduces the effort and time required to load data into

another system; you can process it directly within Hadoop. This overhead becomes impractical in

very large data sets.

Many of the ideas behind the open source Hadoop project originated from the Internet

search community, most notably Google and Yahoo!. Search engines employ massive farms of

inexpensive servers that crawl the Internet retrieving Web pages into local clusters where they

are analyzes with massive, parallel queries to build search indices and other useful data

structures.

22


The Hadoop ecosystem includes other tools to address particular needs. Hive is a SQL

dialect and Pig is a dataflow language for that hide the tedium of creating MapReduce jobs

behind higher-level abstractions more appropriate for user goals. Zookeeper is used for

federating services and Oozie is a scheduling system. Avro, Thrift and Protobuf are platform-

portable data serialization and description formats.

MapReduce:

MapReduce is now the most widely-used, general-purpose computing model and runtime

system for distributed data analytics. It provides a flexible and scalable foundation for analytics,

from traditional reporting to leading-edge machine learning algorithms. In the MapReduce

model, a compute “job” is decomposed into smaller “tasks” (which correspond to separate Java

Virtual Machine (JVM) processes in the Hadoop implementation). The tasks are distributed

around the cluster to parallelize and balance the load as much as possible. The MapReduce

runtime infrastructure coordinates the tasks, re-running any that fail or appear to hang. Users of

MapReduce don’t need to implement parallelism or reliability features themselves. Instead, they

focus on the data problem they are trying to solve.

Pig:

Pig is a platform for constructing data flows for extract, transform, and load (ETL)

processing and analysis of large datasets. Pig Latin, the programming language for Pig provides

common data manipulation operations, such as grouping, joining, and filtering. Pig generates

Hadoop MapReduce jobs to perform the data flows. This high-level language for ad hoc analysis

allows developers to inspect HDFS stored data without the need to learn the complexities of the

MapReduce framework, thus simplifying the access to the data.

The Pig Latin scripting language is not only a higher-level data flow language but also

has operators similar to SQL (e.g., FILTER and JOIN) that are translated into a series of map and

23


reduce functions. Pig Latin, in essence, is designed to fill the gap between the declarative style of

SQL and the low-level procedural style of MapReduce.

Hive :

Hive is a SQL-based data warehouse system for Hadoop that facilitates data

summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop-compatible

file systems (e.g., HDFS, MapR-FS, and S3) and some NoSQL databases. Hive is not a relational

database, but a query engine that supports the parts of SQL specific to querying data, with some

additional support for writing new tables or files, but not updating individual records. That is,

Hive jobs are optimized for scalability, i.e., computing over all rows, but not latency, i.e., when

you just want a few rows returned and you want the results returned quickly. Hive’s SQL dialect

is called HiveQL. Table schema can be defined that reflect the data in the underlying files or data

stores and SQL queries can be written against that data. Queries are translated to MapReduce

jobs to exploit the scalability of MapReduce. Hive also support custom extensions written in

Java, including user-defined functions (UDFs) and serializer-deserializers for reading and

optionally writing custom formats, e.g., JSON and XML dialects. Hence, analysts have

tremendous flexibility in working with data from many sources and in many different formats,

with minimal need for complex ETL processes to transform data into more restrictive formats.

Contrast with Shark and Impala.

24


Apache Hadoop Ecosystem

2.3 LINUX UBUNTU:

Ubuntu-logo

Ubuntu is an ancient African word meaning 'humanity to others'. It also means 'I am what

I am because of who we all are'. The Ubuntu operating system brings the spirit of Ubuntu to the

world of computers.

Linux was already established as an enterprise server platform in 2004, but free software

was not a part of everyday life for most computer users. That's why Mark Shuttleworth gathered

25


a small team of developers from one of the most established Linux projects – Debian – and set

out to create an easy-to-use Linux desktop: Ubuntu.

The vision for Ubuntu is part social and part economic: free software, available to

everybody on the same terms, and funded through a portfolio of services provided by Canonical.

The first official Ubuntu release -- Version 4.10, codenamed the 'Warty Warthog' — was

launched in October 2004, and sparked dramatic global interest as thousands of free software

enthusiasts and experts joined the Ubuntu community.

The governance of Ubuntu is somewhat independent of Canonical, with volunteer leaders

from around the world taking responsibility for many critical elements of the project. It remains a

key tenet of the Ubuntu Project that Ubuntu is a shared work between Canonical, other

companies, and the thousands of volunteers who bring their expertise to bear on making it a

world-class platform for anyone to use.

Ubuntu today has eight flavours and dozens of localised and specialised derivatives.

There are also special editions for servers, OpenStack clouds, and mobile devices. All editions

share common infrastructure and software, making Ubuntu a unique single platform that scales

from consumer electronics to the desktop and up into the cloud for enterprise computing.

The Ubuntu OS and the innovative Ubuntu for Android convergence solution make it an

exciting time for Ubuntu on mobile devices. In the cloud, Ubuntu is the reference operating

system for the OpenStack project, it’s a hugely popular guest OS on Amazon's EC2 and

Rackspace's Cloud, and it’s pre-installed on computers from Dell, HP, Asus, Lenovo and other

26


global vendors. And thanks to that shared infrastructure, developers can work on the desktop,

and smoothly deliver code to cloud servers running the stripped-down Ubuntu Server Edition.

After many years Ubuntu still is and always will be free to use, share and develop. We

hope it will bring a touch of light to your computing — and we hope that you'll join us in helping

to build the next version.

2.4 MySQL:

Mysql-logo

MySQL is the world's most popular open source database software, with over 100 million

copies of its software downloaded or distributed throughout it's history. With its superior speed,

reliability, and ease of use, MySQL has become the preferred choice for Web, Web 2.0, SaaS,

ISV, Telecom companies and forward-thinking corporate IT Managers because it eliminates the

major problems associated with downtime, maintenance and administration for modern, online

applications.

Many of the world's largest and fastest-growing organizations use MySQL to save time

and money powering their high-volume Web sites, critical business systems, and packaged

software — including industry leaders such as Yahoo!, Alcatel-Lucent, Google, Nokia,

YouTube, Wikipedia, and Booking.com.

27


The flagship MySQL offering is MySQL Enterprise, a comprehensive set of production-

tested software, proactive monitoring tools, and premium support services available in an

affordable annual subscription.

MySQL is a key part of LAMP (Linux, Apache, MySQL, PHP / Perl / Python), the fast-

growing open source enterprise software stack. More and more companies are using LAMP as an

alternative to expensive proprietary software stacks because of its lower cost and freedom from

platform lock-in.

28


3. SYSTEM REQUIREMENTS

The purpose of this SRS document is to identify the requirements and functionalities for

Intelligent Network Backup Tool . The SRS will define how our team and the client conceive the

final product and the characteristics or functionality it must have. This document also makes a

note of the optional requirements which we plan to implement but are not mandatory for the

functioning of the project.

This phase appraises the needed requirements for the Hotel Inspection dataset for a

systematic way of evaluating the requirements several processes are involved. The first step

involved in analyzing the requirements of the system is recognizing the nature of system for a

reliable investigation and all the case are formulated to better understand the analysis of the

dataset.

Document Conventions:

The convention used in the size of fonts remains the same as for other documents in the

project. The section headings have the largest font of 14, subheadings have a font size of

12(bold), and the text is on font 12. The priorities of the requirements are specified with the

requirement statements.

Intended Audience and Reading Suggestions:

29


This document is intended for project developers, managers, users, testers and

documentation writers. This document aims at discussing design and implementation constraints,

dependencies, system features, external interface requirements and other non functional

requirements.

3.1 IDENTIFICATION OF NEEDS:

The foremost and important necessity for a business firm or an organization is to know

how they are performing in the market and parallel they need to know how to overcome their

competitors in the market.

To do so we need to analysis our data based on all the available factors. The system

requirements for the project to be accomplished are:

3.2 ENVIRONMENTAL REQUIREMENTS:

3.2.1 Software Requirements:

Development & Usage:

Linux Operating System.

Apache Hadoop.

Mozilla Firefox: (or any browser).

Microsoft Excel or Open office.

3.2.2 Hardware Requirements:

Development & Usage:

Pentium 4 processor.

30


40GB Hard disc.

256 MB RAM. / 4 GB RAM

System with all standard accessories like monitor, keyboard, mouse, etc.,

4. BUSINESS LOGIC

Logic Features:

1. Store:

The main intention of the Hotel Inspection Dataset is to analysis the data based

on the violations made by all inspected restaurants and hotel. To handle the

things we first load the data to our Hadoop HDFS Component.

2. Analysis:

This is the other major step for the dataset, this module is done based the type

of dataset we have, any ways our Hotel Inspection Dataset is a structure data.

So we work with Hadoop Ecosystem HIVE.

4.1 SYSTEM ANALYSIS:

4.1.1 FUNCTIONAL REQUIREMENTS:

4.1.1.1 Technical Feasibility:

Evaluating the technical feasibility is the trickiest part of a feasibility study. This is

because, at this point in time, not too many detailed design of the system, making it difficult to

access issues like performance, costs on (on account of the kind of technology to be deployed)

etc.

A number of issues have to be considered while doing a technical analysis. Understand

the different technologies involved in the proposed system.

31


Before commencing the project, we have to be very clear about what are the technologies

that are to be required for the development of the new system.

Find out whether the organization currently possesses the required technologies. Is the

required technology available with the organization?

If so is the capacity sufficient?

For instance –“Will the current printer be able to handle the new reports and forms required for

the new system?”

4.1.1.2 Operational Feasibility

Proposed projects are beneficial only if they can be turned into information systems that

will meet the organizations operating requirements. Simply stated, this test of feasibility asks if

the system will work when it is developed and installed. Are there major barriers to

Implementation? Here are questions that will help test the operational feasibility of a project.

• Is there sufficient support for the project from management from users? If the current

system is well liked and used to the extent that persons will not be able to see reasons for change,

there may be resistance.

• Are the current business methods acceptable to the user? If they are not, Users may

welcome a change that will bring about a more operational and useful systems.

32

Create Secure Shell Connection (SSH) from Local host to Linux (Ubuntu) Kernel – ssh localhostStart all Demons Name node, Secondary Name node, Data node, Job tracker, Task Tracker – start-all.sh

Check weather all demons are up Jps

Create a directory and move the dataset to the HDFS using terminal Linux.

Check the data format from the browser view. From the data point of view chose the Ecosystem to workBased on the Ecosystem, design the platform and execute the jobs.

Once the jobs are executed. Generate the Reports based on the dataset Analyze the Reports for the improvement of the firm


• Have the user been involved in the planning and development of the project? Early

involvement reduces the chances of resistance to the system and in General and increases the

likelihood of successful project.

Since the proposed system was to help reduce the hardships encountered in the existing

manual system, the new system was considered to be operational feasible.

4.2 SYSTEM DESIGN:

4.2.1 Business Flow:

4.2.1.1 Apache Hadoop Working Model-I:

33

Install a Virtual Machine..VMware Open a virtual machine which is already created from the Cloudera

Start the centous from the virtual Machine. Work with the terminal.Create a directory and move the dataset to the HDFS using terminal Linux.

Check the data format from the browser view. From the data point of view chose the Ecosystem to workBased on the Ecosystem, design the platform and execute the jobs.

Once the jobs are executed. Generate the Reports based on the dataset Analyze the Reports for the improvement of the firm


Apache hadoop working Model-I

4.2.1.2 Apache Hadoop Working Model-II:

Apache Hadoop Working Model-II

34


4.2.2 Business Logic:

Functional Programming:

Multithreading is one of the popular way of doing parallel programming, but major complexity of multi-thread programming is to co-ordinate the access of each thread to the shared data. We need things like semaphores, locks, and also use them with great care, otherwise dead locks will result.

User defined Map/Reduce functions:

Map/reduce is a special form of such a DAG which is applicable in a wide range of use

cases. It is organized as a “map” function which transform a piece of data into some number of

key/value pairs. Each of these elements will then be sorted by their key and reach to the same

node, where a “reduce” function is use to merge the values (of the same key) into a single result.

Mapper:

map(input_record) {

...

emit(k1, v1)

...

emit(k2, v2)

35


...

}

Reducer:

reduce (key, values) {

aggregate = initialize()

while (values.has_next) {

aggregate = merge(values.next)

}

collect(key, aggregate)

}

MapReduce logic

36


Job execution phase in Hadoop

37


5. PROJECT MODULES

5.1 MODULES INTRODUCTION:

The dataset holds the Hotel Inspection data from last years. We have taken the dataset

from a reference website https://data.ny.gov/. The size of the dataset is very huge with the data

around three lacks of lines. We had taken a part of it as that our basic systems can’t be able to

support that much huge amount of dataset, this needs a well classified configuration to work on.

To deal with the project we have taken the dataset with around twenty five thousand lines.

We have analyzed the raw dataset, by eliminating the unnecessary fields from the data

and given the dataset a well organized format.

The dataset is dividing into two tables based on the data and their fields. The first table

deals with the inspection data with contains the parameters like id, name of restaurant, area,

address, location, inspected data, violated code, critical point of violation, type of inspection. The

second table deals with the violation code and the violation property.

5.2 MODULES:

5.2.1 Analyzing the Data and filtering the Data:

The first step of the project we need to analyze the data, we should check the data how it

has been formatted. We should be aware of the fields that has given to us and need to know the

importance of each and every field, if we think that there are some unnecessary information

which is disturbing our dataset, we need to talk to our client before taking any step in changing

the dataset or removing or moving any columns from the dataset.

38


Raw-dataset

39


unnecessery data fields

The unnecessary fiedls have removed from the raw dataset. The fields have been removed from

the dataset and the dataset has been divided into two separate Tables.

Table1 (violation) – dataset with violation code and its explanation.

Table2 (Hotel) – dataset with voilation code and remaing fields from the filtered dataset.

Now the final dataset would be refered as filtered dataset.

40


final-dataset

5.2.2 Identifying the headers (Schema):

The schema is generated based on the dataset and the data we are having. This schema is

for the table hotel.

Schema for Hotel:

Name of Header Description headername in schema

ID - id (Primary Key) - id

CAMIS - Refers to the Store ID's - camis

DBA - Refers to the Restaurant - dba

BORO - Place - boro

41


BUILDING - Building Number - building

STREET - Street Address - street

ZIPCODE - Area zipcode - zipcode

PHONE - Store phone - phone

CUISINE DESCRIPTION - Type of Cusine - cuisine_description

INSPECTION DATE - Inspected on Date - inspection_date

ACTION - Type of Action - action

VIOLATION CODE - Voilaton Codes - violation_code

CRITICAL FLAG - Serious of Voilations - critical_flag

SCORE - Rating - score

GRADE - Grade - grade

GRADE DATE - Grade Date - grade_date

RECORD DATE - Record Date - record_date

INSPECTION TYPE - Type of Inspection - inspection_type

The scheme is for the table Violation. The schema is violation of code and the voilation

description.

Schema for table Violation:


ID - id (Primary Key) - id

42


VIOLATION CODE - Refers to Violation code - violation_code

VIOLATION DESCRITION - Refers description of code - v_desc

Table for Hotel:


ID id (Primary Key) IdCAMIS Refers to the Store ID's CamisDBA Refers to the Restaurant DbaBORO Place BoroBUILDING Building Number BuildingSTREET Street Address StreetZIPCODE Area zipcode ZipcodePHONE Store phone PhoneCUISINE DESCRIPTION Type of Cusine cuisine_descriptionINSPECTION DATE Inspected on Date inspection_dateACTION Type of Action ActionVIOLATION CODE Violation Codes Violation_codeCRITICAL FLAG Serious of Violations critical_flagSCORE Rating ScoreGRADE Grade GradeGRADE DATE Grade Date grade_dateRECORD DATE Record Date record_dateINSPECTION TYPE Type of Inspection inspection_type

43


Table for Violation:

Name of Header Description Headername in schema

ID id (Primary Key) IdVIOLATION CODE Refers to Violation code violation_codeVIOLATION DESCRITION Refers description of

codev_desc

Violation Table Schema diagram:

Violation Table Schema diagram

Hotel Table Schema diagram:

44


Hotel Table Schema diagram

5.2.3 Installing Single Node Hadoop Cluster:

Java Development Kit 1.7:

Download the Java Development Kit 1.7 from the official website of Oracle services.

Once the JDK1.7 is downloaded extract the file from downloads and create a directory named

JAVA in the root directory. The path of the root directory “ /usr/lib/java”

Once the java folder is created with sudo (admistrator) permissions, then move the

downloaded jdk to the /usr/lib/java/ so the jdk1.7 would be in the /usr/lib/java/jdk1.7. The java

path would be now “/usr/lib/java/jdk1.7”.

45


Jdk installation path

Once this part is done, now we need to set the runable file to configure with the java and kernel,

to do so run the below mentioned scripts.

sudo update-alternatives --install "/usr/bin/java"

"java" usr/lib/java/jdk1.7.0_67/bin/java" 1

sudo update-alternatives --install "/usr/bin/javac" "javac"

"/usr/lib/java/jdk1.7.0_67/bin/javac" 1

sudo update-alternatives --install "/usr/bin/javaws" "javaws"

usr/lib/java/jdk1.7.0_67/bin/javaws"

46


To check the java installation is completed, run the command “ java –version”

java path

Now edit the bashrc file in Linux (Ubuntu), to do so run the command

sudo gedit ~/.bashrc

47


and add the following lines to the file:

export JAVA_HOME="/usr/lib/java/jdk1.7.0_67"

set PATH="$PATH:$JAVA_HOME/bin"

alias jps="/usr/lib/java/jdk1.7.0_67/bin/jps"

Install Hadoop1.2.0:

Download hadoop1.2 version from the source website of Apache Hadoop:

Create a file named Hadoop in /usr /lib/ path, once the file is created extract the downloaded

Hadoop file and move it to “/usr/lib/Hadoop” path with the sudo permissions.

Hadoop location

Configure the Hadoop location with bashrc file, sudo gedit ~/.bashrc

48


Add the lines to the file :

export HADOOP_HOME="/usr/lib/hadoop/hadoop-1.2.1"

PATH=$PATH:$HADOOP_HOME/bin

Hadoop installation cross checking

Install Hive:

Now download the hive 0.12.0 file from the target source of Apache hive:

Create a directory with hive in “/usr/lib” directory and move the extracted file to “/usr/lib/hive/”

this path is the hive directory.

Open the bashrc file: sudo gedit ~/.bashrc

Configure the file with the script :

49


# Hive Home Directory Configuration

HIVE_HOME="/usr/lib/hive/hive-0.12.0"

export PATH=$PATH:$HIVE_HOME/bin

hive installation cross check

We need to configure the four important file in the Hadoop environment. Open the Hadoop

directory from the location “/usr/lib/Hadoop.hadoop1.2.1/conf”

Open the files hdfs-site.xml, mapred-site.xml, core-site.xml, Hadoop-env.sh. Add the following

lines to these file respectively:

50


Hadoop-env.sh file

core-site.xml

51


mapred-site.xml

hdfs-site.xml

52


Hadoop installation in Cloudera:

5.2.4 Moving the data to HDFS:

Once the data schema is ready, the Hadoop installation is done , now our next task is to

move the data from our localfilesytem to the Hadoop single node cluster i.e., to the HDFS a

Component of Hadoop where the data is stored in the form of file systems.

The Command we use is: Hadoop fs -mkdir Hotel

This Command is used for creating a directory for our project in HDFS. Here we are

creating a directory Hotel which is used to store our datasets

1) Hotel dataset

2) Violation code dataset

Creating directory

53


Hadoop fs -ls

This command is used to display all the directories in the HDFS, We need to cross check

as to know whether our directory Hotel has been created or not.

Listing the directory

Hadoop fs -copyFromLocal Src ... Dest

This command is used to move our file from localfilesytem to HDFS.

We are copying our file Hotel.csv to the Hotel directory of HDFS

Hadoop fs -copyFromLocal '/home/username/Desktop/hotel.csv' /user/username/hotel/

54


'/home/username/Desktop/hotel.csv' indicates the location of the file.

‘/user/username/hotel/’ indicates the location of HDFS.

Hotel – indicates the HDFS directory

Moving data to hdfs

from the images we can see the two files hotel.csv ad codes.txt had been moved to the hdfs

directory Hotel.

Hadoop fs -ls Hotel

This command is used to display all the files from our specified HDFS directories Hotel. We

need to cross check as to know whether our file has been created or not.

Hadoop fs -ls hotel

55


Checking the files in hdfs

This is clear that we have moved all our files to the HDFS – into the hotel directory.

5.2.5 Creating the tables in hive:

We are all set to create the tables for our dataset.

The query for creating the hotel table:

hive -e "create table 360_hotel ( camis string, dba string, boro string, building string, street

string, zipcode string, phone string, cuisine_description string, inspection_date string, action

string, violation_code string, critical_flag string, score string, grade string, grade_date string,

record_date string, inspection_type string)row format delimited fields terminated by ',' "

56


Table created successfully

To see the table:

hive -e “show tables”

57


Checking created table

5.2.6 Importing data from hdfs to hive warehouse:

To Load Data:

hive -e "load data inpath '/user/training/hotel/hotel.csv' overwrite into table 360_hotel"

Data loaded to hive warehouse and table.

58


Hotel Table Description: hive -e “desc 360_hotel”

Table Description

Checking the tables:

hive -e “select *from hotel limit 3”

verifying the data

59


5.2.7 Analyzing the data based on the queries from the client:

- Frequent violated code.

- How many stores/restaurants have been inspected and location wise.

- Number of violations made by each restaurant.

- How many areas have been covered in the inspection?

- Types of cuisines inspected.

- More inspections were done on descending order.

- ascending order more violation codes.

- No violation cited on from restaurants.

- Critical and noncritical violations.

- Critical Violation and non critical violation codes.

Frequent violated code:

hive -e "SELECT violation_code, COUNT(violation_code) FROM hotel GROUP BY violation_code HAVING ( COUNT(violation_code) > 1 )limit 5 "

Job execution

60


The above displays the result to the screen, but we need the result set to be reported to an excel sheet to generate the reports.

To do so we need to store the result set in table or we can store the result in HDFS, then we can move the result data from HDFS to our localfilesystem, from there the dataset is exported to excel files to generate reports.

This result is stored in HDFS in the form of output.ods or output.xls

hive -e "insert overwrite directory '/user/training/Desktop/output.csv' SELECT violation_code, COUNT(violation_code) FROM hotel GROUP BY violation_code HAVING ( COUNT(violation_code) > 1 )"

The result set has been stored in the HDFS with the file name output.ods and the path to access it is '/user/username/output.csv'

To export the file from HDFS to Localfilesystem

hadoop fs -copyToLocal '/user/training/output.csv' /home/training/Desktop/

Query executed and the data is loaded to the hdfs directory

61


The result set has been stored in HDFS. Now we need to move the result set to the

Local file system.

The result set has been moved to the home directory. '/home/training/'

Stored output

62


This the result files in CSV format. We need to export this dataset to excel to make the report in an efficient way.

The output generated by the query.

5.2.8 Generating the Reports:

This Module, Here we deal with all the generated reports. We can use any data reporting tools or else we can go with excel.

63


Report generated from the query

6. EXECUTIONS OF JOBS

6.1 METHODS OF EXECUTION:

We can execute the jobs in hive in three different ways:

6.1.1 Executing the job from the hive prompt:

The job is written directly in the hive prompt:

64


hive prompt

6.1.2 Executing the job from terminal with Hadoop:

The job is executed here with the help of Hadoop terminal, there will be no contact with the hive

prompt in the job execution:

65


Query using hive -e

6.1.3 Executing the job as a script:

The job is executed as script here, once the script has been written, the script is placed in the

home directory of the Linux environment

66


query from script

script home directory

67


query written in script

6.2 EXECUTION OF HIVEQL JOBS:

How many stores/restaurants have been inspected and location wise.

hive e "insert overwrite directory '/user/training/output2-1.csv' select count(dba) from hotel where boro='BRONX'"

hadoop fs -copyToLocal /user/training/output2-1.csv' /home/training/Desktop

hive e "insert overwrite directory '/user/training/output2-2.csv' select count(dba) from hotel where boro='BROOKLYN'"

hadoop fs -copyToLocal /user/training/output2-2.csv' /home/training/Desktop

hive e "insert overwrite directory '/user/ training/output2-3.csv' select count(dba) from hotel where boro='MANHATTAN'"

68


hadoop fs -copyToLocal /user/ training /output2-3.csv' /home/ training/Desktop

hive e "insert overwrite directory '/user/ training /output2-4.csv' select count(dba) from hotel where boro='QUEENS'"

hadoop fs -copyToLocal /user/training/output2-4.csv' /home/ training /Desktop

hive e "insert overwrite directory '/user/ training /output2-5.csv' select count(dba) from hotel where boro='STATEN ISLAND'"

hadoop fs -copyToLocal /user/ training /output2-5.csv' /home/ training /Desktop

img:

Number of violations made by each restaurant:

hive -e "insert overwrite directory '/user/ training /output2.csv select distinct(dba) from hotel"

hive -e "insert overwrite directory '/user/ training /output3.csv' select count(violation_code) from hotel where dba = 'MORRIS PARK BAKE SHOP'"

hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code) from hotel where dba = 'WENDY'"

hive -e "insert overwrite directory '/user/ training /output3-2.csv' select count(violation_code) from hotel where dba = 'DJ REYNOLDS PUB AND RESTAURANT'"

hive -e "insert overwrite directory '/user/ training /output3-3.csv' select count(violation_code) from hotel where dba = 'RIVIERA CATERER'"

hive -e "insert overwrite directory '/user/ training /output3-4.csv' select count(violation_code) from hotel where dba = 'TOV KOSHER KITCHEN'"

69


hive -e "insert overwrite directory '/user/ training /output3-5.csv' select count(violation_code) from hotel where dba = 'BRUNOS ON THE BOULEVARD'"

hive -e "insert overwrite directory '/user/ training /output3-6.csv' select count(violation_code) from hotel where dba = 'KOSHER ISLAND'"

hive -e "insert overwrite directory '/user/ training /output3-7.csv' select count(violation_code) from hotel where dba = 'WILKEN'S FINE FOOD'"

hive -e "insert overwrite directory '/user/ training /output3-8.csv' select count(violation_code) from hotel where dba = 'REGINA CATERERS'"

hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code) from hotel where dba = 'MAY MAY KITCHEN'"

hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code) from hotel where dba = 'NATHAN'S FAMOUS'"

hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code) from hotel where dba = 'SEUDA FOODS'"

hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code) from hotel where dba = 'CARVEL ICE CREAM'"

hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code) from hotel where dba = 'GLORIOUS FOOD'"

70


img

How many areas have been covered in the inspection:

hive -e "select distinct(boro) from hotel"

Types of cusines inspected:

"insert overwrite directory '/user/ training /output.csv' select distinct(cuisine_description) from hotel"

"select count(cuisine_description) from hotel where cuisine_description='African'"

"select count(cuisine_description) from hotel where cuisine_description='American'"

"select count(cuisine_description) from hotel where cuisine_description='Armenian'"

"select count(cuisine_description) from hotel where cuisine_description='Bagels/Pretzels'"

"select count(cuisine_description) from hotel where cuisine_description='Bakery'"

"select count(cuisine_description) from hotel where cuisine_description='CafÃ©/Coffee/Tea'"

"select count(cuisine_description) from hotel where cuisine_description='Caribbean'"

"select count(cuisine_description) from hotel where cuisine_description='Chicken'"

"select count(cuisine_description) from hotel where cuisine_description='Chinese'"

"select count(cuisine_description) from hotel where cuisine_description='Continental'"

"select count(cuisine_description) from hotel where cuisine_description='Donuts'"

"select count(cuisine_description) from hotel where cuisine_description='German'"

"select count(cuisine_description) from hotel where cuisine_description='Greek'"

"select count(cuisine_description) from hotel where cuisine_description='Hamburgers'"

"select count(cuisine_description) from hotel where cuisine_description='Hotdogs'"

"select count(cuisine_description) from hotel where cuisine_description='Indian'"

"select count(cuisine_description) from hotel where cuisine_description='Japanese'"

71


Critical Violation and non critical violation codes:

hive -e "insert overwrite directory '/user/training/critical.csv' select violation_code from hotel where critical_flag = 'Critical'"

hive -e "insert overwrite directory '/user/training/not-critical.csv' select violation_code from hotel where critical_flag = 'Not Critical' "

72


7. TESTING

7.1 INTRODUCTION:

Software testing is a critical element of software quality assurance and represents the

ultimate review of specification, design and coding. The increasing visibility of software as a

system element and attendant costs associated with a software failure are motivating factors for

we planned, through testing. Testing is the process of executing a program with the intent of

finding an error. The design of tests for software and other engineered products can be as

challenging as the initial design of the product itself.

7.2 SAMPLE UNIT TESTING:

Unit testing is done when the data is loaded into hdfs. Once data is loaded we need to cross

the data by seeing it in the browser. Now the sample data from the browser say about once chunk

of the file:

Copy the data to the text file, load the sample data to the hdfs and work on the data, write

the jobs on the sample data execute the jobs store the results. Then if the job is successfully

executed on the sample data then execute the job on the main dataset with the same parameters.

73


8. SCREENS

Violation codes

violation made

74


Inspections made area wise

Violations counts from each restaurant

75


Types of Cosines inspected

More inspections in cosines

76


Critical and non critical issues

77


9. CONCLUSIONS

Hadoop is trending technology in the market. Hadoop solves the big data problem more effectively and efficiently. More importantly Hadoop can analyze any kind of data. Analyzing the data based on Hadoop requires very less amount of time, and it reduces the production time which directly affects the economy of the organization.

Analyzing the dataset based on the Apache Hadoop will overcome all the issues caused by the traditional RDBMS and Master slave Architecture of Servers.

In our project we are trying to analyze Hotel Inspection dataset using Hadoop.

This analysis makes to analyze total number of hotels, their violations and their descriptions.

78


10 REFERENCES

Hadoop:

https://hadoop.apache.org/

Java:

http://www.oracle.com/technetwork/java/javase/downloads/

Hive:

https://hive.apache.org/

Linux:

http://www.ubuntu.com/

79

hotel inspection data set analysis copy

Engineering