data analysis using hive ql & tableau
TRANSCRIPT
Jongwook Woo
HiPIC
CSULA
By Priyanka KaleGUIDE: Dr. Jongwook Woo
High-Performance Information Computing Center (HiPIC)California State University Los Angeles
Geolocation Data Analysis for Safe
Residence using HiveQL
High Performance Information Computing CenterJongwook Woo
CSULA
Myself
Name: Priyanka Kale Experience: Since Dec 2015 member at,
HiPIC (High-Performance Information Computing Center
Computer Information Systems Department of California State University Los Angeles. http://web.calstatela.edu/centers/hipic/
Yelp Data Analysis:
Yelp API provides open dataset for yearly challenge and using this same dataset I have analyzed and created Microsoft Azure machine learning model predicting the ratings (stars) a particular business will receive based on the past history of review counts and ratings. This model is build using Multiclass Logistic regression algorithm. https://github.com/priya708/Yelp-Data-Analysis
High Performance Information Computing CenterJongwook Woo
CSULA
Contents
IntroductionBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences
High Performance Information Computing CenterJongwook Woo
CSULA
Contents
Introduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences
High Performance Information Computing CenterJongwook Woo
CSULA
Introduction
Goal- To determine if a location is safe or not by analyzing huge crime data (1.3 GB) for Chicago city in IL collected from 2001 to present(November 2015).
This is a study of real dataset provided by the government of United States of America using Big Data Analytics and related Tools.
Query output is visualized using different graphs and maps for better interpretation.
High Performance Information Computing CenterJongwook Woo
CSULA
ContentsIntroduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences
High Performance Information Computing CenterJongwook Woo
CSULA
VolumeBig Data
Complexity
Variety
Variability
High Performance Information Computing CenterJongwook Woo
CSULA
ContentsIntroduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences
High Performance Information Computing CenterJongwook Woo
CSULA
Process-Flow
Download Dataset
Upload Data file to Hue
Fire queries via Beeswax
Download result csv file
Visualizes result file using google fusion table
High Performance Information Computing CenterJongwook Woo
CSULA
ContentsIntroduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences
High Performance Information Computing CenterJongwook Woo
CSULA
11
Specifications
Microsoft Azure Hortonwork’s sandbox: Linux systemNo. of nodes: 48 coresSize-14 Gb
• Microsoft Azure Hortonwork’s sandbox: 1. Linux system2. No. of nodes: 43. 8 cores4. Size-14 Gb
High Performance Information Computing CenterJongwook Woo
CSULA
Specifications
High Performance Information Computing CenterJongwook Woo
CSULA
ContentsIntroduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences
High Performance Information Computing CenterJongwook Woo
CSULA
ImplementationHue is a web application which helps to browse HDFS and work with Hive and Cloudera Impala queries, MapReduce jobs.
High Performance Information Computing CenterJongwook Woo
CSULA
Creation of tables in Hcatalog
High Performance Information Computing CenterJongwook Woo
CSULA
Hive and Beeswax
Hive is an infrastructure built on top of Hadoop for data summarization, query and analysis. Beeswax an application to perform HIVE queries
High Performance Information Computing CenterJongwook Woo
CSULA
Processing in Beeswax
High Performance Information Computing CenterJongwook Woo
CSULA
ContentsIntroduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences
High Performance Information Computing CenterJongwook Woo
CSULA
Queries and Visualization
Total no and rank of crime type –
select primary_type, count(iucr), rank() over (ORDER BY count(iucr) desc) from crime group by primary_type limit 100;
High Performance Information Computing CenterJongwook Woo
CSULA
High Performance Information Computing CenterJongwook Woo
CSULA
Number of crime as per location type for a given area-
Select location_description, count(iucr) from crime where address = '008XX N MICHIGAN AVE' group by location_description limit 100;
High Performance Information Computing CenterJongwook Woo
CSULA
High Performance Information Computing CenterJongwook Woo
CSULA
Query to determine which type of crime is highest in a given area:
SELECT Crime_type, count (iucr), rank () over (ORDER BY count (iucr) desc) AS rank from CrimeWhere address = "037XX W OHIO ST"GROUP BY Crime_type;
High Performance Information Computing CenterJongwook Woo
CSULA
Query to determine number of crimes occurred on a particular month:
SELECT Month, count (Case_number) from crime GROUP BY Month;
High Performance Information Computing CenterJongwook Woo
CSULA
Final Outcome of Analysis:
Query to determine the most unsafe and safest area in Chicago:
CREATE TABLE UnsafeArea row format delimited fields terminated by ',' STORED AS RCFile AS select address, count(iucr) AS total_crimes,rank() over (ORDER BY count(iucr) desc) AS rank from crime GROUP BY address;
High Performance Information Computing CenterJongwook Woo
CSULA
Final Outcome of Analysis:
High Performance Information Computing CenterJongwook Woo
CSULA
Contents
Introduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences
High Performance Information Computing CenterJongwook Woo
CSULA
GitHub
URL: https://github.com/priya708/Project-520
High Performance Information Computing CenterJongwook Woo
CSULA
Contents
Introduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences
High Performance Information Computing CenterJongwook Woo
CSULA
Business Perspective
Get better advertisement
Predictive Policing for Police department: The future of Law enforcement?
• Reducing Random Gunfire• Connecting Burglaries and Code
Violations
High Performance Information Computing CenterJongwook Woo
CSULA
ContentsIntroduction To Big DataBig DataProcess FlowSpecificationsImplementationVisualizationGitHubBusiness PerspectiveReferences
High Performance Information Computing CenterJongwook Woo
CSULA
https://catalog.data.gov
https://cwiki.apache.org/confluence/display/Hive/Tutorial
https://hortonworks.com/tutorials
References