data science tools - github pages 11_ big data tool… · apache hadoop apache spark linux apache...

15
Data Science Tools

Upload: others

Post on 17-Jul-2020

8 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

Data Science Tools

Page 2: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

Challenges of Data Engineering

● Efficient retrieval, processing and storage

● High Volume Computing

● Parallel Computing

● Resilience/redundancy

Page 3: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

Arsenal of Big Data Tools● Apache Hadoop● Apache Spark● Linux ● Apache Pig● Hortoworks

● R:○ parallel○ doSNOW

● SQL● Apache Hive

Page 4: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

R Gone Wild: parallel

● Package parallel● install.packages(“parallel”)● library(parallel)● Easy transition from non-parallel code● Remember to load variables and packages using

○ clusterExport for variables○ clusterEvalQ for packages

Page 5: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

The Battle of Clouds

1) AWS

2) Microsoft Azure

3) Google Cloud

Offers both high/low-level modules

Allows costs to be more variable

Page 6: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

Local cluster(s)

● Easier to control

● Easier to personalize

● Large short-term expense

● May not optimal for cost

Page 7: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

High Level vs Low levelHigh level: UIs, SQL, R, Python, Azure

1) Easy to use, easy to understand

2) When things go wrong.. =(

Low Level: Linux, C, C++ (General use of terminals)

1) Much more tedious and complicated2) More control

Page 8: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

Relational DataCharts, tables (the conventional)

1) Represents traditional form of data

a) Example: accounting books, information on students

2) SQL traditionally used to handle this data mysql, psql

3) Cannot effectively represent images, text, video

Page 9: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

Non - relational DataImages, music, video, text

1) Sparsity issues if represented by a matrix

2) Large amounts of data in a few files

3) SQL traditionally used to handle this data

4) Examples: images, text, video

Page 10: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

Hadoop Ecosystem● A non-relational file system

● Very scalable, parallel architecture

● A high-latency, high throughput system

● Built-in resilience and redundancies

● Uses large file blocks to maximize throughput

Page 11: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

Parallel Computing Terminology● Core: processing unit, usually a cpu● Chip: a chip that contain cpus● Socket: Physical connector to a chip● Node: A single unit that can store, send, and

receive information (servers)● Thread: a single process that can be run

concurrently on a core

Page 12: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

Hadoop Components: HDFS

http://backtobazics.com/wp-content/themes/twentyfourteen/images/hadoop/hdfs-1.x-architecture.jpg

Page 13: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

Hadoop Components #2:YARN

https://cdn.edureka.co/blog/wp-content/uploads/2014/05/YARN-431x300.jpg

Page 14: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

Hadoop Ecosystem● MapReduce

● Spark

● Hive: SQL platform

● Pig: High-level hadoop language

● Tez

Page 15: Data Science Tools - GitHub Pages 11_ Big Data Tool… · Apache Hadoop Apache Spark Linux Apache Pig Hortoworks R: parallel doSNOW SQL Apache Hive. R Gone Wild: parallel ... When

Coming UpYour assignment: Project 3 and survey

Next week: How to be lit in the Summer

See you then!