data science technology overview

31
Data Science Technology Overview SOOJUNG HONG (DEC 3. 2015) Main Reference : 1. Big Data : The next frontier for innovation, competetion and productivity (Mckinsey Global Institute Report) 2. Big Data at Work (Harvard Business Review Press) 3. And many other Machine Learning literatures

Upload: soojung-hong

Post on 22-Jan-2018

151 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Data science technology overview

Data Science Technology Overview

SOOJUNG HONG (DEC 3. 2015)

Main Reference :

1. Big Data : The next frontier for innovation, competetion and productivity (Mckinsey Global Institute Report)

2. Big Data at Work (Harvard Business Review Press)

3. And many other Machine Learning literatures

Page 2: Data science technology overview

Data Analytics Application Spectrum

Data

Model and Algorithm

Analytic Insight

(Big Data Technology)

(Machine Learning Technology)

(Visualization Technology)

Source : Analytics Driven Organization : Atos white paper

Page 3: Data science technology overview

Source : Gartner, August 2014

Page 4: Data science technology overview

Data Science Innovation Funnel Define the different part of innovation funnel

Part 1 : Data researech & Hypothesis building

Data Science

Part 2 : ML solution building & implementation

ML Engineering

Part 3 : A/B Testing (if Web app) and Analysis

Data Science

Source : http://www.slideshare.net/xamat/10-more-lessons-learned-from-building-machine-learning-systems

Data Analytics Applications

Page 5: Data science technology overview

Right Technology

Each analytics application typically differs in data inputs

Technologies required for processing can vary significantly

Storing and dynamically Analysing

Vast amount of unstructured data

Web and Social Media

Capture and Analysis

High Volume & Velocity

Telecommunication network signalsor

IT infrastructure systems logs

Task

Data Type

Source

Example A Example B

Page 6: Data science technology overview
Page 7: Data science technology overview

Part 1 : Big Data Technology 1. Data Source

Multiple source : image, sound, context

Internal data sources (financial performance data, ERP, PLM, CRM, GPS data etc.)

External data sources (like social media, video, voice and plain text, biz-sector studies)

(ex) Data Example : Stream processing. Large real-time streams of event data for algorithmic trading in financial services,

RFID event processing applications, fraud detection, process monitoring, Location-based services in telco

2. Scalability

Scale up

Scale out

* 150 Exabyte of data in Healthcare (1 Exabyte =1,152,921,504,606,846,976 bytes)

* New York Stock Exchange capture 1 TB of trade information during each trading session

Page 8: Data science technology overview

BigTable

Proprietary distributed database system built on the Google File System (GFS)

Part of the inspiration for Hadoop

Google Cloud Bigtable : public version of BigTable (May, 2015)

Hadoop (MapReduce implementation)

Open source framework for distributed storage and processing of large set of data across multiple computers

Commercial vendors Hadoop (Cloudera Hadoop, Hortonworks Hadoop, EMC Hadoop, Microsoft Hadoop)

Cloud

public, hybrid and private cloud

S3 (Amazon Simple Storage Space), Microsoft Azure Storage

Big Data Technology (1) : How to store data

Page 9: Data science technology overview

Big Data Technology (2) : How to process data In-memory distributed computing environments

Store, distribute and compute large amounts of data

Spark, H2O or SAP Hana

Data warehouse

Specialized database optimized for reporting, often used for storing large amounts of structured data.

Data is uploaded using ETL (extract, transform, and load), often generated using BI tools.

Data mart

Subset of a data warehouse, used to provide data to users usually through business intelligence tools.

Dynamo. Proprietary distributed data storage system developed by Amazon.

Page 10: Data science technology overview

Big Data Technology (3) : How to manage data Hbase

An open source (free), distributed, non-relational database modeled on Google’s Big Table

Column-oriented database management system that runs on top of HDFS

HIVE

Data warehouse infrastructure built on top of Hadoop

Providing data summarization, query, and analysis (i.e. SQL query on top of Hadoop)

Cassandra

Open source database management system designed to handle huge amounts of structured data

Distributed, Continuoulsy available, linear scale performance

Originally developed at Facebook, now managed as a project of the Apache Software Foundation

Page 11: Data science technology overview

Hadoop

Page 12: Data science technology overview

MapReduce Example

Page 13: Data science technology overview

Big Data technology ecosystem

Page 14: Data science technology overview

Coexistance Stratedgy

Majority of big company would not replace their existing systems based on Datawarehouse with big data solution.

coexistence stratedgy

Data warehouse can continue with ist standard workload, using data from legacy operating systems

Page 15: Data science technology overview

Big Data Technology : Data FormatNon-Relational database (NoSQL Database)

In contrast to relational database (which based on Table, Record, Column, Row)

A database that does not store data in tables (rows and columns)

Semi-structured data (NoSQL)

Data do not conform to fixed fields but contain tags and other markers to separate data elements

CSV, XML, HTML-tagged text and JSON documents are semi structured documents

Unstructured data

Data do not conform to fixed fields

Free-form text (e.g., books, articles, body of e-mail messages)

Untagged audio, image and video data

Page 16: Data science technology overview

Part 2 : Machine Learning TechnologyAlgorithms : TECHNIQUES FOR ANALYZING BIG DATA

All of the techniques we list here can be applied to Big Data

In general, larger and more diverse datasets can be used to generate more numerous and insightful

results than smaller, less diverse ones

Lessons Learned : More data beats a clever algorithm!

- ‘A Few Useful Things to Know about Machine Learning’ by P. Domingos

Page 17: Data science technology overview

Machine Learning Technology @Quora Millions of questions and answers

Millions of users

Thousands of Topics

Page 18: Data science technology overview

Part 2 : (1) Machine Learning Technology Supervised learning

A Set of ML technique (Decision Trees, K-NN, SVM, Naive Bayes, Neural Network, Logistic Regression)

Infer a function or relationship from a set of training data

Examples : classification and regression

Unsupervised learning

A set of ML technique (K-means, Mixture models, Hierarchical clustering)

Finds hidden structure in unlabeled data.

Example : Cluster analysis

Page 19: Data science technology overview

Association rule learning (Apriori Algorithm, Eclat algorithm)

A set of techniques for discovering interesting relationships (“association rules”) among variables

Can be applied in large databases (data warehouse)

Application example : Market basket analysis (Walmart’s Diaper and Beer)

Classification (Decision Tree, SVM)

A set of techniques to identify the categories in which new data points belong, based on a training set

Training set contains data points that have already been categorized.

Application example : prediction of segment-specific customer behavior (e.g. churn rate)

Cluster analysis (K-means)

A statistical method for classifying objects that splits a diverse group into smaller groups of similar objects

characteristics of similarity are not known in advance

.

Part 2 : (2) Machine Learning Technology

Page 20: Data science technology overview

Data fusion and data integration

A set of techniques that integrate and analyze data from multiple sources

Develop insights in more efficient and potentially more accurate than single source of data

Example : Data from social media by natural language processing + Real-time sales data

Determine effects for a marketing campaign by having on customer sentiment andpurchasing behavior

Ensemble learning (Bayes Optimal Classifier, Bagging, Boosting)

Using multiple predictive models (each developed using statistics and/or machine learning)

Obtain better predictive performance than could be obtained from any of the constituent models

A type of supervised learning.

.

Part 2 : (3) Machine Learning Technology

Page 21: Data science technology overview

Genetic algorithms

A technique used for escoptimization that is inspired by the process of natural evolution

Well-suited for solving nonlinear problems

Examples : Applications include improving job scheduling in manufacturing

Optimizing the performance of an investment portfolio.

Neural networks (Both Supervised and Unsupervised Learning)

Computational models, inspired by the structure and workings of biological neural networks

Well-suited for finding nonlinear patterns (pattern recognition and optimization)

Examples : Identifying high-value customers that are at risk of leaving a particular company

Identifying fraudulent insurance claims.

Part 2 : (4) Machine Learning Technology

Page 22: Data science technology overview

Network analysis

Characterize relationships among discrete nodes in a graph or a network

In social network analysis, analyze connections between individuals in a community or organization

Examples : Identifying key opinion leaders to target for marketing

Identifying bottlenecks in enterprise information flows

Optimization

Numerical techniques to redesign complex systems and processes to improve their performance

Objectives measures can be cost, speed, or reliability

Genetic Algorithm can be used for optimization

Examples : Application improving operational processes such as scheduling, routing

Part 2 : (5) Machine Learning Technology

Page 23: Data science technology overview

Regression

Techniques to determine how the value of the dependent variable changes when independent variables is modified.

Often used for forecasting or prediction

Examples : Forecasting sales volumes based on various market and economic variables

Determining what manufacturing parameters most influence customer satisfaction

Sentiment analysis (Opinion Mining)

Application of natural language processing and other analytic techniques

Identify and extract subjective information from source text material

Key aspects identify the feature, aspect, or product about which a sentiment is being expressed

Determining the type, “polarity” (i.e., positive, negative, or neutral) and the degree and strength

Examples : Companies applying sentiment analysis to analyze social response on them

Part 2 : (6) Machine Learning Technology

Page 24: Data science technology overview

Example : Sentiment Classification Technique

Page 25: Data science technology overview

Spatial analysis (Multilayer Perceptron, Mixture of Experts, Support Vector Regression)

Using location information

Example : Manufacturing supply chain network performance analysis

Simulation

Modeling the behavior of complex systems used for forecasting, predicting and scenario planning

Repeated random sampling, i.e., running thousands of simulations, each based on different assumptions

Example : Assessing the likelihood of meeting financial targets given uncertainties

Time series analysis (Mixture of Expert Model)

Statistical and signal processing for analyzing sequences of data points (values) at successive times

Extract meaningful characteristics from the data.

Examples : Hourly value of a stock market

Predicting the number of people who will be diagnosed with an infectious disease.

Part 2 : (7) Machine Learning Technology

Page 26: Data science technology overview

Options : Experimentation vs Production

Good intermediate option

1. ML Researcher experiment on iPython Notebooks using Python tools (scikit-learn, Theano)

2. Implement the optimized version

3. Implement the abstraction layer on top of optimized implementation so they can be accessedfrom regular experimentation tools

Must be aware of in Production

1. Value of Mode = Value it brings to the product

2. Bridge gap between the production design and ML algorithm

Machine Learning Infrastructure

Page 27: Data science technology overview

ML Programming language & Tools

R R Studio Popular package : dplyr, plyr, data.table, stringr, zoo, ggvis, caret (for machine learning)

Python iPython Notebook, Spyder Popular package : pandas, SciPy, NumPy, scikit-learn (sklearn) Sklearn meet several of criteria such as speed, well-designed classes for handling data, models, and results

R vs. Python : http://blog.datacamp.com/r-or-python-for-data-analysis

Page 28: Data science technology overview

Example :R library for Machine Learning Analyzing Data : mean, var, sd, min, max, sum, rowSum, colRum

Prediction : predict

Apriori : arules package, apriori

Logistic Regression : glm

K-Means clustering : kmeans

K-Nearest Neighbor Classification : knn

Naive Bayes : naiveBayes

Decision Tree : rpart

Support Vector Machine : svm

Page 29: Data science technology overview

Challenge

Processing other types of data (Big Data!) and how to show efficiently the result of Big Data analysis

Tag cloud

Visualization of the text in the form of a tag cloud, i.e., a weighted visual list

Words that appear most frequently are larger

Helps the reader to quickly perceive the most salient concepts

Part 3 : (1) Visualization

Page 30: Data science technology overview

Clustergram

Visualization technique used for cluster analysis

Displaying how individual members of a dataset are assigned to clusters

as a number of clusters increases

History flow

Visualize the evolution of a document as it is edited by multiple authors

Time appears on the horizontal axis

Contributions to to the text are on the vertical axis

Part 3 : (2) Visualization

Page 31: Data science technology overview

Spatial information flow

Depicts spatial information flows

Example : New York Talk Exchange

It shows the amount of Internet Protocol (IP) data flowing between NY and other cities

The size of the glow on a particular city location corresponds to the amount of IP traffic

Visualization allows us to determine quickly which cities are most closely connected to NY

Part 3 : (3) Visualization