cs 789 advanced big data analytics data representation

14
Mingon Kang, Ph.D. Department of Computer Science, University of Nevada, Las Vegas * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Upload: others

Post on 14-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Mingon Kang, Ph.D.

Department of Computer Science, University of Nevada, Las Vegas

* Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington

CS 789 ADVANCED BIG DATA ANALYTICS

DATA REPRESENTATION

Page 2: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data Representation?

How does a computer represent data?

0 and 1 in the aspect of โ€œgeneralโ€ computer science

Vector/Matrix in the aspect of โ€œMachine Learningโ€

Page 3: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data Representation

Scalar single number - usually write in italics

- lower-case variable names

- e.g., ๐‘  โˆˆ โ„, ๐‘› โˆˆ โ„• [1]

Vector array of numbers - arranged in order

- lower-case names written in bold typeface

- ๐ฑ =

๐‘ฅ1๐‘ฅ2โ‹ฎ๐‘ฅ๐‘›

, ๐ฑ = {๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘›}

- what is ๐ฑ๐ฌ when ๐ฌ = {1, 3, 6}?- Then, ๐ฑโˆ’๐ฌ?

1 https://en.wikipedia.org/wiki/List_of_mathematical_symbols

Page 4: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data Representation

Matrix 2-D array of

numbers

- an element is identified by two indices

- upper-case variable name with bold

typeface, e.g., ๐—- ๐— โˆˆ โ„๐’Žโˆ—๐’: matrix has a height of m and a

width of n, and elements are real numbers

- e.g., ๐— =๐‘ฅ11 ๐‘ฅ12 ๐‘ฅ13๐‘ฅ21 ๐‘ฅ22 ๐‘ฅ23

Tensor array with more than

two axes

- three indices to identify an element

Page 5: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Types of Variable

Categorical variable: discrete or qualitative variables

Nominal:

โ—ผ Have two or more categories, but which do not have an

intrinsic order

Ordinal

โ—ผ Have two or more categories, which can be ordered or

ranked.

Continuous variable

Page 6: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data Representation

Features

An individual measurable property of a phenomenon being observed

The number of features or distinct traits that can be used to describe each item in a quantitative manner

May have implicit/explicit patterns to describe a phenomenon

Samples

Items to process (classify or cluster)

Can be a document, a picture, a sound, a video, or a patient

Reference: https://en.wikipedia.org/wiki/Feature_(machine_learning)

Page 7: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data Representation

Feature vector

An N-dimensional vector of numerical features that

represent some objects

A sample consists of feature vectors

Reference: http://www.slideshare.net/rahuldausa/introduction-to-machine-learning-38791937

Page 8: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Example - Survey

Convert Data to a feature vector/sample matrix

๐‘ก๐‘–๐‘š๐‘’ = ๐‘Ž๐‘”๐‘Ÿ๐‘’๐‘’๐‘Ž๐‘ข๐‘‘๐‘–๐‘œ = ๐‘ฆ๐‘’๐‘ 

โ‹ฎ

Page 9: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Example โ€“ Structured data

Convert Data to a feature vector/sample matrix

๐น๐‘–๐‘›๐‘Ž๐‘›๐‘๐‘’๐‘€๐‘Ž๐‘Ÿ๐‘˜๐‘’๐‘ก๐‘–๐‘›๐‘”

โ‹ฎ

Page 10: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Example โ€“ Image data

Page 11: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Example โ€“ Unstructured data

Feature Extraction

Unstructured data (e.g., text data)

Structured data (e.g., Bag-of-Words Model)

Page 12: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data in Machine Learning

๐‘ฅ๐‘–: input vector, independent variable

๐‘ฆ: response variable, dependent variable

๐‘ฆ โˆˆ {โˆ’1, 1} or {0, 1}: binary classification

๐‘ฆ โˆˆ โ„ค: multi-label classification

๐‘ฆ โˆˆ โ„: regression

Predict a label when having observed some new ๐‘ฅ

Page 13: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data Visualization

Vector space model

Data is a set of features, ๐๐ข = ๐‘“1, ๐‘“2, โ€ฆ , ๐‘“๐‘

All data can be represented by vector

Ref: https://www.slideshare.net/pkgosh/the-vector-space-model

Page 14: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data Visualization

Hand-written data (MNIST)

High-dimensional data

Can visualize data using Principle Component Analysis