big data & ds analytics - paarl | room 301, the national...

37
Big Data & DS Analytics for PAARL Albert Anthony D. Gavino, MBA Data Scientist / DS Evangelist

Upload: vukiet

Post on 18-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Big Data & DS Analyticsfor PAARL

Albert Anthony D. Gavino, MBAData Scientist / DS Evangelist

About the speaker: Albert Anthony D. Gavino

Project profile

Program Objectives / Program Goals

Participants to be able to relate Big Data and Data Science applications to Library services.

1. What is Big Data?

Extremely large data sets that may be analyzed to reveal patterns, trends and associations

The BIG 3 V’s

• Variety: different types of data (Facebook, Twitter, CCTV feed)

• Velocity: the speed that data comes in(batch, streaming every second)

• Volume: the largeness of that data.(1GB, 1TB, 1PB, 1ZB)

Library Data Resources

What resources does the library have (budget, staff, premises, media, opening hours etc.) and how is the library performing against traditional parameters, like lending figures, visitors and social media activity? This library data can also be combined with environmental information like community education levels, geographical distances, age and so on.

http://www.axiell.co.uk/getting the most from your library data/

DATA Analytics Challenges and Pitfalls

The challenges to creating a robust institutional data analytics program include culture, talent, cost, and data. We have deliberately mentioned culture first because it is very easy to jump to data challenges. In fact, most of the literature surrounding data analytics starts with challenges surrounding the data itself. However, we are convinced that institutional culture is the most important factor in determining the success of any given data analytics program, including the politics and process around questions of talent, cost, and data itself.

Reference: The Journal of Academic Librarianship, Libraries and Institutional Data Libraries: Challenges and Opportunities

63% of researchers and administrators expressed unhappiness with the use of metrics in higher education (Abbott et al., 2010)

What about New Tasks like streamlining for the Librarian?

If librarians take on new tasks, it is very important to track the amount of time and level of staff required when undertaking analytics projects. For example, collecting citation data for a researcher with a common name often requires manual and painstaking record-by-record searching in order to disambiguate that individual's research from others that share his/her name. This type of work requires a librarian with a deep and intimate knowledge of the bibliometric databases that are being used to harvest the bibliometric data.

Reference: The Journal of Academic Librarianship, Libraries and Institutional Data Libraries: Challenges and Opportunities

What is the Cost?

• Data analytics should be thought of as a strategic investment, not a cost-saving technique

• the real cost is the time spent on cultural change and on developing and educating a staff with the analytical skills that we need in our discipline

• visionary analytics plan invests in people, in hiring and training, over data tools and platforms.

.

Pitfalls of Data Sharing: Challenges on Institutional Data Analytics

Pitfalls Possible Solution/s

Ownership: who owns the data? It could be registrar, library, IT services.

An assigned office e.g. or Office of the President/ Compliance Office can release the official reports.

Quality: deciding when it is accurate or good data, data reliability.

Data Governance Unit assures the quality of data

Standards: what kind of data variables are in use: string, numeric

This can be addressed by Data Management on data warehousing

Access: who has access to the data User roles can be defined as to who has access

Getting Started on Institutional Data

• Creating an inventory of institutional data • Developing a data dictionary • Designing an unambiguous process for cleaning up those data • Creating an open data set that answers to the most commonly asked data

questions across campus.

Opportunities for Libraries on Big Data

• Libraries know metadata• Libraries know strategy• Libraries know assessment• Libraries are neutral• Libraries know the vendors• Libraries are part of larger bodies like PAARL• Libraries have influence over campuses• Libraries know metrics • Libraries have user-centered culture• Libraries know the vendors• Libraries know the politics and policy issues with commercial parties• Libraries collaborate with both academic and academic support

2. Building a BIG DATA culture• Openness and acceptance to technology: Upper Management• Willingness to invest in the Big Data Platform: which entails cost• Training Staff and making sure of job security: Skills upgrade• Make data sharing acceptable: Trust in the data quality and people• Create Data Quality Assurance Team/s • Foster collaboration among departments • Continuous improvement of models

DATA Governance and DATA Management are different roles

Data governance is the designation of decision-rights and policy-making surrounding institutional data, while data management is the implementation of those decisions and policies. Institutions need both, and both require investment, but the senior leadership of our institutions need to design the former.

Data Governance CouncilData Governance Council

Data ManagementData Management

policiespolicies

metricsmetrics

Data Quality DeptData Quality Dept

Data Warehouse / Data LakeData Warehouse / Data Lake

Machine Learning

Is a type of artificial intelligence that provides computers with the ability to learn without being explicitly programmed.

Market Basket Analysis on Book Recommendations (Association Rule Algorithm)

Weather related information and reading a book (use of hash tags and location and weather data)

Pic from Marco Rasos

Social Listening – is the process of monitoring digital conversations to understand what customers are saying about a brand or service.

Online Research Journals and Click through Rates

Click through Rates (CTR)

Ratio of users who click on a specific link to get to a page from a page ad or button.

OpenCV (Open Source and Computer Vision)

Modern Day Data Scientists

Dr. Reina Reyes, Astrophysicist

Andrew Ng of Baidu, Coursera

Amy Smith, Uber SingaporeData Science Conference 2016

YOU as the next Doctor Strange(Entering the world of Data Science)

Isaac Reyes, Data Scientist Talas Data Scientists

CRISP – DM Methodology

The project was led by five companies: SPSS, Teradata, Daimler AG, NCR Corporation and OHRA, an insurance

company

The project was led by five companies: SPSS, Teradata, Daimler AG, NCR Corporation and OHRA, an insurance

company

CRISP-DM Tasks

From regular data to BIG data, from stat to AI

Reg

ular

dat

aB

IG d

ata

Statistical modeling

Machine Learning

Deep Learning / A.I.

Traditional Modern

Trends in Data Science Domains

Data Science Domain Current Status

Natural Language Processing (NLP)

Entered the market

Predictive Analytics / Machine Learning

Entered the market

Visualization / Dashboards

Entered the market

Image Processing (openCV)

Exploration

Internet of Things (IoT) Exploration

Artificial Intelligence Exploration

DS/Big Data Applications to the field of Study

Agriculture Climate forecast modeling to help farmers manage plantations (e.g. corn yields)

Medical field Image processing for chest x rays, retina images for diabetic patients

Linguistics Natural Language Processing (NLP) for dialects and Sentiment Analysis applications

Economics/Finance Predicting a stock price based on certain indicators (e.g. noise, competitor price)

Sample Field of Study Specific Applications

Engineering Internet of Things (IoT) application to Big Data

Building a Data Science Team

Data ScientistData Engineer/Dev Ops

Statistician Viz Expert

R, Python, Spark ML

Hadoop, Spark Core, Spark stream

SAS, SPSS, R, Matlab

Tableau, Cognos D3, Javascript

Neural NetsRandom Forest

RDD, dataframes,SQLContext

Linear RegressionK-means clustering

visualization GIS maps

DS

ro

le

Pro

g

Lang

uag

eS

ampl

e ou

tput

Data Science Team Composition

11 22 33

Trends on Programming Languages

scalaR

pythonspark Rapid miner EMC

java

TOOLS: OPEN SOURCE vs PROPRIETARY SOFTWARE

OPEN SOURCE PROPRIETARY SOFTWARE

pros No cost on software, packages are available faster

Easy to deploy

cons Takes some time to create and integrate with other software

Expensive software, you have do buy in modules

tools Python, R, Apache Spark SAS, IBM-SPSS, AWS, Google

Small Data vs Big Data (in comparison)

Small data Big data

Sample size can be done (sampling e.g. survey)

Use all of the data in the storage

No need for memory computing, can be run on a regular PC/Mac

Eats up memory and needs distributed computing

Statistical assumptions hold true, normality, heteroskedasticityindependence

Statistical assumptions do not hold true like p-values since the data is so large (what seems not significant to small sets will become significant, be careful when using these assumptions)

Simple DS Cheat sheet

Classifiers

Neural Nets

Random forest

Clustering

K-means

Association

Assoc Rules

Predicting

Linear Regression

Logistic Regression (binary)

Cox Regression(Survival)

Hierarchical Clustering

SVM (Cancer Cells)

Medical

Vizualization TOOLS

Color Hues and Functionality

Local Implications: Data Privacy Act 10173Sensitive personal information refers to personal information:

1. About an individual’s race, ethnic origin, marital status, age, color, and religious, philosophical or political affiliations;

2. About an individual’s health, education, genetic or sexual life of a person, or to any proceeding for any offense committed or alleged to have been committed by such individual, the disposal of such proceedings, or the sentence of any court in such proceedings;

3. Issued by government agencies peculiar to an individual which includes, but is not limited to, social security numbers, previous or current health records, licenses or its denials, suspension or revocation, and tax returns; and

4. Specifically established by an executive order or an act of Congress to be kept classified.

Solutions to the Data Privacy Act: Policies

Make sure you have the following in place

•Opt In for customers•Opt out for customers•Updated your customer policy accordingly•Make your policy available publicly e.g. websites

References

• www.coursera.org/learn/machine-learning

• www.kaggle.com

• www.crowdanalytix.com

• www.talas.ph

• www.facebook.com/analytics4pinoys

• www.linkedin.com/albertgavino