big data presentation for university of reykjavik, iceland, march 22

64
BIG DATA Thorhildur Jetzek, Ph.D. Postdoctoral Fellow Department of IT management, CBS

Upload: thorhildur-jetzek-phd

Post on 14-Apr-2017

367 views

Category:

Business


4 download

TRANSCRIPT

Page 1: Big data presentation for University of Reykjavik, Iceland, March 22

BIG DATA

Thorhildur Jetzek, Ph.D. Postdoctoral Fellow

Department of IT management, CBS

Page 2: Big data presentation for University of Reykjavik, Iceland, March 22

• Stúdent frá Eðlisfræðibraut I í MR 1991• B.Sc. in Economics 1994

• M.Sc. in Economics 1998• Ph.D. in Information Technology Management 2015• Have worked as a economist, IT consultant, assistant

professor, project manager, program manager, director, industrial PhD and now postdoctoral researcher.

• Have always focused on use of technology

Who am I?

Traditional career

My career

@Thorhildur Jetzek CBS 2|

Page 3: Big data presentation for University of Reykjavik, Iceland, March 22

• High ranking– 2nd in Europe (behind LSE) & 22 world-wide

• Focus on collaboration with industry– Industrial PhD (my PhD contract was at KMD)– Engaged scholarship & collaborative research

(current project of mine sponsored by industry)– Crowdsourcing events:

• Student competition where CBS students got access to anonymized data on 100.000 customers of Danske Bank and socio-economic data from KMD as well as data from Danske bank´s public Facebook wall

• Financial prices (DKK 75.000 1st price)

@Thorhildur Jetzek CBS 3|

Page 5: Big data presentation for University of Reykjavik, Iceland, March 22

Five societal megatrends

We are in the eye of the storm….

@Thorhildur Jetzek CBS 5|

Page 6: Big data presentation for University of Reykjavik, Iceland, March 22

Rise of Digitization

An average decline of almost 40% a year in the cost per gigabyte of consumer hard disk drive from 1998 (OECD, 2013).

38% yearly decrease in the cost of shifting one bit per second since 1995 (OECD, 2013).

More than 30 million interconnected sensors are now deployed worldwide, in areas such as security, health care, transport systems or energy control systems, and their numbers are growing by around 30% a year (McKinsey, 2011).

6 billion people have cellphones

30 billion pieces of content are shared on Facebook every month

2002: The year when the amount of information stored digitally surpasses non-digital information!

@Thorhildur Jetzek CBS 6|

Page 7: Big data presentation for University of Reykjavik, Iceland, March 22

Changes...

Forbes highlights• IT in the boardroom: Digital strategies• Changing business models - platforms• Big data and analytics• Lacking skills: EU estimates 160% increase in demand for

Big Data specialists between 2013-2020 to 346,000 new jobs

IDC predicts• Market for big data analysis services over $16 billion in 2014,

growing six times faster than the entire IT industry • Cloud-based big data and analytics will grow three times faster

than spending for on premise solutions in 2015

@Thorhildur Jetzek CBS 7|

Page 8: Big data presentation for University of Reykjavik, Iceland, March 22

Global open accessFLOSS – Free/Open Source Software

“…people´s pursuit of visible carrots is at times interrupted by the larger quest for the invisible gold at the end of the rainbow.” (von Krogh et al., 2012a, p. 671)

• Collaborative projects: Wikipedia, Human Genome Project, Open.Nasa.gov

• Open NGO data: http://data.worldbank.org/ (and multitude of similar)

• Open Government Data: http://data.gov / http://data.gov.uk (and 300 others)

• Open company data (open API´s): Facebook, Twitter, LinkedIn• Platforms: CouchSurfing.com; NeighbourGoods.net

@Thorhildur Jetzek CBS 8|

Page 9: Big data presentation for University of Reykjavik, Iceland, March 22

What is BIG data?

• The jury is still out – Davenport: New technologies – software and

infrastructure plus the data itself– Forrester defines Big Data as “techniques and

technologies that make handling data at extreme scale affordable”

– McKinsey (2011): “Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.

@Thorhildur Jetzek CBS 9|

Page 11: Big data presentation for University of Reykjavik, Iceland, March 22

Dimensions of big data: 4 V‘s

Source: IBM, http://www.ibmbigdatahub.com/infographic/four-vs-big-data

Page 12: Big data presentation for University of Reykjavik, Iceland, March 22

Utilization of data

Source: @PetteriA: http://www.slideshare.net/petterialahuhta/alahuhta-big-dataandanalytics24sep2014

@Thorhildur Jetzek CBS 12|

Page 13: Big data presentation for University of Reykjavik, Iceland, March 22

Terminology@HildaJetzek

13

Social data

Open data Master

data

•Structured – kept in relational databases

•Purposefully entered into systems

Big data

Small data

Data that are open

for any use outside

of organisational

boundaries

Data that are used

as reference data

across systems

Machine data

•High volume

•High velocity

•Variety of structures

•Uncertain Veracity

“bi-product of activity”

@Thorhildur Jetzek CBS 13|

Page 14: Big data presentation for University of Reykjavik, Iceland, March 22

Liquid open data@HildaJetzek

Liquidity – reflects ability to link and stream data across systems

Openness – reflects ability to use data outside of organizational boundaries

Liquid dataIlliquid data

Closed data

Open data

Liquid closed data: Data are effectively reused across a variety of systems within a single organization

Illiquid (silo’ed) closed data: Data are stored where they originate and not reused

Illiquid (silo’ed) open data : Data are used outside of organizational boundaries but offer limited potential for automation or coupling of data

Liquid open data: Data are used outside of organizational boundaries and easily coupled with other data and integrated across systems

Combining internal and external data for improved insights

Internally shared data

Most data within organizations

Many open government data initiatives

Page 15: Big data presentation for University of Reykjavik, Iceland, March 22

How do we identify opennes?@HildaJetzek

Dimension Affordance Explanation

Openness

Strategic Availability Data are open to all by default

Economic Affordability Data are free or charged for at maximum at marginal cost of reproduction

Legal Reusability Data are published with open licenses

Liquidity

Conceptual InteroperabilitySemantics and syntax are clear, data models and metadata are published, use of standard identifiers

Technical

Usability Data are of high quality, published in machine readable and standard formats, using contextual metadata

DiscoverabilityData are easily found through central portals or published with searchable metadata or using linked data semantics

Accessibility Data are easily downloadable or ”query-able” through APIs

Page 16: Big data presentation for University of Reykjavik, Iceland, March 22

Binary or continuous?• Data are not just open or closed, or liquid

or illiquid – a continuous range• Classification useful for strategy purposes

– A part of an organization’s data need to be liquid across the company (customer master)

– Other data could be open but illiquid (financial statement)

– Some data are liquid and open (genomics data, geospatial data)

@Thorhildur Jetzek CBS 16|

Page 17: Big data presentation for University of Reykjavik, Iceland, March 22

Highlights

• Why do we have so much data?• What are the underlying societal changes

we need to be aware of?• Why has openness become so popular?• Does it make sense to make more use of

data, even if it is expensive to re-think how we handle data in the company?

@Thorhildur Jetzek CBS 17|

Page 20: Big data presentation for University of Reykjavik, Iceland, March 22

• Social data– Sources: Social media websites, blog

sites, product reviews, search results– Unstructured,

natural language• Data from mobile phones

People-generated big data

– Most commonly geolocation– for example used to analyze

traffic or movement of people or to do geo-tagging

Source: Waze https://www.waze.com/@Thorhildur Jetzek CBS 20|

Page 21: Big data presentation for University of Reykjavik, Iceland, March 22

Measurement data• Nature/Environment

– Sources: Measurements, such as meteorological, atmospheric and pollution Big Data

• Geospatial data

Source: @Vishy Iyer, UT https://news.utexas.edu/2012/09/28/cracking-the-genetic-code-of-brain-tumors

Cracking the Genetic Code of Brain Tumors

• Lifeforms– Sources: Genetic

sequencing, patient databases

@Thorhildur Jetzek CBS 21|

Page 23: Big data presentation for University of Reykjavik, Iceland, March 22

Structure of data• How do we define structured data?

– Very often referred to as data in structured relational databases

– Known datamodel, identities and tabular formats (columns and rows)

– Still, a lot of (big) data analytics tools/packages want tabular formats

• R uses data-frames• Tableu wants a tabular format• SAS/SPSS use a tabular format

@Thorhildur Jetzek CBS 23|

Page 25: Big data presentation for University of Reykjavik, Iceland, March 22

Semi-structured data• Typically data such as XML or JSON

– Nested, not tabular but a known structure all the same

– Could be applied to text-files such as logs

– Can be transformed intoa tabular structure (with many empty cells)

@Thorhildur Jetzek CBS 25|

Page 26: Big data presentation for University of Reykjavik, Iceland, March 22

Is there structure?

@Thorhildur Jetzek CBS 26|

Page 27: Big data presentation for University of Reykjavik, Iceland, March 22

Unstructured files

• Photos and graphic images• Videos• PDF files • PowerPoint presentations • Emails • Blog entries • Wikis • Word processing documents@Thorhildur Jetzek CBS 27|

Page 29: Big data presentation for University of Reykjavik, Iceland, March 22

Standard analytics• Data analytics can take many different forms• Common forms of data analytics include:

– Static reporting: Annual reports – quarterly reports etc.)

– Dynamic reporting: Business intelligence, ability to choose columns and rows and reorganize data into a format that makes sense to user

– Simple analysis: sums, filtering, pivot tables, max and min values, averages etc.

@Thorhildur Jetzek CBS 29|

Page 30: Big data presentation for University of Reykjavik, Iceland, March 22

Visual analytics

• To explore and understand data by visualizing• Most people have an easier time understanding a chart

than t-values or large numeric matrices• Can range from „traditional“ bar charts or lines to word

clouds (highlights most used words by making them bigger), heatmaps, placing items on geographic maps, use of treemaps, bubble diagrams etc.

• Visual analytics is (like all statistics really) a combination of art and science. It is difficult to tell a good story with one picture, but a very powerful tool if you succceed!

@Thorhildur Jetzek CBS 30|

Page 32: Big data presentation for University of Reykjavik, Iceland, March 22

Basic analytics

• Use of a bit more advanced statistics using Excel or Excel add-ins or tools such as SAS or SPSS.

• Correlational/regression analysis. – Could be used to see if there are any interesting

correlations or explanations which can add to company decision making

– Can be used for forecasting, for example if there is a great weather forecast (external data), sales of icecream are predicted to go up 15% => stock up on icecream

– Be aware of the uncertainty in such models. This is not the truth!

– Be aware of spurious correlations: http://tylervigen.com/spurious-correlations

@Thorhildur Jetzek CBS 32|

Page 33: Big data presentation for University of Reykjavik, Iceland, March 22

Basic analytics

• Time series analysis– Used to predict the future– Makes use of historical data and looks for

trends in the data– Seasonal changes, growth etc. – Use of statistical methods like moving

averages to figure out long term trends

Page 34: Big data presentation for University of Reykjavik, Iceland, March 22

Advanced analytics

• There are many more less used statistical methods for quantitative data (numbers)– Dimension reduction: Search for any natural

clusters in the measurements (columns) that help us identify composite variables

– Cluster analysis: Search for any natural clusters in the data (rows). For instance, marketers like to cluster the general population of consumers into market segments with different buying behaviors

– Social network analysis: Clusters and relationships. Which groups on Facebook are likely to connect?

Page 35: Big data presentation for University of Reykjavik, Iceland, March 22

Advanced analytics• Structural equation modelling (SEM):

• Simultaneously estimate multiple equations (multivariate)

• Estimate variables and paths (relationships)• Can be based on covariance (CB-SEM) or multiple

regression (PLS-SEM)• Confirmatory factor analysis based on similar technique

but without estimating the pathsVariable (typically based on more than one measures to reduce risk of measurment bias)

Path – to understand nature of relationship@Thorhildur Jetzek 35|

Page 36: Big data presentation for University of Reykjavik, Iceland, March 22

A combination@HildaJetzek

-1,8 -1,6 -1,4 -1,2 -1,0 -0,8 -0,6 -0,4 -0,2 0,0 0,2 0,4 0,6 0,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 3,2Liquid open data

-1,5

-1,0

-0,5

0,0

0,5

1,0

1,5

2,0

Extent of digital innovation

Uruguay

Turkey

Tanzania

South Africa

Kazakhstan

Israel

Colombia

Cameroon

Brazil

Botswana

Bangladesh

Australia

Norway

New Zealand

Iceland

Switzerland

Ireland

Netherlands

Singapore

Germany Denmark

Finland

Sweden

Austria

United Kingdom

Italy

France

Japan

Czech Republic

Greece

Portugal

United States

Belgium

Korea, Rep.Estonia

Chile

Qatar

Saudi Arabia

Hungary

Argentina

Bahrain

Costa Rica

Venezuela

Mexico

Ecuador

Russian Federation

Jordan

Thailand

Jamaica

Peru

China

United Arab Emirates

Morocco

Indonesia

Philippines

Namibia

Nepal

Zambia

Rwanda

India

Ghana

Zimbabwe

Uganda

Pakistan

Senegal

Benin

Ethiopia

Nigeria

Malawi

Mali

@Thorhildur Jetzek CBS 36|

Page 37: Big data presentation for University of Reykjavik, Iceland, March 22

Different types of use

@Thorhildur Jetzek CBS 37|

Page 38: Big data presentation for University of Reykjavik, Iceland, March 22

Artificial intelligence• Neural network analysis:

– A computer program modeled after the human brain and can identify patterns in a similar way that we do

– This technique is particularly useful if you have a large amount of data, which can reveal subtle patterns you haven’t found or modelled ex ante

@Thorhildur Jetzek CBS 38|

Page 39: Big data presentation for University of Reykjavik, Iceland, March 22

Machine learning• Machine learning can use many different

algorithms – Machine learning can use supervised, semi-

supervised or unsupervised learning processes

– Despite the fancy connotation, some machine learning algorithms are not that complex

– Of course they can also be very complex (Google‘s self driving car, IBM Watson‘s chess playing algorithm)

@Thorhildur Jetzek CBS 39|

Page 40: Big data presentation for University of Reykjavik, Iceland, March 22

Recommendation algorithm

Source: http://www.datasciencecentral.com/profiles/blogs/collaborative-filtering-tutorials-across-languages

@Thorhildur Jetzek CBS 40|

Page 41: Big data presentation for University of Reykjavik, Iceland, March 22

Data mining• Data mining:

– A process of extracting value from large quantities of unstructured data, including text, images, voice and video. Includes pattern recognition, tagging and annotation

– Data mining can really increase the value of the data• Sentiment analysis:

– Seeks to extract subjective opinion or sentiment from text, video or audio data

– The basic aim is to determine the attitude of an individual or group regarding a particular topic or overall context

– Used to understand stakeholder opinion

@Thorhildur Jetzek CBS 41|

Page 42: Big data presentation for University of Reykjavik, Iceland, March 22

Text analysis

Source: Zimmerman, C., Stein, M. K., Hardt, D., & Vatrapu, R. (2015). Emergence of Things Felt. In Proceedings of the Thirty Sixth International Conference on Information Systems. ICIS 2015

Own analysis based on Twitter data, query all tweeds including "open data" OR opengovdata OR opengov in March 2012/13/14, total of 100k rows

@Thorhildur Jetzek CBS 42|

Page 44: Big data presentation for University of Reykjavik, Iceland, March 22

Data storage (no-SQL)• Hadoop: open-source software

framework for distributed storage of very large datasets on computer clusters

• Cloudera: An enterprise solution to help businesses manage their Hadoop ecosystem

• MongoDB: It’s good for managing data that changes frequently or data that is unstructured or semi-structured

• Apache Cassandra: Data replication, scalability and performance

@Thorhildur Jetzek CBS 44|

Page 45: Big data presentation for University of Reykjavik, Iceland, March 22

Middleware: Data integration and management

• Talend: Master Data Management (MDM) offering, which combines real-time data, applications, and process integration with embedded data quality and stewardship

• Pentaho: A Comprehensive data integration and business analytics platform, incl. embedded analysis

• Splunk: Monitor, search and analyze massive streams of machine data

• InfoSphere Master Data Management: Helps link unstructured content from external sources to the golden record for that enhanced 360-degree view

@Thorhildur Jetzek CBS 45|

Page 46: Big data presentation for University of Reykjavik, Iceland, March 22

(Visual) Analytics• Many of the middleware solutions reach into this space and vice

versa – most of these tools have data integration possibilities and the others offer some analytics

• Tableau: Has focused on integration to various data-sources (incl. Hadoop) and easy visualization of data – very easy to use

• Qlik: Very robust, offering options to create very nice dashboards, but has a bit steeper learning curve

• IBM Watson Analytics: Can use natural language to ask questions that are „translated“ into a query

@Thorhildur Jetzek CBS 46|

Page 47: Big data presentation for University of Reykjavik, Iceland, March 22

Advanced Analytics• SPSS – tabular data only and narrow capabilities

but relatively easy to use• SAS – dynamic (a programming language) and a lot

of options for analysis• Matlab – a lot of flexibility for doing own

programming• R (open source) – can apply packages but you still

have to do a lot of manual labour (code). Many options

• For structural equation modelling: specific packages such as SmartPls or Amos

@Thorhildur Jetzek CBS 47|

Page 49: Big data presentation for University of Reykjavik, Iceland, March 22

Economics of data

• I use the economists approach and view data (of any size) as a resource

• Specific features of digital data– Low marginal costs – easy to distribute and

reuse– Can be used for many different things– Value mostly from downstream activities

From an economic perspective, it makes sense to reuse data as much as possible

@Thorhildur Jetzek CBS 49|

Page 50: Big data presentation for University of Reykjavik, Iceland, March 22

Data accounting• Data as a resource

– What data do we have– Where do they originate from– Where are they stored, who is responsible– Are they sensitive or can they be opened for resuse– Are the streaming or static– Are they mission critical or less important– Are we using them optimally?– Do we have the right skills– Do we have the right tools

• We know about our human resources, machines, buildings, cars, production parts... Now we must have the same knowledge about data

@Thorhildur Jetzek CBS 50|

Page 52: Big data presentation for University of Reykjavik, Iceland, March 22

Business strategy• Consider the competitive advantage

offered by your own data• Consider the potential value from using other

available external data• Consider costs and benefits• Consider what other companies are doing (not

necessarily in the same industry)Sometimes it makes sense to reuse data internallySometimes it makes sense to fetch and use external dataSometimes it makes sense to share own data

@Thorhildur Jetzek CBS 52|

Page 53: Big data presentation for University of Reykjavik, Iceland, March 22

Value generation mechanisms@HildaJetzek

Exp

loita

tion:

G

ood

gove

rnan

ceE

xplo

ratio

n:

Driv

ing

chan

ge

Economic: Market mechanisms

Social: Information sharing mechanisms

∆ Transparency

∆ Civic engagement

∆ Efficiency

∆ Innovation

Page 54: Big data presentation for University of Reykjavik, Iceland, March 22

Model with 2-sided markets@HildaJetzek

Soft infrastructure

Sustainable value

Paying sideBuying and selling goods and services

Non-paying sideSharing relevant

content

Cost of high-speed networks

Openness of data

Societal level impact

(MSPs)

Intermediaries

Information sharing + market

mechanisms = Synergy

Effectiveness of data and privacy protection

frameworks

Ease of reaching a skilled workforce

Motivation

AbilityBasic requirements

Resource

Digital leadership of government

Opportunity

Societal level structures

MSPs = Multi Sided Platforms@Thorhildur Jetzek CBS 54|

Page 56: Big data presentation for University of Reykjavik, Iceland, March 22

Examples of use• Better understand and target customers• Understand and optimize business

processes• Improving health• Smart cities• Improving sports performance

@Thorhildur Jetzek CBS 56|

Page 58: Big data presentation for University of Reykjavik, Iceland, March 22

Customer Dashboard II

Source: IBM, PRNewswire: https://photos.prnewswire.com/prnvar/20150528/219085?max=1600

Page 59: Big data presentation for University of Reykjavik, Iceland, March 22

Increase efficiency

Source: QuantumBlack http://www.quantumblack.com/

@Thorhildur Jetzek CBS 59|

Page 60: Big data presentation for University of Reykjavik, Iceland, March 22

Improving health

@Thorhildur Jetzek CBS 60|

Page 62: Big data presentation for University of Reykjavik, Iceland, March 22

Smart cities• Improve security and save money• Analyze traffic -> implement automatic

traffic controls

@Thorhildur Jetzek CBS 62|

Page 63: Big data presentation for University of Reykjavik, Iceland, March 22

Improving sports performance

Prozone analyses over 750,000 data points per game, 300 GPS data points per training session and 110 data points per injury to create the most comprehensive injury database in sport. By analyzing over 36 million data points per club/per season we aim to identify the subtle patterns in a player’s performance that predispose them to increased relative risk of injury allowing club to act to prevent injury.

@Thorhildur Jetzek CBS 63|

Page 64: Big data presentation for University of Reykjavik, Iceland, March 22

@Thorhildur Jetzek CBS 64|

THANK YOU!