big data presentation for university of reykjavik, iceland, march 22
TRANSCRIPT
BIG DATA
Thorhildur Jetzek, Ph.D. Postdoctoral Fellow
Department of IT management, CBS
• Stúdent frá Eðlisfræðibraut I í MR 1991• B.Sc. in Economics 1994
• M.Sc. in Economics 1998• Ph.D. in Information Technology Management 2015• Have worked as a economist, IT consultant, assistant
professor, project manager, program manager, director, industrial PhD and now postdoctoral researcher.
• Have always focused on use of technology
Who am I?
Traditional career
My career
@Thorhildur Jetzek CBS 2|
• High ranking– 2nd in Europe (behind LSE) & 22 world-wide
• Focus on collaboration with industry– Industrial PhD (my PhD contract was at KMD)– Engaged scholarship & collaborative research
(current project of mine sponsored by industry)– Crowdsourcing events:
• Student competition where CBS students got access to anonymized data on 100.000 customers of Danske Bank and socio-economic data from KMD as well as data from Danske bank´s public Facebook wall
• Financial prices (DKK 75.000 1st price)
@Thorhildur Jetzek CBS 3|
Five societal megatrends
We are in the eye of the storm….
@Thorhildur Jetzek CBS 5|
Rise of Digitization
An average decline of almost 40% a year in the cost per gigabyte of consumer hard disk drive from 1998 (OECD, 2013).
38% yearly decrease in the cost of shifting one bit per second since 1995 (OECD, 2013).
More than 30 million interconnected sensors are now deployed worldwide, in areas such as security, health care, transport systems or energy control systems, and their numbers are growing by around 30% a year (McKinsey, 2011).
6 billion people have cellphones
30 billion pieces of content are shared on Facebook every month
2002: The year when the amount of information stored digitally surpasses non-digital information!
@Thorhildur Jetzek CBS 6|
Changes...
Forbes highlights• IT in the boardroom: Digital strategies• Changing business models - platforms• Big data and analytics• Lacking skills: EU estimates 160% increase in demand for
Big Data specialists between 2013-2020 to 346,000 new jobs
IDC predicts• Market for big data analysis services over $16 billion in 2014,
growing six times faster than the entire IT industry • Cloud-based big data and analytics will grow three times faster
than spending for on premise solutions in 2015
@Thorhildur Jetzek CBS 7|
Global open accessFLOSS – Free/Open Source Software
“…people´s pursuit of visible carrots is at times interrupted by the larger quest for the invisible gold at the end of the rainbow.” (von Krogh et al., 2012a, p. 671)
• Collaborative projects: Wikipedia, Human Genome Project, Open.Nasa.gov
• Open NGO data: http://data.worldbank.org/ (and multitude of similar)
• Open Government Data: http://data.gov / http://data.gov.uk (and 300 others)
• Open company data (open API´s): Facebook, Twitter, LinkedIn• Platforms: CouchSurfing.com; NeighbourGoods.net
@Thorhildur Jetzek CBS 8|
What is BIG data?
• The jury is still out – Davenport: New technologies – software and
infrastructure plus the data itself– Forrester defines Big Data as “techniques and
technologies that make handling data at extreme scale affordable”
– McKinsey (2011): “Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.
@Thorhildur Jetzek CBS 9|
Classification of big data
Dimensions of big data: 4 V‘s
Source: IBM, http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Utilization of data
Source: @PetteriA: http://www.slideshare.net/petterialahuhta/alahuhta-big-dataandanalytics24sep2014
@Thorhildur Jetzek CBS 12|
Terminology@HildaJetzek
13
Social data
Open data Master
data
•Structured – kept in relational databases
•Purposefully entered into systems
Big data
Small data
Data that are open
for any use outside
of organisational
boundaries
Data that are used
as reference data
across systems
Machine data
•High volume
•High velocity
•Variety of structures
•Uncertain Veracity
“bi-product of activity”
@Thorhildur Jetzek CBS 13|
Liquid open data@HildaJetzek
Liquidity – reflects ability to link and stream data across systems
Openness – reflects ability to use data outside of organizational boundaries
Liquid dataIlliquid data
Closed data
Open data
Liquid closed data: Data are effectively reused across a variety of systems within a single organization
Illiquid (silo’ed) closed data: Data are stored where they originate and not reused
Illiquid (silo’ed) open data : Data are used outside of organizational boundaries but offer limited potential for automation or coupling of data
Liquid open data: Data are used outside of organizational boundaries and easily coupled with other data and integrated across systems
Combining internal and external data for improved insights
Internally shared data
Most data within organizations
Many open government data initiatives
How do we identify opennes?@HildaJetzek
Dimension Affordance Explanation
Openness
Strategic Availability Data are open to all by default
Economic Affordability Data are free or charged for at maximum at marginal cost of reproduction
Legal Reusability Data are published with open licenses
Liquidity
Conceptual InteroperabilitySemantics and syntax are clear, data models and metadata are published, use of standard identifiers
Technical
Usability Data are of high quality, published in machine readable and standard formats, using contextual metadata
DiscoverabilityData are easily found through central portals or published with searchable metadata or using linked data semantics
Accessibility Data are easily downloadable or ”query-able” through APIs
Binary or continuous?• Data are not just open or closed, or liquid
or illiquid – a continuous range• Classification useful for strategy purposes
– A part of an organization’s data need to be liquid across the company (customer master)
– Other data could be open but illiquid (financial statement)
– Some data are liquid and open (genomics data, geospatial data)
@Thorhildur Jetzek CBS 16|
Highlights
• Why do we have so much data?• What are the underlying societal changes
we need to be aware of?• Why has openness become so popular?• Does it make sense to make more use of
data, even if it is expensive to re-think how we handle data in the company?
@Thorhildur Jetzek CBS 17|
Machine-generated big data • Sensors/IoT devices
– From car navigation systems, smart meters, unmanned security systems, sensors etc.
@Thorhildur Jetzek CBS 19|
• Social data– Sources: Social media websites, blog
sites, product reviews, search results– Unstructured,
natural language• Data from mobile phones
People-generated big data
– Most commonly geolocation– for example used to analyze
traffic or movement of people or to do geo-tagging
Source: Waze https://www.waze.com/@Thorhildur Jetzek CBS 20|
Measurement data• Nature/Environment
– Sources: Measurements, such as meteorological, atmospheric and pollution Big Data
• Geospatial data
Source: @Vishy Iyer, UT https://news.utexas.edu/2012/09/28/cracking-the-genetic-code-of-brain-tumors
Cracking the Genetic Code of Brain Tumors
• Lifeforms– Sources: Genetic
sequencing, patient databases
@Thorhildur Jetzek CBS 21|
Structure of data• How do we define structured data?
– Very often referred to as data in structured relational databases
– Known datamodel, identities and tabular formats (columns and rows)
– Still, a lot of (big) data analytics tools/packages want tabular formats
• R uses data-frames• Tableu wants a tabular format• SAS/SPSS use a tabular format
@Thorhildur Jetzek CBS 23|
Semi-structured data• Typically data such as XML or JSON
– Nested, not tabular but a known structure all the same
– Could be applied to text-files such as logs
– Can be transformed intoa tabular structure (with many empty cells)
@Thorhildur Jetzek CBS 25|
Is there structure?
@Thorhildur Jetzek CBS 26|
Unstructured files
• Photos and graphic images• Videos• PDF files • PowerPoint presentations • Emails • Blog entries • Wikis • Word processing documents@Thorhildur Jetzek CBS 27|
Standard analytics• Data analytics can take many different forms• Common forms of data analytics include:
– Static reporting: Annual reports – quarterly reports etc.)
– Dynamic reporting: Business intelligence, ability to choose columns and rows and reorganize data into a format that makes sense to user
– Simple analysis: sums, filtering, pivot tables, max and min values, averages etc.
@Thorhildur Jetzek CBS 29|
Visual analytics
• To explore and understand data by visualizing• Most people have an easier time understanding a chart
than t-values or large numeric matrices• Can range from „traditional“ bar charts or lines to word
clouds (highlights most used words by making them bigger), heatmaps, placing items on geographic maps, use of treemaps, bubble diagrams etc.
• Visual analytics is (like all statistics really) a combination of art and science. It is difficult to tell a good story with one picture, but a very powerful tool if you succceed!
@Thorhildur Jetzek CBS 30|
Helps us understand
Source: SAS http://www.sas.com/en_nz/software/business-intelligence/visual-analytics.html
Basic analytics
• Use of a bit more advanced statistics using Excel or Excel add-ins or tools such as SAS or SPSS.
• Correlational/regression analysis. – Could be used to see if there are any interesting
correlations or explanations which can add to company decision making
– Can be used for forecasting, for example if there is a great weather forecast (external data), sales of icecream are predicted to go up 15% => stock up on icecream
– Be aware of the uncertainty in such models. This is not the truth!
– Be aware of spurious correlations: http://tylervigen.com/spurious-correlations
@Thorhildur Jetzek CBS 32|
Basic analytics
• Time series analysis– Used to predict the future– Makes use of historical data and looks for
trends in the data– Seasonal changes, growth etc. – Use of statistical methods like moving
averages to figure out long term trends
Advanced analytics
• There are many more less used statistical methods for quantitative data (numbers)– Dimension reduction: Search for any natural
clusters in the measurements (columns) that help us identify composite variables
– Cluster analysis: Search for any natural clusters in the data (rows). For instance, marketers like to cluster the general population of consumers into market segments with different buying behaviors
– Social network analysis: Clusters and relationships. Which groups on Facebook are likely to connect?
Advanced analytics• Structural equation modelling (SEM):
• Simultaneously estimate multiple equations (multivariate)
• Estimate variables and paths (relationships)• Can be based on covariance (CB-SEM) or multiple
regression (PLS-SEM)• Confirmatory factor analysis based on similar technique
but without estimating the pathsVariable (typically based on more than one measures to reduce risk of measurment bias)
Path – to understand nature of relationship@Thorhildur Jetzek 35|
A combination@HildaJetzek
-1,8 -1,6 -1,4 -1,2 -1,0 -0,8 -0,6 -0,4 -0,2 0,0 0,2 0,4 0,6 0,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 3,2Liquid open data
-1,5
-1,0
-0,5
0,0
0,5
1,0
1,5
2,0
Extent of digital innovation
Uruguay
Turkey
Tanzania
South Africa
Kazakhstan
Israel
Colombia
Cameroon
Brazil
Botswana
Bangladesh
Australia
Norway
New Zealand
Iceland
Switzerland
Ireland
Netherlands
Singapore
Germany Denmark
Finland
Sweden
Austria
United Kingdom
Italy
France
Japan
Czech Republic
Greece
Portugal
United States
Belgium
Korea, Rep.Estonia
Chile
Qatar
Saudi Arabia
Hungary
Argentina
Bahrain
Costa Rica
Venezuela
Mexico
Ecuador
Russian Federation
Jordan
Thailand
Jamaica
Peru
China
United Arab Emirates
Morocco
Indonesia
Philippines
Namibia
Nepal
Zambia
Rwanda
India
Ghana
Zimbabwe
Uganda
Pakistan
Senegal
Benin
Ethiopia
Nigeria
Malawi
Mali
@Thorhildur Jetzek CBS 36|
Different types of use
@Thorhildur Jetzek CBS 37|
Artificial intelligence• Neural network analysis:
– A computer program modeled after the human brain and can identify patterns in a similar way that we do
– This technique is particularly useful if you have a large amount of data, which can reveal subtle patterns you haven’t found or modelled ex ante
@Thorhildur Jetzek CBS 38|
Machine learning• Machine learning can use many different
algorithms – Machine learning can use supervised, semi-
supervised or unsupervised learning processes
– Despite the fancy connotation, some machine learning algorithms are not that complex
– Of course they can also be very complex (Google‘s self driving car, IBM Watson‘s chess playing algorithm)
@Thorhildur Jetzek CBS 39|
Recommendation algorithm
Source: http://www.datasciencecentral.com/profiles/blogs/collaborative-filtering-tutorials-across-languages
@Thorhildur Jetzek CBS 40|
Data mining• Data mining:
– A process of extracting value from large quantities of unstructured data, including text, images, voice and video. Includes pattern recognition, tagging and annotation
– Data mining can really increase the value of the data• Sentiment analysis:
– Seeks to extract subjective opinion or sentiment from text, video or audio data
– The basic aim is to determine the attitude of an individual or group regarding a particular topic or overall context
– Used to understand stakeholder opinion
@Thorhildur Jetzek CBS 41|
Text analysis
Source: Zimmerman, C., Stein, M. K., Hardt, D., & Vatrapu, R. (2015). Emergence of Things Felt. In Proceedings of the Thirty Sixth International Conference on Information Systems. ICIS 2015
Own analysis based on Twitter data, query all tweeds including "open data" OR opengovdata OR opengov in March 2012/13/14, total of 100k rows
@Thorhildur Jetzek CBS 42|
An example of solutions
Data storage (no-SQL)• Hadoop: open-source software
framework for distributed storage of very large datasets on computer clusters
• Cloudera: An enterprise solution to help businesses manage their Hadoop ecosystem
• MongoDB: It’s good for managing data that changes frequently or data that is unstructured or semi-structured
• Apache Cassandra: Data replication, scalability and performance
@Thorhildur Jetzek CBS 44|
Middleware: Data integration and management
• Talend: Master Data Management (MDM) offering, which combines real-time data, applications, and process integration with embedded data quality and stewardship
• Pentaho: A Comprehensive data integration and business analytics platform, incl. embedded analysis
• Splunk: Monitor, search and analyze massive streams of machine data
• InfoSphere Master Data Management: Helps link unstructured content from external sources to the golden record for that enhanced 360-degree view
@Thorhildur Jetzek CBS 45|
(Visual) Analytics• Many of the middleware solutions reach into this space and vice
versa – most of these tools have data integration possibilities and the others offer some analytics
• Tableau: Has focused on integration to various data-sources (incl. Hadoop) and easy visualization of data – very easy to use
• Qlik: Very robust, offering options to create very nice dashboards, but has a bit steeper learning curve
• IBM Watson Analytics: Can use natural language to ask questions that are „translated“ into a query
@Thorhildur Jetzek CBS 46|
Advanced Analytics• SPSS – tabular data only and narrow capabilities
but relatively easy to use• SAS – dynamic (a programming language) and a lot
of options for analysis• Matlab – a lot of flexibility for doing own
programming• R (open source) – can apply packages but you still
have to do a lot of manual labour (code). Many options
• For structural equation modelling: specific packages such as SmartPls or Amos
@Thorhildur Jetzek CBS 47|
Economics of data
• I use the economists approach and view data (of any size) as a resource
• Specific features of digital data– Low marginal costs – easy to distribute and
reuse– Can be used for many different things– Value mostly from downstream activities
From an economic perspective, it makes sense to reuse data as much as possible
@Thorhildur Jetzek CBS 49|
Data accounting• Data as a resource
– What data do we have– Where do they originate from– Where are they stored, who is responsible– Are they sensitive or can they be opened for resuse– Are the streaming or static– Are they mission critical or less important– Are we using them optimally?– Do we have the right skills– Do we have the right tools
• We know about our human resources, machines, buildings, cars, production parts... Now we must have the same knowledge about data
@Thorhildur Jetzek CBS 50|
...capitalizing on the benefits of digitization needs to be a strategic imperative
Value generation
@Thorhildur Jetzek CBS 51|
Business strategy• Consider the competitive advantage
offered by your own data• Consider the potential value from using other
available external data• Consider costs and benefits• Consider what other companies are doing (not
necessarily in the same industry)Sometimes it makes sense to reuse data internallySometimes it makes sense to fetch and use external dataSometimes it makes sense to share own data
@Thorhildur Jetzek CBS 52|
Value generation mechanisms@HildaJetzek
Exp
loita
tion:
G
ood
gove
rnan
ceE
xplo
ratio
n:
Driv
ing
chan
ge
Economic: Market mechanisms
Social: Information sharing mechanisms
∆ Transparency
∆ Civic engagement
∆ Efficiency
∆ Innovation
Model with 2-sided markets@HildaJetzek
Soft infrastructure
Sustainable value
Paying sideBuying and selling goods and services
Non-paying sideSharing relevant
content
Cost of high-speed networks
Openness of data
Societal level impact
(MSPs)
Intermediaries
Information sharing + market
mechanisms = Synergy
Effectiveness of data and privacy protection
frameworks
Ease of reaching a skilled workforce
Motivation
AbilityBasic requirements
Resource
Digital leadership of government
Opportunity
Societal level structures
MSPs = Multi Sided Platforms@Thorhildur Jetzek CBS 54|
Examples of use• Better understand and target customers• Understand and optimize business
processes• Improving health• Smart cities• Improving sports performance
@Thorhildur Jetzek CBS 56|
Customer dashboard I
Source: IBM
@Thorhildur Jetzek CBS 57|
Customer Dashboard II
Source: IBM, PRNewswire: https://photos.prnewswire.com/prnvar/20150528/219085?max=1600
Increase efficiency
Source: QuantumBlack http://www.quantumblack.com/
@Thorhildur Jetzek CBS 59|
Improving health
@Thorhildur Jetzek CBS 60|
Health dashboards
Source Headsupheath http://headsuphealth.com/
@Thorhildur Jetzek CBS 61|
Smart cities• Improve security and save money• Analyze traffic -> implement automatic
traffic controls
@Thorhildur Jetzek CBS 62|
Improving sports performance
Prozone analyses over 750,000 data points per game, 300 GPS data points per training session and 110 data points per injury to create the most comprehensive injury database in sport. By analyzing over 36 million data points per club/per season we aim to identify the subtle patterns in a player’s performance that predispose them to increased relative risk of injury allowing club to act to prevent injury.
@Thorhildur Jetzek CBS 63|
@Thorhildur Jetzek CBS 64|
THANK YOU!