no big data without small data
TRANSCRIPT
15/04/2023 2
Big Data - a definition
Big data is the name we give to a collection of data so large and complex that that it can’t be processed using traditional IT applications and programs. In general the volume is at least 1000 times larger than traditional sources of data.
© Decision Support Systems 2014
“You can have data without information but you can’t have information without data.” Daniel Keys Moran – American author
15/04/2023 © Decision Support Systems 2014 3
Small Data -a definition
Data (small data) is a synonym for facts; and Merriam-Webster defines it as:“facts or information used usually to calculate, analyse, or plan something”.
But if the facts are not right then the analysis will be incorrect and we will make the wrong decisions!
Invoice Date Customer Country Euros excl VAT VAT Total Invoice Number Payment
22/01/2014 Mondea Netherlands 795.00 166.95 961.95 2014.416 iDeal24/01/2014 Physter Technology Czech Republic 795.00 0 795.00 2014.417 Visa27/01/2014 Copenhagen Airports A/S Denmark 795.00 166.95 961.95 2014.421 MC28/01/2014 Vista Group Finland 575.00 120.75 695.75 2014.423 MC07/02/2014 Global Information USA 709.01 0 709.01 2014.441 Invoice14/02/2014 DataPad Inc United States 795.00 0 795.00 2014.451 MC21/02/2014 Scrip Companies USA 795.00 795.00 2014.464 PayPal
15-04-2023 © Decision Support Systems 2014 4
the creation of a consistent, accurate and timely source of processed data (information) that can be used to support the decision making process
the creation of an historic information source which can be used uniquely as the basis of both comparative and predictive analysis
the integration of data from different sources (both internal and external)
the creation of “one source of the truth” which we need as the basis of making better decisions
The goals of working with data
15-04-2023 © Decision Support Systems 2014 5
Where does the data come from?
Small Data - structured Internal applications Spreadsheets!
Big Data – often unstructured Organizational processes:
measurements, websites, machines
Communication:e-mail, reports, presentations
Social media:Facebook, LinkedIn, Twitter
Sensors:temperature, weather, traffic, rainfall
Archives:old documents, old films
15/04/2023 © Decision Support Systems 2014 6
Unstructured data - a definition
Unstructured data is not directly accessible in a database. Examples are various sorts of documents like Office documents, PDF, XML, email messages, pictures , videos and sound clips. The contents are often dates, numbers and other facts but are difficult to interpret directly with the current technology
A Letter from the Chairman of IBM
The market for data and analytics is estimated at $187 billion by 2015. To capture this growth potential, we have built the world’s broadest and deepest capabilities in Big Data and analytics—both technology and expertise. We have invested more than $24 billion,including $17 billion of gross spend on more than 30 acquisitions. We have 15,000 consultants and 400 mathematicians. Two thirds of IBM Research’s work is now devoted to data, analytics and cognitive computing. IBM has earned 4,000 analytics patents.
15/04/2023 © Decision Support Systems 2014 7
Big Data is an addition
Big Data is an additional source, not something that just exists independently
the goal is to complement the existing data“revenue” from Big Data must have the same
definition as “revenue” from Small Dataquality is just as important; if that is not the
case Big Data is just a lot of Bad Data
What we call things is important
15/04/2023 © Decision Support Systems 2014 8
How much didI sell?
How much canI book?
Revenue
= € 100,000 = € 96,422
15/04/2023 © Decision Support Systems 2014 9
Just like most of the other IT analysts I am convinced that data quality forms a huge risk for our decision making processes – the problem is that the quality of the data is so bad that we can’t prove it!
Norman Manley, IT analyst
Data quality – a problem?
15-04-2023 © Decision Support Systems 2014 10
the files have many different formats which makes them very difficult to read
it is often unclear what the contents of a field are (and also what they mean)
privacy is a problem – are we allowed to see some things and are we allowed to do anything with the data?
data is often missing (both individual fields and parts of files)
data is not up to date
Small data – what are the problems?
How useful is big data?
It seems that the 4 engines of a Boeing 747 generate more data on a flight to New York than most companies in do a year.
The question remains, do we need to save all the data, for how long, what are we going to do with it and can we generate “actionable information”?
15-04-2023 © Decision Support Systems 2014 13
15-04-2023 © Decision Support Systems 2014 14
Very Big Data!
At up to 500MB per flight this is a huge amount of data
15/04/2023 © Decision Support Systems 2014 15
Big Data successesVestas, a Danish wind turbine manufacturer collects data from 35,000 meteorological stations and 45,000 of their own turbines. This allows them to choose the best locations, in terms of wind conditions, for placing new windmills. They expect to collect as much as 24 petabytes of data (they already have 2.8 petabytes). The time needed to analyse the suitability of a new location has reduced form several weeks to 15 minutes.
15/04/2023 © Decision Support Systems 2014 16
Big Data successesLos Angeles and Santa Cruz police, together with PredPol (a software vendor) and a mathematician from the University of Santa Clara have developed a system that can predict where criminal activity will take place, accurate to an area of 50 m2. A combination of historical data and feeds from “live” cameras is used to predict where the local police should patrol to prevent (amongst other crimes) burglary. The number of burglaries has been reduced by 33% in the last year. This is called “predictive policing”
15/04/2023 © Decision Support Systems 2014 17
Conclusionsif Small Data doesn’t work properly then Big
Data has no chanceBig Data in itself has no value - but it does give
us the possibility to generate new insightsthe crux is accuracy– bad data quality leads to
information that is even worse