what is big data? · sql-mapreduce, reduce function . data output from mass spectrometer ....

53
Global Sponsor: What is Big Data? Mark Whitehorn, Co-Founder, Penguinsoft Consulting Ltd.

Upload: others

Post on 04-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Global Sponsor:

What is Big Data?

Mark Whitehorn, Co-Founder, Penguinsoft Consulting Ltd.

Page 2: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

It’s all about me…

Prof Mark Whitehorn Chair of Analytics School of Computing University of Dundee Scotland

Consultant Writer (author)

2

Presenter
Presentation Notes
8 minutes
Page 3: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

It’s all about me…

Prof Mark Whitehorn Teach a Masters in BI And another in Data Science - Also research work

3

Presenter
Presentation Notes
8 minutes
Page 4: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Actually, it isn’t all about me…

Andy Cobley Chris Hillman Prof. Angus Lamond Dr. Yasmeen Ahmad

4

Presenter
Presentation Notes
8 minutes
Page 5: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

What is Big data?

Is it really just a marketing campaign? http://www.perceptualedge.com/articles/visual_business_intelligence/big_data_big_ruse.pdf “If you’re like me, the mere mention of Big Data now turns your stomach….Why all the fuss? Why, indeed. Essentially, Big Data is a marketing campaign, pure and simple.” Stephen Few

5

Page 6: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Big data

Clearly I am not like Stephen Few. I don’t believe I have a particular axe to grind, I simply find this interesting This talk is designed to try to explain: what Big Data is what characteristics we have found useful why it may be of interest to you a paradox

6

Page 7: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Data

All computer applications manipulate data

7

Page 8: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Data

So, in the ’60 and ‘70s we rapidly learnt to separate the data, and its manipulation, from the application

8

Page 9: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Data

So, in the ’60 and ‘70s we rapidly learnt to separate the data, and its manipulation, from the application Which led directly to the development of database engines and, ultimately, relational ones (DB2, Oracle, SQL Server)

9

Page 10: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Data

Data has always existed in two, very broad, flavours…..

Data that is treated as small, discrete packages and is a good fit with the relational way of storing and querying data

Data that is not as above

10

Page 11: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Data is stored in tables

11

LicenceNo Make Model Year Color

CER 162C Triumph Spitfire 1965 Green

EF 8972 Bentley Mk.VI 1946 Black

YSK 114 Bentley Mk.VI 1949 Red

Page 12: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Data is stored in tables

Car

12

Each table has a name

LicenceNo Make Model Year Color

CER 162C Triumph Spitfire 1965 Green

EF 8972 Bentley Mk.VI 1946 Black

YSK 114 Bentley Mk.VI 1949 Red

Page 13: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Data is stored in tables

13

Data is atomic

Car

LicenceNo Make Model Year Color

CER 162C Triumph Spitfire 1965 Green

EF 8972 Bentley Mk.VI 1946 Black

YSK 114 Bentley Mk.VI 1949 Red

Page 14: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Data is stored in tables

14

Columns

Car

LicenceNo Make Model Year Color

CER 162C Triumph Spitfire 1965 Green

EF 8972 Bentley Mk.VI 1946 Black

YSK 114 Bentley Mk.VI 1949 Red

Page 15: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Columns

Data is stored in tables

15

Rows

Car

LicenceNo Make Model Year Color

CER 162C Triumph Spitfire 1965 Green

EF 8972 Bentley Mk.VI 1946 Black

YSK 114 Bentley Mk.VI 1949 Red

Page 16: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Data is stored in tables

16

Each row represents a unique entity in the ‘real’ world……

Car

LicenceNo Make Model Year Color

CER 162C Triumph Spitfire 1965 Green

EF 8972 Bentley Mk.VI 1946 Black

YSK 114 Bentley Mk.VI 1949 Red

Page 17: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

17

Page 18: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Data

The manipulation consists typically of sub-setting the data by rows and columns and then maybe doing some sums: SELECT Make, Model (chooses the columns) FROM Car Where Year < 1947 (chooses the rows)

18

Page 19: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Data

Note that this kind of manipulation is treating the data as atomic, which is fine, because the relational model assumes atomicity of data Note also, that the rows are unordered

19

Page 20: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Data

Data has always existed in two, very broad, flavors…..

Data that is inherently atomic and is a good fit with the relational way of storing and querying data

Data that is not as above

20

Page 21: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Examples

Examples of ‘other’ data:

Images Music Word docs Sensor data Web logs Twitter Machines Point of Sale Mass spectrometers

21

Page 22: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

What’s in a name?

So, what do we call the ‘rest’?

Un-structured? Semi-structured? Multi-structured? Non-relational? Non-tabular?

22

Page 23: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

What’s in a name?

What about:

Big data?

23

Page 24: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Other definitions?

V V V v v v v

Volume Variety Velocity Value Very interesting Various other variations beginning with V…..

24

Page 25: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Big Data – not new?

So why have we focused, for the last 30 years, almost exclusively on the first flavor? Because it: is easy (relatively easy – Jim Gray*) represents a significant proportion of the available data

*Jim Gray and Andreas Reuter - Transaction Processing: Concepts and Techniques (1993) Turning Award 1998

25

Page 26: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Big Data has come of age

Two factors have changed

Rise of the Machines Increase in data capture

There is a great synergy here

We are acquiring far more big data and we now have the computational power

to extract the information it contains

26

Page 27: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Big Data is hard

Of the 3 Vs, perhaps the most important is Variability We often want to look inside the data

Frequently non-atomic Need custom functions for virtually every operation “Find the rotating wing aircraft in the image” “Identify the best customer” “What does the blog sphere think of our company?”

27

Page 28: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Examples Log file Mass spec. Images

28

Big Data

Page 29: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Big Data

The problem here is that the order of the rows is significant We want to know which page views lead to other page views Of course we CAN do that in SQL, but it may not be efficient to do so

29

Page 30: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Examples Log file Mass spectrometer Image

30

Big Data

Page 31: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

31

Page 32: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Big Data

The problem here is that the order of the rows is significant (as before)

And the number of rows is likely to be overwhelming

32

Page 33: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

SQL-MapReduce, Reduce Function

Data output from Mass Spectrometer

335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361 335.2152332 264.3601074 335.2163925 259.6187134 335.2175518 239.7870178 335.2187111 313.8243713 335.2198704 490.8760071 335.2210297 634.064209

335.222189 589.8432007 335.2233483 351.9743347 335.2245077 65.21440887

335.225671 0 336.890869 0 336.892037 75.75605011 336.893205 179.8110657 336.894373 247.535553 336.895541 225.6489563

336.8967091 140.6246338 337.1257588 0 337.1280972 86.48993683 337.1292664 170.0835876 337.1304357 215.8146362 337.1316049 188.9733276 337.1327741 110.2854233 337.1912444 0

337.192414 0 337.1935835 143.2112122 337.1947531 357.401123 337.1959227 467.1167297 337.1970923 411.569458 337.1982619 245.5514221 337.1994315 80.80451202

Detecting centroids of peaks is highly complex using SQL as it is not a set based operation

33

Page 34: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

SELECT file_id ,scan_id ,ren_tm ,ms_lvl ,mz ,i AS n_ ,SUM(i) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS p_i ,(CASE WHEN (i > 0) THEN 1 ELSE 0 END) AS Ind ,(Ind - SUM(ind) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)) ,CAST((CASE WHEN B = 1 THEN CSUM(1,Ind) WHEN B = 0 AND Ind = 1 THEN 0 ELSE NULL END) AS DECIMAL(38,0)) AS CurveID FROM dd_stg.mzml WHERE ms_lvl = 1 ) WITH DATA PRIMARY INDEX (mz)

Almost 800 lines of complex SQL

,(weighted_peak_mz * chrg) / 700000.000000000000000 AS delta_mz ,CASE WHEN ( (CASE WHEN SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y' ELSE NULL END) = 'Y' OR (CASE WHEN SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y' ELSE NULL END) = 'Y' OR (CASE WHEN SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 2 PRECEDING AND 2 PRECEDING) BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y' ELSE NULL END) = 'Y' OR (CASE WHEN

SELECT file_id,scan_id,ren_tm,ms_lvl,mz ,i ,CASE WHEN ind = 1 THEN SUM(CurveID+Mark) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz, ind ROWS UNBOUNDED PRECEDING) ELSE NULL END AS CurveNum FROM (SELECT file_id,scan_id,ren_tm,ms_lvl,mz,n_I AS i ,CASE WHEN ( (CASE WHEN n_i - p_i > 0 THEN 1 WHEN n_i - p_i < 0 THEN -1 ELSE 0 END) - SUM(CASE WHEN n_i - p_i > 0 THEN 1 WHEN n_i - p_i < 0 THEN -1 ELSE 0 END) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) ) = 2 THEN 1 ELSE 0 END AS Mark ,Ind ,B ,CurveID

SELECT A.file_id ,A.ren_tm ,A.scan_id ,A.ms_lvl ,A.CurveNum A.Weighted_Peak_mz ,A.ren_tm ,A.sum_i ,A.ren_tm - B.ren_tm AS Diff_Ren_Tm ,A.Weighted_Peak_mz - B.Weighted_Peak_mz AS Diff_WP ,B.CurveNum AS L_CurveNum ,B.Weighted_Peak_mz AS L_Weighted_Peak_mz ,B.ren_tm AS L_ren_tm ,B.sum_i AS L_Sum_I FROM DD_STG.S2_WEIGHTED_CURVE AS A INNER JOIN DD_STG.S2_WEIGHTED_CURVE AS B ON (A.Weighted_Peak_mz - B.Weighted_Peak_mz) BETWEEN 0.00000 AND 1.000000 AND A.ren_tm = B.ren_tm AND A.CurveNum <> B.CurveNum AND B.max_i > (0.66667 * A.max_i)

,A.Weighted_Peak_mz - B.Weighted_Peak_mz AS Diff_WP ,B.CurveNum AS L_CurveNum ,B.Weighted_Peak_mz AS L_Weighted_Peak_mz ,B.ren_tm AS L_ren_tm ,B.sum_i AS L_Sum_I FROM DD_STG.S2_WEIGHTED_CURVE AS A INNER JOIN DD_STG.S2_WEIGHTED_CURVE AS B ON (A.Weighted_Peak_mz - B.Weighted_Peak_mz) BETWEEN 0.00000 AND 1.000000 AND A.ren_tm = B.ren_tm AND A.CurveNum <> B.CurveNum AND B.max_i > (0.66667 * A.max_i) ) AS J LEFT JOIN DD_TAB.CHARGE_STATES AS C ON CAST(J.Diff_WP AS DECIMAL(18,2)) = CAST(C.chrg_mz_diff AS DECIMAL(18,2))

34

Page 35: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Procedural code uses 2 loops for same result while (inputIterator.advanceToNextRow()) { currIntensity=inputIterator.getDoubleAt(5); maxIntensity=0.0; //Initialise Temp Array for (int i=0; i <= 50; i++){ curveArray[0][i]=0; curveArray[1][i]=0; } if (overlapFlag==1){ count = 1; } else { count = 0; } //Find start of Curve, lastintensity is 0 //or previous lastintensity is higher than lastintensity – overlapping peaks (double peak curve) if (currIntensity > 0 && lastIntensity == 0 || overlapFlag==1){ //Populate Temp Array with Curve points and find maxIntensity to derive threshold while (currIntensity > 0){ if(maxIntensity < currIntensity) maxIntensity=currIntensity; if (overlapFlag==1){ overlapFlag=0; curveArray[0][count-1]=overlapMZ; curveArray[1][count-1]=overlapIntensity; PI = overlapIntensity; currIntensity=inputIterator.getDoubleAt(5); } curveArray[0][count]=inputIterator.getDoubleAt(4); curveArray[1][count]=inputIterator.getDoubleAt(5); count++; inputIterator.advanceToNextRow(); PI2 = PI; PI = currIntensity; currIntensity=inputIterator.getDoubleAt(5);

if (currIntensity > PI && PI2 > PI){ //Overlapping Peak found, store MZ and Intensity and start new Curve for next Iteration overlapFlag=1; overlapMZ=inputIterator.getDoubleAt(4); overlapIntensity=inputIterator.getDoubleAt(5); break; } } //Process Temp Array to create intermediate metrics while (curveArray[1][curveCount] > 0){ if (curveArray[1][curveCount] > intensityThreshold){ if (maxMZ < curveArray[0][curveCount]){ maxMZ=curveArray[0][curveCount]; } if (minIntensity > curveArray[1][curveCount] || minIntensity == 0){ minIntensity=curveArray[1][curveCount]; } if (minMZ > curveArray[0][curveCount] || minMZ == 0){ minMZ=curveArray[0][curveCount]; } sumIntensity=sumIntensity+curveArray[1][curveCount]; sumMZ=sumMZ+curveArray[0][curveCount]; sumMZByIntensity=sumMZByIntensity+(curveArray[0][curveCount]*curveArray[1][curveCount]); curvePoints++; } curveCount++; }

35

Page 36: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Custom analysis and custom visualisation – vital tools in understanding big data

36

36

Page 37: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Examples Log file Mass spec. Images

37

Big Data

Page 38: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361
Page 39: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

What is Big Data?

Examples Log file Mass spec. Images

BIG DATA

Page 40: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Big Data

The problem here is that this data is not atomic A picture is worth a thousand words In other words, we don’t know what question we may ask in the future

40

Page 41: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Just as you can always fit an aircraft engine into a car chassis, you can always put Big Data in a table, but it reaches a stage where it is no longer the most effective solution to do so The analysis is not sub-setting the data by rows and columns We are often interested in order Each class of big data usually requires a (lovingly hand-crafted) custom analysis

41

Summary so far…

Page 42: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Big Data

Which means that we are going to be storing the data without imposing a schema upon it – in other words, we are going to be storing the data “schema-less”

42

Page 43: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Paradox

The relational model guarantees that any question can be asked of the data and that a consistent answer will be delivered. How does that work?

Big data doesn’t impose a schema on the data, the data is stored as schema-less. One reason for this is that a schema would restrict the questions that we can ask of the data. How does that work?

43

Page 44: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Paradox

The paradox is that we are saying that : if we impose a strict schema we can ask (and answer) any question we impose no schema we can ask (and answer) any question

44

Page 45: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Paradox

The relational model doesn’t restrict the questions that can be asked of data

This is essentially true as long as (note the qualification there!): we are treating the data as atomic. We are not looking for order in the rows.

We can subset by row and column and we can do difficult sums on the end result.

So, does the relational model allow us to drill inside the atomic data? Well, no, but relational database engines often do The model assumes atomic data A query that finds all the last names of the employees paid more than $40,000 is

relational A query that finds all the employees where the third letter of their last name is ‘c’ is not

relational

45

Page 46: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Paradox

So, if the data is atomic and is treated as atomic, and order in unimportant, then the relational model allows any question to be asked

46

Page 47: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Paradox

Storing data in an ‘unstructured’ way allows you to ask any question of the data

This is essentially true as long as (note the qualification here!): we are prepared to design new functions for every new type of query that we want

to run So, imagine that you have some satellite images. They are stored without a

schema being imposed by the database engine We want to find rotating wing aircraft – so we write an algorithm that does that Now we want to find all the penguins – we need another custom algorithm So, a schema-less database allows any question to be asked of the data – as long as we are prepared to write a new custom algorithm for each new type of query

47

Page 48: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Case Study

Oil Rig data Gone fishing Sensor data

48

48

Page 49: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Case Study

Twitter Who loves you?

Social/text/sentiment

49

49

Page 50: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Case Study

Big Data in the Life Sciences World The massed spectrometers Why would anyone do that?

50

50

Page 51: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Lessons learned

Engagement

Choose you battles – look for an area where you can gain competitive advantage Choose your platform carefully Programming – algorithm development Data scientists Custom algorithms Custom visualisations

51

51

Presenter
Presentation Notes
4 minutes
Page 52: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Global Sponsor:

Questions?

Page 53: What is Big Data? · SQL-MapReduce, Reduce Function . Data output from Mass Spectrometer . 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361

Global Sponsor:

Thank You for Attending