Transcript

copyright 2015

Big Data: debunking some of the myths

Chris Swan @cpswan

copyright 2015

Agenda • My background

• What do I mean by big data?

• Know your algorithm

• Know your data

• Performance

copyright 2015

My background CTO CTO Client Experience Co-head CTO Security Corporate Finance fintech, early stage IT R&D – Networks and security Grid, app server engineering Combat System Engineer

copyright 2015

Recent adventure with Big Data

copyright 2015

Misquoting Roger Needham

Whoever thinks their analytics problem is solved by big data,

doesn’t understand their analytics problem and doesn’t understand

big data

5

copyright 2015

What do I mean by ‘big data’?

copyright 2015

Overview

7

Based on a blog post from April 2012 – http://is.gd/swbdla

Problem Types

Algorithm Complexity

Dat

a Vo

lum

e

Simple

Big Data

Quant

copyright 2015

Simple problems

8

Low data volume, low algorithm complexity

Problem Types

Algorithm Complexity

Dat

a Vo

lum

e

Simple

Big Data

Quant

copyright 2015

Quant Problems

9

Any data volume, high algorithm complexity

Problem Types

Algorithm Complexity

Dat

a Vo

lum

e

Simple

Big Data

Quant

copyright 2015

Big Data Problems

10

High data volume, low algorithm complexity

Problem Types

Algorithm Complexity

Dat

a Vo

lum

e

Simple

Big Data

Quant

Types of Big Data Problem:

1. Inherent

2. More data gives better result than more complex

algorithm

copyright 2015 11

Good - Lots of new tools, mostly open source

Bad - Term being abused by marketing departments

Ugly

- Can easily lead to over reliance on systems that lack transparency and ignore specific data points 'Computer says no', but nobody can explain why

The good, the bad and the ugly of Big Data

copyright 2015

It’s important to know your algorithms

copyright 2015

Turning an assumption into a line

copyright 2015

There are lots of algorithms to understand

copyright 2015

Statisticians

copyright 2015

Quants

copyright 2015

Data scientist

copyright 2015

It’s also important to know your data

copyright 2015

Whatever we call our ‘experts’

copyright 2015

Who’s heard of Anscombe’s quartet?

copyright 2015

Same statistical properties, but…

http://en.wikipedia.org/wiki/Anscombe's_quartet

copyright 2015

Performance

copyright 2015

Don’t agonise over distros

The performance of Hadoop distros are all the same to within 1 server

within a cluster

Stefan Groschupf One of the creators of Hadoop

copyright 2015

Small is still beautiful

copyright 2015

Because latency

copyright 2015

In terms of distance

http://loci.cs.utk.edu/dsi/netstore99/docs/presentations/keynote/sld023.htm

copyright 2015

Interactive > Real time

copyright 2015

Questions?


Top Related