big data now playing ..... a t the sandbox

Post on 03-Jan-2016

84 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Big data now playing ..... a t the sandbox. John.Dunne@cso.ie 17 th October 2014 IAOS, Vietnam. Overview. Context How CSO got interested in b ig data The sandbox Learning from other industries Learning from the past The sandbox – looking to the future - PowerPoint PPT Presentation

TRANSCRIPT

Big data now playing ..... at the sandbox

John.Dunne@cso.ie17th October 2014

IAOS, Vietnam

Overview

• Context• How CSO got interested in big data• The sandbox• Learning from other industries• Learning from the past• The sandbox – looking to the future• Concluding comments

Keywords – big data, modernisation, sandbox2

Big data – working definition

Data that is difficult to collect, store or process within the conventional systems of statistical organizations.

Either, their volume, velocity, structure or variety requires the adoption of new statistical software processing techniques and/or IT infrastructure to enable cost-effective insights to be made.

3

Do more with less

Mindset - Opportunities exist with secondary data sources

4

Legal environment

Data Protection

Official StatisticsFreedom of Information

Key : 3 Legislative pillars5

Modernisation and big data

2011 Conference of European Statisticians endorse modernisation strategy

2012 Big data on modernisation agenda

2013 ESSC Scheveningen memorandum on Big data and official statistics

2013 International Big data team gets going

2014 Big data on UNSC agenda

2014 The sandbox goes live at MSIS Dublin

7

2013 CSO Project - To determine household composition using smart metering data

Origin of data : Consumer Behaviour Trials in 2009 and 2010

• Over 5000 households in pilot• 3 months baseline data (reading every 30 mins)• Pre-trial survey using CATI

http://www.unece.org/stats/documents/2013.09.coll.html

Project with pilot data brought challenges

Pilot 7 million data points per monthICHEC helped out

Go live 2160 million data points per monthJoe, we need a bigger computer

8

https://www.ichec.ie/

The hardware on which the sandbox system is based is a High Performance Computing cluster called Stoney. The cluster is hosted in the National University of Ireland, Galway since April 2009 and is composed of 60 compute nodes each of which has two 2.8GHz Intel (Nehalem EP) Xeon X5560 quad-core processors, 48GB of RAM and a 1TB local disk. Each node is connected to two networks – an InfiniBand network for accessing the shared Lustre filesystem and for high performance communications as well as a Gigabit Ethernet network for management tasks. In addition, a 20TB shared filesystem is available to all nodes.ICHEC will dedicate 20 compute nodes to enable a Hadoop cluster with 160 cores almost 1TB of RAM and 20TB of HDFS distributed storage.

The sandbox

10

The sandbox provides an environment to

o test feasibility of remote access and processingo test whether existing standards/models/methods

can be applied to big datao evaluate the usefulness of big data software toolso learn by doing with respect to potential uses,

advantages and disadvantages of big datao facilitate further collaboration in the

international community

11

The toys (data sources)

o twitter datao mobile phone data o satellite imagery / aerial photographyo price data/ job vacancy data via scrapingo scanner data/price data sourced via large

vendorso data from road traffic sensorso smart meter data on electricity/gas consumption

12

Some of the players

To play, contact Steven.Vale@unece.org

Learning from other industries- technical partners can have a role to play

Data Clearing Houses

Exchange of data for billing purposes

ROW Mobile Network Operators

Irish MobileNetwork OperatorsMNOs

14

Learning from the past- think about the bigger picture

Nordbotten, Thygesen and the statistical archive concept

http://www.census.gov/history/pdf/kraus-natdatacenter.pdfhttp://blog.modernmechanix.com/the-national-data-center-and-personal-privacy/

The National Data Center and Personal Privacy By Arthur R Miller

Learning from the past- do not underestimate privacy concerns

16

The sandbox - looking to the future

o Centres for Research and Development

?o Centres of Excellence

?o Partner organisations for collecting, processing or storing data

of a less or non sensitive nature ???

o Significant partner organisations enabling the collection, processing or storing data of a sensitive nature

?????

• Think about bigger picture / broader system• An open mind to the possibility of new partners• Be open and transparent• Don’t underestimate privacy concerns• Continue to collaborate and share

Concluding remarks

top related