big data and large scale data analysis andrew mead school of life sciences 23 rd october 2013

7
Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013

Upload: arlene-cameron

Post on 04-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013

Big Data andLarge Scale Data Analysis

Andrew MeadSchool of Life Sciences

23rd October 2013

Page 2: Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013

Big Data

• Modern technologies make it increasingly easy to collect large quantities of data– ‘Omics revolution– Remote sensing– Weather (and hence climate change applications)– Internet applications– Social networking– Shopping preferences– Health applications– …

• But how do we make the most of these data?

Page 3: Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013

Gene expression microarrays

• Data on many thousands of genes (spots) on each array

• Comparisons of multiple samples (treatments, time, individual plants or animals, …)

• Processing of data for each gene separately or in combination

Page 4: Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013

Landscape data

• Land-use/cover for each land-parcel

• Basis for simulation studies of changes in land-use

• Summary of spatial data into simple statistics

JCA101 - Simulation01 - Run001 - Year2009

50 100 150 200

50

100

150

200

250

Page 5: Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013

Challenges

• Storage of big data sets• Management

– Structured– Unstructured

• Analysis– Often similar questions as for smaller data sets– Computationally intractable as data volume

increases

Page 6: Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013

Multivariate Statistics and Data Mining

• Dimensionality reduction– Find the important combinations of variables– Use these in models

• Use computing power to search for “patterns”

• Challenge in connecting the analysis process to the data

• Distributed computing, massively parallel processing (MPP), machine learning, search-based applications (SBA), …

Page 7: Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013

Statistics and Big Data

• Computing power is probably crucial!• But statistical approaches are important

– Designing the data collection• Sub-sampling?

– Defining the problem– Managing the data– Dimension reduction

• Finding the signal amidst the noise!