big data and large scale data analysis andrew mead school of life sciences 23 rd october 2013
TRANSCRIPT
![Page 1: Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013](https://reader036.vdocuments.us/reader036/viewer/2022081603/56649f135503460f94c26a24/html5/thumbnails/1.jpg)
Big Data andLarge Scale Data Analysis
Andrew MeadSchool of Life Sciences
23rd October 2013
![Page 2: Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013](https://reader036.vdocuments.us/reader036/viewer/2022081603/56649f135503460f94c26a24/html5/thumbnails/2.jpg)
Big Data
• Modern technologies make it increasingly easy to collect large quantities of data– ‘Omics revolution– Remote sensing– Weather (and hence climate change applications)– Internet applications– Social networking– Shopping preferences– Health applications– …
• But how do we make the most of these data?
![Page 3: Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013](https://reader036.vdocuments.us/reader036/viewer/2022081603/56649f135503460f94c26a24/html5/thumbnails/3.jpg)
Gene expression microarrays
• Data on many thousands of genes (spots) on each array
• Comparisons of multiple samples (treatments, time, individual plants or animals, …)
• Processing of data for each gene separately or in combination
![Page 4: Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013](https://reader036.vdocuments.us/reader036/viewer/2022081603/56649f135503460f94c26a24/html5/thumbnails/4.jpg)
Landscape data
• Land-use/cover for each land-parcel
• Basis for simulation studies of changes in land-use
• Summary of spatial data into simple statistics
JCA101 - Simulation01 - Run001 - Year2009
50 100 150 200
50
100
150
200
250
![Page 5: Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013](https://reader036.vdocuments.us/reader036/viewer/2022081603/56649f135503460f94c26a24/html5/thumbnails/5.jpg)
Challenges
• Storage of big data sets• Management
– Structured– Unstructured
• Analysis– Often similar questions as for smaller data sets– Computationally intractable as data volume
increases
![Page 6: Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013](https://reader036.vdocuments.us/reader036/viewer/2022081603/56649f135503460f94c26a24/html5/thumbnails/6.jpg)
Multivariate Statistics and Data Mining
• Dimensionality reduction– Find the important combinations of variables– Use these in models
• Use computing power to search for “patterns”
• Challenge in connecting the analysis process to the data
• Distributed computing, massively parallel processing (MPP), machine learning, search-based applications (SBA), …
![Page 7: Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013](https://reader036.vdocuments.us/reader036/viewer/2022081603/56649f135503460f94c26a24/html5/thumbnails/7.jpg)
Statistics and Big Data
• Computing power is probably crucial!• But statistical approaches are important
– Designing the data collection• Sub-sampling?
– Defining the problem– Managing the data– Dimension reduction
• Finding the signal amidst the noise!