defining & characterizing big data
DESCRIPTION
Presentation by James Barkley at the MIT Big Data Explorers "Crash Course" on 9/20/2014. "Defining & Characterizing Big Data". http://www.mitbigdataexplorers.com/TRANSCRIPT
Defining & Characterising Big DataBig Data Crash Course,
an event sponsored by the MIT-SDM Big Data Explorers Club
Jim BarkleyThe MITRE Corporation
MIT SDM Fellow
September 20, 2014
2
Defining Big Data
“Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.”
-Wikipedia
“Big Data is when the size of the data itself becomes [a significant] part of the problem.”- O’reilly Media
“large, diverse, complex, longitudinal, and/or distributed datasets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future.”- National Science Foundation
3
Characterizing “Big”
4
“Numbers Everyone Should Know” – SoCC 2010 Keynote, Jeffrey Dean, Google
5
• Preserving Privacy Values
• Responsible Educational Innovation
• Big Data & Discrimination
• Law Enforcement & Security
• Data as a Public Resource
Big Data and Government
http://www.whitehouse.gov/BigData
6
• National Science Foundation• Department of Defense• National Institute of Health• Department of Energy• US Geological Survey
Big Data Research and Development Initiative
• DARPA XDATA• http://
www.darpa.mil/OpenCatalog/index.html
7
Big Data & Industry
http://blogs.the451group.com/information_management/2012/11/02/updated-database-landscape-graphic/
8
• Algorithms
• Bio-/Health-/Life- Sciences
• Infrastructure/City-Related
• Massive Scale/Data Optimization
• Risk; Privacy; Policy
• Social Media-Related Projects
• Visual/Scene Recognition
Big Data & MIT
9
• LABS:– CSAIL; Intel Science & Tech
Center– MIT Geospatial Data Center– MIT Information Quality (MITIQ)– MIT Initiative on the Digital
Economy (IDE)– Laboratory for Information &
Decisions Systems (LIDS)– Operations Research Center
(ORC); Accenture & MIT Alliance on Business Analytics
– W3C Consortium Big Data Community Group
Big Data & MIT
• Example Research Project:
• TUNABLE FAST SIMILARITY SEARCH FOR HIGH-DIMENSIONAL DATA
• “Locality-Sensitive Hashing (LSH) is an efficient algorithm for finding pairs of similar (or highly correlated) objects in a database without enumerating all pairs of such objects. Example applications include searching for near-duplicate documents, similar images, highly correlated stocks etc.”
10
• Thanks for coming today. You choose your own level of involvement– “Some are naturally average, some settle, and some have mediocrity thrust
upon them”• Club Goals & Vision:
1. Serve as learning platform through planned activities, sharing, and collaboration
2. Serve as a networking tool3. Serve as an incubator for projects, investigations, research, and startups
• Future ideas for club activities:– Use the listserv!!! Share articles and have discussions.– Monthly club meetings (1-2 hour, touch base on current efforts, make club
members present) ?– Small BOF sessions around specific technologies (e.g., MongoDB) or domains
(e.g., Health Care)– More full-day or weekend events. Hackathons, unconferences– Subsidize local area conference attendance– Kaggle team-ups and hacking for profit
Big Data Explorers & You
“Cyberspace. A consensual hallucination experienced daily by billions of legitimate operators, in every nation, by children being taught mathematical concepts . . . A graphic representation of data
abstracted from the banks of every computer in the human system.”