broad data (india 2015)
TRANSCRIPT
Tetherless World Constellation
Broad Data Jim Hendler
Tetherless World Professor of Computer and Cognitive Science
Director, The Rensselaer Institute of Data Exploration and Applications (IDEA)
Rensselaer Polytechnic Institutehttp://www.cs.rpi.edu/~hendler
@jahendler (twitter)
Tetherless World Constellation
This talk
• What I’m not going to talk about much– The Semantic Web (per se)
• http://www.slideshare.net/jahendler/semantic-web-the-inside-story
– Social Machines• http://www.slideshare.net/jahendler/social-machines-oxford-hendler
– My work with Watson and Cognitive Computing• http://www.slideshare.net/jahendler/watson-an-academics-perspective
• http://www.slideshare.net/jahendler/watson-summer-review82013final
• What I am going to present–The rest of the big data story…
Tetherless World Constellation
Data is important!
• Roughly every 50 years a new power source for the human race is found. Once upon a time it was chemical, then it was electrical, then nuclear, etc.
• Information – so not just data, but data being used – is the new power source for our generation.
http://www.slideshare.net/jahendler/the-science-of-data-science
4
The Rensselaer Institute for Data Exploration and Applications
BusinessSystems:
Built and NaturalEnvironments:
Cyber-Resiliency:
Policy, Ethics andStewardship:
Materials Informatics:Data-driven Physical/Life Sciences:
Healthcare Analytics and Mobile Health:
Social Network Analytics:
Agents and Augmented Reality:
Office of Research 5
Developing a “Data Science” Research Agenda
Multiscale Sparcity
Abductive Agent-oriented
Tetherless World Constellation
BIG Data
• The term “Big Data” is widely used nowadays to refer to a whole bunch of machine-readable data in one accessible (to the researcher) place–3 main contexts
• The large data collections of “big science” projects – in traditional data warehouse or database formats
• The enterprise data of large, non-Web-based companies (IBM, TATA, etc.)
– Generally in multiple data formats, stores, warehouses, etc.
• The data holdings of a Google, Facebook or other large Web company
– Include large “unstructured” holdings – Include “graph” data
Tetherless World Constellation
But wait, there’s more!
• 4th context: Broad Data – The huge amount of freely available, but widely varied,
Open Data on the World Wide Web (Structured and Semi-structured)• Example: The extended Facebook OGP graph (the
part outside Facebook’s datasets)• Example: dbpedia, yago, wikidata, and other sources
of indexed information sources• Example: The growing linked open data cloud of
freely available linked data from many domains• Example: millions of datasets that are available on
the Web freely available from governments around the world
Tetherless World Constellation
BROAD data challenges
• For broad data the new challenges that emerge include – (Web-scale) data search – “Crowd-sourced” modeling and user testing– rapid (and potentially ad hoc) integration of
datasets– visualization and analysis of only-partially
modeled datasets– policies for data use, reuse and combination.
• Which are an overlooked but critical part of the KDD world
Tetherless World Constellation
10
KDD Pipeline – as usually presented
Data Storage
(Big Data Warehouse)
Data Storage
(Big Data Warehouse)
Tetherless World Constellation
KDD Pipeline – in the real world
• Data is increasingly being brought in from external sources, with mixed provenance, and increasingly outside the analyzers’ control.
• At increasing rates and scalesData
Storage
Data Storag
e
Sensors … apps
Social Media
Customer Behaviors
Web Partners
Formatting, standards use, data cleansing, data bias analysis, …
Open data
Data Storag
e
Data Storag
e
Data Storag
e
Data Storag
e
Data Storag
e
Data Storag
e
Data Storag
e
Data Storag
e
Data Storag
e
Data Storag
e
Data Storag
e
Data Storag
eData
SourcesData
Sources
…
…
Tetherless World Constellation
Tough data integration challenges
Enterprise analytics
Open Data Integration
Hard problems!
Tetherless World Constellation
DIVE into Data
Discover Integrate Validate Explore
Thinking outside the Database box
IDEA
Discovery needs semantics
How do you find the Data you need?How do you find the Data you need?
The answer isn’t:Middle Eastern Terrorists for $800 …
IDEA
Discovery challenge: keyword search won’t work
World Bank: Africa
US Data.gov: Crop
Africover: Agriculture
Kenya: Agricultural
IDEA
Integration challenge: need to understand the data
Person
RIN 660125137
Address # 1118
Address St Pinehurst
Address zip
12203
Course topic
CSCI
Course # 4961
Campus Personnel
RPI ID 660125137
Name Hendler
Campus Classes
CRN 1118
Name Intro to Physics
YES
NO!!!!
IDEA
Semantic Web and Linked Data (UK)
County Council
Ordnance Survey
Royal Mail
IOGDC Open Data Tutorial 18
IDEA
But very hard for machines without people (or knowledge)
Head to head comparison shows that burglaries in Avon and Somerset (UK) far exceed those in Los Angeles, California
* one of the most dangerous places in the US vs. one of the safest in the UK * fails the “smell test”
IDEA
Data + everything else you know
Same or different?
Do the terms mean the same? Are they collected in the same way? Are they processed differently? …
Office of Research
Exploration challenge: develop/test earlier in pipeline
23
Data StorageData
Storage
Sensors and apps Social Media
Customer Behaviors
Web Partners
Formatting, standards use, data cleansing, data bias analysis, …
Open data
Data StorageData
StorageData
StorageData
StorageData
StorageData
StorageData
StorageData
StorageData
StorageData
StorageData
StorageData
StorageData
StorageData
Storage
ExploreExplore
Can we develop mechanisms to rapidly develop/test hypotheses prior to entering the full analytics pipeline?
Can human perceptual apparatus help?
Tetherless World Constellation
Exploration challenge is to improve human/data interaction
Were there really no fires in 1985?
Tetherless World Constellation
Traditional Metadata
• Traditionally metadata tries to be comprehensive–Example:ISO 19115
(GIS standard)• >400 elements• 14 “packages”• Dozens of UML models
(not all consistent w/ each other)
• After 50 years this still doesn’t work!
Tetherless World Constellation
The alternative: Not your “father’s metadata”
• Big Data on the web– is moving away from
traditional relational models (cf. NoSQL)
– Moving towards third party application and extension (cf. Json)
– Focus on interoperability and exchange with “lightweight” semantics• Using ideas from the Semantic Web
– Search: Schema.org – Social Networking: OGP
Tetherless World Constellation
Google 2014
Google finds embedded metadata on >20% of its crawl – Guha, 2014
Tetherless World Constellation
• The schema.org hierarchy and details are all available on line
–https://schema.org/docs/full.html
Tetherless World Constellation
Dataset extension to schema.org - April, 2013
Schema.org/Dataset – add this to your pages!
Tetherless World Constellation
USA “Project Data” – metadataRDFa
Embedded metadata for Search, Web Apps
Based on Schema.org/Dataset
Tetherless World Constellation
Not just Govt sector
• IPTC rNews– Embedded format for online news publications
Tetherless World Constellation
Not just Govt sector
• Goodrelations– Embedded format for online products/catalogs
Tetherless World Constellation
Not just Govt sector
• Open Graph Protocol–Embedded format for Facebook
relationships
Tetherless World Constellation
Next steps
Smith James June 4
Jones Fred May 17
O’Connell Frank April 3
Chang Wu February 21
Hoffman Bernd December 9
Person
Date
It’s not enough just to describe the data elements…
Tetherless World Constellation
Describing a dataset … requires a context
Smith James June 4
Jones Fred May 17
O’Connell Frank April 3
Chang Wu February 21
Hoffman Bernd December 9
Person
Date
1976 Dates of Birth
Tetherless World Constellation
Describing a dataset … requires a contextHow do we capture more of this information?
Smith James June 4
Jones Fred May 17
O’Connell Frank April 3
Chang Wu February 21
Hoffman Bernd December 9
Person
Date
1976 Cancer Mortality dates
IDEA
ARL Network-Science CTA
1
10
100
1000
1 10 100 1000
Count
Time interval (# of days)
Mentorship first
Housing first
1
10
100
1000
1 10 100 1000
Count
Time interval (# of days)
Mentorship first
Housing trust first
1
10
100
1000
1 10 100 1000
Count
Time interval (# of days)
Housing trust first
Mentorship first
1
10
100
1000
1 10 100 1000
Count
Time interval (# of days)
Housing trust first
Mentorship first
A
C
B
D
0
50
100
150
200
250
300
350
400
-300 -200 -100 0 100 200 300
Count
Time interval (# of days)
0
100
200
300
400
500
600
700
-300 -200 -100 0 100 200 300
Count
Time interval (# of days)
0
50
100
150
200
250
300
350
400
450
-300 -200 -100 0 100 200 300
Count
Time interval (# of days)
0
50
100
150
200
250
300
350
-300 -200 -100 0 100 200 300
Count
Time interval (# of days)
A
C
B
D
Algorithms designed Y3 were tested against 220GB of data from Everquest II game looking for proxy measures of trust - Performance results on real data showed good correspondence with theoretical results. (but 220GB = 1 month of our 2 yrs of data)
IDEA
Scaling inference for discovery, integration & validation
AI “rules on graphs” bring (limited) KR languages to supercomputing models
Weaver (PhD 2013) showed power of BlueGene/Q for AI computations
51
From visualization to exploration
… Unfortunately, visualization too often becomes an end product of scientific analysis, rather than an exploration tool that scientists can use throughout the research life cycle. However, new database technologies, coupled with emerging Web-based technologies, may hold the key to lowering the cost of visualization generation and allow it to become a more integral part of the scientific process.
Tetherless World Constellation
Conclusions
• Our data challenge is becoming “Broad Data”– World Wide Web trend towards more and more varied
data• In many domains
– E-commerce, Open Govt, many more (cf. Health/Medical care)
• Broad data requires– Modern, Web-oriented metadata– LINKING the metadata, not the data
• Broad data requires thinking outside the “Database” box– DIVE: discover, integrate, validate and– especially: EXPLORE (early, often, rapidly)