pnc 2013 kyoto, japan december 10 th 2013 jeanette zerneke electronic cultural atlas initiative...
TRANSCRIPT
PNC 2013 Kyoto, JapanDecember 10th 2013
Jeanette ZernekeElectronic Cultural Atlas Initiative (ECAI)
School of Information, UC Berkeley
Linking Multi-faceted Complex Data
From Vision to Reality
Analysis of whole corpora and collections of text.. Linking multiple types of information…. Sharing data from different sources…. Mapping multiple layers of complex data Complex yet simple interactive visualizations…. Network analysis of everything
Open Linked Data… BIG Data
Starting with a Simplified Vision
Text Analysis Word Counts and Networks
Space and Time Maps, GIS and Historical GIS
Science Visualizations
Early ExperimentsAlice in WonderlandInteractive Word Cloud Generator
Initial steps and inspiration
Word Cloud Alice in Wonderland
TextArc ~ 2002
This visualization has all the words in the text, multiple interactive viewing modes,and it still works
Alice in Wonderland
Static Word Clouds – Single Dataset
123rf Royalty free stock photos
http://willsfamily.org/gwills/books/The image shows the 300 most commonly occurring words in the text. Full tech details provided
…a question that’s becoming increasingly central as the social sciences embrace new quantitative tools: Can number-crunchers outside these fields use the data to make bold, useful contributions? Or do they need more specialized knowledge to be able to ask the right questions in the first place?
…the unprecedented scale of the Google dataset, encompassing millions of books, has enticed scholars with the promise of a new, quantitative approach to language and culture—“culturomics,” as a pivotal Science paper dubbed it.
A language isn’t simply a set of words; it encompasses structures all the way from individual sounds to the combination of words and phrases into syntactic patterns. “This ‘language is words’ axiom is part of most people’s folk linguistics that we have to train people out of when they take Intro to Linguistics,” Fruehwald said. “That’s why it’s a little hard to take the work of these physicists seriously at first glance.”
When Physicists do Linguistics
Ben Zimmer, Boston Globe 2013
Inspired by The Internet
http://en.wikipedia.org/wiki/InternetPartial map of the Internet based on the January 15, 2005 data found on opte.org.
This image of the internet shows a beautiful complex structure which appears to have a holographic structure
It resonates with us
It has inspired us to look for these structures in the rest of our world
Inspired by The Internet
These examples show many creative ways to present data
They are ‘authored’ – a human decides how it looks
Generally they present a static view of one type of data and/or relationship
The visualization expresses something about: The characteristics of the data elements within the data
universe or the quality or pattern of the relationships between
elements
Great Data Visualizations
There is a temptation to make the data fit the box so you can do amazing visualizations – To cleanse the data so it fits To forge ahead and forget the pedigree and links to
sources To believe any text should be graph-able with a network
graph – so it is must be useful
But we know analysis requires understandable algorithms evaluation requires judgment not just visualization
There must be feedback between visual inspection / analysis / evaluation / human interaction and judgment
Challenges
We noticed that all the highly valued visualizations are authored visualizations.
They are designed by a person or group of people to express a specific point or an understanding of the data. A person has chosen both the data and visualization
methodology A person has chosen the color, size, line width, etc.
This is quite similar to making maps… and there is a long tradition of study on how to make maps as accurate and readable as possible
Zooming in to what works
Digital mapping systems inherently have ways of handling data from different sources and of different data types Three primary mappable data types: point, line, and
shape Attributes & metadata linkable at data set and element
levels
They were originally built for both user interaction with the data and authoring views / maps
Initial ECAI projects took advantage of these features Sasanian Seals, Missions of North America, Silk Road
Mapping & Spatio-temporal Visualization
Color coded routes of various explorers are displayed
Options on the left enable overlay of printed maps, links to videos, images, manuscripts and source documentation
ECAI Silk Road Routes
Use the same methodologies as mapping to explore complete text corpus 3 dimensional representation of data Linkage to metadata, attributes and source info Multiple views – search within dataset
Creating a customized work environment for the scholar’s analysis, visualization and publication of results
Blue Dots Project
Linked Open Data – the details
6.7 Architecture of a Linked Data application that implements the crawling pattern.
Required functions to ingest data
The use of data sets generated by others in the past can be impeded in many different ways – the hard drive crashed, and there was no back-up; the person who could give permission cannot be found and so on. There are some clearly distinct barriers to be overcome. Here is one typology:
1. Discovery: Does a suitable data set exist? 2. Location: Where is a copy? 3. Deterioration: Is the copy too deteriorated and/or obsolete to be usable? 4. Permission: May it be used? 5. Interoperability: Is it standardized enough to be usable with acceptable effort? 6. Description: It is clear enough what the data represent? 7. Trust: Are the lineage, version, and error rate understood and acceptable? 8. Use: Should I use it for my purpose?
Identifying Requirements for Data Reuse & Linking
Data Management as Bibliography by Michael Buckland, 2011
Fitness for purpose: Is the data adequately documented to allow confidence
in the content for your purpose? What is the uncertainty / ambiguity of the data? Are the sources of the data adequately documented?
Should you use it?
• Liberate the notes!• Expand ‘publication’ to notes• Notes as a primary resource• Expand ‘library’• Published volumes as derivative• Modernize bibliography • Preserve the ‘workshop’• Make work environment closer to a shared office (convergence), enable collaboration, and support creativity.
Editor’s Notes – Supporting Scholars
ecai.org/mellon2010ecai.org/KnowledgeUnix
More in Ryan Shaw’s presentation later in this session
Early California Cultural AtlasUC Riverside and Electronic Cultural Atlas Initiative
with the California Center for Native Nations at UCR, Stanford Spatial History Lab, and the National Center for the Teaching of History in the Schools
Map of North America, 1685
Ranchos San Bonito and El Pescadero
ECPP:Baptism Record for Mission San Carlos
Most of the data layers come from institutionally supported data sources: California Digital Library State of California Huntington Library – Early California Population Project Library of Congress David Rumsey Map Collection ECAI ePublication – North American Missions
One new dataset had to be compiled from historical sources Estimated locations of Indian villages
ECCA dataset sources
Usability Evaluation Discoverable, physically and legally available Functional with cost effective methods of ingest Semantic compatibility, acceptable level of uncertainty
ECCA Topology of Uncertainty & Ambiguity We defined
Uncertainty as a combination of multiple factors, which affect the accuracy and precision of data.
Ambiguity is uncertainty whose source is differences of opinion, perception or understanding of the data. In humanities projects, ambiguity is accepted. It is not expected to be eliminated.
For ECCA we divided uncertainty characteristics into two dimensions: source and type
ECCA Data Evaluation
In the ECCA project, each data layer has unique uncertainty and ambiguity. We have characterized the sources of this uncertainty below.
Spatio-temporal paradigm diversity (Ambiguity / Subjectivity) Perception of time and place for different communities effects how place and
‘land use’ are documented
Data recording and collection (kinds of ambiguity, accuracy, and precision) What was recorded and what has been preserved
Cultural perspectives and technology have influenced what was recorded Events that followed affect preservation
Data characterization- categorization (generalization and interpretation) Deciding how to convert the collected data into categories and objects which can
be visualized and analyzed Building / using ontologies with mapping
Sources of Uncertainty
ECCA composite characterization of the types of uncertainty:
Accuracy - Is there a knowable correct value? How close are we to it?
Precision – exactness of measurementLineage of the data - Documenting sources & metadataLegal / Protocol limitations – what data is available for useCredibility - reliability of information sourceCompleteness - Data sample size / number of observations
What percent of the total items do we know Documentation if there are known areas of missing data
Scale - For maps and timelines scale is important What scale is appropriate for what we know or can represent about the
data?
Type of Uncertainty
Datasets for use at specific levels of ‘certainty’ – digits of resolution
For maps and timelines scale is a crucial component of the visualization design. Precision implies scale in spatial and temporal data. It affects the scale at which it is appropriate to represent the data. If the data is presented at an incorrect scale it can appear either more or less specific than the data warrants. Other aspects of uncertainty can also impact the appropriate scale of data representation.
Interactive maps display change of scale seamlessly with zoom functions. At the small-scale, lines are generalized and labels are moved around or even dropped for some items when they won’t fit. ‘Zooming in’ triggers display of data with greater precision. For some implementations we will need to have datasets customized for different scales of display.
Role of Scale
Native California Ethno-geography
A case study of ambiguity in the origin of Indians who were baptized at mission San Juan Bautista
Complex GIS Data: Individual villages, networks of villages that functioned together, and villages that changed locations.
Villages and Networks
The Synthesis – A Gazetteer With one set of Locations for Linking to Data
Together the detailed and synthesis Gazetteers create a dataset that allows linking to objects at different scales
Visualization of Ethno-geographic Change over Time
Village Location
Active Baptisms
Depopulated Village Sites
Mission Site
Rancho Site
URL
What we are doing is creating custom research environments for each project
How can we create sharable tools that implements this process?
Can we create infrastructure that supports the data collection and analysis. One that includes authoring system that are customizable to match your research and publication needs?
Reality – Creating Research Environments
Research Process Support Development of infrastructure that helps researchers,
data creators, and catalogers construct the metadata needed to support decision-making about fitness of data for re-use -- for specified quality levels, specific analysis methods, and appropriate visualization tools.
Data Evaluation Interface Wouldn’t be great if we had a plug and play interactive
context and quality visualization interface for scholars and visualization authors? So you could see if the data was what you need.
Visualization Authoring Systems Easily customizable visualization authoring interfaces
that can be deployed to match your research questions and allow publication of your discoveries.
What we still need – Back to Vision