pnc 2013 kyoto, japan december 10 th 2013 jeanette zerneke electronic cultural atlas initiative...

48
PNC 2013 Kyoto, Japan December 10 th 2013 Jeanette Zerneke Electronic Cultural Atlas Initiative (ECAI) School of Information, UC Berkeley Linking Multi-faceted Complex Data From Vision to Reality

Upload: jared-jackson

Post on 26-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

PNC 2013 Kyoto, JapanDecember 10th 2013

Jeanette ZernekeElectronic Cultural Atlas Initiative (ECAI)

School of Information, UC Berkeley

Linking Multi-faceted Complex Data

From Vision to Reality

Analysis of whole corpora and collections of text.. Linking multiple types of information…. Sharing data from different sources…. Mapping multiple layers of complex data Complex yet simple interactive visualizations…. Network analysis of everything

Open Linked Data… BIG Data

Starting with a Simplified Vision

Text Analysis Word Counts and Networks

Space and Time Maps, GIS and Historical GIS

Science Visualizations

Early ExperimentsAlice in WonderlandInteractive Word Cloud Generator

Initial steps and inspiration

Word Cloud Alice in Wonderland

TextArc ~ 2002

This visualization has all the words in the text, multiple interactive viewing modes,and it still works

Alice in Wonderland

Static Word Clouds – Single Dataset

123rf Royalty free stock photos

http://willsfamily.org/gwills/books/The image shows the 300 most commonly occurring words in the text. Full tech details provided

The I-Ching

These word clouds are ‘Authored w/ specific cartographic choices’

Google Books - Enlightenment

Spiritual vs Enlightenment?

If you know what to ask?

…a question that’s becoming increasingly central as the social sciences embrace new quantitative tools: Can number-crunchers outside these fields use the data to make bold, useful contributions? Or do they need more specialized knowledge to be able to ask the right questions in the first place?

…the unprecedented scale of the Google dataset, encompassing millions of books, has enticed scholars with the promise of a new, quantitative approach to language and culture—“culturomics,” as a pivotal Science paper dubbed it.

A language isn’t simply a set of words; it encompasses structures all the way from individual sounds to the combination of words and phrases into syntactic patterns. “This ‘language is words’ axiom is part of most people’s folk linguistics that we have to train people out of when they take Intro to Linguistics,” Fruehwald said. “That’s why it’s a little hard to take the work of these physicists seriously at first glance.”

When Physicists do Linguistics

Ben Zimmer, Boston Globe 2013

Inspired by The Internet

http://en.wikipedia.org/wiki/InternetPartial map of the Internet based on the January 15, 2005 data found on opte.org.

This image of the internet shows a beautiful complex structure which appears to have a holographic structure

It resonates with us

It has inspired us to look for these structures in the rest of our world

Inspired by The Internet

Web of Science

These examples show many creative ways to present data

They are ‘authored’ – a human decides how it looks

Generally they present a static view of one type of data and/or relationship

The visualization expresses something about: The characteristics of the data elements within the data

universe or the quality or pattern of the relationships between

elements

Great Data Visualizations

There is a temptation to make the data fit the box so you can do amazing visualizations – To cleanse the data so it fits To forge ahead and forget the pedigree and links to

sources To believe any text should be graph-able with a network

graph – so it is must be useful

But we know analysis requires understandable algorithms evaluation requires judgment not just visualization

There must be feedback between visual inspection / analysis / evaluation / human interaction and judgment

Challenges

We noticed that all the highly valued visualizations are authored visualizations.

They are designed by a person or group of people to express a specific point or an understanding of the data. A person has chosen both the data and visualization

methodology A person has chosen the color, size, line width, etc.

This is quite similar to making maps… and there is a long tradition of study on how to make maps as accurate and readable as possible

Zooming in to what works

Digital mapping systems inherently have ways of handling data from different sources and of different data types Three primary mappable data types: point, line, and

shape Attributes & metadata linkable at data set and element

levels

They were originally built for both user interaction with the data and authoring views / maps

Initial ECAI projects took advantage of these features Sasanian Seals, Missions of North America, Silk Road

Mapping & Spatio-temporal Visualization

Color coded routes of various explorers are displayed

Options on the left enable overlay of printed maps, links to videos, images, manuscripts and source documentation

ECAI Silk Road Routes

Use the same methodologies as mapping to explore complete text corpus 3 dimensional representation of data Linkage to metadata, attributes and source info Multiple views – search within dataset

Creating a customized work environment for the scholar’s analysis, visualization and publication of results

Blue Dots Project

Blue Dots Linked to Source Data

A custom work environment for the scholar

So can we link everything? e.g., Linked Open Data

ECAI Religious Atlas of Asia

‘30000 Foot’ View: All Religions – Random Overlay - An Imprecise Tool

Linked Open Data – the details

6.7 Architecture of a Linked Data application that implements the crawling pattern.

Required functions to ingest data

The use of data sets generated by others in the past can be impeded in many different ways – the hard drive crashed, and there was no back-up; the person who could give permission cannot be found and so on. There are some clearly distinct barriers to be overcome. Here is one typology:

1. Discovery: Does a suitable data set exist? 2. Location: Where is a copy? 3. Deterioration: Is the copy too deteriorated and/or obsolete to be usable? 4. Permission: May it be used? 5. Interoperability: Is it standardized enough to be usable with acceptable effort? 6. Description: It is clear enough what the data represent? 7. Trust: Are the lineage, version, and error rate understood and acceptable? 8. Use: Should I use it for my purpose?

Identifying Requirements for Data Reuse & Linking

Data Management as Bibliography by Michael Buckland, 2011

Data Interoperability & Reuse

Buckland, Zerneke 2011

Fitness for purpose: Is the data adequately documented to allow confidence

in the content for your purpose? What is the uncertainty / ambiguity of the data? Are the sources of the data adequately documented?

Should you use it?

• Liberate the notes!• Expand ‘publication’ to notes• Notes as a primary resource• Expand ‘library’• Published volumes as derivative• Modernize bibliography • Preserve the ‘workshop’• Make work environment closer to a shared office (convergence), enable collaboration, and support creativity.

Editor’s Notes – Supporting Scholars

ecai.org/mellon2010ecai.org/KnowledgeUnix

More in Ryan Shaw’s presentation later in this session

Early California Cultural AtlasUC Riverside and Electronic Cultural Atlas Initiative

with the California Center for Native Nations at UCR, Stanford Spatial History Lab, and the National Center for the Teaching of History in the Schools

Map of North America, 1685

Ranchos San Bonito and El Pescadero

ECPP:Baptism Record for Mission San Carlos

Most of the data layers come from institutionally supported data sources: California Digital Library State of California Huntington Library – Early California Population Project Library of Congress David Rumsey Map Collection ECAI ePublication – North American Missions

One new dataset had to be compiled from historical sources Estimated locations of Indian villages

ECCA dataset sources

ECAI Early California (ECCA) Project Data

Usability Evaluation Discoverable, physically and legally available Functional with cost effective methods of ingest Semantic compatibility, acceptable level of uncertainty

ECCA Topology of Uncertainty & Ambiguity We defined

Uncertainty as a combination of multiple factors, which affect the accuracy and precision of data.

Ambiguity is uncertainty whose source is differences of opinion, perception or understanding of the data. In humanities projects, ambiguity is accepted. It is not expected to be eliminated.

For ECCA we divided uncertainty characteristics into two dimensions: source and type

ECCA Data Evaluation

In the ECCA project, each data layer has unique uncertainty and ambiguity. We have characterized the sources of this uncertainty below.

Spatio-temporal paradigm diversity (Ambiguity / Subjectivity) Perception of time and place for different communities effects how place and

‘land use’ are documented

Data recording and collection (kinds of ambiguity, accuracy, and precision) What was recorded and what has been preserved

Cultural perspectives and technology have influenced what was recorded Events that followed affect preservation

Data characterization- categorization (generalization and interpretation) Deciding how to convert the collected data into categories and objects which can

be visualized and analyzed Building / using ontologies with mapping

Sources of Uncertainty

ECCA composite characterization of the types of uncertainty:

Accuracy - Is there a knowable correct value? How close are we to it?

Precision – exactness of measurementLineage of the data - Documenting sources & metadataLegal / Protocol limitations – what data is available for useCredibility - reliability of information sourceCompleteness - Data sample size / number of observations

What percent of the total items do we know Documentation if there are known areas of missing data

Scale - For maps and timelines scale is important What scale is appropriate for what we know or can represent about the

data?

Type of Uncertainty

Datasets for use at specific levels of ‘certainty’ – digits of resolution

For maps and timelines scale is a crucial component of the visualization design. Precision implies scale in spatial and temporal data. It affects the scale at which it is appropriate to represent the data. If the data is presented at an incorrect scale it can appear either more or less specific than the data warrants. Other aspects of uncertainty can also impact the appropriate scale of data representation.

Interactive maps display change of scale seamlessly with zoom functions. At the small-scale, lines are generalized and labels are moved around or even dropped for some items when they won’t fit. ‘Zooming in’ triggers display of data with greater precision. For some implementations we will need to have datasets customized for different scales of display.

Role of Scale

Native California Ethno-geography

A case study of ambiguity in the origin of Indians who were baptized at mission San Juan Bautista

Compile Village Location Information from multiple sources

Complex GIS Data: Individual villages, networks of villages that functioned together, and villages that changed locations.

Villages and Networks

Developing a classification scheme to use for mapping

Villages and Networks -- Detail

The Synthesis – A Gazetteer With one set of Locations for Linking to Data

Together the detailed and synthesis Gazetteers create a dataset that allows linking to objects at different scales

Visualization of Ethno-geographic Change over Time

Village Location

Active Baptisms

Depopulated Village Sites

Mission Site

Rancho Site

URL

What we are doing is creating custom research environments for each project

How can we create sharable tools that implements this process?

Can we create infrastructure that supports the data collection and analysis. One that includes authoring system that are customizable to match your research and publication needs?

Reality – Creating Research Environments

Research Process Support Development of infrastructure that helps researchers,

data creators, and catalogers construct the metadata needed to support decision-making about fitness of data for re-use -- for specified quality levels, specific analysis methods, and appropriate visualization tools.

Data Evaluation Interface Wouldn’t be great if we had a plug and play interactive

context and quality visualization interface for scholars and visualization authors? So you could see if the data was what you need.

Visualization Authoring Systems Easily customizable visualization authoring interfaces

that can be deployed to match your research questions and allow publication of your discoveries.

What we still need – Back to Vision

Thank you

I hope we see much to inspire us in this session and the rest of the conference!