metadata and information visualization naomi dushay cornell information science national science...
Post on 21-Dec-2015
218 views
TRANSCRIPT
Metadata Metadata and and
Information VisualizationInformation Visualization
Naomi DushayNaomi Dushay
Cornell Information ScienceCornell Information Science
National Science Digital LibraryNational Science Digital Library
¿Que es NSDL?¿Que es NSDL?National Science Digital LibraryNational Science Digital Library
Purpose:Purpose: EducationalEducational
broad definition of Science: also Technology, Engineering, broad definition of Science: also Technology, Engineering, Mathematics, etc.Mathematics, etc.
Production Production Research Research
Users:Users: teachers, students, researchers, general publicteachers, students, researchers, general public K-grayK-gray
http://nsdl.orghttp://nsdl.org
http://comm.nsdl.orghttp://comm.nsdl.org Virtual communitiesVirtual communities
NSDL: Metadata AggregatorNSDL: Metadata Aggregator
Centralized Metadata RepositoryCentralized Metadata Repository
Two-tiered model: collections & itemsTwo-tiered model: collections & items Item records harvested from collectionsItem records harvested from collections
Diverse metadata formats and Diverse metadata formats and granularity levelsgranularity levels
Metadata Repository
collection
item
item
itemitem
item
collection
item
item
itemitem
item
NSDL ArchitectureNSDL Architecture
resourceresource
resource
resource
resourceresource
resource
resource
resource
resource
SearchService
UI
Goal: Goal: Provide Normalized Metadata Provide Normalized Metadata
Why? Why? Quality of NSDL services (e.g. search Quality of NSDL services (e.g. search
results, or UI display)results, or UI display) Enhance predictability of metadata for Enhance predictability of metadata for
reharvesting servicesreharvesting services Improve metadata quality, when possibleImprove metadata quality, when possible
How?How?
Metadata Normalization Metadata Normalization ChallengesChallenges
Broad contentBroad content Types of resourcesTypes of resources TopicsTopics
Metadata QualityMetadata Quality Wildly inconsistent (what fields are used, Wildly inconsistent (what fields are used,
what info is present)what info is present) Missing informationMissing information Consistent, controlled vocabularies? Consistent, controlled vocabularies?
FuggedaboutitFuggedaboutit
Disparate Quantities Disparate Quantities (by subject, by collection)(by subject, by collection) 7 vs. 300,000 items 7 vs. 300,000 items
Virtual Communities Virtual Communities Within communities, no agreement on Within communities, no agreement on
needsneeds
Reduce human effort to keep costs downReduce human effort to keep costs down
Metadata in the MARC Metadata in the MARC WorldWorld
Relatively controlled, closed system with Relatively controlled, closed system with strong communitystrong community
Comprehensive and current documentationComprehensive and current documentation Edit checks at MARC application and Edit checks at MARC application and
bibliographic utility levelsbibliographic utility levels Routine review at creation point Routine review at creation point Random sampling at import/exportRandom sampling at import/export Trusted suppliersTrusted suppliers
Metadata Wild WestMetadata Wild West
Scattered community with many working in Scattered community with many working in isolation, few with relevant background in isolation, few with relevant background in describing resourcesdescribing resources
Wide variety of resources to describeWide variety of resources to describe
Insufficient documentation and training Insufficient documentation and training
availableavailable Harvesting model developed well before Harvesting model developed well before
notion of data qualitynotion of data quality
scrubbed & normalizedscrubbed & normalized
NSDL Harvesting ModelNSDL Harvesting Model
NSDLMROAI
server
NSDLSearchService
http://nsdl.org
NSDLArchiveService
NSDLMetadataRepository
(MR)
collectionAAA
metadata
collectionBBB
metadata
collectionBBB
metadata
collectionAAA
metadata
OAIserver
OAIserver
Continuum of Approaches Continuum of Approaches (1)(1)
Random sampling (XMLSpy)Random sampling (XMLSpy) AdvantagesAdvantages
Includes some formatting and color codingIncludes some formatting and color coding
DisadvantagesDisadvantages Assumes consistency/predictabilityAssumes consistency/predictability Difficult to determine extent of problems Difficult to determine extent of problems
foundfound Tedious, at bestTedious, at best
Continuum of Approaches Continuum of Approaches (2)(2)
Spreadsheets (Microsoft Excel)Spreadsheets (Microsoft Excel) AdvantagesAdvantages
Better sorting and control by reviewerBetter sorting and control by reviewer
DisadvantagesDisadvantages Unwieldy for large filesUnwieldy for large files
Requires sustained focus from reviewerRequires sustained focus from reviewer
Requires translation into tab-delimited fileRequires translation into tab-delimited file
Continuum of Approaches Continuum of Approaches (3)(3)
Visual Graphical Analysis (Spotfire)Visual Graphical Analysis (Spotfire) AdvantagesAdvantages
View of several data dimensions simultaneouslyView of several data dimensions simultaneously Reviewer controls data displayReviewer controls data display Tends to pull reviewer focus to anomaliesTends to pull reviewer focus to anomalies Handles fairly large files at one time, while allowing subset viewsHandles fairly large files at one time, while allowing subset views Display manipulation possible without programmersDisplay manipulation possible without programmers
DisadvantagesDisadvantages High cost of softwareHigh cost of software Requires translation into tab-delimited fileRequires translation into tab-delimited file
Visual Graphical Analysis:Visual Graphical Analysis:Allows you to review ALL the information in the file THOROUGHLY and Allows you to review ALL the information in the file THOROUGHLY and
QUICKLY.QUICKLY.
With a mouse click or two, you canWith a mouse click or two, you can:: ReassignReassign which characteristics the which characteristics the axesaxes represent in a scatter plot represent in a scatter plot Assign color, shape, and/or sizeAssign color, shape, and/or size to any characteristic to represent to any characteristic to represent
up to 5 dimensions simultaneouslyup to 5 dimensions simultaneously Display or not display specific valuesDisplay or not display specific values, including empty values, for , including empty values, for
any characteristicany characteristic Display a selection of valuesDisplay a selection of values and/or characteristics, and have the and/or characteristics, and have the
selection apply to other visualizations (e.g. tables and plots)selection apply to other visualizations (e.g. tables and plots) View the information as a View the information as a tabletable, or in other representations, or in other representations Sort tablesSort tables by characteristic column(s) by characteristic column(s)
Metadata AnalysisMetadata Analysis
Spotfire demoSpotfire demo
Metadata analysis Metadata analysis questions:questions:
Are the elements’ values plausible? Are the elements’ values plausible?
Are there any glaring errors that must Are there any glaring errors that must be addressed?be addressed?
Spotfire Table Spotfire Table ViewView
DC Creator values in the language
field!
Only DC Language elements are selected for
display Sorted by element
value
The ability to select interesting subsets of information – on the fly – allows for manageably sized, scrollable lists in which ALL values can be examined.
Metadata analysis Metadata analysis questions:questions:
Are there non-empty values that Are there non-empty values that supply no information and that may supply no information and that may confuse end users?confuse end users?
Are all the DC Date values in W3CDTF Are all the DC Date values in W3CDTF syntax?syntax?
Spotfire Table Spotfire Table ViewView
Non-empty, “no information”
values that may confuse end users
Only DC Date elements are
selected for display
The only W3CDTF syntax present is four
digits.
Sorted by element value
Metadata analysis Metadata analysis questions:questions:
Which of the values of the DC Type Which of the values of the DC Type element are actually DCMIType element are actually DCMIType terms?terms?
Spotfire Table ViewSpotfire Table View
Not DCMIType terms
DCMIType term
Only DC Type elements are
selected for display
Sorted by element value
So …So …
Visualizing metadata for analysis can:Visualizing metadata for analysis can:
Improve efficiency and thoroughness of Improve efficiency and thoroughness of review effortsreview efforts
Improve predictability of transformation Improve predictability of transformation resultsresults
Allow extensive data analysis without an Allow extensive data analysis without an ongoing need for programming supportongoing need for programming support
How do we normalize How do we normalize metadata?metadata?
Perform “safe” transforms to “smarten up” Perform “safe” transforms to “smarten up” metadatametadata XSL stylesheets -- from raw XML metadata to NSDL XSL stylesheets -- from raw XML metadata to NSDL
normalized XML metadatanormalized XML metadata
Principles:Principles: Do no harm (Don’t lose information)Do no harm (Don’t lose information) Add information, when possibleAdd information, when possible
Indicate schemes for valid valuesIndicate schemes for valid values Remove meaningless textRemove meaningless text
“…”“…”, “not available”, “-”, “not available”, “-” Empty elementsEmpty elements
Correct erroneous information Correct erroneous information ““text/pdf” text/pdf” “application/pdf” “application/pdf”
Remove characters that impede functionality or displayRemove characters that impede functionality or display Encoding fixes (e.g. “&”, double XML encodings, bad UTF-8 …)Encoding fixes (e.g. “&”, double XML encodings, bad UTF-8 …) Scrub URLsScrub URLs
Goal 2: NSDL at a GlanceGoal 2: NSDL at a Glance
What’s in the NSDL?What’s in the NSDL? CollectionsCollections SubjectsSubjects
Intuitive UIIntuitive UI
Interactive GUI displaysInteractive GUI displays
NSDL at a Glance - DemosNSDL at a Glance - Demos
SpotfireSpotfire
TreemapTreemap http://www.smartmoney.comhttp://www.smartmoney.com
Star TreeStar Tree http://nsdl.org/collections/ataglance/http://nsdl.org/collections/ataglance/
browseBySubject.htmlbrowseBySubject.html
How AboutHow About
Better Online Browsing?Better Online Browsing?
Search and BrowseSearch and Browse False dichotomy!False dichotomy!
Many different user tasksMany different user tasks
Multiple ways to present results to usersMultiple ways to present results to users Should the presentation vary with quantity Should the presentation vary with quantity
and/or context of results?and/or context of results? e.g, “browse” may be a certain presentation of e.g, “browse” may be a certain presentation of
subject search results. subject search results.
A Short List of User TasksA Short List of User Tasks
““Known Item Search”Known Item Search” Single Item SearchSingle Item Search Answer to a QuestionAnswer to a Question x “Best” Resourcesx “Best” Resources
Most informative? Easiest to access? Most appropriate to 8Most informative? Easiest to access? Most appropriate to 8thth graders?graders?
AllAll Germane Resources Germane Resources Sense of the Information SpaceSense of the Information Space Serendipitous FindsSerendipitous Finds
… … still looking for user needs and tasks analysis for information still looking for user needs and tasks analysis for information discovery …discovery …
} Inputs may be fuzzy
Problem NarrowedProblem Narrowed
Improve evaluation of resource Improve evaluation of resource relevance without having to “go there”relevance without having to “go there” ““See and Go Manifesto” Ramana RaoSee and Go Manifesto” Ramana Rao Allow users to manipulate result presentationAllow users to manipulate result presentation
What do we miss when we can’t walk What do we miss when we can’t walk through the stacks? through the stacks? Sense of information spaceSense of information space Serendipitous findsSerendipitous finds
Information Organization Information Organization
Books, Bookcases, Bookspines, Catalogs Books, Bookcases, Bookspines, Catalogs all evolved over timeall evolved over time library staff/user needslibrary staff/user needs bookstore staff/customer needsbookstore staff/customer needs organized by organized by subjectsubject
We are taught how to use libraries We are taught how to use libraries how resources are organizedhow resources are organized how to use tools (card catalog, OPAC)how to use tools (card catalog, OPAC)
A Brief, Recent History of A Brief, Recent History of Information Discovery Information Discovery
Card catalog Card catalog (the world begins here)(the world begins here)
OPAC w/o keywordOPAC w/o keyword OPAC w/ keywordOPAC w/ keyword Internet, before WWWInternet, before WWW WWW before any catalogingWWW before any cataloging Yahoo, Alta Vista, etc.Yahoo, Alta Vista, etc. GoogleGoogle
}Open vs. Closed Stacks
More Information More Information OrganizationOrganization
““binned” thenbinned” then (possibly) sub-binned then(possibly) sub-binned then sorted (alphabetical, size, format …)sorted (alphabetical, size, format …)
Note tension between linear ordering Note tension between linear ordering and hierarchical classificationand hierarchical classification
LocationLocation and and BookspineBookspine
BookspinesBookspines Aid information discovery while allowing Aid information discovery while allowing
efficient book storageefficient book storage Surrogate for bookSurrogate for book
surrogate closely related to resourcesurrogate closely related to resource Visual (color, size, shape …)Visual (color, size, shape …) Aimed at multiple audiencesAimed at multiple audiences
Bookstore staffBookstore staff Potential usersPotential users
NISO standardNISO standard
Can We Improve Reality?Can We Improve Reality? A resource A resource cancan be in multiple places at once be in multiple places at once 2 or 3 dimensional organization instead of linear2 or 3 dimensional organization instead of linear Organization can be dynamicOrganization can be dynamic
User manipulabilityUser manipulability Can use Can use proximityproximity to indicate relationships to indicate relationships
Can we make visual surrogate richer?Can we make visual surrogate richer? Semantic zoom for resource?Semantic zoom for resource?
Different users have different needsDifferent users have different needs Visual surrogate … user selected?Visual surrogate … user selected?
Staff can alter organization of stored resources Staff can alter organization of stored resources without affecting users’ viewswithout affecting users’ views
Flexibility: organizing a very large collection has Flexibility: organizing a very large collection has different constraints than organizing a small collectiondifferent constraints than organizing a small collection
The Big QuestionsThe Big Questions
How do we present shelves of How do we present shelves of bookspine information to our users bookspine information to our users within a monitor screen?within a monitor screen?
What should a virtual bookspine look What should a virtual bookspine look like?like?
(demo)(demo)
Design NotesDesign Notes TensionTension
intuitive, familiar intuitive, familiar new capabilities, change new capabilities, change
Semantic zoomSemantic zoom spec (partial bookspine info: color, position) spec (partial bookspine info: color, position) bookspine info bookspine info full metadata full metadata resource itselfresource itself
User manipulabilityUser manipulability
Text issuesText issues horizontal, not vertical horizontal, not vertical
Most materials in EnglishMost materials in English default sort is alphabeticaldefault sort is alphabetical
Prototype Next StepsPrototype Next Steps
Click through for resourceClick through for resource API API
Any fielded dataAny fielded data Search results? Colored by rank?Search results? Colored by rank?
Any tree structure for any fielded dataAny tree structure for any fielded data Multiple field values Multiple field values JitterJitter ScalingScaling
When too much, scroll it (a la spotfire)?When too much, scroll it (a la spotfire)? Table view (sortable, selectable, searchable, like Table view (sortable, selectable, searchable, like
spotfire)spotfire)
The Metadata FrontierThe Metadata Frontier
Missing informationMissing information Automatically generated (full text, iVia, kth Automatically generated (full text, iVia, kth
nearest neighbor, support vector … based on nearest neighbor, support vector … based on training set)training set)
Via community (ENC?)Via community (ENC?) Controlled vocabulariesControlled vocabularies
Automatic translation ?Automatic translation ? Data mining?Data mining?
Value-added services to motivate providersValue-added services to motivate providers
Thank You!Thank You!
Goal 3 sub 1: ClassificationGoal 3 sub 1: Classification
LCC files on orderLCC files on order Star Tree?Star Tree? Windows Explorer?Windows Explorer? Other?Other?
Metadata analysis Metadata analysis questions:questions:
Which XML elements are present in the Which XML elements are present in the metadata and with what namespaces metadata and with what namespaces are they associated?are they associated?
Are there any non-DC elements in the Are there any non-DC elements in the metadata?metadata?
Element Names vs. Namespaces Element Names vs. Namespaces (Scatter Plot)(Scatter Plot)
Metadata analysis Metadata analysis questions:questions:
Do all the metadata records haveDo all the metadata records have
DC IdentifierDC Identifier DC FormatDC Format … …
Missing Elements Missing Elements (Scatter Plot)(Scatter Plot)
2 records without
language element
format element present
inconsistently
Easy to rescale axis on the fly
and scroll through records
Metadata analysis Metadata analysis questions:questions:
Exactly which elements use XML Exactly which elements use XML attributes?attributes?
Do those elements also appear in Do those elements also appear in the metadata without an attribute?the metadata without an attribute?
(this approach can be used to isolate (this approach can be used to isolate empty and non-empty elements)empty and non-empty elements)
Empty and Non-Empty CharacteristicsEmpty and Non-Empty Characteristics
all WITH an attribute presentall WITH an attribute present all WITHOUT an attribute presentall WITHOUT an attribute present
There are subject fields with and without the
nsdl_dc:GEM attribute value
There are no identifier fields without an attribute present
Data Problems: Missing Data Problems: Missing DataData
Defining what’s “missing” partially Defining what’s “missing” partially dependent on nature of dependent on nature of implementationimplementation
Title and Description critical for user Title and Description critical for user selectionselection
Format and Type particularly critical Format and Type particularly critical for NSDL filtering of search resultsfor NSDL filtering of search results
Data Problems: Incorrect Data Problems: Incorrect datadata
In wrong elementIn wrong element misunderstood definitions or careless misunderstood definitions or careless
crosswalkingcrosswalking
Nonsensical values (“promiscuous defaults”)Nonsensical values (“promiscuous defaults”)
Bad crosswalks (may be non-standard or too Bad crosswalks (may be non-standard or too limited)limited)
Metadata record ID used for IdentifierMetadata record ID used for Identifier
Data Problems: Confusing DataData Problems: Confusing Data
Ambiguous separators Ambiguous separators (comma instead of semi-colon)(comma instead of semi-colon)
HTML tagging within elementsHTML tagging within elements
Encoding problemsEncoding problems Double encoding: &Double encoding: & Bad UTF-8Bad UTF-8 Illegal XML characters (e.g., un-encoded Illegal XML characters (e.g., un-encoded
ampersand)ampersand)
Automated MR ingest Automated MR ingest processprocess
NSDL Collection
Registration“raw” or “native”
metadata
Validation
Notify provider of problems;May need to halt processing
MetadataRepository
providerOAI
server
NSDLMROAI
server
OAI Harvest
NormalizeValidation
normalizedmetadata