if we're not there yet, how far do we have to go ? web metadata at the university of melbourne
DESCRIPTION
Paper at DC-ANZ 2005 (May 2005, Melbourne)TRANSCRIPT
Young & Hughes, DC-ANZ 2005 1
If we’re not there yet, how far do we have to go ?
A review of web metadata at The University of Melbourne
Eve Young, Metadata CoordinatorInformation Acquisition and Organisation Section
Information Division
Baden Hughes, Research FellowDepartment of Computer Science and Software
Engineering
The University of Melbourne
Young & Hughes, DC-ANZ 2005 2
Overview
� Background � Web publishing policies circa 1999, 2001� Research projects
� Towards standardization� Dublin Core� UniMelb administrative metadata
� Broad scale compliance analysis� UniMelb web environment� DC Metadata on the UniMelb Web� UniMelb Metadata on the UniMelb Web
� Reflections and challenges for the future
Young & Hughes, DC-ANZ 2005 3
Before Metadata on UniM site
� Existing standard (1999) not widely adopted� 9 metadata tags
� expiryDate, maintainer, authoriser, author, description, keywords, lastModified, distribution, contentType
� Operational and implementation issues� Difficulty finding information� Suspected non-compliance
� Investigate and analyze � Manual research
Young & Hughes, DC-ANZ 2005 4
Expiry Tag Analysis
� Expiry tag functionality important� Analysis into non-compliance (608 pages)� Only 27% of pages audited were compliant� Of the remainder of pages reviewed, 441
had no date, or NA as value
Young & Hughes, DC-ANZ 2005 5
A to Z Index: Compliance Audit
� Audit of metadata on 78 web pages� Highest compliance 84.6% (content type)� Lowest 11.5 % (expiry date)� More unknown than known maintainers� Default value tags had high degree of
compliance� Page specific tags (keywords) had lowest
Young & Hughes, DC-ANZ 2005 6
Metadata Working Group
� Advise on implementation of a uniform approach to the creation of metadata
� Membership drew on expertise from across the university - academics, IT, web, metadata, and library
� Reviewed metadata standards, DC, IMS, AGLS� Metadata use in large information –rich
organizations, eg, Aust Govt, UK Government, UNSW libraries
Young & Hughes, DC-ANZ 2005 7
UniMelb metadata standard
� 19 elements (meta tags) to describe and manage a resource
� 2003 revised standard endorsed by Information Strategy Committee.
� Requirement on all University of Melbourne web pages
Young & Hughes, DC-ANZ 2005 8
Why Dublin Core (besides this being a DC-ANZ conference) ?
� ISO 15836� 15 elements, simple� International consensus� Well supported� Offers semantic interoperability� Extensible� Easy to implement in our environment
Young & Hughes, DC-ANZ 2005 9
University of Melbourne DC Metadata Elements
DC.TitleDC.CreatorDC.SubjectDC.DescriptionDC.PublisherDC.Contributor
DC.RightsDC.DateDC.Date.ModifiedDC.LanguageDC.FormatDC.Identifier
Young & Hughes, DC-ANZ 2005 10
University of Melbourne Administrative Metadata ElementsUM.Creator.EmailUM.Authoriser.NameUM.Authoriser.TitleUM.Maintainer.NameUM.Maintainer.DepartmentUM.Maintainer.EmailUM.Date.ReviewDue
Young & Hughes, DC-ANZ 2005 11
Broad Scale Compliance Analysis
� Full crawl of the University of Melbourne web presence in March 2005
� Used was the Internet Archive's Heritrix suite (an open-source, extensible, web-scale, archival-quality web crawler)
� Total 57Gb of data was retrieved from www.unimelb.edu.au and its associated sub-domains over a period of 146 hours
� 1.4 million documents were retrieved
Young & Hughes, DC-ANZ 2005 12
The UniMelb Web Environment
Format Demographics of UniMelb Web
text/html
image/jpeg
image/gif
application/pdf
text/plain
application/msword
application/msexcel
application/mspowerpoint
application/postscript
others
Young & Hughes, DC-ANZ 2005 13
Observations� HTML is no longer the dominant format
� UniMelb’s metadata creation processes primarily oriented at creating Dublin Core-extended metadata as simple HTML meta tags
� Pure HTML content in fact is no longer dominant format� Web-accessibility of “non-native” document types
� Many MIME Types are not addressed by the UniMelb guidelines for metadata creation but which do offer some potential for restricted metadata inclusion
� Emerging document types such as XML and RDF do not easily allow for the embedding of metadata internal to the resource.
� The emergence of dynamic documents� Analysis of “All Other” categories shows many (~38%) of these
documents are dynamic, generated server side on demand by PHP, ASP, JSP etc.
� No thought currently given to inclusion of metadata in automatically generated documents of this type
Young & Hughes, DC-ANZ 2005 14
DC Metadata on the UniMelb WebUsage of DC Elements
0.010.020.030.040.050.060.070.080.090.0D
C.T
itle
DC
.Cre
ator
DC
.Sub
ject
and
DC
.Des
crip
tion
DC
.Pub
lishe
r
Dc.
Con
tribu
tor
DC
.Rig
hts
DC
.Dat
e
DC
.Dat
eMod
ified
DC
.Lan
guag
e
DC
.For
mat
DC
.Iden
tifie
r
Ove
rall
Ave
rage
DC Metadata Element
% C
over
age
% HTML Pages withMetatdata in <HEAD>
%HTML Pages withMetadata in <BODY>element
Total % HTML Pagescontaining Metadata ineither <HEAD> or<BODY>
Young & Hughes, DC-ANZ 2005 15
Observations
� Alignment with broad Dublin Core norms� These figures are generally in line with the
findings of broad scale Dublin Core-oriented metadata communities � OAI (Ward, 2003) � OLAC (Hughes, 2004)
Young & Hughes, DC-ANZ 2005 16
UM Metadata on the UniMelb WebUsage of UM Elements
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
UM.Date
.Rev
iewDue
UM.Crea
tor.E
UM.Auth
orise
r.Title
UM.Main
taine
r.Nam
e
UM.Main
taine
r.Email
Overa
ll Ave
rage
UM Metadata Element
% C
over
age
% HTML Pages withMetatdata in <HEAD>
%HTML Pages with Metadatain <BODY> element
Total % HTML Pagescontaining Metadata in either<HEAD> or <BODY>
Young & Hughes, DC-ANZ 2005 17
Observations� Differences between core Dublin Core and institutional metadata
� institutional metadata is more regularly contributed, despite the automatic creation of some DC by content creation applications
� Correlation with manual inspection statistics� these experiments suggest trends detected in earlier focused
studies such as Zajacek (2002a, 2002b) are valid. � Differences between metadata included in <HEAD> vs <BODY>
elements� for institutional metadata, there is a strong tendency to include
metadata in the <BODY> elements where it is immediately visible on the page rather than in the <HEAD> elements which may reflects the emphasis of the training materials
Young & Hughes, DC-ANZ 2005 18
Reflections and Challenges 1� % coverage of HTML sources is relatively low, but it does
account for a large number of documents (650K total in this survey)� Many documents are non-compliant for identifiable reasons – eg
exclusion of metadata in template based pages such as those within the learning management system
� External search engines like are not using meta tag information any more but perform full text indexing (see Richardson, 2004)� Benefit to general web searchers of institutional metadata creation is
almost zero � May still retain currency for other administrative purposes eg the
authorization of web content publication. � Need to distinguish between the institutions need for web content
management, and how metadata facilitates this goal, and decoupling from web search experience in general.
Young & Hughes, DC-ANZ 2005 19
Reflections and Challenges 2� Potential impact of institution wide Content Management System
� Existing metadata standards failed to address distributed content creation (or underestimated the pervasive effect of “publish to web” type technologies to all staff),
� Opportunity to increase compliance with new generation tools andpractices.
� Revisiting motivation for web metadata: search assistance or administrative processes ?� Changes to work practices required for web publishing authorisation
� “Compliance audit” service � for run time verification of metadata compliance, with a
“watermarking” service which automatically imprimaturs compliantpages in the absence of manual inspection.
� Require the formalisation of University of Melbourne metadata as a true Dublin Core application profile and an associated formal schema, and the creation of controlled vocabularies for extensions.
Young & Hughes, DC-ANZ 2005 20
Reflections and Challenges 3
� Large number of pages which will be updated only at an irregularinterval� Substantially increasing the coverage of institutional metadata in the
short to medium term may require the deployment of an automated metadata creation service such as DCdot (Powell, 2000) or an augmentation service such as OLACdot (Hughes, 2005).
� Early experiments with DCdot show significant promise, but need to be more carefully evaluated in light of recent research in the area (Greenberg, 2005).
� Training of critical importance� Significant effort was invested in training key personnel, and the
propagation of the institutional standards and training notes online, only a small number of face to face classes have been held.
Young & Hughes, DC-ANZ 2005 21
Conclusion� UniMelb was identified as one of the leading universities with
regard to metadata implementation (Ivanova, 2004)� Empirical evidence suggests that The University of Melbourne
still faces significant challenges� Compliance in the age of moving standards - over a 2 year
period the evolution of external standards, web content creationtools, and web content demography is significant
� Strong basis for institutional metadata was formed by the adoption of Dublin Core � the disparate content creation environment and rapidly changing
composition of web content has induced a less than satisfactory application of these standards.
� Automated metadata creation and assessment, forming a significant component of future work may address this problem inpart
Young & Hughes, DC-ANZ 2005 22
Questions / Comments
� http://eprints.unimelb.edu.au/archive/00000983� Eve Young
Metadata CoordinatorInformation Acquisition and Organisation SectionInformation [email protected]
� Baden HughesResearch FellowDepartment of Computer Science and Software Engineering [email protected]