an apache module for generating self-describing web resources joan a. smith michael l. nelson...
TRANSCRIPT
An Apache Module for GeneratingSelf-Describing Web Resources
Joan A. Smith
Michael L. Nelson
Alliance for Information Science and Technology Innovation
East Coast Meeting
16 October 2007
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 2
Web Site Preservation: 2 Problems
The counting problemHow many pages are on that site?
To save it you have to find it
The representation problemWhat’s that page all about?
Future use requires understanding
Guess the bean count, win the jar
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 3
A Crawler’s View of the Web Site
web roothttp://www.foo.edu/
X
X X
The crawler has run into the counting problem, and doesn’t know it….
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 4
Pages Out of Crawler Reach
• Some pages linked from web root• Some dynamic content• Some orphaned pages• Some pages protected with access controls• Some pages too deep for a particular crawler
Sitemap protocol attempts to address these problems
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 5
Sitemap Protocol
• Google-driven initiative– Derives from earlier concept of graphical “site map” and
alphabetical “site index”– Google standardized the protocol
• XML-formatted file– Created by webmaster of a web site– Can be “tweaked” to include/exclude files based on a wide
variety of criteria
• Simplifies site resource exposure to search engines (i.e., to their robots)
• Supported by Google, MSN, Yahoo and ASK
But it doesn’t address inherent HTTP limitations
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 6
Web Crawling & The Counting Problem• HTTP cannot ask for only new or modified resources
– Conditional GET by datestamp or etag has limited benefit
– Cannot get a list of pages that have been deleted; changed; added
– Each resource must be requested, one at a time, by name
• There is no “SELECT *” in HTTP
– Crawlers cannot request a list of all URLs for the site
– Crawlers can only GET one resource at a time, by name
– HTTP cannot give a crawler a list of resources it has
Undiscovered resources will not be refreshed• Sitemaps
– XML document lays out site structure (cf. http://www.sitemaps.org/protocol.php )
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url></urlset>
– Provides minimal, crawl-oriented metadata (update frequency, etc.)
– Can include Dynamic URLs
CountingProblem
Search EngineSolution
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 7
A Web Page: Behind the Scenes
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 8
HTTP: Behind the Scenes
Resource example:http://foo.edu/jackJill.jpg
• Note the limited metadata from the HTTP GET request• Binary content is not human-readable• We only “GET” one resource at a time
Additional metadata could help the digital image archeologist of the future:
– Color map– NISO information– Base64 encoding of resource– MD5 or other hash function– Subject matter
And metadata that could help preserve the Jack and Jill document content:
– Language– Script type and version– Document summary/abstract– Keyword extraction– Lexical signature
% telnet foo.edu 80 Trying 82.165.199.160... Connected to foo.edu. Escape character is '^]'.
GET /jackJill.jpg HTTP/1.1 Host: foo.edu
HTTP/1.1 200 OK Date: Mon, 11 Jun 2007 16:49:25 GMT Server: Apache/1.3.33 (Unix) Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Accept-Ranges: bytes Content-Length: 15986 Content-Type: image/jpeg
ÿØÿà"#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿĬê@‘XÑ9÷'M½ÂšX¬4ýÃÆ{çÉÎ Ð?~‰·õÔÓ!� �RÓ@Š’û¡·TÓ`r’pz{ ëÖ.éhéQ)Ùè5ü b»[g¨øx^zè ²�"#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿĬê@‘XÑ9÷'M½ÂšX¬4ýÃÆ{çÉÎ Ð?~‰·õÔÓ!� �RÓ@Š’û¡·TÓ`r’pz{ ëÖ.éhéQ)Ùè5ü b»[g¨øx^zè�
Connection closed by foreign host.
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 9
Web Crawling & The Representation Problem
• HTTP provides limited metadata– Server-client communication, focused on the here and now– Not concerned with preservation-related information– Format obsolescence is not addressed– File content comprehension is the client’s problem
• Client and Server must be configured to handle each resource “type”– Default includes most common current file types– Older types, unusual files may get an error message from the client browser:
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 10
Archives: Metadata-Rich
• Each model handles resource representation information in its own way• Metadata helps ensure long-term persistence, availability of resources
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 11
The MPEG-21 DIDL Model
• A complex-object model combining the resource and its metadata together
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 12
Post-Harvest Processing required for ingestion
Harvest Analyze/Examine/Process Archive
Often a combination of manual and automated input
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 13
Metadata Generation Utility Examples
Name Description
Jhove Analysis by type (img, audio, text)
Kea Key phrase extraction
OTS Open Text Summarizer
ExifTool Image/video metadata extractor
PDFlib-pCOS Extract PDF metadata
MP3-Tag Extract audio file tags
Essence Customized information extraction
GDFR MIME++
MD5 Message Digest
File Magic Uses content-identification bits of the file
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 14
# Webs >> # Archiving Institutions
Archivist
Web Sites
Typical ingest scenario
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 15
Harvest with Metadata
Metadata Magic: Get the resource together with its metadata
Harvest Pre-processed resource
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 16
Harnessing the Web Server
Archivist: mod_oai GetRecord request and response
User: standard GET request and response
Self-describing resource
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 17
Configuring the Web-Server for Metadata Magic
http://foo.edu/example.html
• No impact to everyday users
• Regular “GET” => “regular” response
• OAI-PMH “Get Record” => “crate” response
http://foo.edu/modoai/?verb=getRecord&identifier= http://foo.edu/example.html&metadataPrefix=crate
• Standard Apache “Location” directive
• mod_oai module configured with “plug-ins”
• Scripts, utilities, etc. can vary by MIME type
<Location /modoai> SetHandler modoai-handler modoai_plugin "jhove" "/opt/jhove/jhove -m jpeg-hul %s" "/opt/jhove/jhove --v" "image/jpeg" modoai_plugin "ots" "/usr/local/bin ots –summary %s" "/usr/local/bin ots -v" "text/*" modoai_plugin "jhove" "/opt/jhove/jhove -m pdf-hul %s" "/opt/jhove/jhove --v" "application/pdf" modoai_plugin "pronom" "java -jar DROID.jar -L%s" "java -jar DROID.jar -v" "*/*" </Location /modoai>
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 18
6 Verbs of the OAI-PMH
Verb Function
Identify description of repository
ListMetadataFormats metadata formats supported by repository
ListSets sets defined by repository
ListIdentifiers OAI unique ids contained in repository
ListRecords listing of N records
GetRecord listing of a single record
metadataabout therepository
harvesting
verbs
most verbs can take qualifying arguments: dates, sets, ids, metadata formats, and resumption token (for flow control)
• Compatible with HTTP• Supports OAIS model• Can support complex object model
OAI-PMH can help resolve the Counting and Representation Problems
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 19
OAI-PMH Verbs and mod_oai:Addressing The Counting & Representation Problems
• Counting Problem– “ListIdentifiers” provides equivalent of Sitemap– “ListRecords” response serializes the site’s contents using a single
request– Qualifiers – by date range, by MIME set – enable customized crawls– Simplifies update semantics
• Representation Problem– “ListRecords” and “GetRecords” responses can include a wealth of
metadata• MPEG-21 DIDL• CRATE
– Allows more sophisticated crawling and archival preparation
OAI-PMH verbs address inherent HTTP limitations
mod_oai lets the web server provide self-describing resources
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 20
What is a “Self-Describing” Resource?
EXIF TOOL:File Name 103_0315.JPGCamera Model Name Canon EOS DIGITAL REBELDate/Time Original 2003:09:30 13:37:51Shooting Mode SportsShutter Speed 1/2000Aperture 7.1Metering Mode EvaluativeExposure Compensation 0ISO 400Lens 75.0 - 300.0mmFocal Length 300.0mmImage Size 3072x2048Quality NormalFlash OffWhite Balance AutoFocus Mode AI Servo AFContrast +1Sharpness +1Saturation +1Color Tone NormalFile Size 1606 kBFile Number 103-0315
Standard HTTP Headers --Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Content-Length: 15986 Content-Type: image/jpeg
PLUS: Output from built-in utilities:
JHOVE TOOL:Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpgReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hulMIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCTImages: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endianCompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0File/Magic:
JPEG image dataJFIF standard 1.00resolution (DPI)"LEAD Technologies Inc. V1.01“33 x 26
MD5 Hash:58a54e8638db432f4515eedf89f44505
…CRATE: Wrapped together with the resource in simple XML
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 21
Apache: mod_oai Location Directive
<Location /modoai> Apply these rules to http://foo.edu/modoai SetHandler modoai-handler Use modoai to process these requests
modoai_plugin plugin element: one utility per element "jhove" each has a label, used as a metadata “ID tag” "/opt/jhove/jhove -m jpeg-hul %s" the command-line or script to call the utility "/opt/jhove/jhove --v" include the version number of the installed utility
on a single text line
"image/jpeg" which MIME types should be analyzed (any jpeg)
EOL here modoai_plugin "ots" Open Text Summarizer "/usr/local/bin ots –summary %s" “%s” means substitute resource name here "/usr/local/bin ots -v" "text/*" Use on all text (plain, HTML, XML, etc.) resources modoai_plugin "jhove" Another invocation of the JHOVE utility "/opt/jhove/jhove -m pdf-hul %s" Note the different hul used here "/opt/jhove/jhove --v" report the version "application/pdf" Use on all PDF resources (only) modoai_plugin "pronom" the PRONOM DROID tool "java -jar DROID.jar -L%s" "java -jar DROID.jar -v" report the version "*/*" Use this utility on every resource </Location /modoai>
• Scripts• Pipes• Executables• MIME-based selective
processing
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 22
Building a CRATE
• URI, UUID
• Standard HTTP Headers
• Plug-In Metadata
• Base64-Encoded Resource
CRATE
CRATE ID
METADATA
RESOURCE
A simple model for self-describing web resources
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 23
A self-describing resource using mod_oaihttp://foo.edu/modoai/?verb=getRecord&identifier= http://foo.edu/jackJill.jpg&metadataPrefix=crate
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2007-06-18T18:21:46Z</responseDate> <request verb="GetRecord" identifier=http://foo.edu/jackJill.jpg
metadataPrefix=“crate">http://foo.edu/crate/</request> <GetRecord>
<record> <header> <identifier>http://foo.edu/jackJill.jpg</identifier>
<datestamp>2007-01-17T04:09:07Z</datestamp><setSpec>mime:image:jpeg</setSpec>
</header><crateContent> <mimeType>image/jpeg encoding=“base64”</mimeType>
<data>JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc</data></crateContent><crateMetadata>
<description><label>“file magic”</label> <exec>/usr/bin/file jackJill.jpg</exec><version>file-4.16</version><data>JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26</data>
</description><description><label>“jhove”</label> <exec>/opt/jhove/jhove –m jpeg-hul</exec>
<version>Jhove (Rel. 1.1, 2006-06-05)</version><data> Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpgReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hulMIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCTImages: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endianCompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0</data>
</description></crateMetadata>
</record></GetRecord> </OAI-PMH>
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 24
Automatic, Best-Effort Metadata
• Automatic– Generated at time of dissemination– Integrates preservation functions with the web server
• Unverified– Utility results are not cross-checked– Output of analyses go directly into XML response
• Undifferentiated– No categorization of output– Resource and metadata form complex-object response
A simple, easy-to-implement option for improving
available preservation metadata for web resources
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 25
Preservation & Metadata
Resource Metadata Available
Less More
Pro
bab
ilit
y o
f P
res
erv
atio
n
Low
Hig
h
HTTP/HTML
Automatic metadata utilities/CRATE
Archival Information Package (AIP)
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 26
Current Status
• mod_oai Open Source release at project’s completion• Draft CRATE schema definition (XSD)• Metrics Collection & Evaluation
– Impact of utilities on web server performance– Examine utility compatibility and issues– Address security concerns
• Native Utility Efficiency– Language dependent (Java, C)– Improvements may depend on external pressure
• Security– Metadata vs information exposure risk– Access controls
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 27
Demo
AT MODOAI.ORG:
http://www.modoai.org/demos.html
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 28
Further Information
• The mod_oai project home page:
http://www.modoai.org/• JCDL 2007:
Generating Best Effort Preservation Metadata For Web Resources At Time Of Dissemination
• IWAW 2007:CRATE: A Simple Model For Self-Describing Web Resources
• Authors’ webs:• http://www.cs.odu.edu/~mln/pubs/
• http://www.joanasmith.com/pubs.html
Supplementary Slides
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 30
Robot Crawls of A Large, Deep Web
Google example here: http://www.joanasmith.com/deepWeb/animOdu2.gif
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 31
Addressing the Counting Problem Using OAI-PMH via mod_oai
Advantages for Crawler: • Single request itemizes all
resources in web tree: ListIdentifiers
• Can refine by MIME set, Datestamp
Original modoai was limited:
• No Dynamic URLs
• Web root tree only
• Same metadata as HTTP
Basic request: http://www.foo.edu/modoai/?verb=ListIdentifiers&metadataPrefix=oai_dc
Enhanced request: &from=2006-09-15&set=mime:video:mpeg
New version: • Utilizes sitemap files• Can include Dynamic URLs• Rich metadata possibilities via
“plugins”
16 October 2007 {jsmit,mln}@cs.odu.edu Slide # 32
Web Server Configuration: “conf” file ### Section 1: Global Environment # ServerType standalone ServerRoot "/etc/httpd" PidFile /var/run/httpd.pid ResourceConfig /dev/null AccessConfig /dev/null Timeout 300 KeepAlive On MaxKeepAliveRequests 0 KeepAliveTimeout 15 MinSpareServers 16 MaxSpareServers 64 StartServers 16 MaxClients 512 MaxRequestsPerChild 100000
### Section 2: 'Main' server configuration
# Port 80
<IfDefine SSL> Listen 80 Listen 443 </IfDefine>
User www Group www ServerAdmin [email protected] ServerName www.openna.com DocumentRoot "/home/httpd/ona"
<Directory /> Options None AllowOverride None Order deny,allow Deny from all </Directory>
<Directory "/home/httpd/ona"> Options None AllowOverride None Order allow,deny Allow from all </Directory>
<Files .pl> Options None AllowOverride None Order deny,allow Deny from all </Files>
<IfModule mod_dir.c> DirectoryIndex index.htm index.html index.php index.php3 default.html index.cgi </IfModule>
#<IfModule mod_include.c> #Include conf/mmap.conf #</IfModule>
UseCanonicalName On
<IfModule mod_mime.c> TypesConfig /etc/httpd/conf/mime.types </IfModule>
DefaultType text/plain HostnameLookups Off
• Operational Rules• Modules (mod_perl, etc.)• Security• Virtual Hosts