tdwg - sdd oct 20, 2002 search & browsing beyond keys over standards tdwg general presentation...
TRANSCRIPT
TDWG - SDD Oct 20, 2002
Search & Browsing
Beyond Keys over Standards
TDWG General Presentation Sessions
P. Bryan Heidorn
October 20, 2002
TDWG - SDD Oct 20, 2002
Search & Browsing
Why Structured Data Description?
• Interactive Keys
• Search
• Authoring– Structure– Vocabulary
TDWG - SDD Oct 20, 2002
Search & Browsing
Interactive Keys
• Interchange of information between programs.– E.g. DeltaAccess and LucID
TDWG - SDD Oct 20, 2002
Search & Browsing
SDD Supports text
• Hybid text and defined Characteristics
• Leaf characters may be defined while the flower characteristics are no.
• <Wording><Wording>
• Reuse of Flora for Search
• Automatic Markup and Classification
• Still can help people with ID
TDWG - SDD Oct 20, 2002
Search & Browsing
Search
• Warning that this is not a SDD standard but an SDD inspired implementation from the Australia discussion.
• SDD supports hybrid text and character encoded documents
• How do you search a hybrid or partly encoded document?
• Some fields based on full text, some based on DB search
TDWG - SDD Oct 20, 2002
Search & Browsing
Search
• Biological Information Browsing Environment – (http://www.biobrowser.org)
• Search subparts of descriptions– Find the blue flowers not blue mountains– Flowers: …. Blue– Location: Blue Mountains
• Search Part Hierarchy
TDWG - SDD Oct 20, 2002
Search & Browsing
Marking Structure
Taxonfiles
ML
FNAStructured
files
FragmentUnlabeledfragmentsUnlabeledfragmentsUnlabeledfragmentsUnlabeledfragments
ManualLabeling
TrainingSet
TrainingSet
TrainingSet
TrainingSet
LabeledFragments
Reassembly
LabeledFragments
LabeledFragments
LabeledFragments
TDWG - SDD Oct 20, 2002
Search & Browsing
Classification = Markup
• Support Vector Machine (Hong Cui)
• Any distance measure
• Clustering / Classification
TDWG - SDD Oct 20, 2002
Search & Browsing
Converting Documents to XML
FNA
Compositefiles
Perl
FNA
Genusfiles
FNA
Genusfiles
FNA
Genusfiles
FNA
GenusfilesFNA
HTMLfiles
FNA
HTMLfiles
FNA
HTMLfiles
FNA
Speciesfiles
FNA
Familyfiles
FNA
Familyfiles
FNA
Familyfiles
FNA
Familyfiles
TDWG - SDD Oct 20, 2002
Search & Browsing
Simple FNA.dtd<!ELEMENT FNA ANY><!ELEMENT NomenclaturalInfo (#PCDATA)><!ELEMENT Description (#PCDATA)><!ELEMENT Distribution (#PCDATA)><!ELEMENT Discussion (#PCDATA)><!ELEMENT Images (#PCDATA)><!ELEMENT ImagesMap (#PCDATA)><!ELEMENT Copyright (#PCDATA)><!ELEMENT Other (#PCDATA)><!ELEMENT Variability (#PCDATA)><!ELEMENT References0 (#PCDATA)><!ELEMENT References1 (#PCDATA)>
TDWG - SDD Oct 20, 2002
Search & Browsing
<?xml version="1.0" encoding="iso-8859-1" ?> <!DOCTYPE FNA (View Source for full doctype...)> - <FNA> <Images>http://www.canis.uiuc.edu/~webvibe/fna_images/plates/I24900139.html</Images> <Other /> <NomenclaturalInfo>3. Abies grandis (Douglas ex D. Don in Lambert) Lindley, Penny Cycl. 1: 30. 1833 - Grand fir, lowland white fir, sapin grandissime</NomenclaturalInfo> <NomenclaturalInfo>Pinus grandis Douglas ex D. Don in Lambert, Descr. Pinus [ed. 3] 2: unnumbered page between 144 and 145. 1832</NomenclaturalInfo> <Description>Trees to 75m; trunk to 1.55m diam.; crown conic, in age round, with age …. brown, often with reddish periderm visible in furrows bounded by hard flat opposite, light gray, sessile, apex rounded; scales ca. 2--2.5 ´ 2--2.5cm, densely … bracts included. Seeds 6--8 ´ 3--4mm, body tan; wing.</Description> <Distribution>Moist, coastal coniferous forests and mountain slopes; 0--1500m; B.C.; Calif., Idaho, Mont., Oreg., Wash.</Distribution>
TDWG - SDD Oct 20, 2002
Search & Browsing
Structured Index
Swish-ex StructuredIndex
FNAStructured
files
TDWG - SDD Oct 20, 2002
Search & Browsing
Swish-ex
• Based on Berkeley swish-e
• XML support added
• Hierarchical Key Structure Related to DTD
TDWG - SDD Oct 20, 2002
Search & Browsing
FNA
Nomenclature00
Morphology01
Distribution10
References11
Plant01|000
Bark01|001
Leaf0|010
Others..0|011…
Size0|010|000
Margin0|010|001
Apex0|010|010
Petiole0|010|011
TDWG - SDD Oct 20, 2002
Search & Browsing
XML Query Interface
Add the query string to a tree node, for example COMMAN NAME = dog-face
TDWG - SDD Oct 20, 2002
Search & Browsing
The query propagates to the text field, and it is ready to be sent to Server/database
TDWG - SDD Oct 20, 2002
Search & Browsing
The results are fetched back from the database, and show up on screen
TDWG - SDD Oct 20, 2002
Search & Browsing
Query results are displayed in BIBE as documents, the relationships between key terms are currently mapped as shown below.
TDWG - SDD Oct 20, 2002
Search & Browsing
Processing Thesaurus and Definition
Taxonfiles
ThesaurusProcessor
TaxonFilesWith
Definitions
StructuredThesaurus
Glossaryfiles
TDWG - SDD Oct 20, 2002
Search & Browsing
Thesaurus
• Automatic Query Expansion
• Vocabulary switching
• Novice to expert
• (acorn) becomes (acorn or glans)
• (roundish) becomes (obovate or roundish)
• (Illinois) becomes (ill or illinois)
TDWG - SDD Oct 20, 2002
Search & Browsing
Clicking on a set of documents reveals a list of files. By choosing a file, users investigate text and images from The Flora of North America.
TDWG - SDD Oct 20, 2002
Search & Browsing
Clicking on a set of documents reveals a list of files. By choosing a file, users investigate text and images from The Flora of North America.
TDWG - SDD Oct 20, 2002
Search & Browsing
Outline
• Run FNA files through Brill tagger (p-o-s) to:
• tokenize• tag the files, i.e. mark parts of speech: nouns, verbs, adjectives, adverbs
• Modify Brill lexicon (Wall Street Journal- based) to fit FNA files
• Run an extraction program through the tagged files
Marija Markovic, Adjective Extraction
TDWG - SDD Oct 20, 2002
Search & Browsing
Why Adjectives
• Useful (with Nouns) for retrieval
• Complex plant morphology sections
• around 30 fields in FNA XML files
• Essential for other compounds • art, definitions, links
TDWG - SDD Oct 20, 2002
Search & Browsing
Examples
conspicuousfoliaceous#-keeled#-locular#-ribbed#-ridged#-veined#-flowered#-loculed#-lobed#-dentate
#-merous#-lobedV-shapedabaxialapicalaromaticaxillarybasalbell-shapedbisexualblue-violetcanary
centralchasmogamouscleistogamousclub-shapedcream-browncuneate-obovoidcylindricdistalellipsoidellipticequalfetid
TDWG - SDD Oct 20, 2002
Search & Browsing
Participants
• University of Illinois– Grad. School of Library and Information Science– Herbarium
• Illinois Natural History Survey and Herbarium
• University of North Carolina– School of Information and Library Science
• North Carolina Botanical Garden
TDWG - SDD Oct 20, 2002
Search & Browsing
Participants• University of Illinois
– Grad. School of Library and Information ScienceLesley Deem, Xiao Hu, Jing Wu,
– Herbarium, Dave Siegler
• Illinois Natural History Survey and Herbarium, Ken Robertson, Michael Jeffords
• University of North Carolina– School of Information and Library Science
Jane Greenberg, Evaline Daniels– North Carolina Botanical Garden
Bob White
TDWG - SDD Oct 20, 2002
Search & Browsing
Project Objectives
• Develop a Distributed Search and Identification Support– Character Lists, Taxonomic descriptions
• Develop Distributed Taxonomic Descriptions– Text, in situ images, ID images, herbarium sheets
• Highlight Museum Holdings
• Audience: Citizen Scientists, Schools, Scientists
TDWG - SDD Oct 20, 2002
Search & Browsing
Addressed Issues Encountered with Collaboration
• Inability to produce a final character matrix– Introduction of new characteristics
• Differing definitions of “like” terms.– Need to be defined in context– Can not proliferate (reuse)
TDWG - SDD Oct 20, 2002
Search & Browsing
Distributed Properties
• Image and / or Character States need not be harvested until access time
• No “change” or “new” log in http unlike Open Archive
• Property definitions but not references may change without notification
• New instances may be added without notification
TDWG - SDD Oct 20, 2002
Search & Browsing
Schemata
• Taxon Description
• Characteristic
• Character Value
• Character Image
• Contributor
TDWG - SDD Oct 20, 2002
Search & Browsing
Component of Description
Taxon
Characteristic
CharacterImages
CharacterState
*Contributor ContributorID
ImageID
RefKey:CharacteristicID:Value
RefKey:CharacteristicID
ValueID
*Contributor may be used for all objects
TDWG - SDD Oct 20, 2002
Search & Browsing
Decomposition and Distribution
• The selection of characteristics and their values can be distributed
• No need to have definitive set
• Potential for reuse in new taxonomic descriptions
• Potential for differing definitions but in well defined space
TDWG - SDD Oct 20, 2002
Search & Browsing
Taxonomic Description
• Project Definition (ala Autralia 2002)• Identification – Taxon
<Taxon> <Rank>Species</Rank> <Name>Echinacea pallida</Name> <Authority>(Nutt.) Nutt</Authority> <Vernacular>Pale purple coneflower</Vernacular> </Taxon>
TDWG - SDD Oct 20, 2002
Search & Browsing
Taxon (Australia, 2002)
<character keyref="stipules, color" type="nominal"> (type should be in Characteristic definitions)<state keyref="brown" autoordered="no"> <modifier keyref="dark reddish-"/> </state> <state keyref="yellow" autoordered="no"> <modifier keyref="extremely rarely"/> </state> </character>
TDWG - SDD Oct 20, 2002
Search & Browsing
(Not SDD but SDD inspired)
Preamble…..
<xsd:element name = "CharacterState">
<xsd:complexType><xsd:sequence>
<xsd:element ref = "Character"/>
<xsd:element ref = "Value"/> (State)
<xsd:element ref = "Image" minOccurs = "0"/>
<xsd:element ref = "Definition" minOccurs = "0"/>
<xsd:element ref = "Synonym" minOccurs = "0"/>
<xsd:element ref = "BroaderTerms" minOccurs = "0"/>
<xsd:element ref = "NarrowerTerms" minOccurs = "0"/>
<xsd:element ref = "RelatedTerms" minOccurs = "0"/>
</xsd:sequence>
TDWG - SDD Oct 20, 2002
Search & Browsing
Character State<Character>Leaf Shape</Character> <State>lanceolate</State> <Image>http://www.isrl.uiuc.edu/~openkey/demo/lanceolate.xml</Image> <Image>http://www.isrl.uiuc.edu/~openkey/demo/lanceolate2.xml</Image> <Definition>Lance-shaped; much longer than wide, with the widest point below the middle.</Definition> <Synonym/> <BroaderTerms>oval</BroaderTerms> <NarrowerTerms/> <RelatedTerms>elliptic-lanceolate</RelatedTerms> </Characters>
TDWG - SDD Oct 20, 2002
Search & Browsing
Federation of Characters and States
KeyKeyKeyStates
SharedCharacter
Lists
KeyKeyKeyCharacters
TDWG - SDD Oct 20, 2002
Search & Browsing
Character Sharing
UIDatabase
UNCDatabase
FederatedDatabase
OA Server OA Server
OA Harvester
Polyclave Server
The world
PV.XML
DELTAAccess
LuCID
TDWG - SDD Oct 20, 2002
Search & Browsing
<xsd:element name = "Value" type = "xsd:string">
<xsd:annotation>
<xsd:documentation>This is the values of a "character" or
"characteristic" of an item, the value of a property.
For example the legal values for /leaves/leaf_arrangement are
alternate, opposite, basal,whorled and cauline
</xsd:documentation>
</xsd:annotation>
Character Schema Element
TDWG - SDD Oct 20, 2002
Search & Browsing
Image Schema
<xsd:element ref = "Identifier"/> <xsd:element ref = "Description" minOccurs =
"0"/><xsd:element ref = "CopyrightPermission"/> <xsd:element ref = "ImageLocation" …> <xsd:element ref = "CopyrightHolder" …> <xsd:element ref = "Source" minOccurs = "0"/><xsd:element ref = "DerivedFrom" …> <xsd:element ref = "DateCreated" …> <xsd:element ref = "Species" minOccurs = "0"/> <xsd:element ref = "ContributorID“ …> <xsd:element ref = "SpecimenID" …>
TDWG - SDD Oct 20, 2002
Search & Browsing
PlantCharacteristic• <?xml version="1.0"?>• <!DOCTYPE PlantCharacteristic SYSTEM
"http://soldev.isrl.uiuc.edu/~webvibe/PlantCharacters/PlantImageCharacteristic.dtd">
• <ImageElements>• <Part>Leaf-Stem</Part>• <ValueName>Alternate</ValueName>• <Description>Leaf Arrangements - Alternate</Description>• <ImageLocation>http://soldev.isrl.uiuc.edu/~webvibe/
PlantCharacters/images/AlternateLeaf.jpg</ImageLocation>• <CopyrightHolder>Illinois Natural History
Survey</CopyrightHolder>
TDWG - SDD Oct 20, 2002
Search & Browsing
PlantCharacteristic
• <Source>Observing, Photographing, and Collecting Plants. Illinois Natural History Survey Circular 55, 1980</Source>
• <DateCreated>January 29, 2002</DateCreated>• <Species></Species>• <DerivedFrom>Another Image</DerivedFrom>• <Contributor>• <ContributorID>pbheidorn</ContributorID>• <ContributorName>P. Bryan Heidorn</ContributorName>• <ContributorDetails>University of Illinois,
GSLIS</ContributorDetails>• </ImageElements>