introducing an automated subject classifier
TRANSCRIPT
Introducing an automated subject classifierPru Mitchell, Tine Grimston
Robert Parkes
With thanks to: Phil Anderson, Leidos #vala16 #s27
Cunningham Library• Services• ACER staff• ACER students• Education
community• Indexing services
Australian Education Index• First print edition 1957 • Available on Informit as A+
Education, ProQuest, Taiwan• Indexed by ACER staff and
external contract indexers
Indexing varies with staffing levels and budget“an increasingly onerous task”
2006 2007 2008 2009 2010 2011 2012 2013 2014 20150
100020003000400050006000700080009000
10000
Production steps
1. Identification of potential sources
2. Acquisition of identified sources
3. Selection of relevant material from these sources
4. Cataloguing or indexing of selected material
5. Quality assurance of indexed records
6. Dissemination of records to users
The product
Indexing database
Cunningham catalogue
One vocabulary to bind them
• AEI• EdResearch
Online• Australian
Education Research Theses
• IDP Database• Learning
Ground• BOLDE
Australian Thesaurus of Education Descriptors
Web docsbooks
Journal articles
conf papers
Machine learning
Automated classificationWhy• More to index• Less staff time available• Increasing metadata
feeds instead of print journals• Increase efficiency
Our story
2009 First journal metadata2011 Information online
presentation2012 Increased metadata
replacing print journals2013 Feasibility study 2014 Initial installation in June
– followed by continuousrefinement of system
What is the classifier?
Two Processes1. Training:
Uses past data to create models of how each subject term should be used
2. Classifier: Uses the models to assign subjects to new records based on article title, abstract and journal title
Training the classifier
• Selection of past records - not all are suitable
Running the classifier
What the human indexer sees
How the classifier has performed
• Provides a useful set of descriptors on the majority of records
• Average of 11.7 major descriptors assigned per record (Max=13)
• Average of 6.5 “correct” major descriptors per record
FindingsA particular challenge:Horse-Girl Assemblages: Towards a Post-Human Cartography of Girls' Desire in an Ex-Mining Valleys Community [Discourse, 35(3)]
• Classifier performance greatly dependent on abstract length, style and level of detail
• ACER index a wide variety of material, some is not necessarily easy to index using ATED
• The specific topic of an article might only have a more general term in ATED
• Quality vs efficiency
Workflow improvements
Classifier use increasing due to workflow improvements
Publisher feeds
• Taylor & Francis 2009--• SAGE 2013--• Wiley 2013--• Springer 2013--• Inderscience 2013—• Emerald (in negotiation)
• Many publishers can provide a metadata feed of education journals
• All in XML, but all different from each other
• 24,138 articles received in feeds in 2015, up from 5,006 in 2010
Lessons • Indexing from the abstract
• Thesaurus structure• Metadata • Processes simplification• Prioritisation• Indexer experience• Curation• Skill set required in team
What next?• Ongoing development of workflows• Possible changes to our database
structure• More publisher feeds• Other ways to get bibliographic
metadata into the workflow – eg RSS feeds, search alerts from databases
• Develop selection processes further
• Documentation and dissemination
Questions