introducing an automated subject classifier

Introducing an automated subject classifierPru Mitchell, Tine Grimston

Robert Parkes

With thanks to: Phil Anderson, Leidos #vala16 #s27

Cunningham Library• Services• ACER staff• ACER students• Education

community• Indexing services

Australian Education Index• First print edition 1957 • Available on Informit as A+

Education, ProQuest, Taiwan• Indexed by ACER staff and

external contract indexers

Indexing varies with staffing levels and budget“an increasingly onerous task”

2006 2007 2008 2009 2010 2011 2012 2013 2014 20150

100020003000400050006000700080009000

10000

Production steps

1. Identification of potential sources

2. Acquisition of identified sources

3. Selection of relevant material from these sources

4. Cataloguing or indexing of selected material

5. Quality assurance of indexed records

6. Dissemination of records to users

The product

Indexing database

Cunningham catalogue

One vocabulary to bind them

• AEI• EdResearch

Online• Australian

Education Research Theses

• IDP Database• Learning

Ground• BOLDE

Australian Thesaurus of Education Descriptors

Web docsbooks

Journal articles

conf papers

Machine learning

Automated classificationWhy• More to index• Less staff time available• Increasing metadata

feeds instead of print journals• Increase efficiency

Our story

2009 First journal metadata2011 Information online

presentation2012 Increased metadata

replacing print journals2013 Feasibility study 2014 Initial installation in June

– followed by continuousrefinement of system

What is the classifier?

Two Processes1. Training:

Uses past data to create models of how each subject term should be used

2. Classifier: Uses the models to assign subjects to new records based on article title, abstract and journal title

Training the classifier

• Selection of past records - not all are suitable

Running the classifier

What the human indexer sees

How the classifier has performed

• Provides a useful set of descriptors on the majority of records

• Average of 11.7 major descriptors assigned per record (Max=13)

• Average of 6.5 “correct” major descriptors per record

FindingsA particular challenge:Horse-Girl Assemblages: Towards a Post-Human Cartography of Girls' Desire in an Ex-Mining Valleys Community [Discourse, 35(3)]

• Classifier performance greatly dependent on abstract length, style and level of detail

• ACER index a wide variety of material, some is not necessarily easy to index using ATED

• The specific topic of an article might only have a more general term in ATED

• Quality vs efficiency

Workflow improvements

Classifier use increasing due to workflow improvements

Publisher feeds

• Taylor & Francis 2009--• SAGE 2013--• Wiley 2013--• Springer 2013--• Inderscience 2013—• Emerald (in negotiation)

• Many publishers can provide a metadata feed of education journals

• All in XML, but all different from each other

• 24,138 articles received in feeds in 2015, up from 5,006 in 2010

Lessons • Indexing from the abstract

• Thesaurus structure• Metadata • Processes simplification• Prioritisation• Indexer experience• Curation• Skill set required in team

What next?• Ongoing development of workflows• Possible changes to our database

structure• More publisher feeds• Other ways to get bibliographic

metadata into the workflow – eg RSS feeds, search alerts from databases

• Develop selection processes further

• Documentation and dissemination

Questions

[email protected]