introducing an automated subject classifier

19
Introducing an automated subject classifier Pru Mitchell, Tine Grimston Robert Parkes With thanks to: Phil Anderson, Leidos #vala16 #s27

Upload: australian-council-for-educational-research

Post on 18-Jan-2017

219 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Introducing an automated subject classifier

Introducing an automated subject classifierPru Mitchell, Tine Grimston

Robert Parkes

With thanks to: Phil Anderson, Leidos #vala16 #s27

Page 2: Introducing an automated subject classifier

Cunningham Library• Services• ACER staff• ACER students• Education

community• Indexing services

Page 3: Introducing an automated subject classifier

Australian Education Index• First print edition 1957 • Available on Informit as A+

Education, ProQuest, Taiwan• Indexed by ACER staff and

external contract indexers

Indexing varies with staffing levels and budget“an increasingly onerous task”

2006 2007 2008 2009 2010 2011 2012 2013 2014 20150

100020003000400050006000700080009000

10000

Page 4: Introducing an automated subject classifier

Production steps

1. Identification of potential sources

2. Acquisition of identified sources

3. Selection of relevant material from these sources

4. Cataloguing or indexing of selected material

5. Quality assurance of indexed records

6. Dissemination of records to users

Page 5: Introducing an automated subject classifier

The product

Page 6: Introducing an automated subject classifier

Indexing database

Cunningham catalogue

One vocabulary to bind them

• AEI• EdResearch

Online• Australian

Education Research Theses

• IDP Database• Learning

Ground• BOLDE

Australian Thesaurus of Education Descriptors

Web docsbooks

Journal articles

conf papers

Page 7: Introducing an automated subject classifier

Machine learning

Page 8: Introducing an automated subject classifier

Automated classificationWhy• More to index• Less staff time available• Increasing metadata

feeds instead of print journals• Increase efficiency

Our story

2009 First journal metadata2011 Information online

presentation2012 Increased metadata

replacing print journals2013 Feasibility study 2014 Initial installation in June

– followed by continuousrefinement of system

Page 9: Introducing an automated subject classifier

What is the classifier?

Two Processes1. Training:

Uses past data to create models of how each subject term should be used

2. Classifier: Uses the models to assign subjects to new records based on article title, abstract and journal title

Page 10: Introducing an automated subject classifier

Training the classifier

• Selection of past records - not all are suitable

Page 11: Introducing an automated subject classifier

Running the classifier

Page 12: Introducing an automated subject classifier

What the human indexer sees

Page 13: Introducing an automated subject classifier

How the classifier has performed

• Provides a useful set of descriptors on the majority of records

• Average of 11.7 major descriptors assigned per record (Max=13)

• Average of 6.5 “correct” major descriptors per record

Page 14: Introducing an automated subject classifier

FindingsA particular challenge:Horse-Girl Assemblages: Towards a Post-Human Cartography of Girls' Desire in an Ex-Mining Valleys Community [Discourse, 35(3)]

• Classifier performance greatly dependent on abstract length, style and level of detail

• ACER index a wide variety of material, some is not necessarily easy to index using ATED

• The specific topic of an article might only have a more general term in ATED

• Quality vs efficiency

Page 15: Introducing an automated subject classifier

Workflow improvements

Classifier use increasing due to workflow improvements

Page 16: Introducing an automated subject classifier

Publisher feeds

• Taylor & Francis 2009--• SAGE 2013--• Wiley 2013--• Springer 2013--• Inderscience 2013—• Emerald (in negotiation)

• Many publishers can provide a metadata feed of education journals

• All in XML, but all different from each other

• 24,138 articles received in feeds in 2015, up from 5,006 in 2010

Page 17: Introducing an automated subject classifier

Lessons • Indexing from the abstract

• Thesaurus structure• Metadata • Processes simplification• Prioritisation• Indexer experience• Curation• Skill set required in team

Page 18: Introducing an automated subject classifier

What next?• Ongoing development of workflows• Possible changes to our database

structure• More publisher feeds• Other ways to get bibliographic

metadata into the workflow – eg RSS feeds, search alerts from databases

• Develop selection processes further

• Documentation and dissemination

Page 19: Introducing an automated subject classifier

Questions

[email protected]