scikit-talk: a toolkit for processing real-world

Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 252–256July 29–31, 2021. ©2021 Association for Computational Linguistics

252

Scikit-talk: A toolkit for processing real-world conversational speech data

Andreas Liesenfeld, Gabor Parti and Chu-Ren HuangThe Hong Kong Polytechnic University

Hong Kong SAR, [email protected], [email protected],

[email protected]

Abstract

We present Scikit-talk, an open-source toolkitfor processing collections of real-world con-versational speech in Python. First of its kind,the toolkit equips those interested in study-ing or modeling conversations with an easy-to-use interface to build and explore largecollections of transcriptions and annotationsof talk-in-interaction. Designed for appli-cations in speech processing and Conversa-tional AI, Scikit-talk provides tools to custom-build datasets for tasks such as intent proto-typing, dialog flow testing, and conversationdesign. Its preprocessor module comes withseveral pre-built interfaces for common tran-scription formats, which aim to make work-ing across multiple data sources more acces-sible. The explorer module provides a collec-tion of tools to explore and analyse this datatype via string matching and unsupervised ma-chine learning techniques. Scikit-talk servesas a platform to collect and connect differ-ent transcription formats and representationsof talk, enabling the user to quickly build mul-tilingual datasets of varying detail and granu-larity. Thus, the toolkit aims to make work-ing with authentic conversational speech datain Python more accessible and to provide theuser with comprehensive options to work withrepresentations of talk in appropriate detail forany downstream task. For the latest updatesand information on currently supported lan-guages and language resources, please refer to:https://pypi.org/project/scikit-talk/

1 Introduction

Real-world conversational speech data, also knownin the designated fields of study as transcription oftalk-in-interaction, is complex. Talk can be tran-scribed in varying level of detail and symbolic rep-resentations of elements in talk range from simpleword-level transcripts to fine-grained phonetic rep-resentations and multi-layered xml tags for all sorts

of audio-visual information that the representationaims to capture. When using this data type for anydownstream task, an important step is to decide onthe appropriate level of granularity and to strive fora systematic representation format. Scikit-talk isdesigned to aid anyone interested in processing talkwith these decisions and to make building datasetsof talk more efficient, flexible and accessible.

Designed for applications in speech processingand Conversational AI, Scikit-talk provides toolsto custom-build datasets for tasks such as intentprototyping, dialog flow testing, and conversationdesign. An important step in the development ofcommercial task-oriented bots is the initial designof intent labels and dialog flows. Depending onthe task, prototyping intent flows can be difficult.Breaking down the interaction of ordering a pizzainto discrete intents such as “select pizza type” and“input delivery address” is straight-forward. But,for instance, designing dialog flows for more com-plex interactions, such as bots for technical supporttasks, help, or FAQ hotlines can be more complex.How will these interactions unfold? What are typ-ical dialog flows for such interactions in the realworld? Scikit-talk provides an accessible tool fortasks such as bot prototyping for ConversationalAI. It offers a unified toolkit to process and querydatabases of real-world interactions to find andanalyse instances of a certain type of interactionfor data-driven conversation design and develop-ment cycles (Figure 1).

Prototype interaction flows sometimes do notmatch well with how users actually complete thetask in real-world scenarios, which can result inadditional development cycles until a satisfactoryperformance is reached (Yang et al., 2019). Initialdesign decisions may require revision later on, suchas editing, merging or deprecating intents once realusers interact with the bot. Scikit-talk addressesthis by providing builders and designers with a tool

253

Figure 1: A typical bot development spiral: from proto-type design to development and deployment cycles

to explore how to better match intent labels anddialog flows in the prototype stage. For instance,when prototyping of a bot for a customer supporttask, Scikit-talk can be used to explore similar ex-planatory sequences in real-world data. This helpsbuilders to ground their prototype in authentic dataand mitigate the risk of grounding developmentcycles on artificial premises.

Scikit-talk is the first open-source tool for pro-cessing transcripts of authentic speech specificallydesigned for research and development. It comeswith a range of data exploration tools for colloca-tion and concordance analysis, as well as a pipelineto automate annotation, analysis, and visualizationof interaction types or specific NLU intents (suchas greeting, accusing, deflecting, or apologizing).

In contrast to prototyping based on typed chat ormovie subtitle data, Scikit-talk enables insights thattake more detail of real talk into account, includ-ing seemingly incoherent or messy discourse struc-tures, disfluency, repair, repetition or restarts andother features that are common in natural spoken in-teraction. Designing for real talk means designingcloser to real-world scenarios.

Figure 2 shows an example of intent label anddialogue state design grounded in conversationaldata. The analysis highlights some of the issuesthat arise when working with this data type, such aslabelling various turns and dealing with features ofconversational interaction such as abandoned turnsor joint turn completions.

Inspired by Scikit-learn, the toolkit consists ofseveral modules to preprocess, explore and analyse

this data type that were designed with explainabilityand customisation in mind (Pedregosa et al., 2011).

The toolkit also enables the user to combine ex-isting language resources using pre-built interfacesfor common corpus transcription formats. The mo-tivation behind Scikit-talk is to build a tool to con-veniently explore large collections of authentic talk.This aim is reflected in the following design deci-sions:

• The fundamental unit of organization isthe turn, not the sentence. A turn is an ut-terance of any length by a speakers that onlyends when another speakers begins to talk.This means that data is segmented only byspeaker change, not punctuation.

• Compile datasets using custom transcrip-tion formats as well as existing languageresources Scikit-talk should provide the op-tion to configure custom formats as well asmaintain a collection of pre-built interfaces forpopular existing language resources of conver-sational speech.

• Retain as much detail of authentic speechas the transcription formats permit. Norepetitions, repairs, or non-lexical utterancesare discarded. This often includes any avail-able transcriptions of utterances such as “oh”or “erm”, laughing, sighing, pauses, trunca-tion and overlapping speech.

• Utterances should be explored within theirinteractional contexts. Scikit-talk shouldprovide a collection of state-of-the-art toolsfor concordance and collocation analysis oflexical as well as non-lexical elements in talk.

• Make the tool available to anyone inter-ested in processing real-world conversa-tional speech data. For the given tasks, thetoolkit should lower the technical barrier ofaccess and adoption, making it a useful open-source platform for developers, researchersand designers alike.

2 Overview of the Scikit-talk toolkit

The idea of a toolkit for conversational speech pro-cessing is inspired by an existing similar tool for on-line chat conversations (Chang et al., 2020). Scikit-talk aims to provide both a unified framework forseveral existing transcription conventions as well as

254

a range of tools to explore this data type in Python.The Preprocessor module is used to combine sev-eral existing datasets via pre-built interfaces forvarious data formats while preserving as much tran-scribed information as possible. Apart from corpusbuilding, Scikit-talk also includes tools for datamanipulations. It includes string matching toolsto explore and identify features of interest. And itincludes tools for unsupervised machine learningtechniques to explore distributional patterns in thedata.

As for custom datasets, Scikit-talk can be usedto analyse any transcriptions of talk in a turn-based format where the conversation unfolds in the“ABAB...” pattern for two speakers or “ABCBCA...”for three speakers, etc.1

3 Pre-built corpus interfaces

The main challenge of providing a comprehensivetool to explore how people talk in natural settingsdistributionally is that language resources of natu-ral conversational speech are few and far between.Unlike movie subtitle corpora, these datasets aresmall because building them requires painstak-ing manual transcription and annotation. Existinglarge-scale conversational corpora often come witha custom annotation formats. While this may be forgood linguistic reasons, it poses a significant bar-rier to facilitate work across different datasets. Pro-cessing and finding instances of a specific type ofinteraction or task in these datasets is hard becausetranscription conventions and data representationformats vary. Addressing this challenge, Scikit-talkprovides a platform that aims to collect and connectexisting computer-readable transcription formatsof talk.

This fragmentation of representation formatsposes a challenge to create custom datasets frommultiple data sources or across existing corpora.Scikit-talk here provides a practical solution by pro-viding a module that converts data across severalcommon formats, the CHILDES CHAT format2,that of the Spoken British National Corpus (Spo-kenBNC)3, and the UPenn Linguistic Data Con-sortium (LDC)4. This means the current versionof the toolkit already covers several of the largestavailable datasets of authentic conversations, such

1Or any other possible order of any number of speakers.2https://talkbank.org/manuals/CHAT.pdf3http://cass.lancs.ac.uk/cass-projects/spoken-bnc2014/4https://www.ldc.upenn.edu/

as SpokenBNC or any LDC datasets. More corpusinterfaces for major language resources in morelanguages are planned.

Preprocessor enables the user to interface datafrom different corpora which can significantly re-duce the workload of building large datasets andunifying data formats. The challenge here is thattranscription formats may differ in convention andgranularity, some providing more detailed informa-tion than others. Preprocessor merges transcriptionfeatures that are consistent across the formats whileautomatically discarding those that are not. Thisminimizes the loss of information for the sake ofconsistency. Typically, consistency issues arise re-garding the transcription of non-lexical conduct,pauses, overlaps, restarts and phonetic reductions.The module has been tested with SpokenBNC,LDC and CHAT format data in British English andLDC and CHAT format data in Mandarin Chinese.Combining several datasets using Preprocessor re-sults in a single, large dataset that enables the userto build collections of interactions that are morecomprehensive, provide broader coverage of a phe-nomenon, and enable more methods of statisticalanalysis (Figure 3).

3.1 String matching tools

The first step of exploring interactions, intents, anddialog flows in Scikit-talk is usually to compile aset of keywords to query the custom dataset. Thiscan be keywords related to individual actions suchas apologies or words typically occurring in a cer-tain type of interactions or talk on a certain topic.The Explorer module also comes with tools for ba-sic corpus query such as frequency, concordance,and collocation analysis.

The Regular Expression-based Annotator toolallows the user to build a collection of instances ofa type of interaction or intent. It can also be usedto annotate linguistic features for a more detailedanalysis to explore how, for instance, apologies areused across the dataset.

3.2 Unsupervised machine learning tools

The Cluster analysis module offers tools to conductexploratory data analyses using various unsuper-vised machine learning techniques.

Specifically, the module provides access to arange of dimensionality-reduction (Multidimen-sional scaling and t-SNE) and clustering techniques(Hierarchical Agglomerative Clustering). These

255

Figure 2: Example of data-driven intent label and dialog state analysis

Figure 3: Scikit-talk module overview

tools enable further comparative analysis of lin-guistic patterns by comparing the (dis)similarity ofany annotated features. The module also includesa range of customisable visualization tools.

Figure 3 shows a typical Scikit-talk workflow.First a dataset is defined. If the data is transcribedfollowing one of the pre-built formats, the data canbe interfaced in only one line of code. Based on a

unified representation format, possible next stepsare various string matching operations such as us-ing the built-in concordancer to explore the datasetor annotating features for collocation analysis usingvarious clustering techniques.

4 Conclusion and future work

Scikit-talk provides a platform to represent spo-ken conversational data, build datasets, and ex-plore how certain types of interaction unfold inauthentic talk. Tailored to the research and devel-opment community, the open-source toolkit makesworking with transcriptions of talk across corporamore accessible for anyone interested in processingthis data type. Scikit-talk features a Preprocessortool that provides pre-built corpus interfaces for(currently three) popular transcription formats andenables rapid construction of unified data repre-sentations. The Explorer, Annotator, and Clusteranalysis modules provide streamlined data explo-ration tools for collections of specific interactionor intent types using various string matching andunsupervised machine learning techniques.

In the future, Scikit-talk will be extended to pro-vide additional corpus interfaces, covering moreresources in more languages. Our aim is to main-tain compatibility with future major releases oflanguage resources of real-world conversationalspeech and to provide the research communitywith a useful tool for conversational speech pro-cessing to study and model talk in all its glori-

256

ous detail, grounded in the ways how people ac-tually communicate. More information on thecurrent state of this project can be found here:https://pypi.org/project/scikit-talk/

ReferencesJonathan P. Chang, Caleb Chiam, Liye Fu, An-

drew Wang, Justine Zhang, and Cristian Danescu-Niculescu-Mizil. 2020. ConvoKit: A toolkit for theanalysis of conversations. In Proceedings of the21th Annual Meeting of the Special Interest Groupon Discourse and Dialogue, pages 57–60, 1st virtualmeeting. Association for Computational Linguistics.

Fabian Pedregosa, Gael Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, et al. 2011. Scikit-learn:Machine learning in python. The Journal of ma-chine Learning research, 12:2825–2830.

Qian Yang, Justin Cranshaw, Saleema Amershi,Shamsi T. Iqbal, and Jaime Teevan. 2019. SketchingNLP: A Case Study of Exploring the Right ThingsTo Design with Language Intelligence. In Proceed-ings of the 2019 CHI Conference on Human Factorsin Computing Systems, CHI ’19, page 1–12, NewYork, NY, USA. Association for Computing Machin-ery.

https://www.aclweb.org/anthology/2020.sigdial-1.8

https://www.aclweb.org/anthology/2020.sigdial-1.8

https://doi.org/10.1145/3290605.3300415

https://doi.org/10.1145/3290605.3300415

https://doi.org/10.1145/3290605.3300415

scikit-talk: a toolkit for processing real-world

Documents