Integrating cyberinfrastructure into existing e-Social Science research Svenja Adolphs, 1 Bennett Bertenthal, 2 Steve Boker, 3 Ronald Carter, 1 Chris Greenhalgh, 4 Mark Hereld, 5,6 Sarah Kenny, 5 Gina Levow, 7 Michael E. Papka, 5,6,7 Tony Pridmore 4 1 Centre for Research in Applied Linguistics, School of English, Nottingham University, UK 2 Department of Psychology, Indiana University, USA 3 Department of Psychology, University of Virginia, USA 4 School of Computer Studies and IT, Nottingham University, UK 5 Computation Institute, Argonne National Laboratory and The University of Chicago, USA 6 Mathematics and Computer Science Division, Argonne National Laboratory, USA 7 Department of Computer Science, The University of Chicago, USA Email address of corresponding author: [email protected] Abstract. This study has been facilitated by an NSF/ESRC exchange programme between researchers at the University of Chicago and the University of Nottingham. At the University of Nottingham the National Centre for e-Social Science node seeks to explore, understand and demonstrate the salience of new forms of digital record as they emerge from and for e-Social Science. The Nottingham Multimodal Corpus (NMMC) is a corpus of multimodal data that marries established coding schemes with visual mark-up systems to foster a richer understanding of the embodied nature of language use and its manifold relations to the production of distinctive social contexts. At

Abstract. This study has been facilitated by an NSF/ESRC exchange programme between researchers at the University of Chicago and the University of Nottingham. At the University of Nottingham the National Centre for e-Social Science node seeks to explore, understand and demonstrate the salience of new forms of digital record as they emerge from and for e-Social Science. The Nottingham Multimodal Corpus (NMMC) is a corpus of multimodal data that marries established coding schemes with visual mark-up systems to foster a richer understanding of the embodied nature of language use and its manifold relations to the production of distinctive social contexts. At the University of Chicago, the NSF-funded Social Informatics Data Grid (SIDGrid)1 is being built to enable researchers to collect real-time multimodal behaviour at multiple time scales. Multimedia data (voice, video, images, text, numerical) is stored in a distributed data warehouse that employs Web and Grid services to support data storage, access, exploration, annotation, integration, analysis, and mining of individual and combined data sets. With particular reference to the analysis of and markup of hand gestures in spoken discourse, this paper explores some basic steps in integrating cyberinfrastructure into existing e-Social Science research as an interdisciplinary team with perspectives from linguistics, psychology, data and computing systems, and machine analysis of multi-modal data.

IntroductionInformation technology advances have already made it possible to develop multi-million word databases or ‘corpora’ of spoken conversation as well as software tools to analyse this


data quantitatively (see, for example, the 5 million word CANCODE2 corpus developed at Nottingham). However, social interactions are multi-modal in nature, combining both verbal and non-verbal components and units (Kress and van Leeuwen 2001). Non-verbal, multi-modal behaviour (e.g. body language, gestures, eye-contact etc) plays an integral part in determining the meaning and function of spoken language (Baldry and Thibault 2004).

We report here on how we begin to exploit the emerging e-Science infrastructure to extend early research in the field3,4 and develop an integrated resource for interdisciplinary research into natural language use that responds to the challenges of multi-modal analysis. Selected sequences from twenty-five hours (approximately 125,000 words) of video data taken from naturally occurring dyadic tutorial supervision sessions are analysed using standard corpus linguistic search tools and key clusters of lexico-grammatical linguistic features are isolated. These key linguistic data clusters are then subjected to mark-up of relevant sections of the data by means of visual tracking based on computer vision technologies and to further analysis which, manually and in semi-automated fashion, marries the verbal and the visual components of the video data.

Previous spoken corpus research has highlighted discourse markers as key structuring devices of particular significance (Carter and McCarthy 2006). These units provide a particular site for analysis of the way in which key components of body language, such as, in the case of the Nottingham research, head nods and hand gestures, are utilised by speakers to provide visual support for discourse management. Ongoing research, reported at the 2006 e-Social Science conference, has focused on head nod movements as backchannels in conversation .5

The collaborative task that we set ourselves is to integrate SIDGrid cyberinfrastructure with the Handtalk gesture-in-communication project to enable us to explore the associated scientific, technical and social issues. The discussion is organized as follows – description of the problem and approach, the viewing tool, integration with SIDGrid, special issues encountered, and concluding remarks.

From Headtalk to HandtalkThe problems associated with identifying and classifying head-nods are multiplied when considering hand gestures, even when gesture is only considered in the form of hand and arm movement. A head-nod can be said to exist due to movement along one axis. Hands and arms can move much more freely and both in tandem with or independently of one another. Hands also perform several other practical functions during talk, including scratching, adjusting clothing and hair, writing, reaching for and moving objects, etc. Unlike head-nods, however, there already exists a substantial body of research into how hand gestures support and supplement spoken utterances. Much of this existing work on gesture (McNeill 1992) has attempted to represent the full scope of this complexity by employing teams of transcribers who manually code large amounts of video data and then cross-check each other’s work. Researchers start out by identifying everything they consider to be a gesture by the speaker they are observing and then adding these to the orthographic transcript, using square brackets around the words where a gesture takes place and putting the words where the stroke of the gesture occurs in bold type. The gesture is then located in space as illustrated here:2 CANCODE: EUDICO: TASX: tasxforce.lili.uni-bielefeld.de5 HeadTalk:

Figure 1: Division of the gesture space for transcription purposes (from McNeill 1992: 378).

From this point the gestures can be coded according to classification systems for gesture type, form and meaning.

For this project it was decided to take a more ‘bottom-up’ approach and to work with a very simple set of gestures to which additional layers of complexity could be added at a later stage. To simplify the types and forms of the gestures for the study we decided to focus on the movement of arms only and not to look at hand shape. We also wanted to adapt the existing gesture region model shown in Figure 1. The preliminary examinations of the supervision data we have collected suggested that, if we were to simplify this model, then dividing the area in front of the speaker where the gestures are ‘played out’, along vertical axes would give some interesting results as illustrated in Figure 2.

Figure 2 also illustrates the way in which key linguistic data clusters are then subjected to mark-up of relevant sections of the data by means of video analysis based on computer vision technologies. An interactive program allows users to apply a visual tracking algorithm to selected targets in an input video clip. In the current prototype the user indicates, with a mouse, the position of the head and hands in the first frame of the video. The system then automatically locates the torso and identifies the four regions of interest shown in Figure 2. An augmented version of the Kernel Annealed Mean-Shift (KAMS) algorithm of Naeem et al. (2007) tracks the position of the head and hands through the remainder of the video. The blue circles in Figure 2 denote the tracker’s estimates of the head and hands’ positions. Although no attempt has been made in NMMC to automate the clustering and recognition

systems, the independent success of each strongly suggests that the automatic recognition of head and hand gestures is feasible in the type of image data considered here.

Figure 2: Computer image tracking applied to video.

Tracking is complicated when speakers bring their hands together, or to their face. Khan et al.’s (2004) interaction filter was therefore incorporated into the standard KAMS algorithm to prevent the trackers losing their targets when this happens. Torso and zone position is updated as a function of head location and a text file produced which summarises the movement of the hands into and out of the four zones. Though the method could be applied to a wide variety of object representations, it has long been known that human skin colour clusters very tightly in some colour spaces, making the colour histogram ideal for tracking the face (and hands) of the speaker. The current system therefore represents each target object (hand or face) as a normalized 3D histogram of colour values; see Naeem et al. (2007) for details.

The reason for choosing the vertical axes was based on the fact that during supervisions the speakers spend a good deal of time comparing and contrasting different ideas. These ideas often appear to exist in metaphorical compartments in front of the speaker. When these ideas are compared or contrasted the speaker will often move one or both hands along a horizontal axis to support the verbal element of the communication. Dividing the horizontal plane with vertical lines allows us to track this movement and to link it to the talk. A 4-point coding scheme was constructed.

1) Left hand moves to the left

2) Left hand moves to the right

3) Right hand moves to the left

4) Right hand moves to the right

This initial coding scheme focuses purely on the movement of the arms. Additional or more complex schemes will allow us to make further remarks about the way in which the gesture supports and/or supplements the verbal part of the speaker’s message (initial analysis of correlations with various types of linguistic discourse markers is already under way) and to marry analysis with work undertaken with colleagues specializing in computational linguistics and psychology at the University of Chicago and the University of Virginia who have particular expertise in prosodic analysis, the relationship between pitch, intonation and gesture (Levow, 2005) and have undertaken preliminary research on the analysis of the role gender in gesture communication.6

Applications and representation of dataThe final concern of the corpus development is with how the multiple streams of coded data are physically re-presented in a re-usable operational interface format. With a multi-media corpus it is difficult to exhibit all features of the talk simultaneously. If all characteristics of specific instances where a word, phrase or coded gestures (in the video) occur in talk, are displayed, the corpus would have to involve multiple windows of data including concordance viewers, text viewers, audio and video windows. This may make the corpus impractical to use, and would mean that the corpus may be slow and sometimes prone to fail if the computer system is unable to deal with storing and replaying such high volumes of video data. It is further difficult to ‘read’ any large quantity of multiple tracks of such data simultaneously, as current corpora allow with text. The notion of selecting, for example, a section of text or a search token, and retrieving the exact point in the data at which it occurs is, itself, not straightforward. Again, the fact that gestures are not discrete units in the same way as words and utterances means that it is difficult to align the different modes exactly according to the time at which different actions or words occur.

In order to represent the data in a way that allows for the different data streams to be analyzed alongside one another, the multi-modal data is viewed through the Digital Replay System (DRS) interface (French et al. 2006). The software also enables the researcher to code and search the corpus data.

The Digital Replay System allows video data to be imported and a digital record to be created that ties sequences of video to a transcribed text log, accompanied, where appropriate, by samples of data that are also subjected to visual tracking (indicated in blue circles in Figure 2). The text log is linked by time to the video from which the transcript is derived so that the text log plays alongside the video. Further annotations can be added to the log to show where gestures – head nods in the above example – occur and these annotations are also tied to the video. An index of annotations is produced and each can be used to go to that part of the log and video at which they occur. The annotation mechanism provides an initial means of marking up multi-modal data and of maintaining the coherence between spoken language and accompanying gestural elements. Note that in Figure 3 the second concordance line of the search term yeah has been selected (shown on the right side of the DRS interface). The corresponding video clip where this utterance is spoken is shown in the video clip (to the left of the interface) and can be played on the audio track (positioned at the bottom of the screen). Using this concordance viewer, the analyst can search across a large database of multi-modal data utilising specific types, phrases, patterns of language, or gesture codes as ‘search terms’.

6 Boker Lab:

Once presented as a concordance view the analyst may jump directly to the temporal location of each occurrence within the associated video or audio clip, as well as to the full transcript.

Figure 3: A screenshot of the concordance tool in use within the DRS software interface.

SIDGRID developmentThe Social Informatics Data (SID) Grid (Bertenthal et al. 2007) is designed to enable researchers to collect multimodal behavior, and then to store and analyze different data types (e.g. voice, video, images, text, numerical) in a distributed multimedia data warehouse that employs web and grid services to support data storage, access, exploration, annotation, integration, analysis, and mining of individual and combined data sets.

The diagram in Figure 4 illustrates how SIDGrid connects data, analysis, and researchers. Previously collected corpora and data archives in raw or partially analyzed forms are supported, as well as existing applications such as ELAN and DRS with the addition of suitable (and usually straightforward) interface code. An essential component of the SIDGrid is transparent data integration services so that this distributed data can be used simply and effectively by any infrastructure components and services. SID Grid query, exploration and analysis services are based upon web and grid services.

Integrating access to SIDGrid services into DRS is taking a course similar to the path taken with ELAN7, an open source annotation and viewing tool already in use by practitioners around the world. We have begun work on the main phases of the integration effort.

Enabling the ability to download projects from SIDGrid requires a point of entry in the GUI – Import from SIDGrid in the File menu. The action associated with this menu item invokes a 7 ELAN:

standard client-side SIDGrid module that provides the interface to the repository. The functions provided by this module include: browsing the repository, browsing metadata associated with data in the repository, content preview, data selection, and data download. This approach simplifies deploying feature and function improvements to all applications that incorporate the module, particularly if the application can use the Java directly.

The data is downloaded to the application in the RDF format used by DRS for project import and export. Translation into this DRS-friendly form is provided by services running on the SIDGrid server. On return from our code, the download module need only signal the application where to find the data – a requirement that already imposed by the existing Import Project task in the unmodified application – causing it to be folded into the internal DRS database.

This module was developed first for ELAN. Integrating it into DRS is providing us with the side benefit of an opportunity to insulate it more cleanly from the application so that it can be more easily inserted into future applications.

Figure 4: The SIDGrid architecture showing its relationship with applications like DRS, ELAN and the SIDGrid Portal.

Secondly, we are enabling upload of the current state of the DRS project data to the SIDGrid repository. This requires a symmetric set of modifications to the DRS code (compared with the download problem). Again, the action is tied into the GUI using an Export to SIDGrid item in the File menu. In this case, however, the module must collect the latest data resources together and ready them for upload. In general, this is a task that could be problematic because these resources are scattered throughout data structures maintained by

the application, DRS in this case. Accessing these might well require replicating a lot of application-specific code and/or calling on many routines in the application code. Both of these approaches would require exorbitant effort to maintain against an evolving application software code base – too many dependencies to track. Instead we simplify the requirements placed on the code in the upload module by requiring that the user first save the project using the native Export RDF function.

An alternative would be to intercept the native Export RDF action at the end of its work and simply push the results to the SIDGrid repository. This would require modification of the application code body, rather than to the more superficial code supporting the GUI generation and action binding. The latter is somewhat cleaner, somewhat easier to implement, and somewhat easier to maintain.

Again, as this module was developed for ELAN, adding it to DRS requires less effort as both the mechanism and the code can be borrowed with almost no modification.

As we continue to design and lay these modifications into the DRS code and test the interaction of the application with the SIDGrid services we expect to run into difficulties not encountered with our ELAN effort. For one thing, DRS maintains a database rather than separate project files. We may also encounter complications when we write the format translation code. Not from the mechanics of it, but rather from possible mismatch in terms and data types. For example, while outfitting ELAN for integration with SIDGrid we found that it did not support time series – a data type that is of critical importance to our user base. We had to fill the gap by developing a custom time series viewer module for ELAN, which was later replaced by a similar feature added by the application development team. Similar discoveries may await us in our DRS effort as our teams dig deeper into the co-development project.

We have also begun testing a SIDGrid port of Cvision to enable bulk analysis of large sets of video recordings; this effort leverages existing templates for similar application workflows. As of this writing we have installed a stand-in for Cvision that provides a test of the end-to-end workflow behavior including correct management of data inputs and data products. The next steps will be to integrate the Cvision core code and required libraries with this framework. An issue that remains to be solved is extricating the user from the setup and initialization phases in the existing code so that the code can be run autonomously against video files. The most difficult aspect will be automating the Pick Area to Track step currently required to initialize the tracking.

What does SIDGrid add to the workflow described in the earlier sections of this paper? This is what we anticipate: better data management, automated data and process provenance, data safety and security, opportunity for higher throughput analysis for computationally intensive projects such as gesture analysis, reuse of workflows, and application of analysis to archived datasets.

The integration approach described above creates an augmented workflow for the DRS application user. While continuing to benefit from the performance and convenience of working locally with data in the context of DRS, the user has access at network speeds to a potentially large volume of data maintained in the SIDGrid repository. Some of the browsing, tagging, and computing features provided by the Portal but not in the enhanced DRS will appeal to different users depending on the research problems and methods that they already employ.

Emergent considerationsAs the collaboration between the groups at the University of Nottingham and the University of Chicago progresses we will be interested in identifying and exploring some of the key issues, opportunities, problems, and possibilities afforded by emerging cyberinfrastructure technologies.

In particular, we investigate the benefits of shared data beyond the group responsible for collecting it, the additional science that can be generated from the shared data, and the technical, social and ethical barriers to this effort. In addition how does a Grid environment change the research being done with this data?

Data stored on the SIDGrid repository can be enabled by its owners to be shared with a larger audience. This sharing of the data requires consideration of a number of issues that are not traditionally considered when collecting data for use by a single group. Prominent among these are: what metadata should be included to ensure more general utility of the data, and should annotations adhere to stricter standards and conventions? At the same time mechanisms exposed via the SIDGrid infrastructure need to be able to track the data’s use, maintain the integrity of the original data and revisions of its evolution during analysis. Are the existing features and provisions sufficient?

In addition to the data management aspects and consequences introduced by including a shared data repository in the research process, SIDGrid exposes a number of new issues and capabilities in connection with the Grid environment made available to researchers for use in data analysis. What are the opportunities and challenges afforded by automation of steps in the annotation and analysis of experimental data? Are new research patterns enabled – detailed manual analysis of an instance followed by guided automatic methods leveraging grid computation? What are the potential benefits of uniform analysis of collections of data – single instances could be anomalous, improved statistics, fodder for cluster analysis?

There are many ethical considerations faced throughout the development of multi-modal corpora. Another important priority for future research in this area is the thus the development of tools and methods to address such issues; for example, to anonymise video data while still being able to extract the salient features that are the focus of the analysis. Pixelating faces or using shadow representations of heads and bodies can blur distinctions between gestures and language forms and, when taken to its logical conclusion, anonymisation should also include replacing voices with voice-overs and with other speakers. Ethical considerations of re-using and sharing contextually-sensitive video data as part of a multi-modal corpus resource need to be addressed further in consultation with end users, informants, researchers and ethics advisors. The issues are especially acute when tools are shared or are developed to be web-enabled.

ConclusionsIn this paper we have described a collaborative project whose goals include scientific, technical and social aspects. The collaboration is built around an effort to integrate an ongoing experimental program with newly developed cyberinfrastructure. Namely, the Handtalk project is developing methods and tools to study gesture in communication. This includes a computed visual analysis code for extracting head and hand motion from video. The SIDGrid project is developing infrastructure to support multi-modal data exploration and analysis.

The technical aspects of the project are well underway. Soon enough we will be in a position to evaluate the utility of the developed infrastructure as we apply it to the analysis of our data. The larger questions – from the value of shared data in the e-social sciences to the consequent ethical issues – will require time and experience to understand.

AcknowledgementsThe research on which this article is based is funded in part by the UK Economic and Social Research Council (ESRC), e-Social Science Research Node DReSS (, the ESRC e-Social Science small grants project HeadTalk (Grant No. RES-149-25-1016), the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract W-31-109-ENG-38, and by the NSF under Grant No. BCS-05-37849.

