18-03-2013 hung lst day 1 language technology for the humanities: why and how? steven krauwer...

20
18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

Upload: noel-hart

Post on 26-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 1

Language Technology for the Humanities:why and how?

Steven Krauwer

Utrecht University

CLARIN ERIC Executive Director

Page 2: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 2

Overview

Why? How?

CLARIN in a nutshell The dream The vision Phasing CLARIN ERIC The nightmare The challenge

Why join? Concluding remarks

Page 3: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 3

Why (1)

Wealth of digital language data, spread all over Europe in archives, repositories, libraries

Reflects human behaviour, communication, knowledge, culture etc

Rich source of data, information and knowledge for Humanities and Social Sciences (HSS) scholars (historians, philosophers, social scientists, …)

In addition results of 30 years of European HLT efforts In brief: a great opportunity for HSS to innovate itself and to

become world leaders, especially because of our multilinguality

BUT…….

Page 4: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 4

BUT … How do HSS scholars know what data exists How can they get access to data from all over Europe How do they know what tools exist to retrieve, explore and

exploit these data How do they know how to decompose their HSS research

questions into sub-questions that can be answered by digital methods

OUR ANSWER: CLARIN: the Common Language Resources and

Technology Infrastructure for the Humanities and Social Sciences

Why (2)

Page 5: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 5

How: CLARIN in a nutshell

Common Language Resources and Technology Infrastructure (http://www.clarin.eu)

Basic idea: European federation of digital repositories with language

data and tools (text, speech, multimodal, gesture …) with access to language and speech technology tools

through web services to retrieve, manipulate, enhance, explore and exploit data

with uniform single sign-on access to archives and tools target audience humanities and social sciences scholars to cover all EU and associated countries and all languages relevant for target audience

Page 6: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 6

The CLARIN dream

give me digital copies of all contemporary documents in European archives that discuss the Great Plague of England (1348-1350)

give me all negative articles about Islam or about soccer in the Slovenski Narod daily newspaper (1868-1943)

find European TV news interviews that involve speakers with a Hungarian accent

summarize all articles in European newspapers of August 2012 about OCR – in Portuguese

show me the pronoun systems of the languages of Nepal

Page 7: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 7

The vision:the role of language

Language is at the heart of many disciplines in the Humanities and Social Sciences (HSS), e.g. as an object of study as a means of human communication as a means of human expression as a record of our history as part of one’s cultural identity as carrier of knowledge and information

CLARIN wants to support them all Language and speech technology are part of

this (e.g. in the form of computational linguistics or speech science) – essential, but just a part!

Page 8: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 8

The vision:what CLARIN wants to offer CLARIN makes it possible for the researcher to find

resources (metadata search), and to refer to them in a persistent way (persistent identifiers)

CLARIN allows for content search in and across collections CLARIN offers access to web services and workflows to

perform complex linguistic & content operations and visualisations

CLARIN covers both historical and contemporary language material in all modalities

CLARIN serves both expert and non-expert users CLARIN offers access to depositing and long term

preservation services Ultimate goal: advancing HSS in order to get a better

understanding of our society at a European scale

Page 9: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 9

Phasing of CLARIN

Does CLARIN exist? Yes and no. 2008-2011: CLARIN Preparatory Phase Project, 26

countries, EC funded Goal: designing the infrastructure technically and organisationally, and lining up the players

2012-2015 Construction Phase, jointly funded by the participating countries, no EC fundingGoal: building the European infrastructure

2015-…: Exploitation Phase, jointly funded by the participating countries, no EC fundingGoal: making and keeping it running, populating it, and ensuring that it follows new trends in technology and research – covering all EU and associated countries

Page 10: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 10

CLARIN ERIC

CLARIN ERIC is the governance and coordination body, but will not run or fund operational data services

An ERIC is new type of intergovernmental legal entity, created by the EC, essentially a consortium of countries, with no end point

CLARIN ERIC member countries pay a modest annual fee Countries will each set up a national CLARIN consortium, that

will provide data and linguistic services and create data and tools It is up to the countries to decide how to shape and fund their

CLARIN consortia and how to relate them to other activities at the national level (e.g. research programmes, digitisation programmes, etc)

CLARIN ERIC established by the EC on Feb 29th 2012, with 9 founding members: AT, BG, CZ, DE, DK, EE, NL, PL, DLU

More in the pipeline, NO joining at this moment – but we need all European countries!

Page 11: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 11

What is so nice about ERICs?

They are legal entities, not projects, which helps to make them more sustainable

Members are governments, committing themselves for longer periods of time (min. 5 years)

CLARIN ERIC is a sign of recognition by governments and EC of the importance of sharing language resources

Closeness to funding agencies may help to enforce use of standards and sharing of data in projects they fund

Good starting point for international collaboration as third countries can join or make collaboration agreements (e.g. through agencies or data centres)

ERICs may submit proposals for EC funding

But: bulk of the funding dependent on funding mechanisms and cycles in participating countries – NOT from EC

Page 12: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 12

The CLARIN nightmare

give me digital copies of all contemporary documents in European archives that discuss the Great Plague of England (1348-1350)

give me all negative articles about Islam or about soccer in the Slovenski Narod daily newspaper (1868-1943)

find European TV news interviews that involve speakers with a Hungarian accent

summarize all articles in European newspapers of August 2012 about OCR – in Portuguese

show me the pronoun systems of the languages of Nepal

Page 13: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 13

The CLARIN nightmare, example1

give me digital copies of all contemporary documents in European archives that discuss the Great Plague of England (1348-1350) “All” means from all countries and all archives, not just

some archives in some (now 10) CLARIN ERIC member countries

If contemporary docs exist in digital form at all they are probably pictures – how do we get access to the content? Is OCR doable?

Can we rely on standardized metadata to find them? Are our topic detection technologies good enough? Many of the docs may be in Latin, can we handle that, and

what about other languages, e.g. Hungarian? How would a non-technical scholar know how to formulate

this query?

Page 14: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 14

The CLARIN Challenge

Do HSS scholars realize at all that they should be interested in these things? Some do, most don’t; we should make an effort to show them the

potential benefits of adopting these new methods Showcases and visualisation tools are indispensable Distinguish between lost and future generation

Are the tools offered by language and speech technology the direct answers to the problems of HSS scholars as they see them? Major technological efforts are needed, but technologists have a

strong tendency to offer more and better gearboxes to people who are just waiting for a bus with comfortable seats (and a gearbox)

Technologies that work for modern versions of big languages may not work for older versions or not even exist for digitally less favoured languages

Use and adaptation of existing tools to specific HSS questions may always require intervention by technologically skilled people

Page 15: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 15

What would it take to join

Only countries can be ERIC members, not individual research institutions; countries that join CLARIN ERIC would have to

recognize the ERIC as a legal entity (done for EU countries) commit themselves for at least 5 years pay an annual membership fee (ranging from 12.000 to 200.000

euro, depending on GDP, for HU ca 12.000 euro) set up and fund a national CLARIN consortium (universities, data

archives, etc) to provide access to their data, and to create new data and tools according to their national research priorities

identify (and fund) at least one existing data centre as the national hub that is linked to the rest of CLARIN

commit themselves to sharing resources and adoption of CLARIN standards in nationally funded projects

Page 16: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 16

The benefits from joining

Access to the CLARIN Infrastructure, i.e. to all CLARIN language resources and technology services for scholars in the humanities and social sciences (HSS)

Access to expertise from all over Europe via the CLARIN knowledge sharing infrastructure

Embedding in mainstream European HSS research community, with access to the same data

Better visibility of their research results, their resources, their language and their cultural heritage in the European research community

Open doors for cross-lingual and cross-cultural research Embedding in the European Research Area Opportunities to participate in EU projects initiated by

CLARIN ERIC

Page 17: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 17

What if Hungary does not join?

The bright side: No need to pay an annual 12000 euro membership fee No need to agree on and comply with standards intended to

facilitate exchange of data No obligation to share and preserve digital results from

projects with public funding after their completion No need to set up a national consortium to coordinate

infrastructure building and creation of data and tools at the national level

No need to collaborate with European partners to make tools and resources interoperable at the European level

Researchers whose horizon lies within Hungary wouldn’t even notice!

Page 18: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 18

What if Hungary does not join?

The less bright side for Hungarian researchers: They would have to make their own individual arrangements to get

access to data and services outside Hungary Not having access to the same data and tools might create obstacles for

cross-national collaboration Their data and tools might be less visible in the European research

community, and results not reproducible and therefore not recognized Hungary was one of the leading players in the CLARIN project and risks

to gradually lag behind

The less bright side for CLARIN: We would have to do without the excellent human and linguistic

resources we know the Hungarian research community has to offer We would have no alternative way to cover the Hungarian language and

to provide access to its data collections to the HSS research community

Page 19: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 19

What makes CLARIN interestingin comparison with other RIs?

No cash contribution other than the annual fee to pay for governance and coordination; other than that no cross-border funding

Fee fixed for 5 years with 2% annual increase, no surprises Commitment to investing at the national level, but no major

capital investment required, no fixed prescribed amounts Selection of data and tools to be created follows from own

research priorities and economic situation – not centrally decided

HSS scholars have no digital tradition: unique opportunity to innovate research

HSS scholars tend to work in isolation: unique opportunity to become part of the mainstream European research community

Page 20: 18-03-2013 Hung LST Day 1 Language Technology for the Humanities: why and how? Steven Krauwer Utrecht University CLARIN ERIC Executive Director

18-03-2013 Hung LST Day 20

Concluding remarks

CLARIN has a lot to offer to the Hungarian research community in terms of access to data, tools and expertise, and participation in CLARIN will move Hungarian forward towards full participation in the Digital Age

Hungary has a lot to offer to CLARIN, as is demonstrated by its successful participation in the CLARIN Preparatory Phase and in sister initiatives such as META / CESAR

In times of crisis it is hard for the funding bodies to assign priorities to competing research infrastructure initiatives, but it should be kept in mind that in financial terms CLARIN is a low cost entry model research

infrastructure with no financial risks with its language Hungary has a unique selling point in

Europe!