"tailor made concordancer: (semi-) big data corpora and flexible open source software",...

16
Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software

Upload: dataconomy-media

Post on 12-Apr-2017

177 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software

Page 2: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 2

Some explanations – Writing Centres

Page 3: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 3

Student Diversity

Page 4: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 4

The Writing Centre Triangle

Student

ScientificInstructor

WritingInstructor

no communication

Missing knowledge: Content Linguistics Academic Traditions

Missing knowledge: Content Linguistics Academic Traditions

Page 5: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 5

A Missmatch in Communication

A Chinese student of mechanical engineering

writing a bachelors‘s thesis

in German

A German language instructor with a masters

degree in social sciences

No idea of mechanical engineering in terms of content & academic traditions

No idea of German meta language &German academic traditions

Page 6: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 6

An Example• Which verb goes together with “regression”:

a.Fitb.Estimatec. Calculated.Predicte.Computef. I-hope-it-is-not-contagious

Page 7: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 7

Solution Strategies

• Ask a dictionary• Ask Google• Ask the student• Ask someone else• Have a look at the respective

literature

There are no specialised dictionariesHow would you?She/he does not knowYour colleagues know as much as you

knowA good starting point

Page 8: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 8

Old-Fashioned Knowledge Mining

Page 9: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 9

Corpus-Linguistics / Text-Mining for automated Knowledge Generation

Page 10: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 10

The Task

Design, programme and implement a tool that helps language instructors

working at writing centres to support students

writing in a foreign language

Page 11: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 11

Some Challanges• No one wants to use a programme with such a syntax:

• [a-z]*\[vbp\]\s[a-z\s]*\sregression[a-z]• Sentence boundaries need to be respected• It needs to run online, offline, on Windows, Windows Server, Linux, Linux Servers and

Mac (hey why not on a smartphone as well)• It needs to be easily maintainable• It needs to return high quality results without being to techy regarding IT and linguistic

special terms • It needs to be cheap (i.e. for free)• It needs to work with German, English and Russian texts

Page 12: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 12

The Hannover Concordancer – A Joint Venture

Page 13: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 13

The Architecture

Texts

Metadata

LSA Database

Local / Remote Server Client

Page 14: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 14

Text Preparation Workflow

PDF TXT XML RData DB

RData Index

Texts

Meta Information

Document Term Matrix

BackendPre-Processing

Page 15: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 15

Query Input and Programme Output

KWIC

CollocationsN-GramsReadingsLSA Associations

Frequencies

Com

plex

ity

Words Lemmata POS Tags Of each up to 5 One Corpus Two Corpora ComplexityOutput:

Query Input:

Page 16: "Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH

www.advisori.de 16

Contact Details

Feel free to contact me:

Via E-Mail: [email protected] Xing: https://www.xing.com/profile/Tobias_Gaertner35On LinkedIn: https://www.linkedin.com/in/tobias-g%C3%A4rtner-b11205125/

Did you know we are hiring?