"tailor made concordancer: (semi-) big data corpora and flexible open source software",...
TRANSCRIPT
Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software
www.advisori.de 2
Some explanations – Writing Centres
www.advisori.de 3
Student Diversity
www.advisori.de 4
The Writing Centre Triangle
Student
ScientificInstructor
WritingInstructor
no communication
Missing knowledge: Content Linguistics Academic Traditions
Missing knowledge: Content Linguistics Academic Traditions
www.advisori.de 5
A Missmatch in Communication
A Chinese student of mechanical engineering
writing a bachelors‘s thesis
in German
A German language instructor with a masters
degree in social sciences
No idea of mechanical engineering in terms of content & academic traditions
No idea of German meta language &German academic traditions
www.advisori.de 6
An Example• Which verb goes together with “regression”:
a.Fitb.Estimatec. Calculated.Predicte.Computef. I-hope-it-is-not-contagious
www.advisori.de 7
Solution Strategies
• Ask a dictionary• Ask Google• Ask the student• Ask someone else• Have a look at the respective
literature
There are no specialised dictionariesHow would you?She/he does not knowYour colleagues know as much as you
knowA good starting point
www.advisori.de 8
Old-Fashioned Knowledge Mining
www.advisori.de 9
Corpus-Linguistics / Text-Mining for automated Knowledge Generation
www.advisori.de 10
The Task
Design, programme and implement a tool that helps language instructors
working at writing centres to support students
writing in a foreign language
www.advisori.de 11
Some Challanges• No one wants to use a programme with such a syntax:
• [a-z]*\[vbp\]\s[a-z\s]*\sregression[a-z]• Sentence boundaries need to be respected• It needs to run online, offline, on Windows, Windows Server, Linux, Linux Servers and
Mac (hey why not on a smartphone as well)• It needs to be easily maintainable• It needs to return high quality results without being to techy regarding IT and linguistic
special terms • It needs to be cheap (i.e. for free)• It needs to work with German, English and Russian texts
www.advisori.de 12
The Hannover Concordancer – A Joint Venture
www.advisori.de 13
The Architecture
Texts
Metadata
LSA Database
Local / Remote Server Client
www.advisori.de 14
Text Preparation Workflow
PDF TXT XML RData DB
RData Index
Texts
Meta Information
Document Term Matrix
BackendPre-Processing
www.advisori.de 15
Query Input and Programme Output
KWIC
CollocationsN-GramsReadingsLSA Associations
Frequencies
Com
plex
ity
Words Lemmata POS Tags Of each up to 5 One Corpus Two Corpora ComplexityOutput:
Query Input:
www.advisori.de 16
Contact Details
Feel free to contact me:
Via E-Mail: [email protected] Xing: https://www.xing.com/profile/Tobias_Gaertner35On LinkedIn: https://www.linkedin.com/in/tobias-g%C3%A4rtner-b11205125/
Did you know we are hiring?