dr. slavko zitnik · 2021. 2. 23. · dr. slavko zitnik (fri) nlp class (63555) february 20211/32....

35
Natural language processing class dr. Slavko ˇ Zitnik University of Ljubljana Faculty for computer and information science February 2021 dr. Slavko ˇ Zitnik (FRI) NLP class (63555) February 2021 1 / 32

Upload: others

Post on 24-Aug-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Natural language processing class

dr. Slavko Zitnik

University of LjubljanaFaculty for computer and information science

February 2021

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 1 / 32

Page 2: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Natural language processing

visualization

association analysis

link mining

information extraction

pattern recognition

lexical analysis

information retrieval

coreference resolution

relation modeling

named entity recognition

document summarization

sentiment analysis

sentiment analysis

entity extraction

concept extraction

text clustering text categorization

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 2 / 32

Page 3: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

About 293,000 results (0.19 seconds)

ivan cankar biografijaivan cankar deseticaivan cankar moje življenjeivan cankar na klancu

ivan cankar hlapciivan cankar skodelica kaveivan cankar življenjepisivan cankar moje življenje obnova

Searches related to ivan cankar

1 2 3 4 5 6 7 8 9 10 Next

Ivan Cankar ­ Wikipedia, the free encyclopediaen.wikipedia.org/wiki/Ivan_CankarIvan Cankar was born in the Carniolan town of Vrhnika near Ljubljana. He was one ofthe many children of a poor artisan who emigrated to Bosnia shortly after ...Biography ­ Work ­ Personality and world view ­ Influence

Ivan Cankar ­ Wikipedija, prosta enciklopedijasl.wikipedia.org/wiki/Ivan_Cankar Translate this pageIvan Cankar je za svoje pisateljsko delo uporabljal številne šifre in psevdonime. Ti soznačilni predvsem za zgodnja leta njegovega ustvarjanja. Izmišljena ...Življenje ­ Delo ­ Psevdonimi ­ Bibliografija

Ivan Cankar ­ Wikisource, the free online libraryen.wikisource.org/wiki/Author:Ivan_CankarMar 19, 2014 ­ Author:Ivan Cankar. From Wikisource. Jump to: navigation, search.←Author Index: Ca, Ivan Cankar (1876–1918) ...

Ivan Cankar – Wikivirsl.wikisource.org/wiki/Ivan_Cankar Translate this pageMay 28, 2014 ­ Ivan Cankar. Iz Wikivira, proste knjižnice besedil v javni lasti. Skoči na:navigacija, iskanje. Ivan Cankar (1876–1918). Glej tudi življenjepis ...

Ivan Cankar – največji mojster slovenske besede ­ Veliki ...www.kam.si › Novice › Veliki Slovenci Translate this pageIvan Cankar se je rodil na Vrhniki (na Klancu) 10. maja 1876 kot osmi otrok vpropadajoči obrtniško – proletarski družini trškega krojača. Mladost je preživel na ...

[DOC] Ivan Cankar (1876 ­ 1918) ­ Dijaski.netwww.dijaski.net/.../slo_dob_cankar_ivan_hlapci_09__... Translate this pageIvan Cankar (1876 ­ 1918). Največji mojster slovenske besede in osrednja postava vmoderni književnosti izvira iz revne družine z Vrhnike. Na Klancu ...

[PDF]ŽIVLJENJEPIS Ivan Cankar ­ Dijaski.netwww.dijaski.net/get/slo_rfk_cankar_ivan_07.pdf Translate this pageIvan Cankar se je rodil 10. maja leta 1876 v kmečko družino na. Vrhniki. V družini jebilo osem otrok. Zapustil je družino. Ker je bil zelo nadarjen učenec in je ...

Cankar, Ivan (1876–1918) ­ Slovenska biografijawww.slovenska­biografija.si/oseba/sbi155071/ Translate this pageCankar Ivan, pesnik, r. 10. maja 1876 na Vrhniki, u. 11. dec. 1918 v Lj. Pokopan je vskupnem grobišču s Kettejem in Murnom pri Sv. Križu; spomenik jim je dala ...

Cankarjeva smrt je bila političen umor ­ Zgodovina ­ Hervardiwww.hervardi.com/smrt_ivana_cankarja.php Translate this pageCankarjeva smrt je bila političen umor. Pisatelj, politik in ljudski tribun Ivan Cankar IvanCankar velja za največjega slovenskega pisatelja in ni ga Slovenca, ...

Ivan Cankar ­ Občina Vrhnikawww.vrhnika.si/?m=pages&id=17 Translate this pageDomov ­> Občina Vrhnika ­> Znani Vrhničani ­> Ivan Cankar. Znani Vrhničani. IvanCankar | Simon Ogrin | Jožef Petkovšek | Karel Grabeljšek | France Kunstelj ...

Help Send feedback Privacy & Terms Use Google.com

Web Images Books Videos Search toolsMore

+Slavko Shareivan cankar

Page 4: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

About 293,000 results (0.28 seconds)

ivan cankar biografijaivan cankar deseticaivan cankar moje življenjeivan cankar na klancu

ivan cankar hlapciivan cankar skodelica kaveivan cankar življenjepisivan cankar moje življenje obnova

Searches related to ivan cankar

1 2 3 4 5 6 7 8 9 10 Next

Ivan Cankar ­ Wikipedia, the free encyclopediaen.wikipedia.org/wiki/Ivan_CankarIvan Cankar was born in the Carniolan town of Vrhnika near Ljubljana. He was one ofthe many children of a poor artisan who emigrated to Bosnia shortly after ...Biography ­ Work ­ Personality and world view ­ Influence

Ivan Cankar ­ Wikipedija, prosta enciklopedijasl.wikipedia.org/wiki/Ivan_Cankar Translate this pageIvan Cankar je za svoje pisateljsko delo uporabljal številne šifre in psevdonime. Ti soznačilni predvsem za zgodnja leta njegovega ustvarjanja. Izmišljena ...Življenje ­ Delo ­ Psevdonimi ­ Bibliografija

Ivan Cankar ­ Wikisource, the free online libraryen.wikisource.org/wiki/Author:Ivan_CankarMar 19, 2014 ­ Author:Ivan Cankar. From Wikisource. Jump to: navigation, search.←Author Index: Ca, Ivan Cankar (1876–1918) ...

Ivan Cankar – Wikivirsl.wikisource.org/wiki/Ivan_Cankar Translate this pageMay 28, 2014 ­ Ivan Cankar. Iz Wikivira, proste knjižnice besedil v javni lasti. Skoči na:navigacija, iskanje. Ivan Cankar (1876–1918). Glej tudi življenjepis ...

Ivan Cankar – največji mojster slovenske besede ­ Veliki ...www.kam.si › Novice › Veliki Slovenci Translate this pageIvan Cankar se je rodil na Vrhniki (na Klancu) 10. maja 1876 kot osmi otrok vpropadajoči obrtniško – proletarski družini trškega krojača. Mladost je preživel na ...

[DOC] Ivan Cankar (1876 ­ 1918) ­ Dijaski.netwww.dijaski.net/.../slo_dob_cankar_ivan_hlapci_09__... Translate this pageIvan Cankar (1876 ­ 1918). Največji mojster slovenske besede in osrednja postava vmoderni književnosti izvira iz revne družine z Vrhnike. Na Klancu ...

[PDF]ŽIVLJENJEPIS Ivan Cankar ­ Dijaski.netwww.dijaski.net/get/slo_rfk_cankar_ivan_07.pdf Translate this pageIvan Cankar se je rodil 10. maja leta 1876 v kmečko družino na. Vrhniki. V družini jebilo osem otrok. Zapustil je družino. Ker je bil zelo nadarjen učenec in je ...

Cankar, Ivan (1876–1918) ­ Slovenska biografijawww.slovenska­biografija.si/oseba/sbi155071/ Translate this pageCankar Ivan, pesnik, r. 10. maja 1876 na Vrhniki, u. 11. dec. 1918 v Lj. Pokopan je vskupnem grobišču s Kettejem in Murnom pri Sv. Križu; spomenik jim je dala ...

Cankarjeva smrt je bila političen umor ­ Zgodovina ­ Hervardiwww.hervardi.com/smrt_ivana_cankarja.php Translate this pageCankarjeva smrt je bila političen umor. Pisatelj, politik in ljudski tribun Ivan Cankar IvanCankar velja za največjega slovenskega pisatelja in ni ga Slovenca, ...

Ivan Cankar ­ Občina Vrhnikawww.vrhnika.si/?m=pages&id=17 Translate this pageDomov ­> Občina Vrhnika ­> Znani Vrhničani ­> Ivan Cankar. Znani Vrhničani. IvanCankar | Simon Ogrin | Jožef Petkovšek | Karel Grabeljšek | France Kunstelj ...

Ivan Cankar was a Slovene writer, playwright, essayist, poet and politicalactivist. Together with Oton Župančič, Dragotin Kette, and Josip Murn,he is considered as the beginner of modernism in Slovene literature.Wikipedia

Born: May 10, 1876, Vrhnika

Died: December 11, 1918, Ljubljana

Education: University of Vienna

More images

Ivan CankarWriter

People also search for

FrancePrešeren

OtonŽupančič

DragotinKette

SrečkoKosovel

Josip Murn

View 15+ more

Feedback

Help Send feedback Privacy & Terms Use Google.com

Web Images Books Videos Search toolsMore

+Slavko Shareivan cankar

Page 5: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information
Page 6: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

DBPedia entry

Page 7: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

DBPedia entry

Text excerpt

Page 8: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Zoogle - Traditional

http://tradicionalni-iskalnik.si

Ivan Cankar Išči

Ivan Cankar - Wikipedija, prosta enciklopedijaIvan Cankar se je rodil v hiši Na klancu 141, kot eden od dvanajstih otrok obrtniško-proletarske družine. Leta 1882 se je vpisal v osnovno …

Ivan Cankar – največji mojster slovenske besedeIvan Cankar se je rodil na Vrhniki (na Klancu) 10. maja 1876 kot osmi otrok v propadajoči obrtniško – proletarski družini trškega krojača …

Cankarjeva smrt je bila političen umorCankarjeva smrt je bila političen umor. Pisatelj, politik in ljudski tribun Ivan Cankar Ivan Cankar velja za največjega slovenskega pisatelja in …

[PDF] Ivan Cankar (1876 - 1918)Ivan Cankar (1876 - 1918). Največji mojster slovenske besede in osrednja postava v moderni književnosti izvira iz revne družine …

ŽIVLJENJEPIS Ivan CankarIvan Cankar se je rodil 10. maja leta 1876 v kmečko družino na. Vrhniki. V družini je bilo osem otrok. Zapustil je družino. Ker je bil zelo …

Ivan Cankar memorial houseIvan Cankar memorial house. Ivan Cankar (1876 – 1918) is considered to be Slovenia's most important writer. The original house …

1 | 2 | 3 | 4 | 5 | … | Zadnja

666 najdenih zadetkov:

Zoogle - Semantic

http://semantični-iskalnik.si

Ivan Cankar Išči

Ivan Cankar - Wikipedija, prosta enciklopedijaIvan Cankar se je rodil v hiši Na klancu 141, kot eden od dvanajstih otrok obrtniško-proletarske družine. Leta 1882 se je vpisal v osnovno …

Ivan Cankar – največji mojster slovenske besedeIvan Cankar se je rodil na Vrhniki (na Klancu) 10. maja 1876 kot osmi otrok v propadajoči obrtniško – proletarski družini trškega krojača …

Informacije ekstrahirane iz 25 zadetkov, najdenih 666:

Vrhnika, Na Klancu

Ljubljanska Realka Cankarjeva mati

Kosovel

Josip Murn

Dragotin Kette

Hlapec Jernej innjegova pravica

jeNapisal

prijatelj

prijatelj

prijatelj

matiseJeŠolal

rojenV

Ivan Cankar

Page 9: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Information extraction

Definition

Information extraction

type of information retrieval

goal to automatically extract structureddata from unstructured data sources

Subtasks

named entity recognition

relationship extraction

coreference resolution

Preprocessing

Information extraction method

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 7 / 32

Page 10: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Information extraction

Preprocessing

John is married to Jena . They work at OBI .

Sentence detection

Tokenization

Lemmatization

Part-of-speech tagging

Dependency parsing

John is married to Jena . They work at OBI .

John is married to Jena . They work at OBI .

John be marry to Jena . They work at OBI .

NNP VBZ VBN TO NNP . PRP VBP IN NNP .

John is married to Jena . They work at OBI .

nsubjpass

auxpasspobjprep pobj

nsubj prep

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 8 / 32

Page 11: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Information extraction

General approaches

Informa(on)extrac(on

Pa/ern0based Machine)learning0based

Discovery Rules Probabilis(c Induc(on!"HMM,"CRF!"N!gram!"SVM,"naive"Bayes,"...

!"Linguis:c!"Structural

!"JAPE!"Taxonomy"label"matching

!"Seed"expansion

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 9 / 32

Page 12: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Information extraction Main information extraction tasks

Named entity recognition

Person Person Position Organization

Organizacija

John is married to Jena . He is a mechanic at OBI and she also works there .

It is a DIY market .

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 10 / 32

Page 13: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Information extraction Main information extraction tasks

Relationship extraction

John is married to Jena . He is a mechanic at OBI and she also works there .

It is a DIY market .

employedAtemployedAt

isA

hasProfession

marriedWith

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 11 / 32

Page 14: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Information extraction Main information extraction tasks

Coreference resolution

It is a DIY market .

John is married to Jena . He is a mechanic at OBI and she also works there .

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 12 / 32

Page 15: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Information extraction Main information extraction tasks

End-to-end information extraction

John is married to Jena . He is a mechanic at OBI and she also works there .

It is a DIY market .

Person Person Position Organization

Organization

employedAtemployedAt

isA

hasProfession

marriedTo

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 13 / 32

Page 16: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

About the course

Course goals

Goals

To study algorithms and methods for building computational modelsof natural language processing.

To study issues involved in understanding natural languages togetherwith cognitive and linguistic phenomena.

To identify a text processing problem, design a solution andpractically solve it.

To get to know existing NLP approaches, techniques, tools and thestate-of-the-art in the field.

To become profficient in the end-to-end text processing problemshandling.

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 14 / 32

Page 17: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

About the course

Syllabus

Proposed syllabus

Corpora acquisition and tagging, preprocessing techniques

Information extraction tasks and systems

Slovene text processing

Regular expressions, rule based systems

Semantic web and ontologies (optional in 2021)

Unsupervised learning and visualisation

Classification and tagging techniques

(Deep) Neural networks for text

Text processing assignment (practical work)

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 15 / 32

Page 18: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

About the course

Timetable - tentative

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 16 / 32

Page 19: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

About the course

Grading

Grading

First defense (10 points): task selection & simple corpusprocessing/analysis

Introduction, existing solutions, initial ideas.

Interim defense (10 points): at least one example of a solution to aproblem

Introduction, related work, implemented baseline, future directions

Final defense (30 points): full submission and presentation

clean Git repository (fully reproducible) and final report

Rules

Attendance is preferred but mandatory at the assignment defensedates.

Workshop pass condition: at least 25 points.

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 17 / 32

Page 20: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

About the course

Assignment conditions

At the assignment defense dates at least one member of a group mustbe present, otherwise all need to provide their doctor’s justification.

At the last assignment all members must be present and also need tounderstand all parts of the submitted solution.

Students must work in groups of two to three members! Groups withtwo members have same pass conditions - they do not need tocoordinate so much and are also more independent.

Each group will have to grant access to a private GIT repository withat least read permissions to the assistant (GitHub username: szitnik).

The distribution of work between members should be seen bycommits within the repository.

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 18 / 32

Page 21: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

About the course

Assignment goals

Goals

End-to-end NLP task processing

Understanding of existing works

Presenting results in an article-like report

Start early so that there will be enough time for fine tuning!

Tools

Python 3.6 (Anaconda) and a deep NLP library (PyTorch, Tensorflow,Keras)

Preferred IDE: JetBrains PyCharm (free for students, otherwiseCommunity Edition)

Other tools & languages: whichever you prefer (Scala and IntelliJIDEA)

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 19 / 32

Page 22: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Submission

SUBMISSION

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 20 / 32

Page 23: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Submission

Submission

PDF Report

Follow IMRAD (Introduction-Methods-Results-And-Discussion)structure. A full paper should have Abstract, Introduction, Relatedwork, (Data), Methods/Algorihtms/...,Results/Evaluation/Experiments, (Discussion) and Conclusion.

Use provided LaTeX template.

Submit a manuscript 6-8 pages long (max) including references.Longer papers should be discussed with the assistant.

Better manuscripts will be allowed to publish them on arXiv.org whilethe best could be extended to a Journal/Conference paper(Slovenscina 2.0, Uporabna informatika, etc.)

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 21 / 32

Page 24: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Submission

Submission

PDF Report

Follow IMRAD (Introduction-Methods-Results-And-Discussion)structure. A full paper should have Abstract, Introduction, Relatedwork, (Data), Methods/Algorihtms/...,Results/Evaluation/Experiments, (Discussion) and Conclusion.

Use provided LaTeX template.

Submit a manuscript 6-8 pages long (max) including references.Longer papers should be discussed with the assistant.

Better manuscripts will be allowed to publish them on arXiv.org whilethe best could be extended to a Journal/Conference paper(Slovenscina 2.0, Uporabna informatika, etc.)

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 21 / 32

LATEX template

Page 25: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Submission

Submission

Git repository

Organize code and repository structure according to your task.

Proposed structure should separate code, models, data or othermaterials in separate folders.

README.md should contain short description of the project,instructions for compiling and running the project and course/authorsdata.

Follow good practices from existing Git repositories that you will findduring the course work.

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 22 / 32

Page 26: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Related work

RELATED WORK

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 23 / 32

Page 27: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Related work

Related work

Related work search

NLP Journals, conferences

Computational linguistics, TACL, NLE, IJCLACL 2020, EMNLP 2020, NIPS 2020

NLP Shared Tasks, workshops

CoNLL 2020 SemEval 2020, CodaLab

Code, paper repositories

arXiv.org, Google Scholar, Papers with code

Slovene NLP resources

NLP Journals, conferences

Slovenscina 2.0JTDH 2020

NLP resources, projects

CLARIN.SI, CJVT, slovenscina.eu

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 24 / 32

Page 28: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Assignments

ASSIGNMENTS

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 25 / 32

Page 29: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Assignments

Assignment selection

The selected assignment task should involve richer text data processingand not only feature extraction and direct machine learning classification.Evaluation and discussion is the main objective.Possible options:

(A) IMapBook collaborative discussions classification

(B) Offensive language exploratory analysis

(C) Cross-lingual offensive language identification

(D) Automatic language translation (joint work with UL FF)

Custom ideas must be approved by the assistant (proposals must be based onstrong related work knowledge and solution ideas).

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 26 / 32

Page 30: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Assignments

(A) IMapBook collaborative discussions classification

: Difficulty: easy

Data was gathered within IMapBook system in schools. Afterparticipants (primary school) read a book, book clubs were formed.Each book club had a discussion question to answer. Participantscould chat to collaboratively provide the final answer (like in GoogleDoc). The goal is to classify each of the chat messages into specifiedcategories.

Data (TO BE PROVIDED) contains cca. 800 chat messages, booktext, final responses and annotation instructions.

Goals:

How good can we classify postings into predefined categories?Could (a) a result of a collaborative discussion or (b) eBook textcontribute in training an algorithm to provide better results?

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 27 / 32

Page 31: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Assignments

(A) IMapBook collaborative discussions classification

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 28 / 32

Page 32: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Assignments

(B) Offensive language exploratory analysis

: Difficulty: easy/moderate

Data (TO BE PROVIDED) consists of cca. 65 offensive languagedatasets (around 25 in English).

Goals:

Offensive language exploratory analysis (importance of specifickeywords, relationships between categories, ...).Description of existing offensive language texts using BoW, TF-IDFand pre-trained word embeddings (non-contextual - Word2Vec, Glove,fastText and contextual - BERT, ELMo).Cross-lingual mappings (e.g. from English to Slovene) using e.g.LASER toolkit and explanations.Meaningful visualizations/representations of distances of existingannotation classes.

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 29 / 32

Page 33: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Assignments

(C) Cross-lingual offensive language identification

: Difficulty: moderate

Data (TO BE PROVIDED) consists of cca. 65 offensive languagedatasets (around 25 in English).

Offensive language text classification for selected datasets (max. onedataset from Twitter)

Training on english data and transfer of model into Slovene languageusing multi-lingual pre-trained models (e.g. CroSloEn BERT,mBERT, XLM-R) or embeddings alignment (see Ales Zagar’s mastersor Zan Pecovnik’s diploma)

Slovene data retrieval and automatic classification intooffensive/offensive classes. Manual error analysis (at least 100examples).

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 30 / 32

Page 34: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Assignments

(D) Automatic language translation (joint work with ULFF)

: Difficulty: moderate/hard

Available parallel corpora - https://opus.nlpl.eu. Useful datasets:OpenSubtitles 2016 and 2018 (use other subtitle data with caution!),EUparl, EMEA, DGT, ELRC. Do not use bad corpora such asEUbookshop !!!

Cooperation with UL FF students (traslators) to prepare additionaldata/validation set, ...

Main task: selection of a translation framework and running modelson distributed GPU-enabled machined (e.g. SLING). Detaileddescription of your work and analyses. Examples of frameworks:Fairseq (FB), Marian NMT (MS), T5 (Google), XLM-Roberta.

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 31 / 32

Page 35: dr. Slavko Zitnik · 2021. 2. 23. · dr. Slavko Zitnik (FRI) NLP class (63555) February 20211/32. Natural language processing visualization association analysis link mining information

Assignments

@[email protected]

dr. Slavko Zitnik (FRI) NLP class (63555) February 2021 32 / 32