michael p. oakes university of sunderland. contents proposals for a master’s programme in natural...
TRANSCRIPT
![Page 1: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/1.jpg)
Michael P. Oakes
University of Sunderland
![Page 2: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/2.jpg)
Contents
• Proposals for a Master’s programme in Natural Language Processing
• Future research plans / link with Wolverhampton
• Plans for publications
• Plans for grant proposals
• Other funding ideas
![Page 3: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/3.jpg)
Proposals for a Master’s programme in Natural Language Processing
• Some preliminaries:• Entry requirements: first or second class degree in a related
discipline. Computer programming will be taught from scratch.• Funding: Erasmus, European Social Fund, ESRC Master’s
training package scheme for programme development, work-based learning
• Students must receive an accurate idea of the content of the programme beforehand
• Induction week: meet the teaching team, familiarity with the University, formal registration, etc.
• Diploma, Certificate and Master’s awards. 8 taught modules (24 lectures, 18 hours’ practical, 58 directed reading, 50 self-directed research).
![Page 4: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/4.jpg)
Certificate Stage
REPLI (Research, Ethics, Professionalism and Legal Issues). Generic research skills such as referencing, statistics, experimental design. BCS Accreditation. *
Programming. PERL for string handling, R for statistics (can handle Bayesian statistics, text mining and graphs), Introduction to Java for general computing
Overview of NLP. Phonetics, morphology, lexis, parts-of-speech, syntax, semantics, pragmatics.
Empirical Linguistics. Corpora, annotation, alignment, collocations, anaphora resolution. *
![Page 5: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/5.jpg)
Diploma Stage
Symbolic NLP. finite-state transducers, parsing, semantic representation: First-Order predicate calculus, semantic networks.
Machine Translation. statistical, symbolic, example-based.
Information Retrieval. vector space model, indexing, summarisation, evaluation, clustering, text classification, text data mining. *
Research Seminars. All members of the group (and outside speakers) talk about their research. Assessment is to produce a good project proposal.
![Page 6: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/6.jpg)
Project
• Close links with industry established through 3-month industrial placements, based either with the company or at the University.
• The sponsor will either be from industry or academia, and there will also be a staff member from Wolverhampton to act as supervisor.
• Project management (TOR, reviews), poster, viva, dissertation (typically introduction, research, analysis, implementation, evaluation / experiments, reflective conclusions).
![Page 7: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/7.jpg)
Administration
• Programme board of studies: Institute Director or deputy, student representatives, one or more employers’ representatives, module leaders, programme leader, responsible for the management of the programme and the well-being of each module.
• Board of assessment: to decide student progression. External Examiner, no student representatives
• Internal (prior to hand-out) and External (sample work shown prior to programme assessments) moderation.
• Other quality control: student and staff feedback, EE’s report, programme annual report.
• Each student has a personal tutor and student handbook. • Timely, face-to-face assessment may improve student
satisfaction.
![Page 8: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/8.jpg)
Future Research Plans,
• And how these might complement the research topics of the Research Group in Computational Linguistics.
![Page 9: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/9.jpg)
Automatic Summarisation
• CAST Project produced an automatic summarisation tool: “term-based summarisation”
• Content-Based Abstracting (Paice). • TRESTLE (Gaizauskas). • David Evans: evaluation of information extraction• Query-based summaries. Intrinsic (representativeness) vs.
Extrinsic (judgeability) evaluation (Liang). • SumTrain: reached second round of EU evaluation.• Extraction of statistics-related phrases, e.g. “greater than”,
“significant reduction in”, “was directly proportional to”, “did not affect”.
![Page 10: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/10.jpg)
Concept-Based Abstracting Project
• window length = 4• STOP 6 "and foliar treatment AGEN"• 5 "foliar treatment AGEN +"• 5 "treatment AGEN + AGEN"• 4 "effect of mildew AGEN"• 3 "AGEN gave a significant"• 2 "AGEN was the most"• 2 "AGEN at different sowing"• 2 "AGEN increased fertile tillers“• LOW-FQ 1 "effect of AGEN sprays"
![Page 11: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/11.jpg)
Automatic Terminology Processing
• Le An Ha looked at the concept of a terminology rather than individual terms. Knowledge patterns from glossaries: store of terms and relations between them.
• David Evans. Identification of terms using TF.IDF and other statistical methods (see slide 20).
• Shiyan Ou. Sentiment classification (see slide 20). • Constantin Orasan. Corpus of junk mail (spam filters,
Farrow).• Constantin Orasan. Analysis of genre differences – project
on “Language, Computation and Style” (authorship).• Englishes, Scrip newsfeeds, BELGA: “feature extraction
for text classification”.
![Page 12: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/12.jpg)
Annotation tools
• Constantin Orasan: PALinkA, automatic annotation of anaphoric links.
• Lewandowska, Oakes & Rayson: part-of-speech and semantic code tagging in English; alignment enables partial semantic tagging of L2.
![Page 13: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/13.jpg)
Annotation: Aligned and Partially Tagged Polish text (Lewandowska, Oakes and Rayson)
• Tak jest_A3+ mowi Polemarch_Z99 a do_Z5 tego jeszcze urzadra nocne nabozenstwo, ktore_Z8 warto zobaczyc
• “_”_PUNC That_DD1_Z8 ’s_VBZ_A3+ the_AT_Z5 way_NN1_X4.2 of_IO_Z5 it_PPH1_Z8 ,_,_PUNC “_”_PUNC said_VVD_Q2.1 Polymarchus_NP1_Z99 _,_,PUNC “_”_PUNC and_CC_Z5 ,_,_PUNC besides_RR_Z5 _,_,PUNC there_EX_Z5, is_VBZ_A3+ to_TO_Z5 be_VBI_A3+ a_AT1_Z5 night_NNT1_T1.3 festival_NN1_K1/S1.1.3+ which_DDQ_Z8 will_VM_T1.1.3 be_VBI_A3+ worth_II_I1.3 seeing_VVG_X3.4 ._._PUNC
![Page 14: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/14.jpg)
Mobile Devices
• Laura Hasler and Dalila Mekhaldi: QALL-ME, Question-Answering for Digital Phones.
• Chufeng Chen: Annotation of digital photographs taken with a GPS camera. A gazetteer “translated” longitude and latitude data into place name, geographical feature, e.g. Long = 54.91, Lat = -1.4, place = Sunderland, feature = harbour. Episodic memory.
![Page 15: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/15.jpg)
Other Related Work
• Andrea Mulloni: Corpus Linguistics.• Empirical vs. Chomskyan• Own interest “Statistics for Corpus Linguistics”.• Driving the process rather than merely testing for
statistical significance, e.g. Mutual Information to find collocations.
• Irina Temnikova: Machine Translation• Alignment for example-based machine translation
(Lewandowska & Oakes).
![Page 16: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/16.jpg)
Plans for Publications (1)
• Book Chapters in press:• Processing Multilingual Corpora, Chapter 32 of Corpus
Linguistics: An International Handbook, eds. Anke Lüdeling and Merja Kytö, Mouton de Gruyter.
• Corpus Linguistics and Stylometry, Chapter 52, ibid.• Corpus Linguistics and Language Variation, in
Contemporary Approaches to Corpus Linguistics, ed. Paul Baker, Continuum.
• Javanese, in “Languages of the World”, ed. Bernard Comrie, Routledge.
• J. Vilares, M. Oakes and M. Vilares: A Knowledge-Light Approach to Query Translation in CLIR. RANLP V, ed. N. Nicolov, Benjamins.
![Page 17: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/17.jpg)
Plans for Publications (2)
• Under second review:• S-W. Ke, C. Bowerman and M. Oakes,
“Automatic classification of personal email with PERC and time-related strategies”, ACM Transactions on Information Systems.
• W-C Lin, M. Oakes and J. Tait, “Improving image annotation via representative feature selection”, Cognitive Processing.
![Page 18: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/18.jpg)
Plans for Publications (3)
• Future plans:• VITALAS Video and image Indexing and reTrievAl in the
LArge Scale.• Update “Statistics for Corpus Linguistics” – sold over
1500 copies, but now 10 years old• Last chapter was “Literary Detective Work”, which could
be a book in its own right: disputed authorship (compendium of techniques, Shakespeare, religious texts, still unsolved mysteries e.g. The Quiet Don, Marxism and the Philosophy of Language), unknown languages (Linear B, Voynich manuscript). JLLC, QL.
![Page 19: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/19.jpg)
Plans for Grant Proposals (1)
• Closing the Semantic Gap
• Related to machine learning (boosting), caption analysis, gazetteers, alignment of low level image content features and high level semantic features (words)
• Son of VITALAS?
Image content
Semantic description
H = 0, S = 1, V = 0.5, F = 0.9
Kim Clijsters, tennis
H = 1, S = 0.6, V = 0, F = 0.125
Palace of Brussels
H = 0.3, S = 0.3, V = 1, F = 0.9
Centre Court, Wimbledon, tennis
![Page 20: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/20.jpg)
Plans for Grant Proposals (2)
• Which words are truly characteristic of a corpus? X² etc. • Countable linguistic features. • Measures from IR e.g. PageRank (Łódź, Palomino).• AHRC (if theoretical, Englishes), ESRC (if applied, e.g.
spam filters).• Sentiment analysis (Thijs Westerveld at Teezir): mining
online opinions. Cheerful, chic, cheap, clean vs. chaos, cranky, cumbersome, damaged.
• Interface between NLP and IR: sentence analysis e.g. adjectives, negatives; follow links to navigate websites.
• IR relevant vs. irrelevant documents.
![Page 21: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/21.jpg)
Plans for Grant Proposals (3)
• Temporal relations in query language modelling (Dawei Song).
• Temporal similarity + semantic similarity overall similarity.
• The temporal similarity between texts (e.g. query and document) can be estimated by a) time stamp, b) temporal logic between the texts (Andrea Setzer).
![Page 22: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/22.jpg)
Plans for Grant Proposals (4)
• Corpus Profiling Workshop on October 18th. • Exploring how corpus characteristics affect the
behaviour of techniques in IR and NLP, and to set out a roadmap for a shared research agenda.
• Data set profile impacts on automatic classification, IR, anaphora resolution, automatic summarisation and word sense disambiguation.
![Page 23: Michael P. Oakes University of Sunderland. Contents Proposals for a Master’s programme in Natural Language Processing Future research plans / link with](https://reader030.vdocuments.us/reader030/viewer/2022032707/56649e245503460f94b11ca1/html5/thumbnails/23.jpg)
Other Funding Ideas
• IRSG-like “Industry Day” to foster industrial contacts (consultancy? Grant proposals?)
• Organise conferences, e.g. bid for Corpus Linguistics, CLEF, ECIR.
• Exploitation of Intellectual Property. • Is there an equivalent of CEDEC (Computing and
Engineering Distance Education Centre) with whom we can discuss marketing programmes world-wide / part-time? Work-based learning?