software practicals summer semester 2020 · beginners practical (iap, 2+4 ects) [bachelor students]...

19
Database Systems Research Group Heidelberg University April 22, 2020 Software Practicals Summer Semester 2020

Upload: others

Post on 04-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

Database Systems Research GroupHeidelberg University

April 22, 2020

Software PracticalsSummer Semester 2020

Page 2: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

Slides Online

The slides are available on our webpagehttps://dbs.ifi.uni-heidelberg.de/teaching/current/

Page 3: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

Organization

Page 4: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

Outline ● Overview of topics (today)

○ send application for a topic until Monday, April 27, 1pm○ assignment of topics by April 29

● First milestone (mid/end May)○ prototype/part of software○ summary of research (literature and related systems/tools)○ further milestones in agreement with supervisor

● End of practical (mid/end July)○ code in local Gitlab○ report / documentation as local Wiki document ○ presentation / demo of practical and software (10-12 minutes)

Page 5: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

Organizational issues● Application

○ by email directly to supervisor○ brief list of relevant courses / prior knowledge / “Anwendungsgebiet”○ schedule and milestones for the practical○ group work is not possible○ application is binding (don’t apply if you don’t want to do the practical)

● Deadlines○ presentation: planned for last week in July 2020 ○ Report & Gitlab upload: end of August 2020○ no extension possible○ not finished = failed (grade 5,0)

Page 6: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

Assessment● Credit points (Leistungspunkte)

○ Beginners Practical (IAP, 2+4 ECTS) [Bachelor students]■ workload: 180 h (~1 ½ days/week)

○ Advanced Practical (IFP, 8 ECTS ECTS)■ workload: 240 h (~2 days/week)

● Grading based on○ code (readability, structure, functionality)○ documentation (README, comments)○ commitment and self-reliance○ cool ideas!!

● IMPORTANT○ talk to / communicate with your advisor

Page 7: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

Supervisors

● Michael Gertz (MG)

[email protected]

● Satya Almasian (SA)

[email protected]

● Dennis Aumiller (DA)

[email protected]

● Philip Hausner (PH)

[email protected]

Page 8: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

Project Topics

Page 9: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

Overview of Topics

1. Implement Citation Extraction in spaCy, BP/AP, (Aumiller)

2. Outline Generation for Wikipedia Articles, AP, (Aumiller/Almasian)

3. Analysis of RNV Delays, BP/AP, (Aumiller/Hausner)

4. Time-dependent analysis of COVID-19 case development, BP/AP, (Hausner)

5. Time-dependent Political Twitter Analysis, AP, (Hausner)

6. Annotating Numerical Relations in News Articles , AP, (Almasian)

7. Numerical Word Co-occurrence Networks (extension), BP/AP, (Almasian)

8. YouTube Video Comment Extractor and Exploration, AP, (Gertz)

9. Extraktion und Management von Bundestagsdokumenten, BP/AP, (Gertz)

Page 10: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

BP/AP: Implement Citation Extraction in spaCy (DA)

Given: 1. Rule-based extraction algorithm by Openlegaldata.io2. Dataset of ~1,000 manually annotated referencesTasks: • Transfer functionality to spaCy’s rule-based entity extractor• Publish package that makes this easily usable in spaCy

Subtasks:• Create detailed flow-chart of existing RegEx coverage

Languages / Tools:• Python; spaCy; RegEx

Page 11: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

AP: Outline Generation for Wikipedia Articles (DA/SA)

Given: 1. Cleaned dataset of articles from Wikipedia2. Paper by Zhang et al. [1]Tasks: • Implement efficient data loader• Try to reproduce training results from the paper• Implement alternative scoring (RAND score, etc.)

Subtasks:• Learn details about implementation and investigate improvements• Investigate evaluation metrics

Languages / Tools:• Python; PyTorch; Neural Networks (!!)

Page 12: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

Given: 1. Start.Info API (RNV API) [1]2. Previous outside project: RNV Monitor [2]Tasks: • Crawl all data (not just delays)• Broader analysis of delays (daytime, line, etc.)• Create time dependent geographical heat map

BP/AP: Analysis of RNV Delays (DA/PH)

Subtasks:• Compare results to RNV Monitor dump• Create suitable database scheme

Languages / Tools:• Python; REST API; SQL

Page 13: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

Given: 1. Public data set for Germany [1]2. Reference work from RKI [2]Tasks: • Crawl data set• Identify locations with high increase of case numbers• Create time dependent geographical heat map

BP/AP: Time-dependent Analysis of COVID-19 (PH)

Subtasks:• Create suitable database scheme• Structure in time-dependent fashion

Languages / Tools:• Python; Javascript (vis.js); REST API; SQL

Page 14: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

Given: 1. Twitter dataTasks: • Structure information around creation dates of Twitter posts• Identify important topics for certain dates• Take into account all terms or only hashtags

BP/AP: Time-dependent Political Twitter Analysis (PH)

Subtasks:• Investigate different weighting schemes

Languages / Tools:• Python; SQL

Page 15: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

AP: Annotating Numerical Relations in News Articles (SA)

Given: 1. Corpus of economical news articles 2. Tasks: • Extract high confidence relations that contain numerical information

from news articles• Apply Named Entity Disambiguation to the entities and numbers • Saving the annotated dataset in Mongodb

Subtasks:• Getting familiar with OpenIE for information extraction• Using AIDA for Named Entity Disambiguation • Detecting quantities with Illinois Quantifier

Languages / Tools:• Python, MongoDB, Brief knowledge of JAVA is also recommended

Page 16: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

BP/AP: Numerical Word Co-occurrence Networks (SA)

Given: 1. English Wikipedia corpusTasks: • Improve and existing pipeline of word co-occurrence graph from the

sentences containing numerical information • Enhance the NER (using Metamap from UMLS)• Enhancing the numerical extractor (using Illinois Quantifier)

Subtasks:• Explore the distribution of the numerical values with respect to the

surrounding word to extract valid rangesLanguages / Tools:• Python; SciKit-Learn, Brief knowledge of JAVA is also recommended

Page 17: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

AP: YouTube Comment Extractor/Exploration (MG)

Given: 1. Existing pipeline to extract comments from YouTube2. Comprehensive documentation of the dataTasks:• Implement Web-based dashboard to view comment statistics• Provide Web-based search interface on comments

Subtasks:• Port pipeline to Elasticsearch• Decide which features to realize in dashboard• Develop search methods for comments

Languages / Tools:• Python; Elasticsearch

Page 18: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

AP: Bundestagsdokumente (MG)

Gegeben: 1. Drucksachen und Plenarprotokolle [DIPBT] Tasks:• (Adaptiver) Crawler für Drucksachen • Speicherung der Dokumente in Solr (strukturiert)• Faceted Search auf Dokumente über Web-Frontend

Subtasks:• Datenmodell für Dokumente• Modell für Faceted Search

Languages / Tools:• Python; Solr

Page 19: Software Practicals Summer Semester 2020 · Beginners Practical (IAP, 2+4 ECTS) [Bachelor students] workload: 180 h (~1 ½ days/week) Advanced Practical (IFP, 8 ECTS ECTS) workload:

Slides Online

The slides are available on our webpagehttps://dbs.ifi.uni-heidelberg.de/teaching/current/