ipw slides
TRANSCRIPT
EUROPEAN LEGISLATIVE
RESPONSES TO
INTERNATIONAL TERRORISM
A Database of Laws in German Plenary Protocols
Outline
1. Introduction
2. Xtract: a software for extraction
3. Expected results
4. Discussion
Introduction1
Linking Laws and Plenary Protocols
Extract agenda items and participants‘ information
from plenary protocols from terms 12 – 16
Use GESTA as an index of laws
Link laws to plenary speeches and vice versa
1 introduction
We have ...
Plenary protocol PDFs from electoral terms 12 – 16
1990-12-10 – present
120.655 pages in 1162 documents
GESTA database of laws, terms 8 – 16
1 introduction
We have ...
Plenary protocol PDFs from electoral terms 12 – 16
1990-12-10 – present
120.655 pages in 1162 documents
GESTA database of laws, terms 8 – 16
: ) and ambition to deliver excellent results
1 introduction
We want to ...
Extract from 1990 up to the present time
For each plenary session
Session number, date, ...
For each item on the agenda
Descriptions
list of participants
printed matter references
speech texts
tables
Link the results with our database of laws
1 introduction
Challanges
Older electoral terms are not digitalized
Each electoral term requires different pattern matching strategies
GESTA tables generated for the project
No consistent, direct links to plenary protocols
Course of legislation undetailed
Quality difference between older and newer terms
OCR errors
GESTA Database – no improvements possible for older terms
1 introduction
Xtract2
Xtract – software for data mining
a set of modern tools to annotate plenary protocols with relevant pieces of information
preserves document layout
uses multiple strategies to mark important text blocks
location, shape and internal structure of blocks
pattern matching
Euclidean distances
statistics
comes with its own document viewer
2 software
Xtract – implementation details
PDF access
pdftohtml (custom builds)
Acrobat Professional 9 Extended (older terms)
Data manipulation
C# 4.0: LINQ to XML
Visualization
C# 4.0: WPF (Windows Presentation Foundation)
Statistics
CORSIS: my personal open-source project for corpus analysis
2 software
Xtract – why XML?
Simple and highly-`liquid´ file format
based on simple international standards
excellent APIs in many programming languages
converts easily into other formats
used in Microsoft Office, OpenOffice.org
2 software
Xtract – XML crash course
<event><speaker id=„12“><name>Franz Müntefering</name><is>Bundesminister für Arbeit und Soziales</is>
</speaker></event>
elements
attributes
hierarchical relations
2 software
Xtract – XML crash course
<event><speaker id=„12“><name>Franz Müntefering</name><is>Bundesminister für Arbeit und Soziales</is>
</speaker></event>
elements: event, speaker, name, is
2 software
Xtract – XML crash course
<event><speaker id=„12“><name>Franz Müntefering</name><is>Bundesminister für Arbeit und Soziales</is>
</speaker></event>
attributes: id
2 software
Xtract – XML crash course
<event><speaker id=„12“><name>Franz Müntefering</name><is>Bundesminister für Arbeit und Soziales</is>
</speaker></event>
children: event → speaker
parents: event ← speaker
2 software
Xtract – XML crash course
<event><speaker id=„12“><name>Franz Müntefering</name><is>Bundesminister für Arbeit und Soziales</is>
</speaker></event>
descendants: event → speaker, name, is
2 software
Xtract – XML crash course
<event><speaker id=„12“><name>Franz Müntefering</name><is>Bundesminister für Arbeit und Soziales</is>
</speaker></event>
siblings: name ↔ is
2 software
Xtract – how does it function?
extracts texts from PDF files along with layout
information
2 software
Xtract – how does it function?
merges texts into proximity blocks
2 software
Xtract – how does it function?
marks ambient constructs
2 software
Xtract – how does it function?
marks agenda items
2 software
Xtract – how does it function?
annotates blocks with sections they belong to
2 software
Expected Results3
DIGESTA
Based on `GESTA Gesamtausgaben´: terms 14 – 16
Always up-to-date
Detailed course of legislation information
Direct links to plenary protocols
Can be complemented with keywords from MZES
http://corsis.sf.net/ipw/digesta/
3 results
Done!!
PLEDA – Plenary Protocols Database
Based on plenary protocols
Links agenda items multidirectionally with
participants
Interesting for different linguistic/political research
purposes
3 results
PLEDA – Project Status
12 13 14 15 16
OC
R Run X X - - -
Correction - - -
XML Conversion * * X X X
Division C./S. X X X
Block Merging * * X X X
Ambient Constructs X X X
Page Sections X X X
Interjections * * X X X
Contents * * X
Speeches * * X
Contents-speech links * * X
3 results
GLIT – German Legislative Resp ...
Laws
• .law files
• from GESTA
Protocols
• .pro files
• from BTP
GLIT
• German part of ELIT
3 results
Discussion4
Open questions
Project hosting
Where can we host the results?
Initial GLIT interface
Web service?
Rich client-side app?
Any questions from your side?
4 discussion