semantic edge labeling over legal citation graph and case … · 2017. 3. 22. · 4. designed...

1
Semantic Edge Labeling over Legal Citation Graph and Case Prediction Semantic Edge Labeling University of Florida Projects in Data Sciences 2016 Ali Sadeghian, Ben Grider & Laksshman Sundaram Summery of Results Visual Grph Analysis Entity Extraction CONCLUSIONS Citations, when a certain statute is being cited in another statute, differ in meaning, and we aim to annotate each edge with a semantic label that expresses this meaning or purpose. Our efforts involve defining, annotating and automatically assigning each citation edge with a specific semantic label. We define a golden set of labels that cover a vast majority of citation types that appear in the United States Code (US Code) but still specific enough to meaningfully group each citation. We proposed a Linear-Chain CRF based model to extract the useful features needed to label each citation. The extracted features were then mapped to a vector space using a word embedding technique and we used clustering methods to group the citations to their corresponding labels. This paper analyzes the content and structure of the US Code, but most of the techniques used can be easily generalized to other legal documents. It is worth mentioning that during this process we also collected a human labeled data set of the US Code that can be very useful for future research. The UI was designed to allow someone to navigate the citation graph, examining each node and its subgraphs with a breadth first search. It also has function to search for a node, view the context in which the node was cited, and explore cycles that were pre computed Cycles were selected as a particularly interesting case to test the UI on, as the number and types of cycles in the US Code was something that had not yet been reported on. 9,269 cycles were precomputed and integrated into the UI for a user to view. An example to the right shows a case that could be of interest to a user examing the robustness of the code, where every node is connected in a definition to another node. The UI implementation was created with MongoDB, ExpressJS, and CytoscapeJS, a JavaScript graph visualization library. 1. We proposed a label set long enough to cover almost all of the citations and also short enough for practical use. 2. We trained and evaluated A Linear Chain CRF for predicate extraction (see Table). 3. Collected a dataset of 400 manually annotated citations. 4. Designed multiple automatic labeling schemes based on the extracted predicates. a) K-NN : 63.2% b) K-means : 61.6% c) Multiclass SVM (1vs1): 68.3%% d) Human: At most 71%* * The individuals are trained and have special background in law. NLP techniques applied to extract key information for each cases. The entities extracted were the judges names, legal law firm partners, locations, organisation names and the orders passed by the judges. In the citation graph we have achieved setting up the golden labels and verified the thoroughness of the labels with the help manual annotators.We also apply many machine learning paradigms like SVM's, CNN to achieve the state of the art accuracy for the labelling tasks.For the case prediction, we have extracted many features that are hard to extract from the documents.We successfully designed a model for the length of the case predictor. Individual legal rules seldom exist in isolation, but instead typically occur as components of broader statutory, regulatory, and common-law frameworks consisting of numerous interconnected rules, regulations, and rulings. The complexity of these frameworks impedes comprehension and compliance by government agencies, businesses, and citizens and makes amending legislation laborious and error-prone for regulatory agencies and legislative drafters. Systems are needed to ease the process of understanding this regulations. Motivation Background Various research on the extraction and resolution of the citation text itself, e.g.: M adedjouma et al.: investigate the natural language patterns used in cross reference expressions to automatically detect and link a citation to its target. Oanh Thi Tran et al.: They create a system which can automatically detect references and then extracts their referents. (Previous work limits itself to detect and resolve references at the document targets.) Signaling the difficulty of this task from its very first stage. In a two paper sequel, Mohammad Hamdaqa et al., lay the grounds for an automated system that can semantically label citations. 1. Studying the challenges and complexities, i.e., out of principle formats, variety of citations types, NLP obstacles, etc. 2. Extracting the citation using regular expressions. (limited to the ones that follow standard citation schemes described in Bluebook and ALWD) 3. Proposing a label set and identifying them. 4. Proposing an automated method of labeling without evaluation. Case Prediction USITC - “involve allegations of infringement of patents or other intellectual property rights” EDIS web service provides data access E-Discovery will enable us to extract features of each cases Litigation - E-Discovery will help predict length of the case and the monetary value of it Helps make informed decision on the cases to pursue Case Prediction The length of the case were predicted with the KNN approach considering the first 25 weeks of the case to come up with an estimate for the completion of the case. The other model we employed is the KNN approach with the file patterns and the number of judges involved in the cases. k=6: Error :46.4096 MAPE: 54% K=6, Error: 52.6140 MAPE: 48%

Upload: others

Post on 19-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semantic Edge Labeling over Legal Citation Graph and Case … · 2017. 3. 22. · 4. Designed multiple automatic labeling schemes based on the extracted predicates. a) K-NN : 63.2%

Semantic Edge Labeling over Legal Citation Graph and Case Prediction

Semantic Edge Labeling

University of Florida Projects in Data Sciences 2016 Ali Sadeghian, Ben Grider & Laksshman Sundaram

Summery of Results

Visual Grph Analysis Entity Extraction

CONCLUSIONS

• Citations, when a certain statute is being cited in another statute, differ in meaning, and we aim to annotate each edge with a semantic label that expresses this meaning or purpose. Our efforts involve defining, annotating and automatically assigning each citation edge with a specific semantic label. We define a golden set of labels that cover a vast majority of citation types that appear in the United States Code (US Code) but still specific enough to meaningfully group each citation. We proposed a Linear-Chain CRF based model to extract the useful features needed to label each citation. The extracted features were then mapped to a vector space using a word embedding technique and we used clustering methods to group the citations to their corresponding labels. This paper analyzes the content and structure of the US Code, but most of the techniques used can be easily generalized to other legal documents. It is worth mentioning that during this process we also collected a human labeled data set of the US Code that can be very useful for future research.

The UI was designed to allow someone to navigate the citation graph, examining each node and its subgraphs with a breadth first search. It also has function to search for a node, view the context in which the node was cited, and explore cycles that were pre computed

Cycles were selected as a particularly interesting case to test the UI on,as the number and types of cycles in the US Code was something thathad not yet been reported on. 9,269 cycles were precomputed andintegrated into the UI for a user to view. An example to the right showsa case that could be of interest to a user examing the robustness of thecode, where every node is connected in a definition to another node.

The UI implementation was created with MongoDB, ExpressJS, andCytoscapeJS, a JavaScript graph visualization library.

1. We proposed a label set long enough to cover almost all of the citations and also short enough for practical use.

2. We trained and evaluated A Linear Chain CRF for predicate extraction (see Table).3. Collected a dataset of 400 manually annotated citations.4. Designed multiple automatic labeling schemes based on the extracted predicates.

a) K-NN : 63.2%b) K-means : 61.6%c) Multiclass SVM (1vs1): 68.3%%d) Human: At most 71%** The individuals are trained and have special background in law.

NLP techniques applied to extract key information for each cases. The entities extracted were the judges names, legal law firm partners, locations, organisation names and the orders passed by the judges.

In the citation graph we have achieved setting up the golden labels and verified the thoroughness of the labels with the help manual annotators.We also apply many machine learning paradigms like SVM's, CNN to achieve the state of the art accuracy for the labelling tasks.For the case prediction, we have extracted many features that are hard to extract from the documents.We successfully designed a model for the length of the case predictor.

• Individual legal rules seldom exist in isolation, but instead typically occur as components of broader statutory, regulatory, and common-law frameworks consisting of numerous interconnected rules, regulations, and rulings.

• The complexity of these frameworks impedes comprehension and compliance by government agencies, businesses, and citizens and makes amending legislation laborious and error-prone for regulatory agencies and legislative drafters.

• Systems are needed to ease the process of understanding this regulations.

Motivation

Background

Various research on the extraction and resolution of the citation text itself, e.g.:

• M adedjouma et al.: investigate the natural language patterns used in cross reference expressions to automatically detect and link a citation to its target.

• Oanh Thi Tran et al.: They create a system which can automatically detect references and then extracts their referents. (Previous work limits itself to detect and resolve references at the document targets.)

Signaling the difficulty of this task from its very first stage.• In a two paper sequel, Mohammad Hamdaqa et al., lay the

grounds for an automated system that can semantically label citations.

1. Studying the challenges and complexities, i.e., out of principle formats, variety of citations types, NLP obstacles, etc.

2. Extracting the citation using regular expressions. (limited to the ones that follow standard citation schemes described in Bluebook and ALWD)

3. Proposing a label set and identifying them.4. Proposing an automated method of labeling without

evaluation.

Case Prediction

USITC - “involve allegations of infringement of patents or other intellectual property rights”

EDIS web service provides data access

E-Discovery will enable us to extract features of each cases

Litigation - E-Discovery will help predict length of the case and the monetary value of it

Helps make informed decision on the cases to pursue

Case Prediction

The length of the case were predicted with the KNN approach considering the first 25 weeks of the case to come up with an estimate for the completion of the case. The other model we employed is the KNN approach with the file patterns and the number of judges involved in the cases.

k=6: Error :46.4096

MAPE: 54%

● K=6, Error: 52.6140

● MAPE: 48%