bringing order to legal documents an issue-based ...jackg/pubs/keod12_lu_conrad_stp.pdfbringing...

Bringing Order to Legal DocumentsAn Issue-based Recommendation System via Cluster Association

Qiang Lu1 and Jack G. Conrad2

1Kore Federal, 7600A Leesburg Pike, Falls Church, VA 22043, USA2Thomson Reuters Global Resources, Catalyst Lab, Neuhofstrasse 1, Baar, ZG 6340, Switzerland

[Both authors worked for Thomson Reuters Corporate Research & Development when this work was conducted.][email protected], [email protected]

Keywords: unsupervised learning, recommendation, clustering, labeling, document-cluster associations

Abstract: The task of recommending content to professionals (such as attorneys or brokers) differs greatly from the taskof recommending news to casual readers. A casual reader may be satisfied with a couple of good recommen-dations, whereas an attorney will demand precise and comprehensive recommendations from various contentsources when conducting legal research. Legal documents are intrinsically complex and multi-topical, containcarefully crafted, professional, domain specific language, and possess a broad and unevenly distributed cov-erage of issues. Consequently, a high quality content recommendation system for legal documents requiresthe ability to detect significant topics from a document and recommend high quality content accordingly.Moreover, a litigation attorney preparing for a case needs to be thoroughly familiar the principal argumentsassociated with various supporting opinions, but also with the secondary and tertiary arguments as well. Thispaper introduces an issue-based content recommendation system with a built-in topic detection/segmentationalgorithm for the legal domain. The system leverages existing legal document metadata such as topical clas-sifications, document citations, and click stream data from user behavior databases, to produce an accuratetopic detection algorithm. It then links each individual topic to a comprehensive pre-defined topic (cluster)repository via an association process. A cluster labeling algorithm is designed and applied to provide a precise,meaningful label for each of the clusters in the repository, where each cluster is also populated with memberdocuments from across different content types. This system has been applied successfully to very large col-lections of legal documents, O(100M), which include judicial opinions, statutes, regulations, court briefs, andanalytical documents. Extensive evaluations were conducted to determine the efficiency and effectiveness ofthe algorithms in topic detection, cluster association, and cluster labeling. Subsequent evaluations conductedby legal domain experts have demonstrated that the quality of the resulting recommendations across differentcontent types is close to those created by human experts.

1 INTRODUCTIONThe goal of a recommendation system is to suggestitems of interest to a user based on the user’s ownhistorical behavior or behaviors of a community ofother users. Practical applications have been widelyadopted in different areas, such as recommendingbooks, music, and other products from online shop-ping sites (Linden et al., 2003), movies at Netflix(Bennett and Lanning, 2007), and news from Googleand Yahoo! (Liu et al., 2010; Li et al., 2010).

Depending on the targeted user, however, the rec-ommendation task may differ greatly. A casual newsreader or an online shopper may be satisfied with afew good recommendations, regardless of other poorsuggestions. Professionals such as attorneys or bro-

kers, by contrast, may demand more precise and com-prehensive recommendations. For example, when anattorney is performing legal research, she has to combthrough a huge amount of material to identify themost authoritative documents across different sourcesof the law, such as judicial opinions, statutes, andcourt briefs, to name just a few. In the field of law,it is well known how important high recall is for at-torneys who may be preparing for a trial or relatedlitigation proceeding. They cannot afford to miss akey case that may have significant bearing on theircurrent legal and logical strategy. Furthermore, be-ing presented relevant documents may not be enough;that is, they may not be sufficiently granular since it isthe specific sub-document-level topics or arguments

that are critical. Legal practitioners must familiarizethemselves not only with the primary arguments thatmay be marshaled against them; they must also an-ticipate secondary or alternative arguments associatedwith the legal issue as well. While ineffective legalresearch wastes time and money, inaccurate and in-complete research can lead to claims of malpractice(Cohen and Olsen, 2007).

In this paper, we present a robust and compre-hensive legal recommendation system; it delivers toits users the top-level legal issues underlying a casealong with secondary and supplemental issues aswell. So for a given a document, e.g., in a searchresult, a user is presented with additional documentsthat are closely related, and these recommended doc-uments are grouped together based on issues thatare discussed in the original document. This issue-based recommendation approach, relative to those ar-guments in the original document, is essential to theeffectiveness of a legal research system due to thecomplete coverage provided. Legal documents arecomplex and multi-topical in nature. For example,a judicial opinion may deal with a customer suinga hotel for its negligence due to an injury-causingslip and fall in the shower, who later is awarded anamount of compensation by summary judgment in thecourt. At least three legal issues are presented in thiscase, namely negligence, summary judgment proce-dures and appropriateness of compensation, and eachcould lead to a collection of other important docu-ments specifically addressing such a topic. By pro-viding such an issue-oriented recommendation tool,users are able to explore legal topics within any doc-ument at much greater depth rather than simply sur-veying them.1 Furthermore, this examination is en-abled without explicitly summarizing the issues andconstructing different search queries.

One additional means of ensuring an understand-ing of the scope of a particular issue in a cluster ofdocuments is to provide a meaningful label. A simplephrase may be sufficient for many traditional docu-ments, such as those in a news collection. By con-trast, an hierarchically structured labeling approachmay be more suitable for complex legal documents.Such an approach can provide a more precise and de-tailed description of many complicated legal issues ata fine-grained level yet still maintain a coherent andbroad topical overview at the root level. Legal infor-mation providers often deploy editorial resources toorganize and index content to support specific infor-mation needs, and topical taxonomies and encyclope-dias are two examples of such editorial tools for con-

1In the interest of simplification, the expressions legal issues and legaltopics will generally be used interchangeably.

tent navigation purposes. Nodes in taxonomies canoffer well-crafted descriptions to form a solid founda-tion for a labeling algorithm. It is up to the researchscientist to devise the algorithmic means of generat-ing such informative, multi-tiered labeling captions.Although the labeling component of the system rep-resents a post-cluster generation process involving thecharacterization rather than the generation of the clus-ters, it is arguably as essential as any of the othercomponents of the recommendation system. Statedsimply, if users cannot effectively understand and in-terpret the meaning of the labels, the quality of thesystem’s clusters and document-cluster associationsmay not matter; the system may still fail to meet itsintended performance objectives.

In the interest of clarity, it is also instructive to em-phasize the fundamental underlying use case that mo-tivates this work. It is to provide a robust and effectivecomplement to search. Along with search, navigationand personalization, the recommendation approachwe implement can greatly improve the efficiency andeffectiveness of an otherwise comprehensive informa-tion retrieval system for legal research. Given a non-trivial legal domain-specific query, when a user per-forms a professional search, in addition to receivingranked lists of search results delivered across mul-tiple content types, the recommendation componentof the system presents a supplemental list of rankedclusters, one for each of the key issues present in theselected document from the search results. In short,this cluster recommendation system is enabled by thedocument-to-cluster association process described inSection 4. The process by which the clusters havebeen topically defined and populated is described inan earlier work (Lu et al., 2011).

The recommendation system presented in the pa-per can thus be described as follows:1. It identifies major issues discussed in individual

documents from different content types;2. It links each issue to the back-end repository of

important legal issues via an association process;2

3. For every legal issue, it identifies, populates, andupdates a set of the most important documents forthat issue;

4. It provides a meaningful and sufficiently encom-passing label for every issue in the universe.By defining such a collective framework, we pro-

vide a powerful, coherent, and comprehensive ap-proach to supplementing legal research results withadditional relevant yet important documents in ourissue-based recommendation system. Just as impor-tantly, the system will produce a more comprehensive

2This back-end repository is sometimes referred to as the “universe” oflegal issues.

Document-Topic Association

Document Topic

Segmentation

Topic Clustering

TopicPopulation

(Cluster Population)

Document-Topic

Segmentation

(Topic Universe Generation)

(Document Recomendation)

Metadata Repository

(HNs, KNs, NPs, Citations)

Topic Label Repository

Label Generation

Figure 1: Workflow of document recommendation system

coverage of the essential arguments associated withthese legal issues.

Figure 1 illustrates the overall workflow of thisrecommendation system. At the highest level, the pri-mary activities are represented by the two key com-ponents of the system: (1) define and generate the le-gal topic universe, and (2) perform document recom-mendations via associations (shown in the two lightblue boxes delineated by dotted lines). As the sub-components indicate, both the generation of the uni-verse and the association processes rely on the docu-ment topic segmentation results that precede them.3

Other significant components of the workflow in-clude: (a) populating the clusters with the most im-portant documents, depending on the topics they con-tain, and (b) labeling the clusters, which is an impor-tant piece of any outward rendering of the clusters.Each of these components will be discussed in the re-mainder of the paper in order to clarify their roles.

The paper is organized as follows. Section 2briefly surveys related work. Section 3 gives a shortdescription of the metadata available across the dif-ferent content types in the legal domain. Section 4presents the overall issue-based recommendation sys-tem, which includes a document topic detection algo-rithm and a document association algorithm. The la-beling algorithm is covered in Section 5, followed bya description of the evaluation results and system per-formance in Section 6. Finally, Section 7 concludesthe work while Section 8 discusses future research di-rections.

3The two “Document Topic Segmentation” boxes in Figure 1 are func-tionally the same.

2 RELATED WORK

Topic detection in documents can be applied at twodifferent levels: at the individual document level, andat the document collection level. At the individuallevel, topic detection is often referred to as topic seg-mentation, which is to identify and partition a doc-ument into topically coherent segments. Algorithmsin this category often exploit lexical cohesion infor-mation based on the fact that related or similar wordsand phrases tend to be repeated in coherent segmentsand segment boundaries often correspond to a changein vocabulary (Choi, 2000; Hearst, 1997; Utiyamaand Isahara, 2001). Complementary semantic knowl-edge extracted from dictionaries and thesauruses, oradditional domain knowledge such as the use of hy-ponyms or synonyms can been incorporated to im-prove the segmentation (Choi et al., 2001; Beefermanet al., 1997). Topic detection at the document collec-tion level, on the other hand, is to identify underlyingcommon topics among different documents. Manystudies derive topics by document clustering, eitherusing entire documents (Aggarwal and Yu, 2006)or sentences (Bun and Ishizuka, 2002; Chen et al.,2007). Topic modeling approaches, such as Proba-bilistic Latent Semantic Analysis (PLSA) (Hofmann,1999), Latent Dirichlet Allocation (LDA) (Blei et al.,2003), and Non-negative Matrix Factorization (NMF)(Lee and Seung, 1999), take a more intuitive notionof topics in a collection by characterizing a topic as acluster of words rather than documents. For example,Prasad, et al. (Prasad et al., 2011) applied a technol-ogy in dictionary learning known as NMF to decom-pose a document collection matrix represented by thevector space model into two non-negative matrixes,one of which was a term-topic matrix with each entrybeing the “confidence” of a term in a particular topic.Due to the non-negative nature of the topical matrix,

an original document can be reconstructed approxi-mately by the combination of topics using only theallowed additive operation.

Labeling topics is another significant yet oftensomewhat neglected issue in topic detection research.Topics have traditionally been interpreted via topranked terms based on either marginal probabilityin LDA or some statistical measurements such astf.idf or tf.pdf (Bun and Ishizuka, 2002). A goodoverview of cluster labeling topics can be found inJain (Jain et al., 1999). Historically less work has fo-cused on clustering labeling than cluster generationitself. Popescul and Ungar obtained impressive re-sults by combining X2 and the collection frequencyof a term (Popescul and Ungar, 2000). Glover, et al.labeled clusters of Web pages by relying on the in-formation gain associated with those pages (Gloveret al., 2002a). Whereas Stein and zu Eissen use anontology-based approach (Stein and zu Eissen, 2004),the more challenging problem of labeling nodes in ahierarchy (and the related general vs. specific prob-lem) is addressed by Glover, et al. (Glover et al.,2002b) and Treeratpitu and Callan (Treeratpituk andCallan, 2006). More recently Fukumoto and Suzukiperformed cluster labeling by relying on concepts in amachine readable dictionary (Fukumoto and Suzuki,2011) with positive results. In another distinct recentwork, Malik, et al., focused on finding patterns (i.e.,labels) and clusters simultaneously as an alternative toexplicitly identifying labels for existing clusters (Ma-lik et al., 2010).

A comprehensive review of document recommen-dation systems, as part of broad-based recommen-dation systems, can be found in (Adomavicius andTuzhilin, 2005). Such systems are usually classifiedinto three categories, based on how recommendationsare made: content-based recommendations, collabo-rative filtering recommendations, and hybrid meth-ods. Content-based recommendations suggest simi-lar documents based on the preferred ones by usersin the past, and this type of approach has its rootsin information retrieval and information filtering re-search. Often, both users and documents are repre-sented by sets of features. Documents that best matchthe user profile and the profile of previously vieweddocuments get recommended. A news recommenda-tion system recently described in (Li et al., 2010) is anexample of such system. In addition, the authors alsointroduced an exploitation factor into the algorithmto bring new content to users’ attention. The basicconcept behind collaborative filtering is to utilize thepreferences and evaluations of a peer group to pre-dict the interests of other users. It has been success-fully applied to suggest products in shopping sites,

such as the one used in Amazon (Linden et al., 2003),but it was rarely used independently among documentrecommendation systems. Instead, collaborative fil-tering approaches often combined with content-basedapproaches into hybrid systems. Google news (Liuet al., 2010) adopted this hybrid approach, with heavyemphasis on the click log analysis on its rich historyof usage data, to recommend personalized news todifferent users. Document recommendation technol-ogy has been successfully adapted to legal documentsin recently years. Al-Kofahi et al. (Al-Kofahi andet al., 2007) described a legal document recommen-dation system blending retrieval and categorizationtechnologies together. They designed a two-step pro-cess. In the first step a ranked list of suggestions wasgenerated based on content-based similarity using aCaRE indexing system [4], a meta-classifier consist-ing of Vector Space, Bayesian, and KNN modules.In the second step, recommendations were re-rankedbased on user behavior and document usage data. Therecommendation system described in this paper dif-fers from theirs in several respects: (1) it recommendsdocuments based on important topics discussed in theoriginal document and groups them as such, (2) itlinks the topics in the document to topics in a uni-verse instead of each individual document, and (3)each topic in the document is presented with a pre-cise and hierarchically structured label.

3 LEGAL DOMAINCHARACTERISTICS

Documents in the legal domain possess some note-worthy characteristics. These characteristics includebeing intrinsically multi-topical, relying on well-crafted, domain-specific language, and possessing abroad and unevenly distributed coverage of legal is-sues.

3.1 Data and Metadata ResourcesLegal documents in the U.S. tend to be complex innature because they are the product of a highly ana-lytical and often adversarial process that involves de-termining relevant law, interpreting such law and ap-plying it to the dispute to which it pertains. Legal pub-lishers not only collect and publish the judicial opin-ions from the courts, but also summarize and clas-sify them into topical taxonomies such as the KeyNumber System (described in 3.1.3). We mentionsome features of Thomson Reuters’ products belowto illustrate a point about the kinds of resources le-gal publishers harness in order to offer researchersmultiple entry points and search indexes into the con-

tent. Other legal publishers have their own analogousmeans of accessing their content.

3.1.1 Judicial Opinions making Case LawCorpus

A judicial opinion (or a case law document) con-tains a court’s analysis of the issues relevant to a le-gal dispute, citations to relevant law and historicalcases to support such analysis and the court’s deci-sion. Thomson Reuters’ Westlaw System adds sev-eral annotations to these documents to summarize thepoints of law within and make them more accurateusing a consistent language for the purposes of legalresearch. These include a synopsis of the case, a se-ries of summaries of the points of law addressed inthe case (3.1.2), classification of these points to a le-gal taxonomy (3.1.3), and an historical analysis of thecase to determine whether its holdings and mandateremain intact or whether they have been overruled inpart or in whole (3.1.4).

3.1.2 Case Law Annotated Points of Law

Thomson Reuters’ Westlaw System creates “head-notes” for case law documents, which are short sum-maries of the points of law made in the cases. Atypical case law document produces approximately 7headnotes, but cases with over one hundred headnotesare not rare.

3.1.3 Headnote Classification, Key NumberSystem

Headnotes are further classified to a legal taxonomyknown as the West Key Number System,4 a hier-archical classification of the headnotes across morethan 100,000 distinct legal categories. Each categoryis given a unique alpha-numeric code, known as aKey Number, as its identifier along with a descrip-tive name, together with a hierarchically structureddescriptor, known as a catchline. An example of aheadnote and its key numbers (including catchlines)is shown in Figure 2.

3.1.4 Citation System

Legal documents contain rich citation informationjust as documents from other domains do, such asscientific publications and patents. These are the le-gal domain’s equivalent to URLs. A case law docu-ment tends to cite previous related cases to argue foror against its legal claims; therefore, it is not unusualto have landmark cases decided by the U.S. SupremeCourt with hundreds of thousands of cites to them.

4http://store.westlaw.com/westlaw/advantage/keynumbers/

3.1.5 Statutes’ Notes of DecisionsAnother primary law source is statutes, which arelaws enacted by a state legislature or Congress. Forsome statutes, both in federal and state levels, Notesof Decisions (NODs) are created for them. NODsare the editorially chosen case law documents thatconstrue or apply the statute. In particular, NODsare tied to specific headnotes in the applying cases.An example of NODs is shown in Figure 2, with theblue hyperlink at the end of the headnote text (i.e., 8U.S.C.A....).

4 RECOMMENDATION VIACLUSTER ASSOCIATION

Given the complexity and multi-topical nature of somany legal documents, recommendation algorithmsthat treat documents as single atomic units tend toproduce results that are topically unbalanced in thesense that recommended documents with differenttopical focuses are blended together in a single list,and it is left up to users to decipher which section ofthe original document is the target of the recommen-dation. To tackle this deficiency, the recommendationapproach we have developed has a built-in topic de-tection algorithm that explicitly segments each doc-ument into coherent legal topics. Recommendationsare thus tailored toward each individual topic. The al-gorithm consists of two steps: (1) segment documentsinto topics, and (2) associate topics with clusters ofdocuments containing the most relevant like topics.The document segmentation algorithm is designed toaccommodate various content types, and the associa-tion step is based on the similarity between topics andclusters.

As alluded to in the aforementioned second step,prior to the association process, a universe of topicclusters (i.e. topics) that has a comprehensive topi-cal coverage in the legal domain needs to be definedsuch that topics detected in each individual documentcan be linked to this universe. Without this universalcoverage, one could not envision issue-based recom-mendations, similar to “more like this” for the OpenWeb, where legal researchers could discover topicsrelated to a document they are examining or probedeeper into a topic of interest. It is also critical thatmost if not all legal documents (regardless of theirtype) be linked to these clusters. The clusters, de-scribed in prior work (Lu et al., 2011), are meant tocontain the most important case law documents on alegal topic. Yet they are also populated with othertypes of legal documents such as statutes, regulations,and court briefs, to ensure comprehensive coverage.In short, the utility of the clusters as a means to orga-

Figure 2: A example of a headnote with its assigned key number.

nize legal content around issues or topics is as mucha function of the quality of the clusters themselvesas it is a function of coverage. As such, the clusteruniverse generation process is not part of the recom-mendation algorithm, but we mention it here becauseit is the foundation upon which our recommendationalgorithm is built.

4.1 Topic SegmentationThe topic segmentation algorithm leverages availablemetadata from different document types. We foundthis approach to provide better quality results over tra-ditional topic segmentation algorithms that rely uponlexical cohesion and utilize document text alone.

The segmentation algorithm differs depending onthe availability of metadata in different documenttypes. For the purposes of illustrating variations of thesegmentation algorithm, we will subsequently have abrief discussion of case law documents (possessingheadnotes), statutes, and court briefs.

4.1.1 Case law topic segmentation

As stated before, headnotes are short summaries ofpoints of law in case law documents. Collectively,they provide near-complete coverage of the main legalissues within a case. By grouping headnotes withina case law document based on their “similarities,” itis possible to identify the main legal topics within adocument. We use the vector-space model to repre-sent headnotes in a case. A headnote is depicted interms of four types of features: text, key numbers,KeyCite citations, and noun phrases. The similaritybetween a pair of headnotes, sim(hi,h j), is definedas the weighted sum of the similarities between thecorresponding component vectors. The weights aredetermined using heuristics.

The similarity functions for the component vec-tors are defined using cosine similarity or one of itsvariations, and an agglomerative clustering algorithmgrouping similar headnotes to generate the topics for acase law document. The algorithm merges two head-notes together while maximizing the following equa-tions,

F = maximizeτ

ε(4.1)

where,

τ = maximizek

∑r=1

∑hi∈Tr

sim(hi,Tr) (4.2)

and

ε = minimizek

∑r=1

nrsim(Tr,T ) (4.3)

Tr =∑

hinTr

h

nr(4.4)

T =∑

rinTTr

k(4.5)

where τ is the intra-cluster similarity and ε is theinter-cluster similarity, k denotes the total number oftopics in a document, T denotes the topics for a doc-ument, Tr denotes an individual topic, and nr is thenumber of headnotes in the topic Tr. Also, Tr and Trepresents the center of a single topic and all topics,respectively.

Notice that the algorithm does not require thenumber of topics as an input parameter; rather, it de-pends on an intra-topic similarity threshold to con-trol the granularity of the topics in a document. Thethreshold is determined empirically by analyzing thehistogram of intra-cluster similarities. We use a set ofdocuments with known topic segmentations to guideour threshold selection process.

4.1.2 Statutes topic segmentation

Unlike case law, federal or state statutes do not haveheadnotes; however, some of them contain NODsas stated in 3.1.5, and these NODs are tied to spe-cific headnotes that construe or apply the statute.By grouping these headnotes in the same manner asstated in 4.1.1, legal topics inside the statutes can beidentified.

4.1.3 Court briefs topic segmentationA majority of the court briefs do not contain head-notes, however they have rich citation links, both in-links (other documents citing this brief) and out-links(this brief cites other documents, mostly case law).Utilizing the headnotes available from these directlycited documents and grouping them in the same man-ner as stated in 4.1.1 can help identify main legal top-ics within a brief.

One distinction between the court briefs and caselaw documents, however, makes further topic seg-mentation refinement for briefs necessary. Case lawopinions in general contain complete legal facts re-garding a law suit, while court briefs have a muchnarrower focus. Most often they only deal with cer-tain aspects of the complete set of facts. To help iden-tify these specific aspects, a generic document sum-marization algorithm (Schilder and Kondadadi, 2008)is first applied to the brief text to extract the most im-portant sentences. Then these sentences are sent toa key number classification system built with CaRE(Al-Kofahi et al., 2001) to retrieve major key num-bers from it. By analyzing the commonalities amongtopics from the retrieved key numbers and the afore-mentioned grouped topics (that have associated keynumbers as well) from cited case law documents, ir-relevant legal topics can be filtered in a straightfor-ward manner and the remaining ones will form thelegal topics thereafter.

4.2 Recommendation via topicassociation

The above topic segmentation algorithm produces le-gal topics for each document regardless its contenttype, and each topic serves as an anchor point (i.e.,a source topic) for the recommendation algorithm toretrieve the most similar clusters in the universe. Asstated earlier, each cluster has been pre-populatedwith the most important documents from differentcontent types, namely, case law, briefs, statutes, reg-ulations, and administrative decisions, jury verdicts,trial court orders, expert witness reports, pleadings,motions & memoranda and analytical documents. Byassembling and organizing these documents for eachindividual segmented topic, issue-based recommen-dations can then be generated for that document.

We treat the retrieval of the most similar cluster ofan anchor topic as a ranking problem such that can-didates are first generated and sorted; then the topranked cluster is picked at the completion of the pro-cess.

To generate candidate clusters, a document classi-fication engine, CaRE, is used, although other index-ing engines (e.g., Lucene) could be used. For each

cluster, three classifiers are trained one per featuretype headnote text, key numbers, and citation pat-terns.

Given an anchor topic, we use CaRE to retrieve apool of candidate clusters. We then use a ranker SVMto determine which cluster in the candidate pool ismost similar to the anchor topic. To do this, we repre-sent each anchor-candidate pair in terms of a featurevector. The features include CaRE scores, and vari-ations of the noun phrases and key number featuresdescribed in 4.1.1. These similarity features are thenused to train a ranker SVM to rank clusters in the can-didate pool, and the top ranked cluster is selected asthe recommendation for the anchor topic in the doc-ument. The same process is performed for each seg-mented topic in the document. A series of evaluationsof this process is presented in Section 6.

5 CLUSTER LABELINGConsiderable research and development effort wasdedicated to the labeling algorithm for these clusters.The breadth and depth of this research – generatingdistinct labels for approximately 360K clusters – isbeyond the scope of this paper and merits its ownresearch report. Nonetheless, some of the key fea-tures explored for the labels in question are brieflydiscussed here. We harness a portion of a headingfrom an existing taxonomy [Key Number System (SeeSection 3)], express a bias towards shorter over longerlabels, and strive to avoid any redundancy. These lasttwo features were the direct result of early evaluationrounds which employed attorney domain experts andthe feedback we received from the business unit sup-porting this research.

Our current labels generally consist of three seg-ments. These segments range from general headingsto more specific ones. The most effective number ofsegments was determined empirically. The currentalgorithm evolved from our experiments and repre-sents a hybrid of two earlier versions. The first stageconsists of identifying the first two more general seg-ments of the label. As stated, it leverages an existingtaxonomy of case law topical classifications (multi-level classifications coming from the Key NumberSystem). The second stage relies on a ranking processthat orders noun phrases that were identified in theheadnotes in the cases in the given cluster. The result-ing algorithm uses both information from a seed casefor a cluster and the case law documents with whichthe cluster itself has been populated. The basic idea isto pick the most representative noun phrases from theheadnotes in the top N cases in the given cluster forwhich its key number(s) are the same as the clusteredheadnotes in the seed case. These are appended to the

first two segments from the first stage (the ones con-sisting of the topic titles from the most popular keynumbers in the top N cases). Earlier experiments re-vealed that noun phrases from headnotes classified tospecific key numbers were more accurate than thosefrom more generally assembled headnotes. Further-more, the optimum number of top key numbers to usewas determined empirically (i.e., more than 5).

The process can be summarized as follows:• [Segments 1 and 2] Select the top two headings

from the most popular key numbers;• Get the headnote text from the top N cases with

the same key number(s) as in the clustered head-notes in the seed case;

• Add back the clustered headnotes from the seedcase;

• Use a noun phrase chunker on these text stringsto identify optimal noun phrases (NPs); then rankthe extracted NPs based on several simple fea-tures, such as the length of the NP (how manywords), the combined within headnote term fre-quency (TF) for each of the word in the NP, theTF of the NP itself, the cross-collection documentfrequency (DF) of the NP, its inverse (IDF), etc.;

• [Segment 3] Pick the top NP after ranking, andappend it to the top two headings from the keynumbers above.Based on an iterative manual assessment and er-

ror analysis process, we identified a set of potentialenhancements that were worth investigating. Theseinclude:1. suppressing redundancies across label segments

to foster clarity and the absence of duplicationwithin a label;

2. enforcing normalization across all term and docstatistics used (tf, idf, tf.idf, ...) to ensure consis-tency across the universe of clusters;

3. promoting label segments that contain more fre-quently occurring n-grams (frequently occurringsubstantive noun phrases);

4. creating a bias towards shorter vs. substantiallylonger labels to encourage clearer, more readablelabels;

5. adding a labeling duplication detection step, anda priority-based process by which duplicate labelsare methodically avoided.The baselines used in our experiments include: (1)

the titles from the most popular key numbers in thetop N cases (termed ‘original baseline’), and (2) themost popular (frequent) NPs in the top N cases. SeeFigure 3 for an example of a set of four cluster labelsfor a legal case on “adoption” where religion plays apivotal role. Also see Table 3 for a comparative set ofthese results. Evaluations performed on the labeling

Figure 3: A example of a set of clusters associated with acase on adoption. (Note the 2-3 tier layout)

process, which examines the potential contribution ofseveral feature enhancements, is presented in the nextsection.

6 PERFORMANCE ANDEVALUATION

To access the performance of our recommendationalgorithm, we generated over 360,000 clusters fromabout 7 million U.S. case law documents, which col-lectively contain more than 22 million headnotes clas-sified to approximately 100,000 key numbers. Eachcluster is populated with the most important case lawdocuments, as well as statutes, regulations, admin-istrative decisions, and 6 other additional previouslymentioned content types as its members. Over 100million different documents have gone through the as-sociation process to generate recommendations. Overone thousand documents of different document typeshave been manually reviewed by legal experts in or-der to determine the quality of the recommendation,along with the labels for the associated clusters. Thefollowing results are based on these reviewed sam-ples.

6.1 Evaluation DesignWe report three different evaluations as follows:

6.1.1 Evaluation I: Association andrecommendation quality

In Evaluation I, the quality of the clusters associ-ated with input documents for different content typeswas graded by legal experts using a five-point Lik-ert scale from A (high quality associated cluster) to F(low quality associated cluster). In addition, the topranked recommended documents of different contenttypes within those clusters have been scored using acoherence measurement, which is defined as the ex-tent to which the documents in a given cluster addressthe same specific legal issue with respect to the an-chor topic extracted via segmentation from the origi-nal source document. Furthermore, the recommendedcase law documents have been given a utility score,which is defined as the usefulness of the documents

in the given cluster to a legal researcher. The ratio-nale for assessing utility in addition to coherence isbecause it would be possible to have a cluster with ahigh coherence score which is not very useful to a le-gal researcher. For both metrics, the reviewers used afive-point scale ranging from 5 (high coherence withthe current cluster’s central topic or high utility to alegal researcher) to 1 (low coherence with the cur-rent cluster’s central topic or low utility to a legal re-searcher).

6.1.2 Evaluation II: Label quality

For the label evaluations performed, human domainexperts (attorneys) with many months of clusteringassessment experience provided the judgments. Afive-point Likert scale was again used from A (high-est quality) to F (lowest quality) based on accuracyand clarity. Specifically

A = on point, completely captures the topic .... (5)B = captures the main topic in a general sense (4)C = captures some of the main topical focus ... (3)D = captures secondary or minor topical focus (2)F = misses the topical focus of the cluster ...... (1)

6.1.3 Evaluation III: Composite legal report andissue quality

In Evaluation III, a group of legal professionals wereinvolved in creating 10 research reports from a cross-section of U.S. jurisdictions and covering differenttopical areas. Each of the reports included 7 or fewerof the most authoritative documents, including bothprimary sources, such as case law and statutes, andsecondary sources such as analytical materials. Legaltopics were identified manually for each of the doc-uments by domain experts. Further, they found thateach of the 10 reports in this study had a common‘thread’ (i.e., a common legal issue) running throughit. However, the common thread did not always ap-pear in each document in each of the reports.

The algorithm was applied to the same set of re-ports to detect clusters associated with these docu-ments. The objective of this assessment was to eval-uate two things: (1) is the recommendation algorithmable to discover all the legal topics in these documentsand (2) is the recommendation algorithm able to findthe common legal issue in each of the reports.

We used precision and recall to measure the per-formance for the first objective, in which:• Precision, P, is defined as the number of cor-

rectly identified topics of a document (comparedto number of manually identified topics) dividedby the total number of topics, and

• Recall, R, is defined as the number of correctlyidentified topics of a document divided by the to-

tal number of manually identified topics of a doc-ument (the ground truth).For the second objective, we used true positive

rate, TPR, for evaluation, which is defined by thenumber of common legal issues identified among doc-uments in all reports divided by the number of com-mon legal issues manually identified among docu-ment in all reports by experts.

6.2 Performance6.2.1 Evaluation IIn Evaluation 1, legal experts were asked to grade thequality of 105 clusters associated with the topics iden-tified in 25 source documents. During this exercise,they reached a consensus on the clusters with gradeA being “excellent”, B being “good”, C being “ac-ceptable”, D being “marginal”, and F being “poor”.In addition, the experts defined the “precision rate” asthe ratio of clusters receiving an A or B grade, and the“success rate” as the ratio of clusters receiving an A,B or C grade, to the total associated clusters, respec-tively.

Table 1 shows the association quality for case lawdocuments, federal statues, and court briefs in rankedorder. The precision and recall rate decline based onrank which is not unexpected. This behavior justifiesour decision of picking the top ranked cluster as theoutput of the association process. In the court briefscategory, we also show the quality difference betweenusing and not using the summarization tool (whichpicked the first N-ranked sentences (e.g. N=10) fromthe document). As demonstrated by these results,summarization noticeably improved the quality of thetop-ranked clusters. Table 1 also illustrates that in or-der to achieve high recall, i.e., in the upper 80% rangeand above, it is a challenge to obtain comparable pre-cision rates. These percentages are noticeably lower,with summary-aided briefs highest, followed by caselaw documents, and finally statutes.

Table 2 shows the quality assessment of 105 rec-ommended topics from 25 documents representedby case law opinions, U.S. secondary law materials(a.k.a. analytical materials), and court briefs, in termsof coherence and utility, and the overall recommen-dation quality per topic level and per document level(additional scores to assess the overall recommenda-tion quality across all document types). As one canobserve from Table 2, the recommendation quality atthe topic-level across document types remains high,in the 4.0 range. For case law documents, this is trueboth in terms of utility as well as coherence. Further,the overall recommendation quality at the topic-leveland the document-level similarly remains high, above4.0, in both instances.

Expert AssessmentCase Law Federal Statutes Court Briefs Court Briefs

w/ Summarization w/o SummarizationRank Precision Recall Precision Recall Precision Recall Precision Recall

1 60% 93% 51% 86% 74% 95% 61% 87%2 55% 91% 45% 86% 69% 95% 63% 88%3 51% 91% 48% 85% 68% 94% 59% 87%4 50% 90% 42% 85% 66% 90% 56% 88%5 48% 89% 41% 84% 66% 90% 56% 88%

Table 1: Association Quality on Rank

Grade Expert AssessmentCase Case Analytical Court Brief Overall Overall

Utility Coherence Coherence Coherence Recommendation Recommendationper Topic per Document

5 49 41 39 47 67 174 39 23 37 29 1 03 12 29 23 17 30 82 5 7 3 8 2 01 0 5 3 4 5 0

Average 4.2 3.8 4.0 4.0 4.2 4.4

Table 2: Recommendation Quality Assessment

6.2.2 Evaluation II

The following cluster labeling results were based on100 final clusters which included 60 from an initial setof clusters associated with case law documents andanother set of 40 from clusters associated with anno-tated statutes. They were produced while run in ournewly developed operational environment. The labelsets assessed included our original taxonomy-basedbaseline label set, our NP-based baseline label set,a set where redundancies across the label segmentswere eliminated, a set where features such as df werecomprehensively normalized, a set where labels withmore frequently occurring n-grams were promoted,and lastly, a set where a bias towards shorter labelswas used.

As is evident from Table 3, performance of theoriginal baseline label set was solid and in fact dif-ficult to outperform. Although the initial baselinereceived some of the highest scores from the asses-sors, for being understandable and closer to human-like quality, it is also worth noting that these labels didnot harness the computational effort that subsequentversions did in attempting to produce a readable andprecise (i.e., more granular) three-segment label, andthus even though of high quality readability, they ar-guably do not contain the same degree of informationas their more mature and labored counterparts in sub-sequent versions. That said, given other presentation-related factors such as a bias towards shorter labels,other results such as those produced for version 5 maybe preferable. The other features tested in the labelexperiments proved not to be as consistently effectiveas anticipated.

6.2.3 Evaluation IIIRegarding the first objective the algorithm’s ability todiscover legal topics within a document set the pre-cision and recall of the system on 10 reports acrossdifferent document types in shown in the Figure 4.Overall, the algorithm achieves reasonably high pre-cision, but the recall in this instance was quite low,especially for case law documents. The main reasonfor this performance is the aggressive filtering, i.e., byadopting much higher thresholds, in the post process-ing of the system in order to achieve high

For the second objective the algorithm’s abilityto identify a common legal issue within the documentset Table 4 shows the performance of the system ineach individual report, as well as overall.

Figure 4: Precision and Recall of Evaluation III

In this analysis, each of the 10 reports has a com-mon legal issue running through it. However, thiscommon ‘thread’ did not appear in each document ineach of the reports. In summary, the experts manu-

Grade Original Version 1 Version 2 Version 3 Version 4 Version 5Baseline Baseline Redundundancy Normalized N-Gram Short

1 2 Suppression Features Promotion. Lgth BiasAverage 3.68 3.65 3.65 3.08 3.52 3.66

A 18 16 16 18 19 19B 47 46 45 20 38 41C 23 27 29 55 26 29D 9 9 8 18 14 9F 3 2 2 2 4 2

Total 100 100 100 100 100 100

Table 3: Quality Assessment of Baseline and Hybrid Labels

Rpt Total Docs w/ Same Docs w/ SameIDs No. Theme Theme

Docs Ed. Gen. Assns Alg. Gen. AssnsAE 2 6 6 6

AE 20 6 6 6AE 22 4 4 4AE 24 5 3 3AE 31 6 6 6AE 32 7 5 5AE 35 6 6 5AE 36 6 6 6AE 37 5 3 3AE 39 7 7 7Total 58 52 51

Precision 89.7% 87.9%

Table 4: Precision of common legal issues of Evaluation III

ally created clusters by identifying a common threadthrough all documents in 7 of the 10 reports (un-shaded rows); our system identified a common threadthrough all documents in 6 of these 7 reports. In oneof the reports, our system missed a common threadin one of the documents in that report, and is thusconsidered as a failure. Across the entire set, expertsmanually created topics and identified the commonthread in 52 of the 58 documents (89.7%). Our sys-tem created clusters and identified the common threadin 51 of 58 documents (87.9%), representing a smallbut still appreciable 2% drop in TPR.

As demonstrated in the different evaluations re-ported above, overall, the assessment of our proposedrecommendation system for legal documents consis-tently achieved significantly reliable results.

7 CONCLUSIONSDocument recommendation remains an active areaof research containing a spectrum of challenging re-search problems as different applications have differ-ent needs. It is particularly challenging in the legaldomain where documents are intrinsically complex,multi-topical, and contain carefully crafted, profes-sional, domain specific language. Recommendationsmade in the legal domain require not only high preci-sion but also high recall, since legal researchers can-

not afford to miss important documents when prepar-ing for a trial or related litigation proceedings. In ad-dition, legal practitioners must familiarize themselvesnot only with primary arguments but with secondaryor tertiary arguments associated with the legal issueas well. To this end, we describe in this paper an ef-fective legal document recommendation system thatrelies on a built-in topic segmentation algorithm. Thesystem is capable of high quality recommendationsof important documents among different documenttypes, recommendations that are specifically tailoredto each of the individual legal issues discussed in thesource document. The performance of the system isencouraging, especially given its validation by humanlegal experts through a series of different test assess-ments. Given these results, this paper makes threecontributions to the field. First, it demonstrates howone can expand the set of original relevant legal doc-uments using “more like this” functionality, one thatdoes not require explicitly defining legal issues andconstructing queries. Second, by providing meaning-ful, hierarchically structured labels by way of our la-beling algorithm for legal issues, we show that userscan effectively identify interesting and useful topics.And third, the system is highly scalable and flexible,as it has been applied to on the order of 100 millionassociations across different document types.

Based on our studies, users, especially legal re-searchers, often prefer to have the ability to drill downand focus on key issues common within a documentset, as opposed to getting a high-level overview ofa document collection. Attention to fine-grained le-gal issues, robustness and resulting topically homo-geneous but content-type heterogeneous, high qualitydocument clusters, not to mention scalability are thechief characteristics of this issue-based recommenda-tion system. It represents a powerful research tool forthe legal community.

8 FUTURE WORKOur future work will focus on improvements in theexisting topic segmentation algorithm for documentswhich contain little metadata information. We have

been experimenting with topic modeling algorithms,such as Latent Dirichlet Allocation (LDA) (Blei et al.,2003) and Non-negative Matrix Factorization (NMF)(Lee and Seung, 1999), in other related projects, andhave witnessed very promising outcomes. Humanquality labels remain a challenge since up until now,substantial manual reviews by human experts havebeen required to ensure quality. We are pursuing thissubject as another future research direction.

9 ACKNOWLEDGEMENTSWe thank John Duprey, Helen Hsu, Debanjan Ghoshand Dave Seaman for their help in developing soft-ware for this work, and we are also grateful for the as-sistance of Julie Gleason and her team of legal expertsfor their detailed quality assessments and invaluablefeedback. We thank Khalid Al-Kofahi, Bill Keenanand Peter Jackson as well for their on-going feedbackand support.

REFERENCES

Adomavicius, G. and Tuzhilin, A. (2005). Toward the next generation ofrecommender systems: A survey of the state-of-the-art and possibleextensions. IEEE Transactions on Knowledge and Data Engineering,17(6):734–749.

Aggarwal, C. C. and Yu, P. S. (2006). A framework for clustering massivetext and categorical data streams. In Proceedings of the Sixth SIAMInternational Conference on Data Mining (SDM 2006).

Al-Kofahi, K. and et al. (2007). A document recommendation system blend-ing retrieval and categorization technologies. In Proceedings of AAAIWorkshop on Recommender Systems in e-Commerce, pages 9–16.

Al-Kofahi, K., Tyrrell, A., Vachher, A., Travers, T., and Jackson, P. (2001).Combining multiple classifiers for text categorization. In Proceedingsof the 10th International Conference on Information and KnowledgeManagement (CIKM01), pages 97–104. ACM Press.

Beeferman, D., Berger, A. L., and Lafferty, J. D. (1997). A model of lexicalattraction and repulsion. In Proceedings of the 35th Annual Meetingof the Association of Computational Linguistics (ACL97), pages 373–380.

Bennett, J. and Lanning, S. (2007). The netflix prize. In Proceedings of KDDCup and Workshop.

Blei, D. M., Ng, J. A., and Jordan, M. I. (2003). Latent dirichlet allocation.Journal of Machine Learning Research, 3:993–1022.

Bun, K. K. and Ishizuka, M. (2002). Topic extraction from news archive us-ing tf*pdf algorithm. In Proceedings of the Third International Con-ference on Web Information Systems Engineering (WISE02), pages73–82.

Chen, K.-Y., Luesukprasert, L., and cho Timothy Chou, S. (2007). Hot topicextraction based on timeline analysis and multidimensional sentencemodeling. IEEE Transactions on Knowledge and Data Engineering,19(8):1016–1025.

Choi, F. Y. (2000). Advances in domain independent linear text segmen-tation. In Proceedings of the Applied Natural Language ProcessingConference (ANLP00), pages 26–33.

Choi, F. Y., Wiemer-Hastings, P. M., and Moore, J. (2001). Latent semanticanalysis for text segmentation. In Proceedings of the 2001 Conference

on Empirical Methods in Natural Language Processing (EMNLP01),pages 109–117.

Cohen, M. L. and Olsen, K. C. (2007). Legal Research in a Nutshell. Thom-son West, Saint Paul, MN, 9th edition.

Fukumoto, F. and Suzuki, Y. (2011). Cluster labeling based on conceptsin a machine-readable dictionary. In Proceedings of the 5th Inter-national Joint Conference on Natural Language Processing, pages1371–1375. AFNLP.

Glover, E. J., Kostas, T., Lawrence, S., Pennock, D. M., and Flake, G. W.(2002a). Using web structure for classifying and describing webpages. In Proceedings of the World Wide Web, pages 562–569. ACMPress.

Glover, E. J., Pennock, D. M., Lawrence, S., and Krovetz, R. (2002b).Inferring hierarchical descriptions. In Proceedings of the 11th In-ternational Conference on Information and Knowledge Management(CIKM02), pages 507–514. ACM Press.

Hearst, M. (1997). Texttiling: Segmenting text into multi-paragraph subtopicpassages. Computational Linguistics, 23:33–64.

Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedingsof 22nd Annual International SIGIR Conference, pages 50–57. ACMPress.

Jain, A., Narasimha, M., and Flynn, P. (1999). Data clustering: A review.ACM Computing Surveys, 31(3):264–332.

Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791.

Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-banditapproach to personalized news article recommendation. In Proceed-ing of World Wide Web Conference (WWW10), pages 661–671.

Linden, G., Smith, B., and York, J. (2003). Amazon.com recommenda-tions: Item-to-item collaborative filtering. IEEE Internet Computing,7(1):76–80.

Liu, J., Dolan, P., and Pedersen, E. R. (2010). Personalized news recom-mendation based on click behavior. In Proceedings of the 15th In-ternational Conference on Intelligent User Interfaces (IUI10), pages31–40.

Lu, Q., Conrad, J. G., Al-Kofahi, K., and Keenan, W. (2011). Legal doc-ument clustering with built-in topic segmentation. In Proceedingsof the 20th International Conference on Information and KnowledgeManagement (CIKM11), pages 383–392. ACM Press.

Malik, H. H., Kender, J. R., Fradkin, D., and Mrchen, F. (2010). Hierarchicaldocument clustering using local patterns. Journal of Data MiningKnowledge Discovery, 21(1):53–185.

Popescul, A. and Ungar, L. H. (2000). Automatic labeling of document clus-ters. Unpublished MS Thesis.

Prasad, S., Melville, P., Banerjee, A., and Sindhwani, V. (2011). Emerg-ing topic detection using dictionary learning. In Proceedings of the20th International Conference on Information and Knowledge Man-agement (CIKM11), pages 745–754. ACM Press.

Schilder, F. and Kondadadi, R. (2008). Fastsum: Fast and accurate query-based multi-document summarization. In Proceedings of the 46thAssociation for Computational Linguistics (ACL08), pages 205–208.

Stein, B. and zu Eissen, S. M. (2004). Topic identification: Framework andapplication. In Proceedings of the 4th International Conference onKnowledge Management (KNOW04), pages 353–360.

Treeratpituk, P. and Callan, J. (2006). Automatically labeling hierarchicalclusters. In Proceedings of the 2006 International Conference on Dig-ital Government Research (DG.O 06), pages 167–176.

Utiyama, M. and Isahara, H. (2001). A statistical model for domain-independent text segmentation. In Proceedings of the 39th Annual

Meeting of the Association for Computational Linguistics (ACL01),pages 499–506.