india research lab auto-grouping emails for faster ediscovery sachindra joshi, danish contractor,...
TRANSCRIPT
India Research Lab
Auto-grouping Emails for Faster eDiscovery
Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp*
IBM Research – India *IBM Software Group
|
India Research Lab
Outline of the Talk
eDiscovery Process
A new way of eDiscovery Review: Group Level Review
Creating Syntactic Groups
Creating Semantic Groups
Experiments and Conclusion
|
India Research Lab
eDiscovery Process
Discovery: Process in pre-trial phase- Produce relevant information
eDiscovery: FRCP 2006 amendment- Produce relevant Electronically Stored Information (ESI)
Emails, chats, word docs, presentations etc.
Huge volumes of ESI - Process is expensive- 60% of cases warrant some form of eDisovery- 4.8 billion dollars industry in 2011
|
India Research Lab
eDiscovery Process
High cost due to review stage- Lawsuit between Clinton administration and tobacco
companies (U.S. Vs. Philip Morris)
Apply Text Mining Techniques to reduce high costs involved in eDiscovery Process
|
India Research Lab
Named entity annotatorLanguage AnnotatorSignature Annotator
Architecture of eDiscovery Review Systems
|
India Research Lab
Group Level Review
Review groups of documents that are “related” instead of individual documents- Mark whole group as responsive/unresponsive or privileged- Efficient and consistent
- Syntactically Similar Documents Automated messages, Near and exact duplicates
- Semantically Similar Documents Threads, semantic categories
|
India Research Lab
Detecting Syntactic Groups: Automated Messages
|
India Research Lab
Detecting Near Duplicates
S1: I am away from 17/2/2011 to 19/2/2011. Please mail [email protected] in case of any need
S2: I am away from 26/7/2011 to 31/7/2011. Please mail [email protected] in case of any need
Notion of Similarity: Resemblance
kwindowwithsentenceforchunksallofsetS
kwindowwithsentenceforchunksallofsetS
2
1
2
1
||
||)(
21
2121 SS
SSSSr
Use fingerprinting (Rabin) instead of actual chunks.
|
India Research Lab
Efficient Detection of Near Duplicates
For a document of length n words there would be - n-K+1 chunks with a window size of K
It suffices to keep for each document a relatively small fixed size signature
Let Sn be the set of permutations of [n]And let be chosen uniformly at random over Sn
][}1,...,0{ nnSD
),(||
||)}(min{)}(Pr(min{ BAr
SS
SSSS
BA
BABA
|
India Research Lab
Signature Annotator
In practice choosing the permutations randomly is hard
Use a set of n one-to-one functions fi and keep only the smallest value for each fi
Keep only j lowest significant bits for each value
|
India Research Lab
Discovering Automated Messages
Generating groups of near duplicate – Index Based Clustering- For each document d in index I do
If d is not covered
- Let S = {S1, S2, …, Sn} be the signature of document d
- D = Query(I, atleast(S,k))
- For each document d’ in D d’ is covered
Discovering Groups of Automated Messages- Automated Messages, Group of bulk emails, Group of forward emails
Use MD5 to detect bulk emails. Emails with one segment are automated messages
|
India Research Lab
Detecting Semantic Groups: Email Threads
A tree like structure
A link denotes that the child node was written as a reply to the parent node.
Capture the context in which an email was written
|
India Research Lab
Detecting Email Threads
Meta data based methods- Headers are not
consistently used
Content of old mail remains in the new mail- A segment contains text of
only one communication
An email ei contains ej iff ei
approximately contains all the segment of ej
India Research Lab
© 2007 IBM Corporation
Method for Thread Detection
Email Segment Generator (ESG)
– Creates segments of it where each segment contains content of only one email.
Segment Signature Generator (SSG):
– Generates a signature for a segment
• Use near duplicate signatures
For practical implementation, we limit on the number of segment signatures (N) that can be associated with an email, e.g. 20 segments.
India Research Lab
© 2007 IBM Corporation
Method: Processing at Indexing Timew1w2
wn
Word index
ESG
SSG
Meta index
Signature index
India Research Lab
© 2007 IBM Corporation
Method: Processing at Query Time
q
Word index
w1w2
wn
Meta index Signature index
Generating Candidate Thread Set
Use Signature
Of First Segment
|
India Research Lab
Detecting Email Threads
Given a Candidate Thread Set- Identify the email with only
root segment- An email ec is child of an
email ep if ec minimally contains ep
|
India Research Lab
Creating Semantic Categories
Focus Categories- Documents that are likely to be responsive- Legal Content, Financial Communication, Intellectual Property- High recall
Filter Categories- Documents that are likely to be unresponsive- Bulk emails, Private communication, Jokes- High precision
|
India Research Lab
Creating Semantic Categories
Email Segmentation
Pattern based annotation: Use System T based method
Consolidation- Each concept is independent- Apply additional constraints over concepts
|
India Research Lab
Experiments – Near Duplicate Detection
Enron Corpus- 517K emails from 150 users
Measuring precision- Manually evaluated near
duplicate set for 500 queries- With more bits precision is
100% even with 40% similarity threshold
Only 33.3 % emails are unique
|
India Research Lab
Experiments – Email Thread Detection
No ground truth for threads Subject approximation Method: Based on “Re:”, “Fw:” etc in subject Manually verified the results of thread for our method and subject
approximation method- The union of correct emails in thread for both approaches is treated as
ground truth.
|
India Research Lab
Experiments – Semantic Group
Ground truth: Sampled 2200 emails using generic keywords and then manually labeled
|
India Research Lab
Conclusions
We developed a framework that allow group level review of documents
We developed methods for finding syntactic groups such as automated messages for creating groups
We developed methods for finding email threads and semantic groups
We showed significant reduction in the review time by using the group level review and integrated the proposed techniques with IBM Infosphere eDiscovery Analyzer product