open analytics dc june 2012 presentation
TRANSCRIPT
Document Analysis and Big DataMaking Sense out of the Flood
Agenda
• Define Big Data and Document Analysis• The Infinit.e Solution• Questions
What is Big Data?
“Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.”Source: http://en.wikipedia.org/wiki/Big_data
This is what Big Data Feels Like
Shamelessly stolen from: http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
What is Document Analysis?
"Document Analysis refers tocomputer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.”Source: http://www.text-tech.com/docanalysis/definition.html
Document Analysis
• The goal is to:– Extract Entities (people, places, things)– Create Associations between entities (in the
form of noun-verb-noun), e.g.:• John Doe lives in Washington, D.C• John Doe is married to Jane Doe• John Doe is a Virgo• John Doe traveled to Mexico on July 6th, 2011
• And…
Document Analysis
• Turn Who, What, When andWhere into a unified data structure that supports data analytics and visualization.
Whopeople, organizations, facilities, company
Whatevents, summaries,facts, themes
Whenpast, present, future dates
Wherecity, state, country, coordinate
• Infinit.e is an Open Source document discovery and analysis platform that has these very cool open sourcetools lurking under the hood.
The Infinit.e Solution
github.com/ikanow/Infinit.e
The Infinit.e Solution
CollectingStoring
EnrichingRetrieving
AnalyzingVisualizing
Structured and Unstructured Documents
Infinit.e is a scalable
framework for
Harvesting
• Infinit.e’s harvester:– Collects documents for specified data sources
(URLs, RDBMs via JDBC, file shares)– Marshalls each document through the
enrichment process– Saves each metadata document, entity, and
association created to MongoDB
Source Ingestion Data Flow
Sample RSS Document<rss version="2.0"><channel>…<item>
<title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title><link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-in-egypt-tunisia-report-by-egyptlastminute-com-13613.html</link><description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia the most … </description><dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher><dc:creator>unknown</dc:creator><dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date>
</item>…</channel></rss>
Full Text Source
Document Metadata
• doc_metadata.metadata{
"_id" : ObjectId("4f93638e0cf212156d0559d2"),"title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...","url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-in-egypt-tunisia-report-by-egyptlastminute-com-13613.html""description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...","created" : ISODate("2012-04-22T01:49:02Z"),
“metadata” : {…},"associations" : […],"entities" : […],...
}
Harvested Document Metadata
• document.metadata"metadata" : {
"location" : [{
"region" : "South Asia","citystateprovince" : {
"stateprovince" : "Rolpa”, "city" : "Newang"
},"country" : "Nepal"
}],"icn" : [ "200573487" ],"incidentdate" : [ "07/25/2005" ],"organization" : [
"Communist Party of Nepal (Maoist)/United People's Front” ],...
},
Document Enrichment
• Infinit.e supports the extraction of entities and creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:
Harvested Entities
• feature.entity{
"_id" : ObjectId("4f9189d48baf188282a1c9ef"),"alias" : [
"Zine el Abidine Ben Ali","Zine El Abidine Ben Ali","Zine el Abidine ben Ali"
],"batch_resync" : true,"communityId" : ObjectId("4f8f138103644ee8003bf518"),"db_sync_doccount" : NumberLong(143),"db_sync_time" : "1338751174988","dimension" : "Who","disambiguated_name" : "Zine El Abidine Ben Ali","doccount" : 152,"index" : "zine el abidine ben ali/person","totalfreq" : 353,"type" : "Person"
}
Harvested Entities
Harvested Associations
• feature.association{
"_id" : ObjectId("4f9189d48baf188282a1ca24"),"assoc_type" : "Fact","communityId" : ObjectId("4f8f138103644ee8003bf518"),"db_sync_doccount" : NumberLong(70),"db_sync_time" : "1338491609281","doccount" : NumberLong(73),"entity1" : [
"zine el abidine ben ali","zine el abidine ben ali/person"
],"entity1_index" : "zine el abidine ben ali/person","entity2" : ["president”,"president/position”],"entity2_index" : "president/position","index" : "5e3fff27ddb78d6873ccfc77cf05c52f","verb" : ["career”,"current”,"past”],"verb_category" : "career"
}
Harvested Associations
Geolocation of Entities/Events
• feature.geo{
"_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"),"search_field" : "cairo","country" : "Egypt","country_code" : "EG","city" : "cairo","region" : "Al Qahirah","region_code" : "EG11","population" : 7734602,"latitude" : "30.05","longitude" : "31.25","geoindex" : {
"lon" : 31.25,"lat" : 30.05
}}
Note: MongoDB 2d Index
Geolocation of Entities/Events
Who, What, Where and When
Thank You!
github.com/ikanow/Infinit.e
Craig Vitter
www.ikanow.com