open analytics dc june 2012 presentation

24
Document Analysis and Big Data Making Sense out of the Flood

Upload: ikanow

Post on 30-Jun-2015

303 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Open Analytics DC June 2012 Presentation

Document Analysis and Big DataMaking Sense out of the Flood

Page 2: Open Analytics DC June 2012 Presentation

Agenda

• Define Big Data and Document Analysis• The Infinit.e Solution• Questions

Page 3: Open Analytics DC June 2012 Presentation

What is Big Data?

“Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.”Source: http://en.wikipedia.org/wiki/Big_data

Page 4: Open Analytics DC June 2012 Presentation

This is what Big Data Feels Like

Shamelessly stolen from: http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/

Page 5: Open Analytics DC June 2012 Presentation

What is Document Analysis?

"Document Analysis refers tocomputer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.”Source: http://www.text-tech.com/docanalysis/definition.html

Page 6: Open Analytics DC June 2012 Presentation

Document Analysis

• The goal is to:– Extract Entities (people, places, things)– Create Associations between entities (in the

form of noun-verb-noun), e.g.:• John Doe lives in Washington, D.C• John Doe is married to Jane Doe• John Doe is a Virgo• John Doe traveled to Mexico on July 6th, 2011

• And…

Page 7: Open Analytics DC June 2012 Presentation

Document Analysis

• Turn Who, What, When andWhere into a unified data structure that supports data analytics and visualization.

Whopeople, organizations, facilities, company

Whatevents, summaries,facts, themes

Whenpast, present, future dates

Wherecity, state, country, coordinate

Page 8: Open Analytics DC June 2012 Presentation

• Infinit.e is an Open Source document discovery and analysis platform that has these very cool open sourcetools lurking under the hood.

The Infinit.e Solution

github.com/ikanow/Infinit.e

Page 9: Open Analytics DC June 2012 Presentation

The Infinit.e Solution

CollectingStoring

EnrichingRetrieving

AnalyzingVisualizing

Structured and Unstructured Documents

Infinit.e is a scalable

framework for

Page 10: Open Analytics DC June 2012 Presentation

Harvesting

• Infinit.e’s harvester:– Collects documents for specified data sources

(URLs, RDBMs via JDBC, file shares)– Marshalls each document through the

enrichment process– Saves each metadata document, entity, and

association created to MongoDB

Page 11: Open Analytics DC June 2012 Presentation

Source Ingestion Data Flow

Page 12: Open Analytics DC June 2012 Presentation

Sample RSS Document<rss version="2.0"><channel>…<item>

<title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title><link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-in-egypt-tunisia-report-by-egyptlastminute-com-13613.html</link><description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia the most … </description><dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher><dc:creator>unknown</dc:creator><dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date>

</item>…</channel></rss>

Page 13: Open Analytics DC June 2012 Presentation

Full Text Source

Page 14: Open Analytics DC June 2012 Presentation

Document Metadata

• doc_metadata.metadata{

"_id" : ObjectId("4f93638e0cf212156d0559d2"),"title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...","url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-in-egypt-tunisia-report-by-egyptlastminute-com-13613.html""description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...","created" : ISODate("2012-04-22T01:49:02Z"),

“metadata” : {…},"associations" : […],"entities" : […],...

}

Page 15: Open Analytics DC June 2012 Presentation

Harvested Document Metadata

• document.metadata"metadata" : {

"location" : [{

"region" : "South Asia","citystateprovince" : {

"stateprovince" : "Rolpa”, "city" : "Newang"

},"country" : "Nepal"

}],"icn" : [ "200573487" ],"incidentdate" : [ "07/25/2005" ],"organization" : [

"Communist Party of Nepal (Maoist)/United People's Front” ],...

},

Page 16: Open Analytics DC June 2012 Presentation

Document Enrichment

• Infinit.e supports the extraction of entities and creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:

Page 17: Open Analytics DC June 2012 Presentation

Harvested Entities

• feature.entity{

"_id" : ObjectId("4f9189d48baf188282a1c9ef"),"alias" : [

"Zine el Abidine Ben Ali","Zine El Abidine Ben Ali","Zine el Abidine ben Ali"

],"batch_resync" : true,"communityId" : ObjectId("4f8f138103644ee8003bf518"),"db_sync_doccount" : NumberLong(143),"db_sync_time" : "1338751174988","dimension" : "Who","disambiguated_name" : "Zine El Abidine Ben Ali","doccount" : 152,"index" : "zine el abidine ben ali/person","totalfreq" : 353,"type" : "Person"

}

Page 18: Open Analytics DC June 2012 Presentation

Harvested Entities

Page 19: Open Analytics DC June 2012 Presentation

Harvested Associations

• feature.association{

"_id" : ObjectId("4f9189d48baf188282a1ca24"),"assoc_type" : "Fact","communityId" : ObjectId("4f8f138103644ee8003bf518"),"db_sync_doccount" : NumberLong(70),"db_sync_time" : "1338491609281","doccount" : NumberLong(73),"entity1" : [

"zine el abidine ben ali","zine el abidine ben ali/person"

],"entity1_index" : "zine el abidine ben ali/person","entity2" : ["president”,"president/position”],"entity2_index" : "president/position","index" : "5e3fff27ddb78d6873ccfc77cf05c52f","verb" : ["career”,"current”,"past”],"verb_category" : "career"

}

Page 20: Open Analytics DC June 2012 Presentation

Harvested Associations

Page 21: Open Analytics DC June 2012 Presentation

Geolocation of Entities/Events

• feature.geo{

"_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"),"search_field" : "cairo","country" : "Egypt","country_code" : "EG","city" : "cairo","region" : "Al Qahirah","region_code" : "EG11","population" : 7734602,"latitude" : "30.05","longitude" : "31.25","geoindex" : {

"lon" : 31.25,"lat" : 30.05

}}

Note: MongoDB 2d Index

Page 22: Open Analytics DC June 2012 Presentation

Geolocation of Entities/Events

Page 23: Open Analytics DC June 2012 Presentation

Who, What, Where and When

Page 24: Open Analytics DC June 2012 Presentation

Thank You!

github.com/ikanow/Infinit.e

Craig Vitter

www.ikanow.com

[email protected]