a distributed framework for computation on the results of large scale nlp christophe roeder, william
DESCRIPTION
Available Tools and Data Data – Large corpora: PMC OA, publisher-arranged collections – Curated Ontologies: PRO, GO, etc. Tools – UIMA for NLP Processing – Batch schedulers (SGE, Torque) to scale UIMA – Hadoop to collate data – RDF to represent knowledge – Triple Store (Franz AllegroGraph) to store and access large amounts of RDF dataTRANSCRIPT
[email protected]://compbio.ucdenver.edu
A Distributed Framework for Computation on the
Results of Large Scale NLPChristophe Roeder, William A. Baumgartner Jr., Kevin Livingston,
Lawrence Hunter (UC Denver)
Questions that could be answered using large
corpora• Second source of data for validation/corroboration– Ligand binding site validation – Verspoor
et al• Rough ideas/leads to ppi from co-
occurence• Protein co-occurrence fraction for use
in Hanalyzer networks• Mine more and more recent
knowledge than available from curated on ontologies
Available Tools and Data• Data
– Large corpora: PMC OA, publisher-arranged collections
– Curated Ontologies: PRO, GO, etc.• Tools
– UIMA for NLP Processing– Batch schedulers (SGE, Torque) to scale
UIMA– Hadoop to collate data– RDF to represent knowledge– Triple Store (Franz AllegroGraph) to store
and access large amounts of RDF data
Bio Trends: a Sample Integration Project
• Function:– Count occurrences of proteins in articles– Collate by date, and display on a web
app.• Design
– UIMA over SGE for protein ID, store in RDF files
– Read RDF files and collate with Hadoop• Call out to Allegrograph for ID and attribute
lookup– Format resulting data as JSON for
availability to web app
Prepare Available Data• Start with raw text: PMC Open
Access:– 250k full-text journal articles
• Identify (annotate) interesting spans (genes)– UIMA pipeline, NERs: ABNER, BANNER,
etc, concept mapper on PRO dictionary to noramlize
– Output to RDF for various uses
Options to Analyze Data• Load into triple store and query
– Necessity for exploring queries with complex results over entire graph
– Ex. • Load individual files into in-memory
store and query in small groups– Possible for exploring simple queries
over many small regions of the graph: article related
– Easier to federate• Hybrid
– Some data not available from RDF files, but the triple store.
Map-Reduce• Inspired by Lisp functions “map” and
“reduce”– Map applies a function to each element
of a list• (a1, a2,…an), f(x) (f(a1), f(a2), …f(an))
– Reduce combines lists by applying a function successively• (a1, a2,…an), f(x,y) f(f(f(a1,a2),a3), a4)• (1,2,…n), + (((1+2) + 3) + 4)
Map Reduce on HashMaps
• Map can be used to transform from one kind of key, value to a different kind of key, value– (Filename, text) (gene, count)
• Reduce must have same kind of key and value output as input. A call to reduce gets all values for a particular key.– (gene, count) (gene, count)– (BRCA1, 1), (BRCA1, 3), (BRCA1, 1)
(BRCA1, 5)
Hadoop: a distributed map-reduce on maps or
hash tables• Can divide into parallel friendly tasks by key
• Distributes files over network• Reduces network traffic by
performing computation where data is
• Map is used to move from one key-value type to another. From (filename => contents), to (protein-protein, co-occurrence count).
• Reduce is used to collate results.
Results• PMC OA• Medline Abstracts
Screen Shot
• Grants:
Thank You / Questions• http://www.compbio.ucdenver/bio-
trends• Co-authors
– William Baumgartner for data generation
– Kevin Livingston for RDF and Clojure help
• Grants and PIs– Larry Hunter, UCDenver SOM
• NIH 2R01LM009254-04, NIH 2R01LM008111-04A1, NIH 5R01GM083649-02
– Karin Verspoor, UCDenver SOM• NIH R01 LM010120-01
– Gully Burns, ISI• NSF 0849977