a distributed framework for computation on the results of large scale nlp christophe roeder, william

[email protected]://compbio.ucdenver.edu

A Distributed Framework for Computation on the

Results of Large Scale NLPChristophe Roeder, William A. Baumgartner Jr., Kevin Livingston,

Lawrence Hunter (UC Denver)

Questions that could be answered using large

corpora• Second source of data for validation/corroboration– Ligand binding site validation – Verspoor

et al• Rough ideas/leads to ppi from co-

occurence• Protein co-occurrence fraction for use

in Hanalyzer networks• Mine more and more recent

knowledge than available from curated on ontologies

Available Tools and Data• Data

– Large corpora: PMC OA, publisher-arranged collections

– Curated Ontologies: PRO, GO, etc.• Tools

– UIMA for NLP Processing– Batch schedulers (SGE, Torque) to scale

UIMA– Hadoop to collate data– RDF to represent knowledge– Triple Store (Franz AllegroGraph) to store

and access large amounts of RDF data

Bio Trends: a Sample Integration Project

• Function:– Count occurrences of proteins in articles– Collate by date, and display on a web

app.• Design

– UIMA over SGE for protein ID, store in RDF files

– Read RDF files and collate with Hadoop• Call out to Allegrograph for ID and attribute

lookup– Format resulting data as JSON for

availability to web app

Prepare Available Data• Start with raw text: PMC Open

Access:– 250k full-text journal articles

• Identify (annotate) interesting spans (genes)– UIMA pipeline, NERs: ABNER, BANNER,

etc, concept mapper on PRO dictionary to noramlize

– Output to RDF for various uses

Options to Analyze Data• Load into triple store and query

– Necessity for exploring queries with complex results over entire graph

– Ex. • Load individual files into in-memory

store and query in small groups– Possible for exploring simple queries

over many small regions of the graph: article related

– Easier to federate• Hybrid

– Some data not available from RDF files, but the triple store.

Map-Reduce• Inspired by Lisp functions “map” and

“reduce”– Map applies a function to each element

of a list• (a1, a2,…an), f(x) (f(a1), f(a2), …f(an))

– Reduce combines lists by applying a function successively• (a1, a2,…an), f(x,y) f(f(f(a1,a2),a3), a4)• (1,2,…n), + (((1+2) + 3) + 4)

Map Reduce on HashMaps

• Map can be used to transform from one kind of key, value to a different kind of key, value– (Filename, text) (gene, count)

• Reduce must have same kind of key and value output as input. A call to reduce gets all values for a particular key.– (gene, count) (gene, count)– (BRCA1, 1), (BRCA1, 3), (BRCA1, 1)

(BRCA1, 5)

Hadoop: a distributed map-reduce on maps or

hash tables• Can divide into parallel friendly tasks by key

• Distributes files over network• Reduces network traffic by

performing computation where data is

• Map is used to move from one key-value type to another. From (filename => contents), to (protein-protein, co-occurrence count).

• Reduce is used to collate results.

Results• PMC OA• Medline Abstracts

Screen Shot

• Grants:

Thank You / Questions• http://www.compbio.ucdenver/bio-

trends• Co-authors

– William Baumgartner for data generation

– Kevin Livingston for RDF and Clojure help

• Grants and PIs– Larry Hunter, UCDenver SOM

• NIH 2R01LM009254-04, NIH 2R01LM008111-04A1, NIH 5R01GM083649-02

– Karin Verspoor, UCDenver SOM• NIH R01 LM010120-01

– Gully Burns, ISI• NSF 0849977

a distributed framework for computation on the results of large scale nlp christophe roeder, william

Documents