mining pubmed articles in watson explorer …...to understand watson explorer analytics components,...

Mining PubMed Articles in Watson Explorer Analytics Components

Cognitive Medical Computing for Research

Contents Introduction ...................................................................................................................................................4

National Council of Biotechnical Information ...........................................................................................4

JATS Xml Format ........................................................................................................................................4

Artifacts you will need: ..............................................................................................................................4

Installation .....................................................................................................................................................4

Importing the Collection configuration archive ........................................................................................4

Installing the Processing Engine Archive ...................................................................................................7

Installing the crawler plugin ................................................................................................................... 12

Set the default title field ..................................................................................................................... 18

Confirm Index fields and Facets. ............................................................................................................ 21

Confirm Facet Processing Configuration ................................................................................................ 23

Build the Collection ................................................................................................................................ 24

Mining the NCBI PubMed Collection for Insights ....................................................................................... 29

Basic Principles: the Facet ...................................................................................................................... 29

Basic Principles: the Correlation Index ................................................................................................... 31

Programmatic Mining using REST ........................................................................................................... 34

Author Date Version

Kameron Cole

[email protected]

April 25, 2016 0.1

Introduction

National Council of Biotechnical Information

JATS Xml Format

Artifacts you will need:

1) One of two Analytics Collection configuration archives:

a. (optional) - a configuration to be used with IBM BigInsights integration,

for hadoop-enabled indexing

b. Xxx is a configuration to be used without Hadoop processing.

2) The hci_annotator_01.pear Processing Engine Archive

3) The crawler plugin xml2field.jar archive (version 1, or version 2)

a. The first version creates index fields for all the JATS elements in each

document, and, creates a searchfields.xml file (this file, when complete, can

reach gigabytes in size)

b. The second version only creates the index fields. This option is used with

the Analytics Collection configuration.

4) collection.xml (optional)

5) Four zipped portions of PubMed documents

Installation The installation consists of importing a preconfigured Analytics Collection, installing

custom processing components, and “crawling”, “parsing”, and “indexing” the documents.

Importing the Collection configuration archive

All instructions assume that the WEX system is running.

___1. Open a command prompt as *ESADMIN* user. Enter the following

command

esadmin import –fname <configuration zip file> –cid <collection id> -name <new collection

name>

___2. You may notice that two files are not found during the import

a. Crawler plugin

b. Although not shown, the HealthAnalyticsTS.pear file for annotations may

not be installed on your system

___3. Check that the collection was installed, and check components

a. Navigate in a browser to ESAdmin (http://<hostname>/ESAdmin). You

should see the collection:

b. Check the Text Analytics Archive installation. In ESAdmin, look in Parse

and Index->More->Text Processing Options

c. Check in System->Parser->Configure system text analysis engines-Text

Analytics Engines for gene01

d. If not installed, proceed to Installing the Processing Engine Archive

Installing the Processing Engine Archive

___1. In System->Parser->Configure system text analysis engines-Text Analytics

Engines, click the button: Add System Text Analysis Engine

___2. Configure the installation options

a. Give the TAE a name

b. Check Use processing engine archive

c. Browse to the path, either locally, or on another server

d. Verify installation

___3. Extract configuration files from the .pear archive

a. Using any archive tool, open the .pear file

b. Extract the entire /config directory

c. These files will be extracted

___4. Associate TAE with new Collection

___1. Navigate to Collections-> Parse and Index->More->Text Processing

Options->Select a system text analysis engine

___2. Install the mapping descriptor for CAS to index. It is in the /config

folder you unzipped above.

Installing the crawler plugin

___1. Copy the JATS xml libraries into the <ES_NODE_ROOT>/logs

directory. In this installation kit, the libraries have preserved the directory

structure exactly. Copy /logs into /logs

___2. Check the crawler plugin location in ESAdmin. Navigate to

Crawlers, and you should see:

___3. Hover over the lower right-hand corner, until you see the pencil

icon. You will edit the crawler properties (not crawlspace)

___4. !IMPORTANT. You will see that this is either a Unix file system

crawler, or a Windows file system. If the type of crawler is correct for your

machine, then simply adjust the path to the raw NCBI PubMed data.

Otherwise, skip to step 4.

___5. You will have to delete this crawler, and create one that matches

you system

a. Delete the crawler.

b. Click the plus (+) button.

c. Choose the correct crawler for you file system

d. In the crawler properties, give the crawler any name. Under

Advanced options, configure the crawler plugin. You should have

placed the plugin file in some directory – this location is used for

the Plug-in class path element. The Plug-in class name should be

the same.

e. Select the directory where you have the expanded data files to be

crawled:

f. Complete the rest of the wizard, accepting defaults.

Set the default title field

In the current scenario, the title field is populated using the <title> element in the PubMed

xml. This element is not correct. You will see in the raw xml that the PubMed developers

have used <title> as an html markup field, rather than referring to the intuitive notion of

title:

This results in an unacceptable title display

One way to fix this is to map the element <article-title> to the default title element in

WEX, since <article-title> is generally the more desirable title field:

In the WEX ESAdmin web application, find the crawler and edit the crawlspace

Choose Edit Metadata

From the dropdown menu next to the _$Title$_ field, select the articletitle field as a

mapping. Note that this field is derived from the <article-title> xml field, in the code of

the crawler plugin you installed. Otherwise, it would not be available.

Confirm Index fields and Facets.

The index field and facets should have been configured when you did the initial import.

Please check the following.

___1. Navigate to the index fields pane.

___2. You should see quite a few index fields. To be sure that everything

is correct, click the Import Index Fields button.

___3. Select the searchfields.xml file that came with this installation kit.

Do not use the searchfields.xml that you extracted in the /conf directory of

the .pear file.

___4. When you look in the Import column, all the boxes should be

grayed-out, indicating that all these fields are already in the configuration.

___5. Back in the Parse and Index pane, navigate to Analytics Resources-

>Facet Tree. You should see Facets like below

Confirm Facet Processing Configuration

With such a collection as this, some exceptional processing is required. This collection has

a large taxonomy – the ontology of cataloged facets. The taxonomy cache stores generated

facets that are used by the indexer component. Watson Explorer 10 provides three types of

taxonomy caches:

LRU

Partially in memory. Scalability is limited.

TrieL2O

Completely in memory. This cache type is 2–5 times faster than LRU.

DA

Completely in memory. This cache is 5 - 10% faster than TrieL2O.

Either the LRU cache or the DA cache should be used depending on the size of memory

assigned to the indexer. By default, the LRU cache is set. If you have enough memory,

use a complete cache, i.e. DA. If you want to process a large document set over a longer

duration with a smaller memory footprint, you can configure the system to use the LRU

cache that partially loads the taxonomy index in memory.

For this collection, insure the following:

___1. In the

$ES_NODE_ROOT/master_config/collection_ID.indexservice/collection.xml file,

find the <index> element with a <type> element value of Facet. The XPath of this

element is: /config/collection/indexes/index[type=Facet]

___2. Add the CacheType property to the <index> element and set its value to

DA: <index> <type>Facet</type> <path>facets</path> ... <property

name="CacheType" value="DA"/> </index>

___3. If you use the DA taxonomy cache, you should also set the number of

partitions. Note that this is a different setting than the collection partitions

mentioned previously. Because concurrent read/write operations can be performed

for each partition, setting the number of partitions enables the CPU resource to be

efficiently utilized when processing facets. The valid range of the number of

partitions is 1 – 36.

a. To set the number of partitions of the DA taxonomy cache, add <property

name=”NumberOfCachePartitions” value=”16”/> as a sibling element of

the <property name=”CacheType” value=”DA”/> element that you inserted

above.

Build the Collection

Now you should turn on all the runtime processes, and create the collection!

___1. Turn on the Parser Indexer

a. Configure memory – these are just suggestions. You will have to

adjust according to your available resources.

b. Turn on runtime

___2. Turn on Crawler

___3. Turn on Searcher

a. Configure memory

b. Turn on runtime

You can monitor the process of the Crawler by clikcing the “eyeball”

icon. The Parser Indexer should say waiting.

Navigate in a browser to the http://<hostname>/ui/analytics application.

Make sure your select the new collection:

You should see the collection complete with Facet tree. Click on the

Facet pane and navigate through the Facet tree. Should should have

Facet values.

You should also check one of the JATS/PubMed Facets, which are in

lower-case in the Facet tree.

Mining the NCBI PubMed Collection for Insights

Basic Principles: the Facet

To understand Watson Explorer Analytics Components, the fundamental building block is

the Facet. The name Facet is particularly appropriate -as opposed to, say, “keyword”. A

Facet is similar to a category in a traditional ontological taxonomy. The Facets in this

collection are created in two ways.

The first was is through leveraging a custom plugin, which creates Facets directly from the

Journal Article and Tag Suite. i The Journal Article Tag Suite (JATS) is a standard

(NISO Z39.96-2012) that defines a set of XML elements and attributes for tagging journal

articles and describes three article models. JATS is a continuation of the NLM Archiving

and Interchange DTD work begun in 2002 by NCBI.

The JATS tags are converted directly into Facets, which can participate in the statistical

calculations of the Miner Application.

Here is a sample of the raw article text, with some of the tags highlighted: <!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-

archivearticle1.dtd">

<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-

article"><?properties open_access?><front><journal-meta><journal-id journal-id-type="nlm-ta">AAPS J</journal-id><journal-id

journal-id-type="iso-abbrev">AAPS J</journal-id><journal-title-group><journal-title>The AAPS Journal</journal-title></journal-title-

group><issn pub-type="epub">1550-7416</issn><publisher><publisher-name>Springer US</publisher-name><publisher-

loc>Boston</publisher-loc></publisher></journal-meta><article-meta><article-id pub-id-type="pmid">20711763</article-id><article-

id pub-id-type="pmc">2976985</article-id><article-id pub-id-type="publisher-id">9226</article-id><article-id pub-id-

type="doi">10.1208/s12248-010-9226-9</article-id><article-categories><subj-group subj-group-type="heading"><subject>White

Paper</subject></subj-group></article-categories><title-group><article-title>NonClinical Dose Formulation Analysis Method

Validation and Sample Analysis</article-title></title-group><contrib-group><contrib contrib-type="author"

corresp="yes"><name><surname>Whitmire</surname><given-names>Monica Lee</given-

names></name><address><email>[email protected]</email></address

In the Miner, these will appear as lower-case facets.

The rest of the Facets are created using the IBM Health Care Accelerator Annotators, and

asset now in its seventh year of maturity, it uses advance UIMA Text Analytics to make

complex semantic segments into quantifiable entities. In Studio, the depth of information

of each annotator can be reviewed. For example, the “Problem” Annotator has multiple

features, many of which were directly taken from the Unified Medical Language System

(UMLS) Metathaurus.

Each of the properties (called Features in UIMA) can be mapped to a Facet in the Miner.

For example, Concept ID is a standard defined in Snomed CT

It allows for a technically powerful “grouping” of related medical conditions. In search

engine terms, this is a direct mapping of the notion of Facet. As a Facet, I can use the

concept ID, instead of a keyword, to give a more-inclusive, but not diffused, precision to

my search results. Compare:

http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/

http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/

http://biomedicalontologies.com/tag/concept-id/

Figure 1: Keyword Search, yields 2385

Figure 2: Faceted search, yields 24882

Note the search syntax for the Faceted search: /”Facet Name”/”Facet Value”/. One of the

powerful featured of the CID annotator is the the actual CID number need not accur in the

text. The linguistic construction of the Annotator “derives” the appropriate CID.

Basic Principles: the Correlation Index

Of the three statistical indexes built into WEX’s Content Miner, the Correlation index is

the most valuable, albeit the least understood. This statistical formula, along with the rest

of what makes up the core analytics of this product, originally known by the Japanese

name TAKMI, was included among IBM’s 100 Icons of Progress, to celebrate its

Centennialii.

It is not, in fact, a statistical correlation at all; rather it describes a relationship in the

density of occurrences within a data corpus (D) between two Facets, A and B:

The letter D represents the entire collection of documents and the # symbol represents the number of documents in the collection. The left and the right sides of the equation are equal to each other

The right side of the equation is a ratio between the product of density of A and density of

B (#A/#D) (#B/#D), and the actual density of (AnB), which is #(AnB)/#D, representing a

deviation from independence of A and B. The right side is more intuitive than the left side

as a 2-dimensional index, i.e. the Facet Pairs View in the Miner interface.

This can be illustrated as described in the following graphic. For example, although only 5% of all the

documents in a data corpus are about obtaining an instruction manual (A), this figure rises to 20% when

only personal computer-related documents (B) are examined.

Sample Scenario: Gene Research as influenced by health care insurance

The ICD9 Code 199.1 has the following description:

We can create a data mining investigation around the relationship of this ICD9 Code, and

its relationship to particular genes – of course, as always with this data corpus, within the

context of medical research. Initially, we use the Facets View, and rank by Correlation

value, to see which genes are most densely related to this ICD9 code.

The FacetPairs View gives us a “heat map” of the relative densities of ICD9 Codes, to

genes:

We should investigate those cells which show yellow to red – the color indicates the

strenght of the Corrleation Index value.

Finally, the Connections View gives us additional edges, in a directed graph, as opposed

to the the 2-dimensional array, provided in the FacetPairs. We now have:

Edge 1: Gene -> ICD9

Edge 2: ICD9 -> Gene

Edge 3: Gene -> Gene

Edge 4: ICD9 -> ICD9

The strength of the Correlation value for each edge is indicated by the red-gradient.

Programmatic Mining using REST

The REST API provided with WEX is one of its most powerful features. Every aspect of

the Miner UI is created through REST queries. We will look at the “cube” query, which

has the following syntax:

http://<hostname>:8393/api/v10/search/facet/cube?collection=<col_id>&facets[{“nam

espace”:”keyword”,”id”:”$<.facet_path>”,”count”:50},{“namespace”:”keyword”,”id”:

”$<.facet_path>”,”count”:50}]&correlation=facetPairs&query=*:*&output=applicatio

n/xml http://x18n04.pbm.ihost.com/api/v10/search/facet/cube?collection=PubMed_NCBI&facets=[{%22namespace%22:%22k

eyword%22,%22id%22:%22$.icd9%22,%22count%22:50},{%22namespace%22:%22keyword%22,%22id%22:%22$.g

ene%22,%22count%22:50}]&correlation=facetPairs&query=*:*&output=application/xml

http://x18n04.pbm.ihost.com/api/v10/search/facet/cube?collection=PubMed_NCBI&facets=%5b%7b%22namespace%22:%22keyword%22,%22id%22:%22$.icd9%22,%22count%22:50%7d,%7b%22namespace%22:%22keyword%22,%22id%22:%22$.gene%22,%22count%22:50%7d%5d&correlation=facetPairs&query=*:*&output=application/xml



You will notice that the correlation value for each ICD9 code is 1.0. In Watson Analytics

Components, this is equivalent to no correlation. The same is true of the gene code

returned. This is critical to the understanding of the “cube” query: it must have 3

dimensions (like a cube). The third dimension is the query parameter.

Note that the query issued was “*:*”. This is the reserved, all documents query. So, the

essence of the thrid cube dimension here is vacuous – it’s like saying “whatever.”

Everything correlates with “whatever.”

If we change the value to a something more specific, say, 199.1, we get

The subtle point here is that the query for 199.1 is not the same thing as a query for the

ICD9 code “199.1”. The current query is for any string containing 199.1. To get the

“real” answer for the correlative values of ICD9 and Gene, with regards to ICD9 199.1,

the query must be “faceted.”:

http://<hostname:port>/api/v10/search/facet/cube?collection=PubMed_NCBI&facets=[{"n

amespace":"keyword","id":"$.icd9","count":50},{"namespace":"keyword","id":"$.gene","

count":50}]&correlation=facetPairs&query=keyword::/ICD9/199.1&output=application/

xml

These programmatic returns are analogous to the FacetPairs View shown in the previous

section. The special consideration required for processing REST resturns is that these

returns are not ordered, as they are in the FacetPairs View. Thus, one must search through

the returned XML

i http://jats.nlm.nih.gov/archiving/1.0/

ii http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/takmi/

mining pubmed articles in watson explorer …...to understand watson explorer analytics components,...

Documents