0 (# * (1 ) & 1 2 &$ + 3 0 () + & ! * (. 0 + #...
TRANSCRIPT
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
1
QSPACE VISUALISATION OF MEDLINE ARTICLES
A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER
FOR THE DEGREE OF MASTER OF SCIENCE IN THE FACULTY
OF ENGINEERING AND PHYSICAL SCIENCES
2005
Rasmus Winter
SCHOOL OF COMPUTER SCIENCE
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
2
TABLE OF CONTENTS
List of Figures .............................................................................................. 5
Abstract........................................................................................................ 6
Declaration .................................................................................................. 7
Copyright ..................................................................................................... 8
Acknowledgements ...................................................................................... 9
The Author ................................................................................................ 10
1. Introduction ........................................................................................... 11
1.1 Context of the Study ....................................................................... 11
1.2 Existing Software ............................................................................ 12
1.3 Structure of Dissertation.................................................................. 13
2. Analysis ................................................................................................. 15
2.1 System Users................................................................................... 15
2.2 Requirements Analysis.................................................................... 15
2.3 Utilised Technology ........................................................................ 16
2.3.1 MEDLINE............................................................................. 17
2.3.2 PubMed.................................................................................. 17
2.3.3 MAVERIK 6.2 ....................................................................... 19
2.3.4 Q-SPACE............................................................................... 20
2.3.5 Qt 3.3 ..................................................................................... 22
2.4 Programming Languages ................................................................ 23
2.5 Developmental Approach ............................................................... 23
3. System Design........................................................................................ 25
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
3
3.1 PubMed Data Collecting and Processing........................................ 25
3.2 Visualisation ................................................................................... 26
3.2.1 Q-SPACE Structure................................................................ 26
3.2.2 BioQSpace Visualiser Structure............................................... 31
3.3 File Structures ................................................................................. 33
3.4 GUI Design .................................................................................... 34
4. Implementation ...................................................................................... 38
4.1 Abstract Comparison Attributes ...................................................... 38
4.2 pubmed.pl....................................................................................... 39
4.2.1 Querying PubMed .................................................................. 40
4.2.2 Processing the Results............................................................. 41
4.2.3 Saving the Results................................................................... 42
4.3 BioQSpace Visualiser ...................................................................... 43
4.3.1 Article Storage and Comparison Algorithms ........................... 44
4.3.2 GUI........................................................................................ 45
4.3.3 MAVERIK Navigation........................................................... 51
5. Testing and Evaluation........................................................................... 53
5.1 Testing ............................................................................................ 53
5.2 Evaluation ...................................................................................... 54
5.3 Installation...................................................................................... 56
6. Conclusions............................................................................................ 58
6.1 Summary ........................................................................................ 58
6.2 Performance Issues.......................................................................... 58
6.2 Further Work.................................................................................. 60
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
4
Glossary..................................................................................................... 63
Bibliography............................................................................................... 64
Appendix A: E-Utility Results .................................................................... 70
Appendix B: Files used by pubmed.pl......................................................... 73
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
5
LIST OF FIGURES
Figure 2.1: Minimal Spanning Trees ..................................................... 21
Figure 2.2: A set of tuples in its final configuration................................ 22
Figure 3.1: A screenshot of the original Q-SPACE ................................ 27
Figure 3.2: Original Q-SPACE program structure ................................. 27
Figure 3.3: A MAV_qobj ...................................................................... 29
Figure 3.4: A MAV_hull ....................................................................... 29
Figure 3.5: A series of MAV_qobjs linked by a MAV_trail .................... 30
Figure 3.6: The structure of BioQSpace................................................. 33
Figure 3.7: The basic graphical user interface design ............................. 36
Figure 4.1: The final graphical user interface design .............................. 46
Figure 4.2: The menu bar...................................................................... 47
Figure 4.3: Mark articles by attribute dialog .......................................... 47
Figure 4.4: Help window ...................................................................... 48
Figure 4.5: The word stems window ..................................................... 48
Figure 4.6: About BioQSpace window.................................................. 48
Figure 4.7: The toolbar ......................................................................... 49
Figure 4.8: Action of the ‘show labels’ and ‘use tooltips’ checkboxes ..... 50
Figure 4.9: The advanced options dialog ............................................... 50
Figure 4.10: The article information panel............................................. 50
Figure 4.11: The attribute weight sliders................................................ 51
Figure 6.1: Completion times of loading and reloading sets of articles ... 59
Figure 6.2: A method for parallelising the comparison algorithm .......... 61
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
6
ABSTRACT
Upon querying the citation and biomedical article database PubMed, interpreting
and spotting relationships in the resulting list of articles can be difficult. Some sort of
visualisation to help with these processes is highly desirable, and to that end,
BioQSpace was designed and built. BioQSpace attempts to visualise the relationships
between the articles by rendering them as clustered sets of objects in a navigable 3D
environment.
The application will perform a PubMed search on a given query, parse the resulting
article list, calculate the relationships between each of the articles, and finally cluster
and colour them in 3D.
This thesis describes the design and development of BioQSpace, its usage, and a
critical analysis of the final product.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
7
DECLARATION
No portion of the work referred to in this thesis has been submitted in support of an
application for another degree or qualification of this or any other university or other
institute of learning.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
8
COPYRIGHT
1. Copyright in text of this thesis rests with the Author. Copies (by any process)
either in full, or of extracts, may be made only in accordance with instructions
given by the Author and lodged in the John Rylands University Library of
Manchester. Details may be obtained from the Librarian. This page must
form part of any such copies made. Further copies (by any process) of copies
made in accordance with such instructions may not be made without the
permission (in writing) of the Author.
2. The ownership of any intellectual property rights which may be described in
this thesis is vested in the University of Manchester, subject to any prior
agreement to the contrary, and may not be made available for use by third
parties without the written permission of the University, which will prescribe
the terms and conditions of any such agreement.
3. Further information on the conditions under which disclosures and
exploitation may take place is available from the Head of the Department of
Computer Science.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
9
ACKNOWLEDGMENTS
The author wishes to express his thanks to Steve Pettifer and Anna Divoli for their
help, guidance and their many contributions to the direction and content of this
project.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
10
THE AUTHOR
The author graduated from Manchester University in 2004 with the degree of
Bachelor of Science in Computer Science and Maths, and stayed on to study for a
Masters degree in Computer Science, for which the work described in this thesis is a
substantial part.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
11
1. INTRODUCTION
1.1 Context of the study
PubMed is a publicly accessible search and retrieval system for a number of medical
literature databases, the biggest being MEDLINE, the US National Library of
Medicine’s (NLM’s) database of biomedical citations and abstracts, which contains
over 12 million entries from 4,800 journals. The sheer number of articles resulting
from a PubMed search can often be overwhelming, and PubMed’s simple textual
presentation of them does not provide any clues as to the relationships between them.
This forces the user to read through the titles and abstracts of each one, or to check
the related articles links, when trying to find relevant articles – a task at which
humans are particularly inefficient.
A visual representation of the relationships between the articles would help the user
to focus their attention on groups of similar articles, instead of searching linearly
through a somewhat arbitrarily ordered list. This requires a metric for the similarity
of different articles, which can incorporate many factors such as any drugs, diseases
or biological terms their titles or abstracts have in common.
The purpose of this project is to explore how this similarity measure can be
calculated, and to implement it as an algorithm in an application that will display the
results of a query in a more structured way. The resulting application, BioQSpace,
presents the articles as points in 3D space, and allows the user to explore that space,
both in terms of 3D navigation and the raw comparison data from the articles, and to
dynamically tweak the comparison algorithm to place emphasis on particular
attributes that comprise the comparison algorithm.
In addition to simply assisting in locating relevant information, it is hoped that
BioQSpace can be used in a variety of research topics, such as: discovering previously
unnoticed relationships, by combining two separate search results into one
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
12
visualisation; understanding trends in treatment, by combining results for different
time spans; or exploring the relationships of syntax vs. semantics.
The emphasis of the project is on visualising data and determining relationships
within the data, not on natural language processing (NLP) in terms of the extraction
of important data from the articles, though some techniques will be explored. Much
research has been done in the area of text mining of biomedical literature [SGM05,
KSBG04, SJORB05, KBSP04], and the developed techniques can be fairly complex,
but discussions of them are kept brief.
1.2 Existing Software
Several pieces of software exist that try to explore and visualise relationships amongst
MEDLINE articles, or concepts discussed within them. They all use the PubMed
database query tools, and present results using either text or diagrams, in 2D or 3D.
The application with features most similar to that of BioQSpace is RefViz [Ref]. It
allows searching of ISI Web of Science and OCLC in addition to PubMed, and by
analysing keywords in titles, abstracts and notes is capable of producing 2D diagrams
of abstracts organised in clusters in themes based on their content.
XplorMed [PBA01, Xpl] is a web-based online tool for exploring MEDLINE,
filtering the abstracts produced from a query to extract the ones that most fit the
user’s requirements. It uses a step-by-step interactive procedure, asking the user to
eliminate or elaborate the sets of articles resulting from the previous step, starting
with a standard PubMed query. Among the stages involved in narrowing down the
results are a categorisation using MeSH Terms (see chapter 2.3.2) and fuzzy binary
relation calculations for words in the same abstract. The user can perform the process
iteratively, to minimise the number of irrelevant results as much as possible.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
13
Chilibot [CS04, Chi] is also web-based, and is used to generate graphical
representations of the relationships among user provided terms, using PubMed and
NLP techniques. Chilibot searches PubMed for the user provided terms (typically
gene or protein names), and analyses abstracts in which two of more of the terms
appear, to determine if they are related. If they are, the sentences are further analysed
to find out whether or not it is an interactive relationship, whether the relationship is
stimulatory or inhibitory, and to what extent the terms are expressed. From this
information, a 2D line graph is produced showing all the valid relationships between
the terms.
BiopathwayBuilder [LPP04, Bio] uses information extraction (IE) of MEDLINE
abstracts to build and display gene and protein interaction networks, and allows the
user to enhance the usefulness of the automatic IE results by manually removing or
amending relationships in a 3D environment.
None of these tools perform the same analysis as BioQSpace: that of calculating
relationships between all members of a data set. Visualisation tools often assume
some biased perspective of the data, trying to categorise the elements based on
arbitrarily imposed rules. BioQSpace is completely unbiased in the method it uses to
cluster the data, which is interactive and customisable by the user – weights and
thresholds can be used to change how much each of the attributes that comprise the
comparison algorithm (used for clustering) contribute.
1.3 Structure of Dissertation
This chapter has described the motivation for the project, outlined the key features of
BioQSpace, and suggested some of its possible uses.
The next chapter outlines the target users and the requirements, reviews the existing
technologies on which BioQSpace was built, and discusses the development process
and specification.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
14
Chapter three covers the design process of BioQSpace, and is followed in chapter
four by a discussion of the implementation process and the issues involved therein.
Chapter five describes the testing and evaluation of BioQSpace, concluding with the
installation process.
Chapter six contains concluding remarks, performance issues and ideas for further
work.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
15
2. ANALYSIS
This chapter begins with the identification of the likely users of the system, and a
requirements analysis that lists the capabilities and features that the system is
expected to provide. Then there is a description of the technology relevant to the
project, followed by a brief description of how these existing tools and resources can
be combined and adapted for the purposes specific to BioQSpace. This is followed by
a discussion of the programming languages and the developmental process to be
used, with justification for the chosen options. The chapter is concluded with a
systems analysis and specification of the major features.
2.1 System Users
It is vital to identify the users of this type of system, as well as their abilities and
experience with computer systems. The target users of BioQSpace are medical
researchers or bioinformaticians, and it is unlikely that they all have considerable
computer knowledge. To that end, much of this work has been designed and
constructed with input from Anna Divoli, on staff in the bioinformatics department
at Manchester University, who has experience writing applications for the target
users and encourages ease-of-use to be a top priority.
2.2 Requirements Analysis
The expectations from the user are listed below, where the bold items are MUSTs
that are essential to the system and have to be completed for a successful project, and
the rest are SHOULDs – non-essential items that would be nice to have, but may not
be possible due to time constraints.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
16
1. The system should include tools to query PubMed, parse the results to
extract important attributes, and visualise the relationships between the
articles in 3D.
2. It should be possible to save queries in different directories, and load them
at a later date.
3. Navigation of the visualisation should be possible (and intuitive) using a
standard mouse and/or a graphical user interface.
4. The user should be able to change the way the comparison values are
calculated by changing the weights for the attributes.
5. The user should be able to select and highlight articles using the mouse.
6. The user should be able to highlight articles that have certain attributes.
7. Further information about selected articles should be displayed.
8. The user should be able to remove articles from the visualisation.
9. A comprehensive help system should be available.
10. The two components of the system – querying/parsing and visualisation –
should be separate entities, but linked by an application that can execute both.
11. The user should be able to tweak the comparison algorithm to be more/less
thorough in the data it considers.
2.3 Utilised Technology
The following software, resources and tools are all to be incorporated into the final
system, in varying degrees.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
17
2.3.1 MEDLINE
MEDLINE (Medical Literature Analysis and Retrieval System Online) is a
bibliographic database of citations to journal articles in life sciences, covering the
fields of medicine, nursing, dentistry, veterinary medicine, the health care system,
and the preclinical sciences, but with a particular focus on biomedicine [Meda]. The
referenced papers generally range from 1966 to the present, and total an estimated 12
to 15 million (depending on the reference source), collated from 4,800 journals, with
between 1,500 and 3,500 references added most days of the week, ten months per
year, since 2002.
The majority of entries in MEDLINE record the articles’ authors, title, abstract, date
of publication, and other pertinent information, though it is not required for all
possible fields to be filled for each citation. The list of fields can be seen at [Ovi].
Although MEDLINE does not contain the entire text of the articles they cite, the
titles, abstracts and other pertinent information are available, and the term ‘article’ is
used throughout this dissertation to refer to the collection of data associated with the
cited articles.
MEDLINE cannot be directly searched for free, but its contents can be accessed
through several portals including PubMed [Puba], Infotrieve [Inf] or Medportal
[Medb], some of which are freely accessible, others requiring subscription fees.
2.3.2 PubMed
This is how the PubMed website [Puba] describes its service:
PubMed, available via the NCBI Entrez retrieval system, was developed by the National Center
for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), located at
the National Institutes of Health (NIH). ... PubMed was designed to provide access to citations
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
18
from biomedical literature. LinkOut provides access to full-text articles at journal Web sites and
other related Web resources. PubMed also provides access and links to the other Entrez
molecular biology resources.
PubMed is a database of citations to articles that encompasses MEDLINE, and
references several other databases, including OLDMEDLINE, which predates
MEDLINE and lacks some of its fields. In addition to the MEDLINE fields,
PubMed provides lists of related articles, links to external resources (such as the
article in full), and MeSH Terms for most of the articles.
The citations are manually indexed using terms from NLM’s controlled vocabulary,
MeSH (Medical Subject Headings) [Mes], which describe the contents of the article,
primarily to assist in searching PubMed. MeSH consists of a set of terms naming
descriptors in an alphabetical and hierarchical stucture, with broad headings such as
‘Anatomy’ or ‘Mental Disorders’ at the top, and more specific headings lower down,
in an eleven-level hierarchy. MeSH is annually updated during November and
December, but at the time of writing there are 22,997 descriptors.
The related articles are calculated using an algorithm that computes similarity scores
based on MeSH term frequencies and frequencies of words/phrases in the titles and
abstracts of each of the articles, recording those with the highest score [Com].
PubMed can be searched in a web browser using NLM’s Entrez tool [Enta]. A basic
search can be performed simply using key concepts, such as treatment or disease
terms, but it can be refined by using search tags to search only certain fields, such as
the title ([ti]), the authors list ([au]), or the journal ([ta]). For a complete listing of the
available search tags, see [Pubb].
By default, a search produces a list of article summaries that contain the authors, the
article title, the publication journal and date, related articles, and a unique PubMed
identifier (PMID). A variety of other presentation formats are available that contain
varying amounts of information, from a single-line brief summary to a complete
description.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
19
As well as the web browser interface, Entrez provides Entrez Programming Utilities
(E-Utilities) to retrieve raw PubMed data in several of the presentation formats, in
html or xml format, to parse and use in other applications [Entb]. These include
ESearch, for performing a search on a query term; EFetch, to get the information
about a particular article or set of articles; ELink, to retrieve the list of related articles
for a particular article.
2.3.3 MAVERIK 6.2
MAVERIK, the MAnchester Virtual EnviRonment Interface Kernel, is a publicly
available virtual reality system, capable of producing complex virtual environments
and interacting with 3D peripherals [Mav]. It is written in C, and provides several
core services vital for producing an interactive 3D environment, including the
following features which play important roles in this project:
! A complete set of default primitive objects.
! A spatial management system.
! High performance algorithms for culling, navigation and collision detection.
The primitive objects include boxes, cylinders, spheres, cones and polygons, and can
be rendered using any colours or textures available. MAVERIK graphical objects are
not only restricted to the primitives; new ones can be created by writing functions for
drawing, intersections, bounding boxes and so on, and associating them with a
MAV_class. Typically, all objects will contain information related to dimension,
location and orientation, which can be recorded using MAVERIK data types such as
MAV_vector or MAV_matrix, and associated functions like mav_vectorRotate
or mav_matrixMult.
To save the programmer having to keep track of all graphical objects in an
environment, they are placed in a Spatial Management Structure (SMS) that controls
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
20
how and when they are rendered, and plays a central role in culling, object selection
and collision detection [CH02].
2.3.4 Q-SPACE
Q-SPACE is a tool for visualising sets of comparable items, consisting of objects
positioned in a three dimensional environment, and is written in a combination of C
and C++, utilising the graphical capabilities of MAVERIK [PC01, PCM01]. Q-
SPACE consists of a single MAVERIK window containing graphical representations
of the objects and their relationships, and is navigated and controlled using a 3D-
mouse and keyboard. To implement Q-SPACE for a particular set of data, one has to
write a mechanism to create instances of a subclass of QSERV_tuple, which needs
to provide data structures to store appropriate attribute data and an algorithm to
compare elements of the data set, which returns a value between 0 (elements are
completely different) and 1 (elements are identical).
Q-SPACE uses the tuple class’ comparison algorithm to pair-wise compare all
created instances, storing their comparison values in a triangular matrix. These
values are used to make an ordered list of comparisons, with the most similar pairing
at the head, and the least similar at the tail. This list is used to create a minimal
spanning tree (MST) of the tuples (see figure 2.1), using the similarity values as edge
weights, which is used to ‘colour’ them in groups, where a tuple and its parent are
determined to be in different groups if their similarity is below a given threshold. A
large threshold yields a small number of highly populated groups; a small threshold
splits the tree up into a large number of sparsely populated groups. A tuple that forms
a new group is said to be the dominant tuple in that group.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
21
Figure 2.1: Minimal Spanning Trees.
Given a graph of vertices with weighted edges (a), a spanning
tree is a subgraph that contains all of the vertices and is a tree
((b), (c) and (d)). For it to be an MST, the sum of the weights
of edges must be the minimum for all possible spanning trees
(d) [Gou03].
Once the colouring is complete, the tuples are displayed as cubes (MAV_boxes) in
their group colour, with all members of a group encapsulated by a semi-transparent
minimal convex hull (in the same colour), and with the dominant tuples of different
groups joined by lines. The tuples are iteratively positioned in 3D using a force
placement algorithm that exploits the MST structure by attracting the dominant
tuples from each colour group, then repelling the tuples in each group away from
their dominant tuple. The tuples all begin at the origin, then move outwards,
‘organising themselves’ into linked groups until a stable formation has been reached,
where the tuples are separated by a distance proportional to their similarity. Figure
2.2 shows Q-SPACE’s default data set in its final configuration.
8
6
3
10
5
1
8
6
10
6
5
1 6
3
1
Spanning Trees
(b) Tree weight = 12 (c) Tree weight = 24 (d) Tree weight = 10
(a)
Graph
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
22
Figure 2.2: A set of tuples in its final configuration.
The semi-transparent clusters contain all the tuples from the
same group, with separate clusters joined by lines.
2.3.5 Qt 3.3
Qt is a cross-platform C++ Graphical User Interface (GUI) toolkit designed and
maintained by Trolltech [Tro]. Qt has formed the basis of thousands of applications
worldwide, and is the basis of the KDE Linux desktop environment [Qta]. Although
version 4 of Qt is now available, it is a fairly substantial redesign compared to version
3.3 and is incompatible with MAVERIK in its current form, which is unlikely to
change.
Qt offers a large collection of object-oriented graphical widgets such as buttons, labels
and dialogs, as well as tools for handling streams, databases and threads.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
23
2.4 Programming Languages
In order to be able to adapt Q-SPACE, which is written in a combination of C and
C++, the code for features specific to BioQSpace must also be in C and C++. The
use of C++ allows Qt to be used to build the GUI, and enables the use of the
Standard Template Library (STL) [SGI], a collection of container classes, algorithms
and iterators, most of which are templates, so can be used for any data types.
The choice of language for the visualisation, however, does not restrict the choice of
language for the information extraction part of the system, and for this Perl was
chosen, due to its simple and efficient text and regular expression handling
capabilities.
2.5 Developmental Approach
To produce an application capable of visualising the relationships between
MEDLINE abstracts, all of the components described above have to be integrated.
PubMed will be queried to gather the MEDLINE abstracts and related fields, which
can then be parsed to extract all useful attributes and stored in a subclass of
QSERV_tuple, which in turn can be integrated into a version of Q-SPACE in a Qt
GUI for visualisation.
Q-SPACE forms the bulk of the system, but is a complete application in itself, and as
such, the number of possible developmental approaches is restricted. When building
an application from the ground up, the Waterfall process is a desirable and efficient
methodology [Kol05]. This requires having a complete specification of every element
of the system, implementing each of them separately, then combining and testing
them as a whole. As much of the software for this system has already been written,
and changes to the original code are undesirable, the Waterfall process is not suitable,
so the Incremental Prototyping model is used instead [Pro]. This consists of
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
24
producing a basic functioning application, then incorporating new features according
to the list of requirements, or as they are thought of. This approach allows plenty of
feedback as the project progresses, ensuring that each feature performs exactly as
intended, without conflicts between them. In addition, if at some point in the
production process it is decided that a new comparison attribute is needed, say, it
should not be difficult to incorporate it.
In summary, BioQSpace should provide a tool to query PubMed using E-Utilities,
parse and process the results, and save comparison attributes for the articles in local
files. It should also provide a tool to visualise the contents of those files in a navigable
and interactive 3D environment. The details of these tools are elaborated in the next
chapter.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
25
3. SYSTEM DESIGN
This chapter covers the design of the two parts of the system: the PubMed data
collecting and processing, and the visualisation, which are to be implemented as two
standalone applications. The design discussion covers program and file structures, as
well as GUI design.
3.1 PubMed Data Collecting and Processing
To satisfy parts of user requirements 1 and 2, the PubMed interaction part of the
system needs to perform three jobs:
! Query PubMed with the user’s search term and a maximum number of
results, and store all of the required data locally, in a directory specified by the
user.
! Process the data, extracting key words and phrases to use as comparison
attributes.
! Save the processed data in a format that can be efficiently used by the
visualisation application.
These tasks indicate two appropriate structural decisions: that the script should take
three arguments (target directory, maximum number of results and search query);
and that the jobs should be separated into three subroutines, which will allow each
task to be changed, tested and evaluated independently of the others.
The only possible errors that could occur in this script are disk I/O and internet
connection related, but all should be caught and dealt with appropriately, informing
the user what went wrong.
All of this will be contained in one Perl script, pubmed.pl.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
26
3.2 Visualisation
The general structure of the Q-SPACE and BioQSpace applications will now be
described. Details of the variables and functions associated with each of the
mentioned components have been deliberately omitted here, as they will be explained
in later sections if and when they are deemed necessary.
3.2.1 Q-SPACE Structure
Q-SPACE consists of a curious mix of C and C++ files (and associated headers) that
uses true object-oriented C++ design in some places and pseudo-object-oriented C
design in others, the latter especially when directly interacting with MAVERIK,
which is written purely in C.
A list of computer file information is used for the default data set, where the tuples
are created from the output of the Linux command ls –l which has been saved to a
text file, files.txt. The attributes used are the file name, file type, directory and size.
A screenshot of the original application is shown in figure 3.1. The structure of the
program with the interactions and hierarchies of the Q-SPACE C structs and C++
classes are shown in figure 3.2, and the important elements are summarised below it.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
27
Figure 3.1: A screenshot of the original Q-SPACE.
Figure 3.2: Original Q-SPACE program structure,
interactions and hierarchies.
Linked list interface
hull
MAV_m2n
Logging tools
DEVA_traceString
DEVA_tracer
QSERV_tStore
QSERV_tuple
QSERV_tupleFile
List of files (files.txt)
MAVERIK modules
MAV_hull
MAV_hiliteBox
MAV_spangly
MAV_tooltip
MAV_trail twine
MAV_qobj
MAV_qpit
DEVA_link
DEVA_list
qserver
qpit_main
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
28
DEVA_link, DEVA_list These are templates for creating linked lists of objects, and are used throughout the
application.
DEVA_traceString, DEVA_tracer DEVA_tracer is a class with static methods for logging the progress of the
application, either to stderr or a file, primarily for debugging purposes throughout the
application. They use DEVA_traceStrings to easily concatenate primitive data
types to strings.
There are 13 different types of logging message, including warnings, fatal errors and
sanity messages, and any or all of them can be output by setting the mode
appropriately.
MAVERIK Modules
These are pseudo-object-oriented classes that define how they are created, are drawn,
deal with intersections and are deleted (amongst others) by registering function
callbacks with MAVERIK.
MAV_qobj This is a visual representation of a single tuple, and is simply a coloured cube (figure
3.2(a)), although it has three attributes that can alter its appearance: if it is selected, it
has a rotating white box (MAV_hiliteBox) drawn around it (figure 3.2(b)); if it is
marked, it has flashing white lines (MAV_spangly) emitting from it (figure 3.2(c)); if
it is deleted, then it is not drawn.
MAV_hull This is a semi-transparent minimal convex hull that encapsulates all MAV_qobjs in
the same group. MAV_hull does not construct the hull itself, but uses code written
by Joseph O'Rourke, John Kutcher, Catherine Schevon and Susan Weller to
calculate the minimum set of vertices needed to define the faces of hull, and in what
order, and then draws planes for each face. A MAV_hull is shown in figure 3.4.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
29
Figure 3.3: A MAV_qobj (a) unselected and unmarked, (b)
selected using a MAV_hiliteBox, (c) marked using a
MAV_spangly.
Figure 3.4: A MAV_hull, surrounding all of the MAV_qobjs
in the same group.
MAV_tooltip This is intended to be a rectangle that appears at the position of the cursor if it pauses
long enough over a MAV_qobj or MAV_hull, which contains information about
that object. Due to version 6.2 of MAVERIK not implementing some required text-
related rendering functions, however, nothing actually appears.
MAV_trail A trail (figure 3.5) is used to track the visited tuples. When a MAV_qobj is clicked
on, it is added to the end of the trail’s list of MAV_qobjs. A MAV_trail is drawn
using the twine library written by James Marsh to calculate intermediate points
between the 3D locations of consecutive MAV_qobjs which smoothly interpolates all
of the MAV_qobjs in the list, and joining up those points with yellow lines to
produce a continuous curve that passes through them all.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
30
Figure 3.5: A series of MAV_qobjs linked by a MAV_trail:
a smooth 3D curve.
MAV_qpit A MAV_qpit is a container for all of the MAV_qobjs in the visualisation, each of
which has its own drawing function, so its draw callback does no actual drawing, but
instead recalculates the forces between the components in order to reposition them in
every loop.
MAV_m2n This is a tool for ‘flying’ from the current location to a clicked-on MAV_qobj, and
remaining focused on it until another MAV_qobj is selected. It prevents the user
selecting another MAV_qobj if it is already in flight.
QSERV_tuple This is an abstract class with virtual functions that have to be overwritten for the
specific uses of subclasses. The most important virtual functions are compare and
compareAttribute that define how two instances of the same type of tuple
should be compared, using their attributes.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
31
QSERV_tupleFile This is a particular subclass of QSERV_tuple that compares file information, used
for Q-SPACE’s default data set.
QSERV_tStore This contains all of the (subclasses of) QSERV_tuples in the data set, and provides
all the necessary functions for clustering the data and changing the appearance
attributes of the MAV_qobjs.
qpit_main This is the main body of the program. It begins by initialising the progress monitoring
tools, MAVERIK and all of the modules, then creates a MAV_qpit, a MAV_trail
and a MAV_tooltip and adds them to the main MAVERIK SMS.
When the MAV_qpit is made, it creates a QSERV_tStore, and adds to it a
QSERV_tupleFile instance for each line of files.txt. The tuples are then
clustered and assigned a MAV_qobj to represent them visually.
It then enters the MAVERIK infinite rendering loop which acts upon any input
events (e.g. from the mouse or keyboard), updates the hulls for each of the groups in
the MAV_qpit, and draws everything in the SMS.
3.2.2 BioQSpace Visualiser Structure
The original structure of Q-SPACE should be retained as much as possible, but with
a new interface and additional features. The major programming changes required to
adapt Q-SPACE to satisfy the user requirements should include:
! Construction of the GUI as part of the existing qpit_main initialisation
function.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
32
! Addition of new functions to qpit_main to perform the actions associated
with the interactive GUI widgets.
! Replacement of the existing 3D-mouse navigation system with a 2D-mouse
and/or GUI navigation system.
! Replacement of the DEVA_link and DEVA_list classes, examples of
obscure legacy code, with suitable classes from the STL.
! Writing of a new subclass of QSERV_tuple, QSERV_tupleArticle, to
deal with all of the data associated with MEDLINE articles.
! Writing of a function in MAV_qpit to read and parse the results from
pubmed.pl to create QSERV_tupleArticle instances.
! Rewriting of all user feedback code, so that messages are displayed in a dialog
as part of the GUI, in addition to being written to the terminal.
The ways in which these changes are implemented are described in chapter 4.3. The
amended structure diagram is shown in figure 3.6.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
33
Figure 3.6: The structure of BioQSpace.
The new or substituted elements are in black, and those that
remain from the Q-SPACE structure are in grey.
3.3 File Structures
Once pubmed.pl has been run, the output files can be parsed by the visualiser as
many times as the user wishes, so it is sensible to try to minimise the amount of work
required for parsing by organising the data intelligently. Minimising the size of the
output files is also a desirable feature.
The output files will consist of a main file, qspace_main.txt, that contains all of
the data for all of the articles returned from PubMed, and a file for each attribute,
with names of the form qspace_[attribute_name].txt, that list all
encountered examples of the attribute. Instead of listing in the main file all of an
article’s attribute examples in full, the line number of the corresponding attribute file
hull
MAV_m2n
Logging tools
DEVA_traceString
DEVA_tracer
QSERV_tStore
QSERV_tuple
QSERV_tupleArticle
MAVERIK modules
MAV_hull
MAV_hiliteBox
MAV_spangly
MAV_tooltip
MAV_trail twine
MAV_qobj
MAV_qpit
Files output from pubmed.pl Qt Widgets
qpit_main
qserver
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
34
can be listed instead. As an example, let one of the attributes be disease names, and
the 10th disease name (alphabetically) be Alzheimer’s. Any article that contains the
word Alzheimer’s can then save the number ‘10’ instead of the word ‘Alzheimer’s’ in
qspace_main.txt. Not only will this drastically reduce the file size of
qspace_main.txt, but MAV_qpit can then read in all of the attribute files, index
their contents in arrays, before parsing the main file, where the values can be quickly
extracted from the arrays.
Attributes that are numbers or involve scores should be normalised to lie between 0
and 1. For a given set of articles, normalisation need only be performed once, so
should be done before the data is saved. This way, the visualiser does not need to do
any normalisation of the raw attribute values.
3.4 GUI Design
A good GUI design follows a number of sound principles. The list below is an
abridged version of the guidelines from [IBM]:
! Simplicity: Don’t compromise usability for function. Keep the interface
simple and straightforward, minimising clutter. Common functions should be
immediately apparent, keeping advanced options less obvious.
! Support: Place the user in control and provide proactive assistance. Do not
restrict the user in the number of ways they can complete tasks: provide
alternative routes that they may be more comfortable with. Provide assistance
with achieving tasks, but in an unobtrusive way.
! Familiarity: Build on users’ prior knowledge. If the GUI performs similarly
to software the user is familiar with, and the behaviour is consistent across the
GUI, the interface will be easier to learn and operate. The design should be
based around what the user would expect to find.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
35
! Obviousness: Make objects and their controls visible and intuitive. Using
real-world representations for application functions, such as an icon of a trash
can for a discard function, can help familiarise the user with the associations
between the controls and their functionality.
! Encouragement: Make actions predictable and reversible. Allow the user to
explore the tools the application provides, without fear of being unable to
recover a previous state. Do not bundle actions together in a way the user may
not anticipate.
! Satisfaction: Create a feeling of progress and achievement. Reflect the
results of actions immediately, instead of forcing the user to wait. If this is not
practical, communicate the progress of the process, or offer a preview of a
likely outcome of the action.
! Availability: Make all objects available at all times. Users should be able to
use all of their objects in any sequence and at any time. Restrictions on the
availability of objects can frustrate the user and should be avoided.
! Safety: Keep the user out of trouble. Every attempt should be made to
prevent the user from being able to cause errors. In cases where errors are out
of the system’s control, two-way communication is necessary to clarify what
the user intends, or to remedy the problem.
! Versatility: Support alternative interaction techniques. Allow the user to
choose a method of interaction that suits them best. This includes input
methods, including the mouse, keyboard, microphone or stylus, and output,
such as spoken instruction.
! Personalisation: Allow users to customise. Customisation of colour schemes
and backgrounds can help make an interface comfortable and familiar.
Providing the ability to change default values can enable them to save time
and effort.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
36
Figure 3.7: The basic graphical user interface design.
! Affinity: Bring objects to life through good visual design. The final result
should be an intuitive and familiar representation that is second nature to
users.
Not all of these concepts are applicable to the BioQSpace GUI, but considerable
efforts will be made to satisfy them where appropriate. To this end, the general layout
in figure 3.7 should be adhered to. The components are:
! Toolbar: Common functionality should be placed here, organised in an
intuitive manner. In many applications, this is positioned on the left of the
window, so this GUI should do the same.
! Menu Bar: More advanced options should be accessible through menu items.
The items should have intuitive shortcut keys associated with them.
! MAVERIK viewport: This is the focus of the GUI, where practically all of
the actions will take place, so should comprise the bulk of the window. When
the window is resized, this is the only component that should grow in both
directions.
Menu bar
Toolbar MAVERIK viewport
Abstract attribute data Weights of attributes
Status bar
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
37
! Abstract attribute data: When an object in the visualisation is clicked on, the
user will expect to find out more information about it. That information
should appear in this box in a clear, easy-to-read format.
! Weights of attributes: One of the main requirements of the system is that the
user can adjust the way the abstracts are compared, via changing the weights
of the attributes. This action should have a clear interface, physically
separated from the rest of the available interactive actions.
! Status bar: Any progress or status messages that do not require user
acknowledgement should appear here.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
38
4. IMPLEMENTATION
This chapter describes the main issues involved in the building of BioQSpace, using
the design concepts from the previous chapter, and how the important functions were
implemented in the two parts of the system: pubmed.pl and the visualiser. As the
incremental method was used to develop the system, the order in which elements are
described in this chapter do not reflect the order in which they were implemented –
they are descriptions are of the final result.
4.1 Abstract Comparison Attributes
To compare two abstracts, a set of attributes are needed, which can individually be
compared in a suitable way, then their comparison values combined. The 15
considered attributes and their meanings are listed below, ordered by decreasing
importance, as judged by the author.
! Title words. All of the words in the title with their associated importance
scores.
! Abstract words. All of the words in the abstract with their associated
importance scores.
! Title & abstract words combined. All of the words in both the title and the
abstract with their associated importance scores.
! MeSH Terms. The MeSH terms that were used to classify the article in
PubMed.
! Drugs. Any drugs mentioned in the title or abstract.
! Diseases. Any diseases mentioned in the title or abstract.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
39
! Function Terms. Any words or phrases in the title or abstract that refer to the
functionality of biological entities.
! Structure Terms. Any words or phrases in the title or abstract that refer to the
structure of biological entities.
! Location Terms. Any words or phrases in the title or abstract that refer to the
locations where biological entities act or are acted upon.
! User-defined Terms. Any words or phrases in the title or abstract that are in a
custom list provided by the user.
! PubMed Related Articles. The list of related articles as calculated by PubMed
(used as part of two different attributes).
! Publication Date. The year of publication.
! Authors. The list of contributing authors.
! Journal. The journal that the article was published in, the publishing house
the journal belongs to, and any portals that the journal can be accessed
through.
4.2 pubmed.pl
The three tasks pubmed.pl performs – querying PubMed, processing the results and
saving the results – are now described in more depth. It is assumed that the three
arguments (target directory, maximum number of results and search query) have all
been provided. In the case that the target directory, maximum number of results and
search query have not all been provided, the script exits, informing the user of the
missing arguments.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
40
4.2.1 Querying PubMed
Using the ESearch E-Utility, a query is performed using the user’s search term and
maximum number of results. The URL takes the form
http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=
pubmed&retmax=[max number]&usehistory=n&term=[search term]
This produces an xml-formatted file containing the PMIDs of the abstracts that
match the search term, plus additional information relating to the number of
occurrences of the search term, which is unused.
The file is parsed to extract the PMIDs only, which are concatenated (separated by
commas) and used in the URL for the EFetch E-Utility:
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=p
ubmed&id=[list of PMIDs]&retmode=html&rettype=medline
This results in an html file (containing only the bare minimum of html code) that lists
all utilised MEDLINE fields and the corresponding values for each PMID in the list.
This file is saved as medline_data.txt in the target directory.
The PMIDs are then individually fed into the ELink E-Utility, for which the URL is:
http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfro
m=pubmed&id=[PMID]&bd=pubmed
This also produces an xml-formatted file that lists the PubMed-defined list of related
articles, with their similarity scores. These pairs of numbers are extracted and saved
as [PMID]:[score] pairs in a file, rel_[PMID], in the related subdirectory of the
target directory.
Examples of the results from performing each of these E-Utilities can be seen in
Appendix A, and an example of a file put into related is in Appendix B.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
41
4.2.2 Processing the Results
First, medline_data.txt and the contents of the related subdirectory are
parsed to extract the title, abstract, MeSH terms, authors, journal title, publication
date and related articles for each PMID. The related article scores are normalised so
they lie between 0 and 1.
Then all of the titles and abstracts are scanned for drugs, diseases, function terms,
structure terms and location terms, by using regular expressions listed in 5 files (see
Appendix B), and for user-defined terms, which are listed in a file that the user can
optionally create.
An attempt is then made to identify the importance of all of the words in the title and
abstract, by performing term frequency – inverse document frequency (tf-idf) analysis
on them [Tfi]. First, each word is changed to lower case and, if it is not listed in
common_words.txt, is stemmed using the Porter Stemmer Algorithm [Por80].
This way, related words such as disease, Diseases and DISEASED will all be
counted using the same stem. This approach has a drawback though: some related
biological entities have names that are spelt the same, but differ in the case of the
letters (e.g. Myc = protein, myc = gene). This algorithm will not differentiate between
the two.
For each encountered word stem for each PMID, two values are calculated: ni, the
number of times the stem i appears in the title (or abstract, or both); and di, the
number of titles (or abstracts, or both) the stem appears in. The tf-idf value for each
word can then be calculated using the formula
""#
$%%&
'()* ik k
i
dD
nnidftf log*
where !k nk evaluates as the number of word stems in the title (or abstract, or both)
and |D| is the total number of PubMed results.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
42
This produces high values if a stem appears frequently within a title (or abstract, or
both), but appears in only a small fraction of titles (or abstracts, or both) in the whole
set. Those stems that score highly are considered to be important words.
The tf-idf values are then normalised.
4.2.3 Saving the Results
Each encountered attribute value is saved into one of:
! qspace_authors.txt
! qspace_diseases.txt
! qspace_drugs.txt
! qspace_functions.txt
! qspace_journals.txt
! qspace_locations.txt
! qspace_mesh_terms.txt
! qspace_structures.txt
! qspace_user_terms.txt
! qspace_words.txt
Then each of the articles are saved to qspace_main.txt in the 16 line format
below, where indices indicate the line numbers in the corresponding qspace file, with
the items separated by | characters. Non-applicable fields are indicated with a #
character.
PMID
Title
Author indices list
MeSH term indices list
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
43
Journal name : provider indices list
Publication date
Drug indices list
Disease indices list
Function indices list
Structure indices list
Location indices list
User term indices list
Title word indices list, with tf-idf scores
Abstract word indices list, with tf-idf scores
Entire document word indices list, with tf-idf scores
Related article PMIDs list, with similarity scores
A reference file, journal_list, is used to find the publishing houses and portals of
each known journal. It is created from [Lin05], and should be occasionally updated
by the user with the short getJournals.pl script when new journals, publishing
houses or journal access portals are established.
Also output is qspace_word_stems.html, a collection of all of the word stems
and the words they can represent.
4.3 BioQSpace Visualiser
The implementations of the features added to the original Q-SPACE application,
which together create the BioQSpace visualisation application, are now described.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
44
4.3.1 Article Storage and Comparison Algorithms
MAV_qpit’s list file parsing function has been replaced with one to parse the files
created by pubmed.pl, extract the attribute values and create
QSERV_tupleArticle instances with them. The majority of the attributes are lists
of words or phrases, but the title words, abstract words, title + abstract words and
related articles lists have associated scores. The comparison algorithm for the lists
without scores calculates the fraction of the items from the lists that both articles
share. For the lists with scores, the pseudocode for the comparison value is below:
set comparison value to 0
for each item in list 1
if list 2 contains item
multiply score from list 1 with score from list 2
add result to comparison value
divide comparison value by total number of unique items
from list 1 and list 2
The lists can be quite large, so to reduce the time spent calculating these values (at the
expense of accuracy), subsets of the lists are made in the constructor which contain
the items with the highest scores, so the pseudocode becomes:
set comparison value to 0
for each item in list subset 1
if list 2 contains item
multiply score from list 1 with score from list 2
add result to comparison value
for each item in list subset 2
if list 1 contains item
multiply score from list 1 with score from list 2
add result to comparison value
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
45
divide comparison value by total number of unique items
from list subset 1 and list subset 2
Score thresholds are used in creating the list subsets, which are automatically chosen
when loading a data set so that they consist of approximately 25% of the complete
lists. The thresholds can be changed using the GUI, which alters the size of the list
subsets. A lower threshold means that more items are considered, which means the
calculation takes longer; a higher threshold creates subsets with fewer items, which
can dramatically speed up comparison calculations.
The PubMed related articles list is used in two attributes: one uses the algorithm
described above (PubMed Related Articles); the other checks if the PMID of the first
article is included in the second article’s related articles list, and vice versa (Direct
PubMed relation).
The publication date comparison score is the difference between the years of
publication, as a fraction of the complete range of years for the data set.
The publication journal comparison score yields 1 if the journals are the same;
otherwise it is calculated with the same algorithm as for the lists without scores, using
the journal’s list of publication house and portals.
The maximum comparison values for the different attributes differs greatly – the
publication date and journal value will often evaluate to 1, but the related articles
value can peak at only " 10-2. This greatly undermines the intuitiveness of the
weights, so normalisation values are precalculated when the data set is loaded, which
ensures that all comparison values are relatively spread out in the range (0, 1].
4.3.2 GUI
The final design of the GUI is shown in figure 4.1. The layout is mostly consistent
with the proposed design in figure 3.7, having a menu bar at the top, a toolbar of
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
46
common actions of the left hand side, article information and weight sliders at the
bottom, and the MAVERIK viewport filling most of the window. The only
component missing is the status bar, which has been replaced with a status dialog.
The GUI is built from default Qt widgets, some customised widgets (made by
subclassing existing Qt widgets) and a MAVERIK window. Actions (slots) are
associated with signals that are produced when interactive widgets are activated (such
as the clicked() signal from a QPushButton, or the valueChanged(int)
signal from a QSlider) using the QObject::connect function, which works in a
similar way to how C callback functions are registered [Qtb].
Figure 4.1: The final graphical user interface design.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
47
Figure 4.2: The menu bar.
Menus (figure 4.2)
The path on the right hand side of the menu is the directory whose contents are being
displayed. The File menu contains options to load a new set of articles, which
displays a directory selection dialog to choose a directory from, and to exit. The
Select Articles menu contains actions to let the user select all the articles, no articles,
the inverse of the current selection, or any that are marked. The Mark Articles menu
contains actions to mark the selected articles, no articles, or those that fit attribute
criteria, which uses the dialog in figure 4.3 to let the user choose the attribute values
to mark. The Help menu consists of links to the help system (figure 4.4), the word
stem meanings file produced by pubmed.pl (figure 4.5) and the About BioQSpace
window (figure 4.6).
Figure 4.3: Mark articles by attribute dialog.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
48
Figure 4.4: Help window.
Figure 4.5: The word stems window.
Figure 4.6: About BioQSpace window.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
49
Figure 4.7: The toolbar.
Toolbar (figure 4.7)
This panel consists of two sections: navigation, and miscellaneous options and
operations. The navigation system is discussed in chapter 4.3.3, which describes the
functions of the eight direction buttons, the two zoom buttons and the focus on last
selected article checkbox. The traverse trail buttons navigate through any articles
that have been added to the trail, and the fly to article drop-down list allows the user
to navigate to an article with the selected PMID.
The show labels and use tooltips checkboxes toggle the labels and tooltips, as
illustrated in figure 4.8. The label is the PMID of the article, which appears below the
MAV_qobj. The faulty code for MAV_tooltip has been fixed so that it now appears
as expected, displaying the PMID and the keywords from the article.
Delete selected articles and clear trail act as one would expect. Advanced options
displays the dialog shown in figure 4.9. The top four sliders change the thresholds
described in chapter 4.3.1; the last one changes the threshold that the Q-SPACE
grouping algorithm uses to cluster the articles.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
50
Figure 4.8: Action of the ‘show labels’ and ‘use tooltips’
checkboxes.
Figure 4.9: The advanced options dialog.
Figure 4.10: The article information panel.
Article Information (figure 4.10)
When an article is selected, this panel lists all of its attribute values, described in
chapter 4.1, though only the important subsets of the whole lists are included for
those attributes with scores.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
51
Figure 4.11: The attribute weight sliders.
Weight Sliders (figure 4.11)
With these sliders, the user can change the way the comparison value is calculated.
Higher or lower emphasis can be given to each attribute by positioning the handles
appropriately. When the user is happy with the choice, they click the recalculate
with new weights button, and wait for the comparisons to be recalculated, after
which the new visualisation is displayed. If any of the similarity values evaluate to
zero, the user is warned that the visualisation may be unrepresentative of the data,
indicating how many pairs have zero similarity.
Error, Warning and progress Messages
QMessageBoxes are used to display all error and warning messages. Progress
messages are displayed line-by-line in a custom QDialog, which automatically
disappears when a process has finished.
4.3.3 MAVERIK Navigation
The 3D space is not much good without some useful methods of navigation. In Q-
SPACE, this was performed using a 3D mouse, but in BioQSpace the user can use
either the right-button of a standard 2D mouse or buttons on the GUI.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
52
There are two modes of navigation: free and focused. In free navigation, the user can
look up, down, left and right, and by holding shift can move backwards and
forwards. Focused navigation is initiated when an article is clicked on or otherwise
navigated to (via the trail traversal buttons, for instance). In focused navigation, the
user orbits the selected article, which remains at the centre of the viewport. Zooming
in is restricted so that the article cannot be passed through. The navigation mode can
be toggled using the focus on last selected article checkbox on the GUI. Reverting
back to focused mode after being in free mode will fly back to the last selected article.
In Q-SPACE, left-clicking on an article focuses on it, selects it and adds it to the trail,
while right-clicking does nothing. Holding shift while left-clicking allowed articles to
be selected in a group. In BioQSpace, the right-mouse-button is given functionality:
right-clicking an article focuses on it only, so the user can examine an article without
being forced to add it to the trail.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
53
5. TESTING AND EVALUATION
All software should be thoroughly tested to locate possible bugs and limitations. A
comparison of the final product with the initial specification and requirements is a
good method of evaluating the success of the system. This chapter describes both of
those processes when applied to BioQSpace, and concludes with instructions for
installation of the software.
5.1 Testing
Testing a large, complex software application can never be exhaustive, so a program
cannot be proved to be correct. However, through rigorous testing of the individual
components and the integrated system, one can be more confident that it is correct.
Successful tests should expose flaws in the system, and as such they should be
carefully designed to perform tasks that the programmer would not predict or expect
to be attempted.
Writing in C or C++, languages that do not offer garbage collection, means that in
addition to testing that functions perform as expected, memory allocation and
deallocation must be kept under control. Memory leaks are a common problem, and
can be difficult to pin down, but tools such as Valgrind can help locate where the
problems originate.
It is assumed that the Q-SPACE code is essentially correct, so the sections that have
survived unchanged in BioQSpace need not be tested as thoroughly as the new or
adapted code.
pubmed.pl
The only potential problems out of the control of the programmer are in the data
collection stage, which are susceptible to internet timeouts. Disconnecting the
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
54
computer from the network at various points while running the script meant that the
way the script copes with network issues could be analysed, and, in two cases, fixed.
Other problems can arise in the way the files are written to and read. Searches were
performed that were known to contain results that lacked one or more of the fields,
e.g. relatively recent submissions that have not yet been allocated MeSH terms, to
check that they were saved properly.
Many tests of the resulting files were made to ensure the format was consistently
correct.
User Interface
Using various combinations of the actions of: navigating through the 3D space;
clicking on multiple articles; traversing along the trail; deleting articles; clearing the
trail; and changing between the two different navigation modes, the application’s
stability could be thoroughly tested.
Setting all the weights to zero simply gives all attributes equal weighting, and throws
up no errors.
When the user tries to change the source path, an error message is displayed if the
directory does not exist, or does not contain all the required files.
Cancelling all of the dialogs, even if changes were made, was checked to ensure any
changes were not retained.
5.2 Evaluation
The testing process proved fruitful in locating bugs and unexpected behaviour. The
majority of the discovered problems were fixed, though some known bugs (and, of
course, some unknown bugs) remain that, given more time, would also be remedied.
The remaining known bugs are all minor, cosmetic issues that do not impair the use
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
55
of BioQSpace, for example deleting the currently focused article or article(s) in the
trail, which can result in focused navigation around an empty point in space.
All eight of the MUST requirements in chapter 2.2 have been satisfied. The user can:
! Query PubMed and visualise relationships between the resulting articles, in a
3D environment, saving results of queries.
! Navigate the environment, and interact with the (graphical representations of)
articles via the mouse and GUI.
! Tweak the comparison algorithm to place emphasis on particular attributes.
! View the attribute information of the articles.
! Remove articles from the visualisation.
The SHOULD requirements have been partially satisfied: the two halves of the
system have been implemented in separate applications, and the user can tweak the
amount of data considered in the comparison algorithm. The help system is not
comprehensive, but does cover the basics of how to use the system, and no joining
application was written.
With a little practice, the navigation is simple to use, and the GUI buttons allow
Apple Mac users to navigate despite the lack of a second mouse button. Overall, the
GUI design is uncluttered and intuitive, yet still provides all of the required
functionality.
No attempt has been made to quantitatively evaluate the usefulness of BioQSpace, so
it is unknown to what extent it achieves the original intended purpose of assisting
users of PubMed to find what they want or to conduct research. This would likely
require a lot more time and research, and is beyond the scope of this project.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
56
5.3 Installation
The requirements for BioQSpace are:
! Internet access
! MAVERIK version 6.2, compiled with the --QT option
! Qt version 3.3
! Perl interpreter
The twine library used for the trail comes zipped in the BioQSpace package, but can
be unzipped anywhere. The following environment variables need to be set to the
home directories of the corresponding application in order to compile and run
BioQSpace:
! MAV_HOME (MAVERIK)
! QTDIR (Qt)
! TWINE_HOME (twine)
! BIOQSPACE (BioQSpace)
LD_LIBRARY_PATH needs to be appended with $MAV_HOME/lib and
$TWINE_HOME/lib. Running make in $BIOQSPACE/bioqspace will then
compile the application into the same directory.
If $BIOQSPACE is added to the PATH environment variable, pubmed.pl and the
visualisation can be run from anywhere.
This is a demanding installation procedure that can prove difficult on non-Linux
machines, and time-consuming on machines lacking the required libraries, which
have to be downloaded and installed first. Attempts were made to install BioQSpace
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
57
on an Apple Mac and Microsoft Windows (using Cygwin [Cyg]), but were
abandoned due to excessive problems with dynamic libraries. Many hours were spent
trying to fix the problems, but after too little progress was made it was judged to be
futile.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
58
6. CONCLUSIONS
This chapter concludes the dissertation by summarising the intentions and
achievements of the project, and providing some suggestions for improvements and
extensions to the system.
6.1 Summary
BioQSpace was developed as a visualisation tool to graphically represent the
relationships between articles gathered from PubMed, an online medical literature
database. The project involved adapting an existing application, one that visualises
relationships among abstract data sets, for a specific purpose, and developing tools to
interact with the visualisation.
This thesis began with some background information on the motivation for the
project, and a discussion of related software in existence. The rest of the document
described the stages involved in developing BioQSpace, from requirements analysis,
through design and implementation to testing and evaluation.
6.2 Performance Issues
There are two main bottlenecks to the performance speeds of the visualiser:
! Comparing the articles.
! Positioning and rendering the MAV_qobjs using the force placement
algorithm.
The first is a one-off job, whereas the second is performed continuously, but for both,
the faster they can be completed, the higher the level of user satisfaction. The time
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
59
complexities of the algorithms have order N2 and N log N [PCM01] respectively,
which can dramatically slow down the software when the number of articles gets too
large, but being O(N2), comparing the articles will have the bigger impact. The force
placement algorithm has been kept unchanged from Q-SPACE, and it is unlikely to
be possible to improve on its complexity, but the time taken for article comparisons
depends on the attribute values of the individual results; how many of the weights are
zero; and on the thresholds in the advanced options dialog, so can vary greatly.
BioQSpace was tested under Linux Fedora Core 2, running on a computer with an
AMD Athlon XP 2200+ processor with 512MB RAM and an NVIDIA GeForce4
MX 440 with AGP8x. The query term ‘transcription factor[ti]’ was used, with
maximum results ranging from 50 to 500. As the number of articles approached the
500 mark, the frame rate (determined by the speed of the force placement algorithm)
became noticeably sluggish, though still usable. Figure 6.1 shows the completion
times for loading a set of articles (which comprises reading files, comparison of the
articles, normalisation of comparison values, and estimation of the list subset
thresholds) and reloading a set of articles (recalculating the comparisons without
changing the weights). Each task was performed 5 times, and averages taken.
Completion Times
0
50
100
150
200
250
0 50 100 150 200 250 300 350 400 450 500
Number of Articles
Tim
e to
com
plet
e (s
ecs)
Figure 6.1: Completion times of loading and reloading sets of
articles.
First load time
Average load time (excluding first)
Reload time
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
60
The graph uses quadratic trendlines to approximately interpolate the points, due to
the algorithms being O(N2). As expected, the time taken to reload a set of articles is
approximately half of that to load it, but it takes longer to load a set of articles for the
first time, which is both unexpected and unexplained. Because of this, the time taken
for the first attempt is plotted as a curve separate from the average of the following 4
times.
6.3 Further Work
There are a number of aspects to the system that are unsatisfactory, or have potential
to be greatly improved.
The comparison algorithm, which is performed whenever new data sets are loaded,
or the attribute weights or thresholds are changed, is the most time-consuming part of
the system. The following pseudocode shows how the algorithm works:
make a list of all of the articles
for each article a1 in list
for each article a2 after a1 in list
if neither are deleted
similarity = compare a1 with a2
else
similarity = 0
record similarity in matrix
This is an easily parallelisable algorithm. By recruiting multiple processors to help
compare groups of articles, the time taken to populate the similarity matrix would be
greatly reduced. One possible way of doing it would be to send the data required to
complete each outer loop (an article and all articles following it in the list) to a
separate processor. As each iteration of the loop uses one less article than the
previous one, this would be a very uneven distribution and would be wasteful of the
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
61
resources, so pairs of outer loops could be sent to each processor – one from the front
and one from the rear, as illustrated in figure 6.2.
Figure 6.2: A method for parallelising the comparison
algorithm.
N is the number of articles. +, and -. mean floor and ceiling
respectively.
Another possibility is to output the comparison values for each of the attributes in a
data set into a file the first time it is visualised. Then, any further visualisations of the
same data set can read those values back in, adapting them to reflect the attribute
weights and the thresholds, rather than recalculating them from scratch. A potential
drawback to this approach could be some disk space wastage if many sets of abstracts
are visualised only a small number of times, so some method of clearing out
unnecessary files would be required.
As the primary focus of the system was on the visualisation side, rather than the text
mining side, the algorithms used to process the PubMed data in pubmed.pl, such as
tf-idf, are somewhat basic and naïve. Implementing some more advanced NLP
techniques in the analyses of the titles and abstracts would improve the quality and
relevance of the attribute data that BioQSpace uses in the comparison algorithm.
A particular example of where the comparison algorithm is lacking is with the MeSH
terms. Currently they are strictly compared as complete strings, which does not take
advantage of their hierarchical nature. The levels of the hierarchy are marked with a
Processor 1
Processor 2
Processor 3
Processor 4
Processor N/2
Comparison
algorithm
loop 0, loop N
loop 1, loop N-1
loop 2, loop N-2
loop 3, loop N-3
loop +N/2,, loop -N/2.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
62
/, so are easily identified, and important elements are marked with a *. Instead of
rejecting MeSH terms that are not identical, partial comparisons values can be
assigned if MeSH terms share some of their hierarchy. A higher value can be given if
there is a *.
The current progress status dialog is somewhat primitive and unreliable due to some
clumsy usage of QThreads. Improving the way it is displayed would reassure the
user that the application is working properly.
The GUI does not follow one of the major GUI design principles; that of allocating
representative icons to the buttons and menu items – the only elements with icons are
the navigation buttons. Some time could be spent designing some user-friendly icons
to improve BioQSpace’s usability.
The 10th user requirement from chapter 2.2, that the two components of the system
should be linked with a single application that can run both, was not implemented.
The system would be more cohesive if this linking application were written.
Finally, to address the complication of the installation process, it would be good if
BioQSpace could be packed into a self contained application that does not require the
separate installation of Qt and MAVERIK. Finding out how to perform an
installation on Mac and Windows machines would also make BioQSpace more
accessible, and hence more useful.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
63
GLOSSARY
GUI. Graphical User Interface.
IE. Information Extraction.
MAVERIK. The MAnchester Virtual EnviRonment Interface Kernel, a Virtual Reality system.
MEDLINE. A bibliographic biomedicine database; the largest component of PubMed.
MeSH. Medical Subject Headings, the NLM’s controlled vocabulary thesaurus.
NLM. The National Library of Medicine.
NLP. Natural Language Processing.
PubMed. One of the services provided by the NLM’s Entrez retrieval system, providing tools to query MEDLINE and other databases.
Q-SPACE. An application to visualise similarity amongst a set of abstract data.
Qt. A C++ GUI widget library.
STL. The C++ Standard Template Library.
tf-idf. Term frequency – inverse document frequency.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
64
BIBLIOGRAPHY
[Bio] BiopathwayBuilder [online, cited 15 September 2005]. Available from
World Wide Web: http://www.biopathway.org/BiopathwayBuilder.
[Bio05] Anna Divoli. BioIE – Extracting information from the biomedical
literature [online, cited 15 September 2005]. Available from World Wide
Web: http://umber.sbs.man.ac.uk/dbbrowser/bioie.
[CH02] Jon Cook and Toby Howard (Editors). MAVERIK Programmer’s Guide.
University of Manchester, 2002.
[Chi] Chilibot: finding gene and protein relationships from MEDLINE [online,
cited 15 September 2005]. Available from World Wide Web:
http://www.chilibot.net.
[Com] Computation of Related Articles [online, cited 15 September 2005].
Available from World Wide Web: http://www.ncbi.nlm.nih.gov/
entrez/query/ static/computation.html.
[CS04] Hao Chen and Burt M. Sharp. Content-rich biological network
constructed by mining PubMed abstracts. BMC Bioinformatics, 5:147,
2004.
[Cyg] Cygwin Information and Installation. [online, cited 15 September 2005].
Available from World Wide Web: http://www.cygwin.com.
[Enta] Entrez PubMed [online, cited 15 September 2005]. Available from World
Wide Web: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=
pubmed.
[Enta] Entrez Utilities [online, cited 15 September 2005]. Available from World
Wide Web: http://eutils.ncbi.nlm.nih.gov/entrez/query/static/
eutils_help.html.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
65
[FL04] Tancred Frickey and Andrei Lupas. CLANS: a Java application for
visualizing protein families based on pairwise similarity. Bioinformatics
Applications Note, pages 3702-3704, 2004.
[Gou03] Graham Gough. Algorithms and Data Structures (Lecture notes). The
University of Manchester, 2003.
[HDCV04] Robert Hoffman, Joaquin Dopazo, Juan C. Cigudosa and Alfonso
Valencia. HCAD, closing the gap between breakpoints and genes. Nucleic
Acids Research, 2004.
[IBM] IBM Ease of Use – Design basics [online, cited 15 September 2005].
Available from World Wide Web: http://www.3.ibm.com/ibm/easy/
eou_ext.nsf/Publish/6.
[Inf] Infotrieve Online [online, cited 15 September 2005]. Available from
World Wide Web:
http://www4.infotrieve.com/newmedline/search.asp.
[KBSP04] Ronald N. Kostoff, Joel A. Block, Jesse A. Stump and Kirstin M. Pfeil.
Information content in Medline record fields. International Journal of
Medical Informatics, 73:515-527, 2004.
[Kol05] Adam Kolawa. Which Development Method is Right for your Project?
[online, cited 15 September 2005]. Available from World Wide Web:
http://www.stickyminds.com/sitewide.asp?ObjectId=3152&Function=
DETAILBROWSE&ObjectType=ART
[KSBG03] Thomas Karopka, Thomas Scheel, Sven Bansemer and Änne Glass.
Automatic construction of gene relation networks using text mining and
gene expression data. Medical Informatics, 2:169-183, 2004.
[Lee04] James Lee (with Simon Cozens and Peter Wainwright). Beginning Perl
second edition. Apress, 2004.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
66
[Lin05] LinkOut Journals by Provider [online, cited 15 September 2005].
Available from World Wide Web:
http://www.ncbi.nlm.nih.gov/entrez/linkout/journals/jourlists.cgi?type
id=1&type=journals&format=text&operation=Show
[LPP04] Changsu Lee, Jinah Park and Jong C. Park. A graphic tool for curating
molecular interaction networks from the literature. Computers in Biology
and Medicine, 35:555-564, 2004.
[Mas03] Louis Massey. On the quality of ART1 text clustering. Neural Networks,
16:771-778, 2003.
[Mav] The Advanced Interfaces Group – MAVERIK [online, cited 15
September 2005]. Available from World Wide Web:
http://aig.cs.man.ac.uk/ maverik/maverik.php.
[Meda] MEDLINE fact sheet [online, cited 15 September 2005]. Available from
World Wide Web: http://www.nlm.nih.gov/pubs/factsheets/
medline.html.
[Medb] Medportal [online, cited 15 September 2005]. Available from World
Wide Web: http://www.medportal.com.
[Mes] Medical Subject Headings [online, cited 15 September 2005]. Available
from World Wide Web: http://www.nlm.nih.gov/mesh/
meshhome.html.
[Oua97] Steve Oualline. Practical C Programming 3rd edition. O’Reilly, 1997.
[Ovi] Ovid MEDLINE Field Guide [online, cited 15 September 2005].
Available from World Wide Web: http://www2.umdnj.edu/rwjlbweb/
ovidpuzz/ startscope.htm.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
67
[PBA01] Carolina Perez-Iratxeta, Peer Bork and Miguel A. Andrade. XplorMed: a
tool for exploring MEDLINE abstracts. TRENDS in Biochemical Sciences,
September 2001.
[Por80] Martin Porter. Porter Stemming Algorithm [online, cited 15 September
2005]. Available from World Wide Web: http://tartarus.org/~martin/
PorterStemmer.
[PC01] Steve Pettifer and Jonathan Cook. Exploring Realtime Visualisation of
Large Abstract Data Spaces with QSPACE. IEEE, 2001
[PCM01] Steve Pettifer, Jon Cook and John Mariani. Towards Real-Time
Interactive Visualisation in Virtual Environments. Virtual Reality
International Conference, May 2001.
[Pro] Production Processes [online, cited 15 September 2005]. Available from
World Wide Web: http://www.scism.sbu.ac.uk/law/Section5/chap6/
s5c6p2.html.
[Puba] PubMed Overview [online, cited 15 September 2005]. Available from
World Wide Web: http://www.ncbi.nlm.nih.gov/entrez/query/static/
overview.html.
[Pubb] Searching PubMed: Table 1. Search Field Descriptions and tags [online,
cited 15 September 2005]. Available from World Wide Web:
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.table.pub
medhelp.T37.
[Qta] Qt 3.3: About Qt [online, cited 15 September 2005]. Available from
World Wide Web: http://doc.trolltech.com/3.3/aboutqt.html.
[Qtb] Qt 3.3: Signals and Slots [online, cited 15 September 2005]. Available
from World Wide Web: http://doc.trolltech.com/3.3/
signalsandslots.html.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
68
[Ref] RefViz [online, cited 15 September 2005]. Available from World Wide
Web: http://www.refviz.com.
[RKKGKHKRF00] Andrey Rzhetsky, Tomohiro Koike, Sergey Kalachikov, Shawn
M. Gomez, Michael Krauthammer, Sabina H. Kaplan, Pauline Kra,
James J. Russo and Carol Friedman. A knowledge model for analysis
and simulation of regulatory networks. Bioinformatics Ontology, 12:1120-
1128, 2000.
[SFW03] Ellen Siever, Stephen Figgins and Aaron Weber. Linux in a Nutshell 4th
edition. O’Reilly, 2003.
[SGI] SGI – Standard Template Library Programmer’s Guide [online, cited 15
September 2005]. Available from World Wide Web:
http://www.sgi.com/tech/stl.
[SGM04] Jeremy R. Semeiks, L. R. Grate and I. S. Mian. Text-based analysis of
genes, proteins, aging and cancer. Mechanisms of Ageing and Development,
126:193-208, 2004.
[SJORB05] Jasmin Šari!, Lars Juhl Jensen, Rossitza Ouzounova, Isabel Rojas and
Peer Bork. Extraction of regulatory gene/protein networks from
Medline. Bioinformatics, 2005.
[SK05] Nicholas A. Solter and Scott J. Kleper. Professional C++. Wiley
Publishing Inc., 2005.
[Tfi] Tf-idf – Wikipedia, the free encyclopedia [online, cited 15 September
2005]. Available from World Wide Web:
http://en.wikipedia.org/wiki/Tfidf.
[Tro] Trolltech – Qt Product Overview [online, cited 15 September 2005].
Available from World Wide Web:
http://www.trolltech.com/products/qt.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
69
[Xpl] XplorMed: eXploring Medline abstracts [online, cited 15 September
2005]. Available from World Wide Web: http://www.bork.embl-
heidelberg.de/xplormed.
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
70
APPENDIX A: E-UTILITY RESULTS
Listed here are the Entrez Programming Utilities used in BioQSpace, with example
URLs and their results.
ESearch
URL: http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=10&usehistory=n&term=lupus
Result: <?xml version="1.0"?> <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd"> <eSearchResult> <Count>47317</Count> <RetMax>10</RetMax> <RetStart>0</RetStart> <IdList> <Id>16127435</Id> <Id>16127360</Id> <Id>16127015</Id> <Id>16127001</Id> <Id>16126989</Id> <Id>16126986</Id> <Id>16126985</Id> <Id>16126984</Id> <Id>16126981</Id> <Id>16126980</Id> </IdList> <TranslationSet> <Translation> <From>lupus</From> <To>("lupus"[MeSH Terms] OR "systemic lupus
erythematosus"[Text Word] OR "lupus erythematosus, systemic"[MeSH Terms] OR lupus[Text Word])</To>
</Translation> </TranslationSet> <TranslationStack> <TermSet> <Term>"lupus"[MeSH Terms]</Term> <Field>MeSH Terms</Field> <Count>795</Count> <Explode>Y</Explode> </TermSet> <TermSet> <Term>"systemic lupus erythematosus"[Text Word]</Term> <Field>Text Word</Field> <Count>35520</Count> <Explode>Y</Explode> </TermSet> <OP>OR</OP> <TermSet> <Term>"lupus erythematosus, systemic"[MeSH Terms]</Term> <Field>MeSH Terms</Field> <Count>32040</Count> <Explode>Y</Explode> </TermSet> <OP>OR</OP> <TermSet> <Term>lupus[Text Word]</Term>
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
71
<Field>Text Word</Field> <Count>47317</Count> <Explode>Y</Explode> </TermSet> <OP>OR</OP> <OP>GROUP</OP> </TranslationStack> </eSearchResult>
EFetch
URL: http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=14527253&retmode=html&rettype=medline
Result: <Html><Title>PmFetch response</Title><Body> <Pre> PMID- 14527253 OWN - NLM STAT- MEDLINE DA - 20031006 DCOM- 20040323 LR - 20041117 PUBM- Print IS - 0278-2715 VI - Suppl Web Exclusives DP - 2003 Jan-Jun TI - Creating consensus on coverage choices. PG - W3-199-211 AB - The framework for reaching near-universal coverage outlined in this paper combines tax credits for private insurance and public program expansions. It illustrates how a series of incremental steps could be phased in to achieve near-universal coverage. Hallmarks include creation of a Congressional Health Plan; use of the income tax system to provide tax credits and enroll uninsured people; creation of a state Family Health Insurance Program open to everyone below 150 percent of poverty; and creation of a Medicare Part E, open to the disabled and uninsured older adults. The paper provides coverage and cost estimates and identifies potential sources of revenue to finance coverage. AD - Commonwealth Fund, New York City, New York, USA. FAU - Davis, Karen AU - Davis K FAU - Schoen, Cathy AU - Schoen C LA - eng PT - Journal Article PL - United States TA - Health Aff (Millwood) JID - 8303128 SB - IM CIN - Health Aff (Millwood). 2003 Jan-Jun;Suppl Web Exclusives:W3-212-5. PMID: 14527254 CIN - Health Aff (Millwood). 2003 Jan-Jun;Suppl Web Exclusives:W3-216-8. PMID: 14527255 MH - Financing, Government/legislation & jurisprudence MH - Health Care Reform/*legislation & jurisprudence MH - Humans MH - Income Tax/legislation & jurisprudence MH - Insurance, Health/*legislation & jurisprudence MH - Medically Uninsured MH - Medicare/legislation & jurisprudence MH - Politics MH - Privatization/legislation & jurisprudence MH - Program Development
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
72
MH - Tax Exemption/*legislation & jurisprudence MH - United States MH - Universal Coverage/*legislation & jurisprudence EDAT- 2003/10/07 05:00 MHDA- 2004/03/24 05:00 PST - ppublish SO - Health Aff (Millwood) 2003 Jan-Jun;Suppl Web Exclusives:W3-199-211. </Pre></Body></Html>
ELink
URL: http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10000000&bd=pubmed
Result (extract): <?xml version="1.0"?> <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd"> <eLinkResult> <LinkSet> <DbFrom>pubmed</DbFrom> <IdList> <Id>10000000</Id> </IdList> <LinkSetDb> <DbTo>pubmed</DbTo> <LinkName>pubmed_pubmed</LinkName> <Link> <Id>10000000</Id><Score>2147483647</Score> </Link> <Link> <Id>9979842</Id><Score>10559862</Score> </Link> <Link> <Id>9994279</Id><Score>9966528</Score> </Link> <Link> <Id>10009023</Id><Score>9874436</Score> </Link> <Link> <Id>12398705</Id><Score>9812176</Score> </Link> <Link> <Id>10009474</Id><Score>9781612</Score> </Link> <Link> . . . <Link> <Id>12747441</Id><Score>6657858</Score> </Link> <Link> <Id>10472728</Id><Score>6651632</Score> </Link> <Link> <Id>11689937</Id><Score>6642576</Score> </Link> </LinkSetDb> </LinkSet> </eLinkResult>
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
73
APPENDIX B: FILES USED BY PUBMED.PL
B.1 Lists of regular expressions used to extract attribute values from titles and
abstracts. They are printed in columns here, but the actual files consist of only one
column.
drug_list [T|t]herapies [T|t]herapeutic compounds{0,1} [M|m]edical compounds{0,1} [M|m]edicinal compounds{0,1} [T|t]herapeutic anti[a-z]{1,16} agents{0,1} [a-z]{0,16}prazoles{0,1} [a-z]{0,16}pamines{0,1} [a-z]{0,16}lamines{0,1} [a-z]{0,16}imines{0,1} [a-z]{0,16}piroles{0,1} [a-z]{0,16}adines{0,1}
[a-z]{0,16}udines{0,1} [a-z]{0,16}acins{0,1} [a-z]{0,16}izines{0,1} [a-z]{0,16}cagons{0,1} [a-z]{0,16}hyllines{0,1} [a-z]{0,16}razones{0,1} [a-z]{0,16}nisteins{0,1} [a-z]{0,16}benecids{0,1} [a-z]{0,16}othymines{0,1} [a-z]{0,16}osporines{0,1} [a-z]{0,16}oprines{0,1} [a-z]{0,16}mycins{0,1}
disease_list [A|a]nemia [A|a]naemia Alzheimer [A|a]nxiety [A|a]nomal[a-z]{1,3} [A|a]ngina [A|a]bnormalit[a-z]{1,3} [A|a]llerg[a-z]{1,2} [A|a]ttacks{0,1} [A|a]troph[a-z]{1,2} [A|a]sthma [A|a]sthmatics{0,1} [A|a]utoimmune [A|a]utosomal recessive [A|a]utosomal-recessive [A|a]utosomal dominant [A|a]utosomal-dominant [B|b]ruxism blood coagulation blood clotting Crohn Creutzfeldt [C|c]oronary [C|c]ancers{0,1} [C|c]arcinomas{0,1} [C|c]arcinogenesis conditions{0,1} [C|c]ongenital Cushing [C|c]hemotherapy [C|c]linical [D|d]ementia [D|d]epression Down [D|d]iabet[a-z]{1,2} [D|d]egenarat[a-z]{1,3} [D|d]iagnosis [D|d]iseases{0,1} [D|d]isorders{0,1} [D|d]ysplasia [D|d]yspepsia
[D|d]ystrophy [D|d]ysfunctions{0,1} [D|d]efects{0,1} [D|d]eficit [D|d]osage [D|d]rugs{0,1} [E|e]pilepsy [E|e]epileptic fever fibrosis failure [H|h]emophilia [H|h]aemophilia [H|h]emorrag[a-z]{1,2} [H|h]aemorrag[a-z]{1,2} [H|h]emorag[a-z]{1,2} [H|h]aemorag[a-z]{1,2} [H|h]ereditary [H|h]allucinations [H|h]ealth [H|h]ealing Huntington [I|i]schemi[a|c] [I|i]schaemi[a|c] [I|i]nfections{0,1} [I|i]nfect[a-z]{0,3} [i|I]nflammations{0,1} [I|i]nflammatory [I|i]nflammat[a-z]{1,2} [I|i]nherited [I|i]nheritable [I|i]njury [I|i]njuries [I|i]nsomnia [L|l]ymphoma [L|l]eukemia [L|l]eukaemia [M|m]alignant [M|m]alignancy [M|m]elanomas{0,1} [M|m]etastasis
[M|m]edical [M|m]igraine [M|m]yocardial infarction [N|n]eoplasms{0,1} [N|n]eoplastic [O|o]steoporosis [P|p]athogens{0,1} [P|p]athogenesis of [P|p]athogenesis [I|i]n patients with [P|p]atients{0,1} Parkinson's Parkinson [P|p]aralysis [P|p]soriasis [P|p]neumonia [P|p]rophylaxis [P|p]rophylactics{0,1} [P|p]redisposition [P|p]rognosis [R|r]adiotherapy [S|s]troke [S|s]clerosis [S|s]ickness [S|s]ick [S|s]chizophrenia [S|s]ymptoms{0,1} [S|s]yndromes{0,1} [T|t]hrombosis [T|t]uberculosis [T|t]rauma [G|g]ene therapy [T|t]herapy preventative treatment antibiotic treatment [T|t]reatments{0,1} [T|t]umorigenesis [T|t]umors{0,1} [T|t]umours{0,1} [T|t]halassaemia [T|t]halassemia
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
74
ulcer wound virulence vomiting [a-z]{0,16}carcinomas{0,1}
[a-z]{1,16}itis [a-z]{1,16}pathy [a-z]{1,16}pathies [a-z]{1,16}penia [A-Z][a-z]{1,16}itis [A-Z][a-z]{1,16}pathy
[A-Z][a-z]{1,16}pathies [A-Z][a-z]{1,16}penia [a-z]{1,16}pathic [a-z]{1,16}ergic activity
function_list abolish[a-z]{0,3} accompan[a-z]{0,4} accumulat[a-z]{0,3} acts{0,1} acted acting deactivate[a-z]{0,1} deactivating deactivation activate[a-z]{0,1} activating activation activity affinity aim[a-z]{0,3} antagoni[s|z]e[a-z]{0,1} antagoni[s|z]ing associate[a-z]{0,1} associating attenuat[a-z]{0,3} alkylat[a-z]{1,3} activators{0,1} agonists{0,1} annealing antagonists{0,1} antigens{0,1} antiporters{0,1} antireceptors{0,1} antiterminators{0,1} bending branching bind[a-z]{0,3} block[a-z]{0,2} blocking bound carboxylat[a-z]{1,3} cataly[s|z][a-z]{0,3} characteri[s|z]ed by cleave[a-z]{0,1} compartmentali[s|z]at[a-z]{0,3} compet[a-z]{0,3} contribut[a-z]{0,3} control{0,4} coordinat[a-z]{0,3} couple[a-z]{0,1} coupling channels{0,1} capping carry carrying carrie[s|d] chaperones{0,1} cleaves{0,1} co-activators{0,1} coagulations{0,1} conductors{0,1}
constituents{0,1} dock docking donates{0,1} donating decreas[a-z]{0,3} depend[a-z]{0,3} dephosphorylat[a-z]{0,3} desensiti[s|z][a-z]{1,5} determine dimeris[a-z]{0,3} dissociate[a-z]{0,1} dissociating downregulat[a-z]{0,3} down-regulat[a-z]{0,3} encoded by encod[a-z]{0,3} enhance[a-z]{0,1} enhancing expressed expression of elongat[a-z]{0,3} endocytosis energi[z|s]ers{0,1} escort{0,3} export{0,3} function[a-z]{0,3} form[a-z]{0,3} generate[a-z]{0,1} generating hydroxylat[a-z]{1,3} hetedimer[a-z]{0,7} homodimer[a-z]{0,7} implicat[a-z]{0,3} increas[a-z]{0,3} induc[a-z]{0,3} inhibition of inhibit[a-z]{0,1} inhibiting interact[a-z]{0,3} involv[a-z]{0,3} inhibitors{0,1} inhibitory initiations{0,1} ligands{0,1} ligate[s|d]{0,1} locali[z|s]ers{0,1} lyases{0,1} lytic lead[a-z]{0,3} led link[a-z]{0,3} methylat[a-z]{1,3} mediate[a-z]{0,1} modulate[a-z]{0,1} modulating mediators{0,1}
motors{0,1} oxidat[a-z]{1,3} participate[a-z]{0,1} participating phosphorylat[a-z]{0,3} plays{0,1} protect[a-z]{0,3} prevent[a-z]{0,3} produce[a-z]{0,1} producing proliferat[a-z]{0,3} promote promote[s|d] promoting polymeri[s|z]ing recogni[s|z][a-z]{1,3} reduc[a-z]{0,3} regulate[a-z]{0,1} regulating regulation of relat[a-z]{0,3} relea[s|z][a-z]{0,3} requir[a-z]{0,3} respond[a-z]{0,3} response results in resulted in result[a-z]{0,3} roles{0,1} secret[a-z]{0,3} singal{0,4} splic[a-z]{1,3} stimulate[a-z]{0,1} stimulating stimulation stop[a-z]{0,4} suppress[a-z]{0,3} switch[a-z]{0,3} synthesi[s|z][a-z]{1,3} transcript[a-z]{0,3} transduc[a-z]{1,3} transfer[a-z]{0,3} transport[a-z]{0,3} target[a-z]{0,3} trafficking transactivat[a-z]{0,3} transfect[a-z]{0,2} transfecting transfer[a-z]{0,3} translate[a-z]{0,1} translating trigger[a-z]{0,3} up-regulat[a-z]{0,3} upregulat[a-z]{0,3} uncouple[a-z]{0,1} uncoupling
structure_list 3D [d+] amino acids [A|a]ngstrom angle
box boxes covalent bonds{0,1} hydrogen bonds{0,1}
phi bonds{0,1} psi bonds{0,1} van der waals bonds{0,1} bonds{0,1}
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
75
bridges{0,1} bundles{0,1} alpha-barrel beta-barrel barrel central region chains chiral cleft conformational change conformation cores{0,1} compris{0,3} composed components{0,1} consist[a-z]{0,2} consisting covalently bound covalently linked covalent-linked covalent crystallography highly-conserved highly conserved conserved contain[a-z]{0,2} containing coils{0,1} characteri[s|z]ed by degrees heterodimer[a-z]{0,7} homodimer[a-z]{0,7} homology hydrophobic hydrophilic detergent dimers{0,1} dimeri[s|z][a-z]{1,5} distributed disulphides domains{0,1} dipolar [D|d]altons{0,1} forms{0,1} folds{0,1} folded self-folding foldings{0,1} frames{0,1} framework zinc fingers{0,1} fingers{0,1} major grooves{0,1} minor grooves{0,1}
grooves{0,1} alpha-helix alpha-helices helix helical helices hydration interchains{0,1} identity identical to isomer isomeri[s|z]ed [d+]nm [d+] nm [d+].[d+] nm [d+]A [d+]-A [d+] A [d+].[d+] A [d+]kDa [d+]kilodalton [d+]kilobases{0,1} [d+]kbp [d+] kDa [d+] kilodalton [d+] kilobases{0,1} [d+] kbp [d+]-kDa [d+]-kilodalton [d+]-kilobase{0,1} [d+]-kbp kDa kilodalton lipid bilayers{0,1} bilayers{0,1} layers{0,1} loops{0,1} linked together linked to membrane molecular mass molecular weight motifs{0,1} monomers{0,1} monomeric multidomains{0,1} multi-domains{0,1} NMR organised in organized in occupied by antiparallel anti-parallel parallel
patterns{0,1} pockets{0,1} pores{0,1} protein complex primary sequences{0,1} primary structures{0,1} quaternary repeat-regions{0,1} repeat regions{0,1} regions{0,1} residues{0,1} resolution repeats{0,1} rod-like rings{0,1} beta sheets{0,1} beta-sheets{0,1} sheets{0,1} scattering sequences{0,1} similarity size subunits{0,1} surface alpha-strands0,1} beta-strands{0,1} strands{0,1} structural structures{0,1} segments{0,1} symmetry asymmetric symmetric scaffolds{0,1} active site binding site coordination site site tail terminus termini tetramers{0,1} tetrameric tertiary tetrahedral C-terminal C terminal N-terminal N terminal terminal transmembranes{0,1} topology
location_list is found in are found in is common in are common in allocat[a-z]{0,3} derived detected in distributed in distributed along discovered in encoded in expressed in exist[a-z]{0,3} in extracellular found within found only in
found throughout found within found in found at found on colocalise[a-z]{0,1} with colocalize[a-z]{0,1} with co-localise[a-z]{0,1} with co-localize[a-z]{0,1} with colocalise[a-z]{0,1} colocalize[a-z]{0,1} co-localise[a-z]{0,1} co-localize[a-z]{0,1} contained in
intracellular inside localise[a-z]{0,1} in localize[a-z]{0,1} in localise[a-z]{0,1} at localize[a-z]{0,1} at localise[a-z]{0,1} on localize[a-z]{0,1} on localise[a-z]{0,1} localize[a-z]{0,1} locat[a-z]{0,3} observed in obtained from outside originat[a-z]{0,3} occurs{0,1} in
RASMUS WINTER QSPACE VISUALISATION OF MEDLINE ARTICLES
76
present in position[a-z]{0,3} in position[a-z]{0,3} at position[a-z]{0,3} recognised in recognized in subcellullar topology in plants in mammals in animals in humans in algae in fungi in bacteria
in yeast in the brain in the liver in the bowel in the pancreas in the kidney in the heart in the cerebellum in arteries in the aorta in the ileum in the intestine in the small intestine in the large intestine in the duodenum
in the cytoplasm in the cytoskeleton in the endothelium in the endothelial in the endothelia in the nucleus in the endoplasmic reticulum in the mitochondria in the mitochondrium in the vacuole in the outer part in the periplasmic
B.2 Extract from related articles file rel_15167971. Lines consist of PMID:score
pairs.
11705672:54128223 15140465:50038799 15843060:46688894 8632154:46445710 11411616:46267554 1708303:46101814 7890729:46012060 11005628:45825644 1988561:45243645 9302078:45242854 1377078:44776853 9884076:44762947 11494368:44442335 10653166:44122862 15105791:43335631 . . . 15341182:35590820 15682931:28640352 15952243:27408332 15928825:17280546 15968139:11776961 15335893:7420826 13263611:6926708