visual tools for databade queries and analysis
TRANSCRIPT
Visual Tools for Queries and Display of Quantitative Information
in a Cancer Research Database
JESSE STEWART and JERZY W. JAROMCZYKDepartment of Computer Science
University of Kentucky, Lexington KY
The Kentucky Cancer Registry
• The Markey Cancer has the singular mission to eliminate the morbidity and mortality of cancer
• Since its founding, the Markey Cancer Center and the UK Chandler hospital have served 2000-2200 new patients a year and is one of the few institutions nationwide that address both clinical care as well as cancer research.
• The KCR’s case count exceeded 30,000 annually as of 2009
• The KCR houses a wealth of historical data for hundreds of cancer variants, associated treatments, and their relative success across the state of Kentucky.
Patient
Events
Abstracting
CPDMS.NET
Internet
HTTPS
Registry DB
MySQL
Data Collection
Cancer Abstracts
• A cancer abstract contains up to 240 different elements ranging from patient demographics to staging information to therapy history
• KCR alone stores tens of thousands of unique abstracts
• Each abstract is created by a registrar, a professional trained to understand cancer data standards, formats and coding rules
Accelerating Cancer Research
Develop Queries
Visualize Data Sets
Discover Important
Correlations
Registry Databases and Research
Challenges in Research•Coded Data•SQL•Complex DB Schemas•Access Control•Visualization
Valuable Information•Survival Trends•Incidence Rates•Behavioral and Geographical Correlation
Software Solutions
• Define Queries (Data Sets)– Intuitive: no programming required
– Flexible: allow any data set to be explored
– Accessible: Visual cross-browser application
– Re-use: Save, modify and combine Data Sets
• Data Analysis and Visualization:– Context-specific diagrams
– Compare data sets singularly or side-by-side
– Customizable appearance
The Query Builder
• Presents a high-level abstraction of the Registry Database
• Patient, Case, Therapy data variables are easily recognizable and categorized
• Separates the user from the actual database structure and coded information
– Example: Treatment is encoded as:
• No Treatment=0, Treatment=1, Surveillance=2
The Query Builder
• Translates a question about cancer data into SQL (Structured Query Language) which can be understood by the computer system
• Parses and stores the query for modification and reuse later
Example Query
• Patients diagnosed between Jan 1, 2005 and Dec 31, 2008
• Patients diagnosed in Kentucky• Patients treated with immunotherapy
• SQL may be complex
case_data.diagdate >= 20050101 and case_data.diagdate <= 20081231 and case_tx.txtype = ‘I’ and case_data.diagstate = ‘KY’ from case_data, case_txwhere case_tx.hospkey = case_data.hospkey and case_tx.patkey = case_data.patkey and case_data.incomplete = 0;
Interface Design
• To make writing a query like the previous example simple, the Query Builder must provide intuitive controls permitting a user to define each query component
• Variable names and coded values should be descriptive and easy to locate
• Conditions should be combined in a natural way with Boolean operators
• Tree-like layout chosen to represent queries
Query Builder in Action
Custom UI Controls
• For each variable, DB schema information is used to display a customized UI control, eg:
– Dates: date fields or ranges
– Discrete variables: drop-down list or multiselect
– Variable with many values: autofill field
Syntax Tree
Internal Representation
• Program maintains an abstract syntax tree for the query as it is created
• Captures the essential structure of the query but omits SQL-specific syntax
• This data structure serves as an intermediary between the interface and the database system
• Permits two code-generation targets: JSON and SQL
Serialization and Storage
• Each query once created by the user may be saved for future analysis or manipulation
• The program stores the AST for the query as a JavaScript object, which can then be serialized into JSON (JavaScript Object Notation) and then stored.
• Deserialization and conversion to SQL is performed later for analysis
Query Management
Query Storage
• Queries are often referred to as ‘study groups’ by researchers
• The serialized queries and associated metadata is stored in a database table
study_groups: id | Name | Query| User | LastModified | LastUsed
• MySQL database was chosen for convenience since registry data is stored using this system
Visualization Tools
– Scaled Venn Diagrams• User can quickly ascertain relative size of data sets and
their relationship to one another
– Bar and Histogram Charts• Flexible view of variable distribution for different sets
– Survival Trends• View and compare survival rates over time
– Statistics• Common descriptive statistics
• Comparison with Chi-square, Log rank, T-, Z-tests
Venn Diagrams
• Venn diagrams show logical relationships between a number of sets
• Subset of Euler diagrams – all possible subsets must be displayed
• Can quickly convey how data sets overlap and relate to one another
Area Proportionality
• Area-proportional venn diagrams show the relative size of datasets and their intersections
• Very useful for rapid exploration of data sets such as cancer data
• Although typical venn digrams often display 3 sets, area-proportional diagrams cannot always be drawn with circles for more than 2 sets [1]
• The vast majority of research needs involve comparison of two data sets
Drawing To Scale
Circle-intersection problemTriangle(C1,C2,A) = Triangle(C1,C2,B)Triangle(C1,C2,A) + Triangle(C1,C2,B) + Lens = Sector(C1,A,B) + Sector(C2,A,B)Lens = Sector(C1,A,B) + Sector(C2,A,B) - 2*Triangle(C1,C2,A)
Drawing to Scale
Lens = Sector(C1,A,B) + Sector(C2,A,B) - 2*Triangle(C1,C2,A)
• By applying formulas for the area of a circular sector and triangle, we arrive at this result for the distance between the circles’ centers:
• The value must be approximated, to do so the Root-bisection method was used in implementation.
Visualization: Venn Diagrams
Visualization: Venn Diagrams
Reports
• Several customizable reports were implemented to further leverage the query builder’s utility.
• Each is implemented in PHP, and produces an SQL query using the saved criteria and the settings selected by the user for the report
Data List Tool
Cross-Tab Analysis
Graph Settings Interface
Visualization: Histogram
Visualization: Survival Trends
Chi-square Analysis
Censored Life Table
Success
• The Visual Query Builder and Data Analysis tools have become an integral part of CPDMS.NET – the online abstracting system developed at the KCR.
• Over 5000 study groups have been created by users of the system.
• Features have been added and improved resulting from feedback given by researchers and registrars (cancer data professionals).
• Future developments may include:– Wider array of statistical tests
– Functions to analyze more than two data sets at once
References
• The Kentucky Cancer Registry – A History
• http://www.kcr.uky.edu/about.php
• F. Ruskey and M. Weston – A Survey of Venn Diagrams
• http://www.combinatorics.org/Surveys/ds5/VennEJC.html
• S. Chow and F. Ruskey, Drawing Area-Proportional Venn and Euler Diagrams
• Circle-Circle Intersection Problem
• http://mathworld.wolfram.com/Circle-CircleIntersection.html
Acknowledgements
Eric Durbin, Kentucky Cancer Registry
Dr. Jerzy Jaromczyk, UK Computer Science
Software