it issues for large surveys iasc / iass summer school knowledge discovery in large surveys june 2001...
TRANSCRIPT
IT Issues for large surveys
IASC / IASS Summer School
Knowledge Discovery in Large Surveys
June 2001
Andrew Westlake
Survey & Statistical Computing
www.sasc.co.uk
© Survey & Statistical Computing 2001
Structure of the Presentation
• Motivation» What is the Problem?
• Database Issues: Relational Databases» Things every Statistician should know
• Modelling Issues» Useful additional tools for Projects
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
What is the problem?
• Why should statisticians need to know about IT?• Is there anything important to know?• Why not leave it all to the IT specialists?
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Statistical Analysis is Simple
Answers
StatisticalMethod
Raw Data
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Statistical Analysis: not so simple
Process OutputsInputs
Summaries
Transformationsand Updates
Revision
Conclusions
Interpretation
Record
ResultsStatisticalMethod
PopulationDatabank
Prior Knowledge
Analysis History
Tables,Estimates
AnalysisSpecifications
Raw Data
Meta Data
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Statistical Analysis: the Survey
View
Planning
Experiment
Questionnaires orforms
Fieldwork
Samplingspecifications
EntrySpecifications
Validity Rules
QuestionnaireDesign
SurveyDesign
Meta Data
Data Entryand
Checking
Data ArchiveData Bank
Publication
Enquiries
DataReduction
Conclusion
StatisticalProcedure
Raw Data
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Sample Design and Selection
Form Production, DataCapture, Scanning
Outliers, Imputation
DBMS
SamplingFrame
Results
Raw Data
Adjustments
Disclosure Control
Designparameters
Standard Analyses
Dissemination ad hoc Analyses
Grossing, Aggregation
Processes
Statistical Analysis:
the Statistical
Office Process
View
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Why not use Packages?
• Standard packages provide Functionality» Often adequate for particular tasks
> Elements of analysis
> Questionnaire design
> Data Storage
» Implements a particular view of a problem area> Often helpful, but can be limiting
• Not sufficient for implementing Processes» Sequence, Control and Validation (Knowledge)
• Potential as Components in overall System
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Why get Involved?
• We need good data» Cannot do good analysis with bad data
• We need good access to data» Systems often designed inflexibly, to limited initial
targets
• It can save time in the long run» We’ll get involved anyway in cleaning up the mess
» Easier to work with properly focussed system
• IT people have useful tools and concepts» Can help us to do our ordinary tasks
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Development Opportunities
• Continuous Processes» Continuous or repeated surveys, business enquiries
• Repeated Processes» Design and analysis service for small surveys
» Regular reporting, with adjustments and estimation
• Sharing» Secondary analysis of data (through Data Archive?)
» Dissemination systems
• Very Large Projects» E.g. Census
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Applications to statistical problems
• Statistics is not just about analysis» Need to be concerned about good data in and good use of conclusions
• Examples of development areas» Processing
> HIV and AIDS notification system
> Processing and Manipulation of results from a Demographic Survey in Pakistan
> Result processing from Construction Industry inquiries
» Statistical Databases> Support for dissemination to users and policy makers, and for further analysis
» Metadata> Initiatives to standardise concepts and structures
» Integrated statistical analysis systems> Analysis tools seen as a component that integrates with other components, data store,
metadata, dissemination, etc.
» Distributed resource systems> Data owners retain control, but users can see distributed resources as an integrated whole
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Developing new Software and Systems
• Don’t do it unless it’s essential» Difficult to do well, Expensive, Time-consuming
• Do it properly» Don’t leave it all to the IT experts
> They don’t know what you want> They don’t know what is important
» User-Centred design (HCI) is not enough> Important advance, but leaves the power with the IT people
» Learn the Concepts and Jargon> Useful tools for thinking about structure and design
» Take part in the development process> Not too much detail (but enough)> Concentrate on functionality> Use proper tools and a proper methodology
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Tools IT People use
• Concepts about Structure» Relational database ideas
» Object-oriented concepts
» Modelling of structure and process
• Implementation standards» OLAP – handling aggregate data
» XML – Interchange of complex structures
» Component architecture – Cooperating tools, not monolithic structures
• Design and Development» UML – for system design based on objects
» Methodologies> Contextual Inquiry – for User Requirements
> Feature-Driven Development, Rational Unified Process – for managing the development process
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Database Issues: Relational Databases
• Relational Model and SQL» Components» Operations
• Data Warehouses» What is different?
• Aggregate Data and Data Cubes» Structure» Functionality
• Examples» Processing the PFFPS database in MS Access and SQL
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Database Schema Levels
User 1
User 2
Application 1
Application 2
Search
Data Entry/Update
Report
Cases for statisticalanalysis
Statisticalresults
External Views
External Schema Logical Schema Internal Schema
Logical ModelDatabase
Internal Model
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Relational databases
• Relational structures and operations» Agreed model of components and behaviour (Codd)
» Standardised implementations through SQL
• Views» Definitions of extractions and combinations can be stored and used as
though they were physical tables
• Physical model choice depends on intended usage as well as logical structure
• Normalisation: method to avoid duplication and dependencies» Important for transaction systems
» Easier to be consistent, faster updates
• ODBC standard for access to micro data from applications© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Household
PK,FK1,I1 CLUSTERPK H_NO
Household Member
PK,FK1,I1 CLUST_HHMPK,FK1,I1 H_NO_HHMPK Q01
Eligible Woman
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01
Cluster
PK Cluster
Birth History
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK LINE_WBH
Pregnancy Detail
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK,FK1 LINE_WPB
Contraceptive Methds
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK Q301
Contraceptive Use
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01
Marriage and Fertility
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01
Background
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01
Structure of the PFFPS database
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
SQL Data Manipulation
SELECT {DISTINCT}<expression list>FROM <table list>WHERE <condition>GROUP BY <column list>HAVING <group condition>ORDER BY <column list>
• All manipulation (retrieval) of data uses the single Select statement, which has various components corresponding to different relational operations.
• The result of a SELECT statement is a relational table, which is displayed (by default) or can be stored or processed in another statement.
• Retrieval is usually done through a Query interface, which generates the SQL.
• Basis of ODBC standard for linking to relational databases.
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Data Manipulation – Project and Restrict
• Project operation chooses columnsSELECT B, C FROM R1 SELECT SEX, AGE_AT_DIAGNOSIS, DIAGNOSISFROM CANCER_REGISTRATION
• Restrict chooses rows (Where clause)SELECT * FROM R2 WHERE A < 6SELECT *FROM CANCER_REGISTRATIONWHERE AGE_AT_DIAGNOSIS <= 65
R1
A B C D
1 b1 c1 d1
2 b2 c2 d2
3 b3 c3 d3
R2
A B C D
3 b3 c3 d3
4 b4 c4 d4
5 b5 c5 d5
6 b6 c6 d6
7 b7 c7 d7
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Data Manipulation - Join
• The Join operation is the combination of a Product operation with Restrict to select the rows of the resultSELECT R1.*, R3.* FROM R1, R3 WHERE A = X orSELECT R1.*, R3.* FROM R1 INNER JOIN R3 ON A = X
• This is an Equi-Join • Natural Join is based on columns with the same name
SELECT R1.*, X, E FROM R1, R3 WHERE R1.D = R3.D or
SELECT R1.*, R3.* FROM R1 NATURAL JOIN R3
Join A=X
A B C R1.D
X R3.D
E
1 b1 c1 d1 1 d3 e1
2 b2 c2 d2 2 d5 e2
3 b3 c3 d3 3 d3 e3
Natural Join
A B C D X E
3 b3 c3 d3 1 e1
3 b3 c3 d3 3 e3
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
RDBMS Strengths
• Data Modelling» Useful tools for understanding data structures and flows
» Entity – Relationship (ER) model widely used for structure
• Relational Model» Precise, formal mathematical specification of structure and
behaviour
• SQL» International Standard (SQL/92), widely implemented
• Current Implementations» Widely available, well supported, good implementations,
integration with other products, add-on market for tools
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Data warehouse
• Most RDB systems are designed to support Transactions» Fast access to single (or few) records for updating» Not imposed by the Relational Model
• Data Warehouse systems designed for analysis, not transactions» Different physical model for access, but can still follow
relational principles> Many (selected) records, structured classification variables and measures> Different ideas about redundancy (normalisation)
» Extensions for manipulating Aggregated Data» Tools for gathering and cleaning data from source databases» Analysis tools (Data Mining)
> Hypothesis generation> Inference> Often weak on statistical principles (particularly over-fitting), but improving
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Source: MS OLAP pages
Star Schema
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Aggregate data
• Can produce with aggregation functions in SQL» Count, Sum, Group By
• Induces new concepts, not in relational model» Data Cube (Multi-way table), Dimensions,
Classifications, Levels, Measures
• Requires new functionality» Exploration, Manipulation, Presentation
• Commercial products developing» OLAP (anticipated by Codd), usually within Warehouse» Usually limited to simple aggregation» Some specialised Statistical products
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Aggregate Results, as Multi-way Table
Detail
Minor Group
Major GroupDisease Classification (ICD)
Loc
atio
nC
ountr
yR
egio
n Dis
tric
tPeriod
YearWeek Month
Day
MeasuresReports receivedPopulation at riskEstimated Incidence rateSD of Incidence rate
This example has three dimensions (so that it can be visualised). In reality, for this application, we would need at least two more, Age and Gender.
{
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Functionality for Cubes• Focussing on Subsets
» Slice and Dice
• Change level of Detail» Drill down, Roll up, implies a structure of levels over classifications
for each dimension
• Aggregation rules for measures» May not be sums, may have different base
• Derived measures» What is compatible, sensible
• Manipulation between cubes• Presentation issues
» Layout on 2 dimensions» Annotations and descriptions
• And so on – only basic issues addressed in OLAP products
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Relational Databases for Statistical Data
• Can support complex structure• Can support complex processing• Can link easily to many statistical packages• Can do more data manipulation than most
statistical packages» Examples from Pakistan Fertility survey
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Household
PK,FK1,I1 CLUSTERPK H_NO
Household Member
PK,FK1,I1 CLUST_HHMPK,FK1,I1 H_NO_HHMPK Q01
Eligible Woman
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01
Cluster
PK Cluster
Birth History
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK LINE_WBH
Pregnancy Detail
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK,FK1 LINE_WPB
Contraceptive Methds
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK Q301
Contraceptive Use
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01
Marriage and Fertility
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01
Background
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01
Structure of the PFFPS database
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
RBDMS: Pakistan FFPS in Access
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
PFFPS: Sex distribution for children
by Age
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
PFFPS: Using the Query Interface
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
PFFPS: Generated SQL
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
PFFPS: Crossta
b in Access
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
PFFPS: Joining TablesParity for all women of Childbearing
Age
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
PFFPS: Parity for all women of Childbearing Age
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
PFFPS: Fertility Rate Calculations• Numerators
» Count Births by Period and Mother’s Age
• Denominators» Years of ‘exposure’ by Period and Age at the time
» Months give adequate precision
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
PFFPS: Births by Mother’s
Current Age and Years
Ago of Birth
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
PFFPS: Fertility Numerator Calculations
• Years Ago: Int( ([interview cmc]-[child cmc]-1)/12 )• Women: Sum( IIF ( NZ([line_wbh])<2, [WT_EW], 0 ) )• Children: Sum( IIF ( IsNull([line_wbh]), 0, [WT_EW] ) )
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Modelling Issues• ER Models for Databases• Objects and Object-Oriented Concepts• UML: Unified Modelling Language• XML: eXtended Markup Language• Examples: Processes analysis for HIV and AIDS
Notifications
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Models and Modelling• Development process for systems
» Different from statistical models
• Explicit sets of concepts and procedures» For identifying and describing the components of some system» Often have associated notation» Different levels and purposes: Semantic, conceptual, logical, physical» Can provide useful ways of thinking about and representing data and
structure
• ER models for database» Conceptual approach to entities and relationships, more than the
relational model. Not just structure, but purpose (semantics).
• Object models» Components of a system, what is their structure and behaviour, how do
they work, how are they related, how do they interact, what are the sequences in which things happen?
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Household
PK,FK1,I1 CLUSTERPK H_NO
Household Member
PK,FK1,I1 CLUST_HHMPK,FK1,I1 H_NO_HHMPK Q01
Eligible Woman
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01
Cluster
PK Cluster
Birth History
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK LINE_WBH
Pregnancy Detail
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK,FK1 LINE_WPB
Contraceptive Methds
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK Q301
Contraceptive Use
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01
Marriage and Fertility
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01
Background
PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01
Structure of the PFFPS database
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Object-based methods• Representation of complex content
» Process, not just structure» Central idea of a generic Class definition, with
> Complex elements, including other objects and collections> Behaviour, implemented as functions (methods)> Interfaces, that control external access to attributes and behaviour
• Object» Instance of a Class (Person, Account)» Has identity, that can be transmitted
> Set CurrentCust = New Person> Set CurrentCust.Account = New Account> Set Operator.Customer = CurrentCust
» Can be interrogated and asked to perform operations> CurrentCust.Age; CurrentCust.Account.Balance> CurrentCust.Print; CurrentCust.Account.Debit(£50)
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Some Statistical Objects• Dataset
» Matrix of Cases and Fields
• Scale» Set of codes and meanings for the values used in a Field
• Variable» Combination of a Field and a Scale
• Classification» Defined over a Scale, but allowing re-grouping
• Classification Set» Hierarchy (or Tree) of Classifications, with mappings between levels
• Measure» Single-valued expression derived from one or more fields, with the derivation
formula
• Dimension» Combination of a Variable and a Classification set
• Summary Table (Data Cube)» Combination of Statistical Population, a set of Dimensions and a set of Measures
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Object-Oriented Concepts• Inheritance
» One class can be defined as an extension of another
» Inherits all the structure and methods, but can extend or alter as required> Eligible Woman <<extends>> Household Member
> Head of Household <<extends>> Household Member
• Polymorphism» Behaviour of a method depends on the class for which it is invoked, eg
Print> The Class is responsible for providing a suitable method (can be inherited)
• Encapsulation» Attributes are private to the object, only exposed through methods
• Rich way of thinking about structure» Pervasive for programming
» Appearing in database systems (some special products)
» Supported by modelling tools© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Unified Modelling Language (UML)
• Standard (OMG) way to represent object models» Collection of diagram types and components to represent various
types of object and behaviour» Formal specification with semantics and conventions for
representation
• Recognises complexity» Same objects can participate in multiple diagrams, with different
emphasis or different level of detail or abstraction» Multiple Levels
> From User Requirements (Use Case diagrams) down to coding and implementation (Statechart, Activity, Sequence, Component and Deployment diagrams).
• Emphasis on software implementations» But much wider application for design» Rich and complex. Extensible.
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Data Records Aggregations
Synthetic Data
Models
Metadata
Analysis
Package Diagram
• Types of Information in a Statistical Database
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
LATS Database
System User
«subclass»
Data Maintainer
«extends»
System Administrator
«extends»
Use Information_______________
DiscoveryManipulation and Display
ExtractionTransfer and Linking
Data User
«system»External System
Load and MaintainInformation Store
___________________________________Load Survey Data
Load and link metadataStore derived informationStore results of analysis
Administer System_________________
User roles and permissionsAccess rules for information
«system»Information Resource
«system»Analysis Tool
«extends»
«extends»
«Data File»Extracted Data
Use Case Diagram
• Database Users
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Use Information
Discovery_________
Catalogue SearchText Search
Traverse LinksView Information
Extraction__________
Select subset and variablesSelect Data format
Manipulationand Display
________________________Store Results of Analysis
Modelling
Transfer andLinking
____________________Target system
User Validation_______________
GuestData User
Store Resultsof Analysis
Use Information
«extends»
«system»Information Resource
«system»Analysis Tool
«uses»
«extends»
«extends»
«extends»
«Data File»Extracted Data
«extends»
Modelling?
«extends»
Extension in Use Case Diagram
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Operator 2 Operator Supervisor
New Notification
[Machine-readable data][On Paper]
Enter Identification Data
Verify Identification data
Transfer Identification data
Seek Match
Create New Patient
[No Match]
Transfer Identification information
Check Identification Information
[Patient Found]
Check Notification type
Create New
[Not present]
Enter or Check Details
Refer for Review[Decision not clear]
Enter New Notification
Activity Diagram
• Similar to familiar Flowchart
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Source: Chris Nelson
Class Diagram
• ebXML Registry Model
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
IQML Registry Model
Association
associationType : IntegerassociationRole : StringsourceObjectId : StringtargetObjectId : StringsequenceNumber : String
Package
RegistryEntryAttribute
type : Stringvalue : String
RegistryEntry
0..*
2
0..*
2
1..*
0..*
1..*
0..*
0..*0..*
ClassificationNode
0..1
0..*
+Parent
0..1
+Child0..*
Classification
classificationSchemeId : StringclassificationNodeId : String
0..*1 0..*1
1
0..*
1
0..*
Source: Chris Nelson
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
UML Summary
• Rich system of elements and diagrams for expressing designs» Can be complex to learn, but rewarding» Formal definition and semantics of elements, so designs
can be precise
• Focussed on software development» Round-trip development tools
> Rational Rose, Together, … (particularly for Java)
» Wider application for any design or specification context» Lots of diagram software, from free to expensive» Model building can take time, but if done in detail,
implementation in software can be fast
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
XML: Purpose
• Designed to express complex structures of information in a way that can easily be moved between applications
• Markup Language (based in SGML) » Text with Tags (<Field> field contents </Field>)
» Nested Tags => multiple hierarchies
<ClassificationVersion> <identifier> Sex </identifier><items>
<ClassificationItem> <code> 1 </code> <title> male </title></ClassificationItem>
<ClassificationItem> <code> 2 </code> <title> female </title></ClassificationItem>
</items></ClassificationVersion>
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
XML: Structure
• Generic language» Tags not defined, only the language structure
» Generic tools to read and write XML> Interface tools for application developers, (DOM, SAX)
> Presentation and transformation tools, style sheets (XSL)
» Tolerant applications> Can detect omissions and skip additions
• Schema and DTD - Document Type Definition» Rules about the specific tags and structures allowed in a
specific context> Can have generic tools to check conformance
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
XML: Uses and Standards
• Linear text, solves problem of exchange between applications
• Still have to define and agree on structures» Agreement can be complex, but then easy to generate
XML Schema (eg from UML)
• Many proposals and agreements on standard structures in XML» DDI, Triple-S, MathML, GML, ebXML, …
• Only handles structure, not semantics or behaviour
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
HIV Notification
On Paper (Green Form)
HIV Notification
In File
AIDS Notification
On Paper (Blue Form)
AIDS Notification
In File
Death Notification
In File
HIV File
dBase
AIDS File
dBase
Death File
dBase
HIV and AIDS Reporting System
• System» Separate file for each type of
Notification
• Problems» Duplicate Notifications» People have records in more than
one file» Difficult to identify and match
individuals» Analysis based on People
• Solution in Use» Cross-linking identifiers in all files
• But» Difficult to maintain, not reliable
HIV Notification
On Paper (Green Form)
HIV Notification
In File
AIDS Notification
On Paper (Blue Form)
AIDS Notification
In File
Death Notification
In File
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
HAP System
External Contact
Operator
External Reporter
Quarterly Reports_________________
New Extract
Publication section
External Inquiry________________
New Information
ad hoc ReportRequest
Government
Review Data Quality
«extends»
Enter Notification
Use Case diagram of system activities
HAP System
External Contact
Operator
External Reporter
Quarterly Reports_________________
New Extract
Publication section
External Inquiry________________
New Informationad hoc Report
Request
Government
Review Data Quality
«extends»
Enter Notification
What is really important, so should
be efficient
HAP System
External Contact
Operator
External Reporter
Quarterly Reports_________________
New Extract
Publication section
External Inquiry________________
New Information
ad hoc ReportRequest
Government
Review Data Quality
«extends»
Enter Notification
HAP System: Processes
Can afford to spend time getting
information right
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Redesign database to focus on Patients
PatientPatient
HIV Infection AIDS Diagnosis Death Report
HAP System: Data Stores
Notification
IdentificationType
Need components for the various stages (may not
appear in order)
Important to keep record of Notifications as
Sources
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
HAP System: Process Notifications
Submit new Notification information
___________________________________Match to existing Patient
New PatientNew HIV Infection, or New AIDS diagnosis
New Death reportUpdate Patient information
Patient
HIV Infection AIDS Diagnosis Death Report
Notification
IdentificationType
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
Summary
• Good systems produce better data» Easier to analyse» Better quality results
• Good systems save work» Less effort on cleaning data» Better and more metadata» Can move repetitive tasks into the system
• Get involved» Learn the conceptual ideas» Use the tools» Contribute to the design and development process
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
References: Databases
• Date (2000). Introduction to Database systems, 7th Edition. Addison Wesley. ISBN: 0 201 68419 5.» This is the standard ‘bible’ for relational database
systems, hard work, but important if you want a deep understanding of the strengths and limitations of relational systems.
• Date & Darwen (1997). A guide to the SQL Standard, 4th Edition, Addison-Wesley
• Dowling. Database design and management using Microsoft Access. Letts. ISBN: 1 85805361 7.» A cheap book that works through a development
project using Access
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
References: Systems Design and UML
• Booch, Jacobson, & Rumbaugh. The Unified Modelling Language User Guide. Addison Wesley. ISBN: 0 201 57168 4.
» Further information about UML in general and in the context of Rational Software products is available at www.rational.com
• Coad, Lefebvre & De Luca. Java Modelling in Colour with UML: Enterprise Components and Process. Prentice Hall. ISBN: 0 130 11510 X.
» See also www.togethersoft.com
• Holtzblatt. Contextual Design : A Customer-Centered Approach to Systems Designs. Academic Press. ISBN: 1558604111
• Kruchten. The Rational Unified Process – an Introduction. Addison Wesley. ISBN: 0 201 70710 1
• McConnel, 1996. Rapid Development – taming wild software development. Microsoft Press. ISBN: 1556159005
• Reed. Developing Applications with Visual Basic and UML. ISBN: 0 201 61579 7
• Sheridan & Sekula. Iterative UML development using VB6. ISBN 1 75622701 9.» This book introduces the ideas of iterative development and works through some projects
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
References: UML Software• Microsoft Visio 2000 (Professional or Enterprise edition)
» Contains UML and database modelling tools, as well as general diagram facilities. It links in with MS Visual Studio, which has further UML code design and development tools, based on Rational Rose. A free 60-day evaluation version of Visio is available from Microsoft. New version (2002) just released.
• Rational Rose» The market leader in UML diagram and development support, and various
suites are available (such as for Analysts or Developers) containing additional tools, including a requirements management database and code generation tools. Various presentations and evaluations are available from Rational Software Ltd, www.rational.com/uk, or 01344 295000.
• Together» Another UML modelling tool, though more focussed on Java and round-
trip code generation. www.togethersoft.com
• ArgoUML» Open Source UML tool. argouml.tigris.org
© Survey & Statistical Computing 2001
© Survey & Statistical Computing 2001
References: Other Software• Beyond 20/20
» A dissemination and manipulation tool for multi-way tables (data cubes) aimed at statistical users. There are versions for independent use with downloaded files, and for building web dissemination servers. It is being used by a number of statistical offices, including ONS, Unesco, Statistics Canada and the US Census Bureau. The UK distributors are Forvus, at www.forvus.co.uk, and the developer site, at www.beyond2020.com, has various demos and descriptions, including downloads.
• Bridge» A repository database for statistical metadata, developed under the Eurostat project IMIM
(Integrated Metadata Information Management). See www.runsw.com
• Microsoft Office» The Pivot Table component in Excel 2000 is a good demonstration of the manipulation facilities
developed by non-statisticians for data cubes (the earlier versions have less general functionality). » Access (all versions) is a good example of a relational database system, suitable for projects up to
moderate scale (in terms of complexity and number of users, as well as physical size).
• Microsoft Project » Provides the classic project management tools, including PERT and Gantt charts. Evaluation
versions are available from Microsoft. There are also a number of heavyweight project management systems available from other companies
© Survey & Statistical Computing 2001