it issues for large surveys iasc / iass summer school knowledge discovery in large surveys june 2001...

64
IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing www.sasc.co.uk

Upload: maddison-estis

Post on 14-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

IT Issues for large surveys

IASC / IASS Summer School

Knowledge Discovery in Large Surveys

June 2001

Andrew Westlake

Survey & Statistical Computing

www.sasc.co.uk

Page 2: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Structure of the Presentation

• Motivation» What is the Problem?

• Database Issues: Relational Databases» Things every Statistician should know

• Modelling Issues» Useful additional tools for Projects

© Survey & Statistical Computing 2001

Page 3: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

What is the problem?

• Why should statisticians need to know about IT?• Is there anything important to know?• Why not leave it all to the IT specialists?

© Survey & Statistical Computing 2001

Page 4: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Statistical Analysis is Simple

Answers

StatisticalMethod

Raw Data

© Survey & Statistical Computing 2001

Page 5: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Statistical Analysis: not so simple

Process OutputsInputs

Summaries

Transformationsand Updates

Revision

Conclusions

Interpretation

Record

ResultsStatisticalMethod

PopulationDatabank

Prior Knowledge

Analysis History

Tables,Estimates

AnalysisSpecifications

Raw Data

Meta Data

© Survey & Statistical Computing 2001

Page 6: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Statistical Analysis: the Survey

View

Planning

Experiment

Questionnaires orforms

Fieldwork

Samplingspecifications

EntrySpecifications

Validity Rules

QuestionnaireDesign

SurveyDesign

Meta Data

Data Entryand

Checking

Data ArchiveData Bank

Publication

Enquiries

DataReduction

Conclusion

StatisticalProcedure

Raw Data

© Survey & Statistical Computing 2001

Page 7: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Sample Design and Selection

Form Production, DataCapture, Scanning

Outliers, Imputation

DBMS

SamplingFrame

Results

Raw Data

Adjustments

Disclosure Control

Designparameters

Standard Analyses

Dissemination ad hoc Analyses

Grossing, Aggregation

Processes

Statistical Analysis:

the Statistical

Office Process

View

© Survey & Statistical Computing 2001

Page 8: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Why not use Packages?

• Standard packages provide Functionality» Often adequate for particular tasks

> Elements of analysis

> Questionnaire design

> Data Storage

» Implements a particular view of a problem area> Often helpful, but can be limiting

• Not sufficient for implementing Processes» Sequence, Control and Validation (Knowledge)

• Potential as Components in overall System

© Survey & Statistical Computing 2001

Page 9: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Why get Involved?

• We need good data» Cannot do good analysis with bad data

• We need good access to data» Systems often designed inflexibly, to limited initial

targets

• It can save time in the long run» We’ll get involved anyway in cleaning up the mess

» Easier to work with properly focussed system

• IT people have useful tools and concepts» Can help us to do our ordinary tasks

© Survey & Statistical Computing 2001

Page 10: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Development Opportunities

• Continuous Processes» Continuous or repeated surveys, business enquiries

• Repeated Processes» Design and analysis service for small surveys

» Regular reporting, with adjustments and estimation

• Sharing» Secondary analysis of data (through Data Archive?)

» Dissemination systems

• Very Large Projects» E.g. Census

© Survey & Statistical Computing 2001

Page 11: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Applications to statistical problems

• Statistics is not just about analysis» Need to be concerned about good data in and good use of conclusions

• Examples of development areas» Processing

> HIV and AIDS notification system

> Processing and Manipulation of results from a Demographic Survey in Pakistan

> Result processing from Construction Industry inquiries

» Statistical Databases> Support for dissemination to users and policy makers, and for further analysis

» Metadata> Initiatives to standardise concepts and structures

» Integrated statistical analysis systems> Analysis tools seen as a component that integrates with other components, data store,

metadata, dissemination, etc.

» Distributed resource systems> Data owners retain control, but users can see distributed resources as an integrated whole

© Survey & Statistical Computing 2001

Page 12: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Developing new Software and Systems

• Don’t do it unless it’s essential» Difficult to do well, Expensive, Time-consuming

• Do it properly» Don’t leave it all to the IT experts

> They don’t know what you want> They don’t know what is important

» User-Centred design (HCI) is not enough> Important advance, but leaves the power with the IT people

» Learn the Concepts and Jargon> Useful tools for thinking about structure and design

» Take part in the development process> Not too much detail (but enough)> Concentrate on functionality> Use proper tools and a proper methodology

© Survey & Statistical Computing 2001

Page 13: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Tools IT People use

• Concepts about Structure» Relational database ideas

» Object-oriented concepts

» Modelling of structure and process

• Implementation standards» OLAP – handling aggregate data

» XML – Interchange of complex structures

» Component architecture – Cooperating tools, not monolithic structures

• Design and Development» UML – for system design based on objects

» Methodologies> Contextual Inquiry – for User Requirements

> Feature-Driven Development, Rational Unified Process – for managing the development process

© Survey & Statistical Computing 2001

Page 14: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Database Issues: Relational Databases

• Relational Model and SQL» Components» Operations

• Data Warehouses» What is different?

• Aggregate Data and Data Cubes» Structure» Functionality

• Examples» Processing the PFFPS database in MS Access and SQL

© Survey & Statistical Computing 2001

Page 15: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Database Schema Levels

User 1

User 2

Application 1

Application 2

Search

Data Entry/Update

Report

Cases for statisticalanalysis

Statisticalresults

External Views

External Schema Logical Schema Internal Schema

Logical ModelDatabase

Internal Model

© Survey & Statistical Computing 2001

Page 16: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Relational databases

• Relational structures and operations» Agreed model of components and behaviour (Codd)

» Standardised implementations through SQL

• Views» Definitions of extractions and combinations can be stored and used as

though they were physical tables

• Physical model choice depends on intended usage as well as logical structure

• Normalisation: method to avoid duplication and dependencies» Important for transaction systems

» Easier to be consistent, faster updates

• ODBC standard for access to micro data from applications© Survey & Statistical Computing 2001

Page 17: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Household

PK,FK1,I1 CLUSTERPK H_NO

Household Member

PK,FK1,I1 CLUST_HHMPK,FK1,I1 H_NO_HHMPK Q01

Eligible Woman

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01

Cluster

PK Cluster

Birth History

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK LINE_WBH

Pregnancy Detail

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK,FK1 LINE_WPB

Contraceptive Methds

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK Q301

Contraceptive Use

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01

Marriage and Fertility

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01

Background

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01

Structure of the PFFPS database

© Survey & Statistical Computing 2001

Page 18: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

SQL Data Manipulation

SELECT {DISTINCT}<expression list>FROM <table list>WHERE <condition>GROUP BY <column list>HAVING <group condition>ORDER BY <column list>

• All manipulation (retrieval) of data uses the single Select statement, which has various components corresponding to different relational operations.

• The result of a SELECT statement is a relational table, which is displayed (by default) or can be stored or processed in another statement.

• Retrieval is usually done through a Query interface, which generates the SQL.

• Basis of ODBC standard for linking to relational databases.

© Survey & Statistical Computing 2001

Page 19: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Data Manipulation – Project and Restrict

• Project operation chooses columnsSELECT B, C FROM R1 SELECT SEX, AGE_AT_DIAGNOSIS, DIAGNOSISFROM CANCER_REGISTRATION

• Restrict chooses rows (Where clause)SELECT * FROM R2 WHERE A < 6SELECT *FROM CANCER_REGISTRATIONWHERE AGE_AT_DIAGNOSIS <= 65

R1

A B C D

1 b1 c1 d1

2 b2 c2 d2

3 b3 c3 d3

R2

A B C D

3 b3 c3 d3

4 b4 c4 d4

5 b5 c5 d5

6 b6 c6 d6

7 b7 c7 d7

© Survey & Statistical Computing 2001

Page 20: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Data Manipulation - Join

• The Join operation is the combination of a Product operation with Restrict to select the rows of the resultSELECT R1.*, R3.* FROM R1, R3 WHERE A = X orSELECT R1.*, R3.* FROM R1 INNER JOIN R3 ON A = X

• This is an Equi-Join • Natural Join is based on columns with the same name

SELECT R1.*, X, E FROM R1, R3 WHERE R1.D = R3.D or

SELECT R1.*, R3.* FROM R1 NATURAL JOIN R3

Join A=X

A B C R1.D

X R3.D

E

1 b1 c1 d1 1 d3 e1

2 b2 c2 d2 2 d5 e2

3 b3 c3 d3 3 d3 e3

Natural Join

A B C D X E

3 b3 c3 d3 1 e1

3 b3 c3 d3 3 e3

© Survey & Statistical Computing 2001

Page 21: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

RDBMS Strengths

• Data Modelling» Useful tools for understanding data structures and flows

» Entity – Relationship (ER) model widely used for structure

• Relational Model» Precise, formal mathematical specification of structure and

behaviour

• SQL» International Standard (SQL/92), widely implemented

• Current Implementations» Widely available, well supported, good implementations,

integration with other products, add-on market for tools

© Survey & Statistical Computing 2001

Page 22: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Data warehouse

• Most RDB systems are designed to support Transactions» Fast access to single (or few) records for updating» Not imposed by the Relational Model

• Data Warehouse systems designed for analysis, not transactions» Different physical model for access, but can still follow

relational principles> Many (selected) records, structured classification variables and measures> Different ideas about redundancy (normalisation)

» Extensions for manipulating Aggregated Data» Tools for gathering and cleaning data from source databases» Analysis tools (Data Mining)

> Hypothesis generation> Inference> Often weak on statistical principles (particularly over-fitting), but improving

© Survey & Statistical Computing 2001

Page 23: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Source: MS OLAP pages

Star Schema

© Survey & Statistical Computing 2001

Page 24: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Aggregate data

• Can produce with aggregation functions in SQL» Count, Sum, Group By

• Induces new concepts, not in relational model» Data Cube (Multi-way table), Dimensions,

Classifications, Levels, Measures

• Requires new functionality» Exploration, Manipulation, Presentation

• Commercial products developing» OLAP (anticipated by Codd), usually within Warehouse» Usually limited to simple aggregation» Some specialised Statistical products

© Survey & Statistical Computing 2001

Page 25: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Aggregate Results, as Multi-way Table

Detail

Minor Group

Major GroupDisease Classification (ICD)

Loc

atio

nC

ountr

yR

egio

n Dis

tric

tPeriod

YearWeek Month

Day

MeasuresReports receivedPopulation at riskEstimated Incidence rateSD of Incidence rate

This example has three dimensions (so that it can be visualised). In reality, for this application, we would need at least two more, Age and Gender.

{

© Survey & Statistical Computing 2001

Page 26: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Functionality for Cubes• Focussing on Subsets

» Slice and Dice

• Change level of Detail» Drill down, Roll up, implies a structure of levels over classifications

for each dimension

• Aggregation rules for measures» May not be sums, may have different base

• Derived measures» What is compatible, sensible

• Manipulation between cubes• Presentation issues

» Layout on 2 dimensions» Annotations and descriptions

• And so on – only basic issues addressed in OLAP products

© Survey & Statistical Computing 2001

Page 27: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Relational Databases for Statistical Data

• Can support complex structure• Can support complex processing• Can link easily to many statistical packages• Can do more data manipulation than most

statistical packages» Examples from Pakistan Fertility survey

© Survey & Statistical Computing 2001

Page 28: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Household

PK,FK1,I1 CLUSTERPK H_NO

Household Member

PK,FK1,I1 CLUST_HHMPK,FK1,I1 H_NO_HHMPK Q01

Eligible Woman

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01

Cluster

PK Cluster

Birth History

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK LINE_WBH

Pregnancy Detail

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK,FK1 LINE_WPB

Contraceptive Methds

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK Q301

Contraceptive Use

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01

Marriage and Fertility

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01

Background

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01

Structure of the PFFPS database

© Survey & Statistical Computing 2001

Page 29: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

RBDMS: Pakistan FFPS in Access

© Survey & Statistical Computing 2001

Page 30: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

PFFPS: Sex distribution for children

by Age

© Survey & Statistical Computing 2001

Page 31: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

PFFPS: Using the Query Interface

© Survey & Statistical Computing 2001

Page 32: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

PFFPS: Generated SQL

© Survey & Statistical Computing 2001

Page 33: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

PFFPS: Crossta

b in Access

© Survey & Statistical Computing 2001

Page 34: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

PFFPS: Joining TablesParity for all women of Childbearing

Age

© Survey & Statistical Computing 2001

Page 35: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

PFFPS: Parity for all women of Childbearing Age

© Survey & Statistical Computing 2001

Page 36: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

PFFPS: Fertility Rate Calculations• Numerators

» Count Births by Period and Mother’s Age

• Denominators» Years of ‘exposure’ by Period and Age at the time

» Months give adequate precision

© Survey & Statistical Computing 2001

Page 37: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

PFFPS: Births by Mother’s

Current Age and Years

Ago of Birth

© Survey & Statistical Computing 2001

Page 38: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

PFFPS: Fertility Numerator Calculations

• Years Ago: Int( ([interview cmc]-[child cmc]-1)/12 )• Women: Sum( IIF ( NZ([line_wbh])<2, [WT_EW], 0 ) )• Children: Sum( IIF ( IsNull([line_wbh]), 0, [WT_EW] ) )

© Survey & Statistical Computing 2001

Page 39: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Modelling Issues• ER Models for Databases• Objects and Object-Oriented Concepts• UML: Unified Modelling Language• XML: eXtended Markup Language• Examples: Processes analysis for HIV and AIDS

Notifications

© Survey & Statistical Computing 2001

Page 40: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Models and Modelling• Development process for systems

» Different from statistical models

• Explicit sets of concepts and procedures» For identifying and describing the components of some system» Often have associated notation» Different levels and purposes: Semantic, conceptual, logical, physical» Can provide useful ways of thinking about and representing data and

structure

• ER models for database» Conceptual approach to entities and relationships, more than the

relational model. Not just structure, but purpose (semantics).

• Object models» Components of a system, what is their structure and behaviour, how do

they work, how are they related, how do they interact, what are the sequences in which things happen?

© Survey & Statistical Computing 2001

Page 41: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Household

PK,FK1,I1 CLUSTERPK H_NO

Household Member

PK,FK1,I1 CLUST_HHMPK,FK1,I1 H_NO_HHMPK Q01

Eligible Woman

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01

Cluster

PK Cluster

Birth History

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK LINE_WBH

Pregnancy Detail

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK,FK1 LINE_WPB

Contraceptive Methds

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01PK Q301

Contraceptive Use

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01

Marriage and Fertility

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01

Background

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01

Structure of the PFFPS database

© Survey & Statistical Computing 2001

Page 42: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Object-based methods• Representation of complex content

» Process, not just structure» Central idea of a generic Class definition, with

> Complex elements, including other objects and collections> Behaviour, implemented as functions (methods)> Interfaces, that control external access to attributes and behaviour

• Object» Instance of a Class (Person, Account)» Has identity, that can be transmitted

> Set CurrentCust = New Person> Set CurrentCust.Account = New Account> Set Operator.Customer = CurrentCust

» Can be interrogated and asked to perform operations> CurrentCust.Age; CurrentCust.Account.Balance> CurrentCust.Print; CurrentCust.Account.Debit(£50)

© Survey & Statistical Computing 2001

Page 43: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Some Statistical Objects• Dataset

» Matrix of Cases and Fields

• Scale» Set of codes and meanings for the values used in a Field

• Variable» Combination of a Field and a Scale

• Classification» Defined over a Scale, but allowing re-grouping

• Classification Set» Hierarchy (or Tree) of Classifications, with mappings between levels

• Measure» Single-valued expression derived from one or more fields, with the derivation

formula

• Dimension» Combination of a Variable and a Classification set

• Summary Table (Data Cube)» Combination of Statistical Population, a set of Dimensions and a set of Measures

© Survey & Statistical Computing 2001

Page 44: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Object-Oriented Concepts• Inheritance

» One class can be defined as an extension of another

» Inherits all the structure and methods, but can extend or alter as required> Eligible Woman <<extends>> Household Member

> Head of Household <<extends>> Household Member

• Polymorphism» Behaviour of a method depends on the class for which it is invoked, eg

Print> The Class is responsible for providing a suitable method (can be inherited)

• Encapsulation» Attributes are private to the object, only exposed through methods

• Rich way of thinking about structure» Pervasive for programming

» Appearing in database systems (some special products)

» Supported by modelling tools© Survey & Statistical Computing 2001

Page 45: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Unified Modelling Language (UML)

• Standard (OMG) way to represent object models» Collection of diagram types and components to represent various

types of object and behaviour» Formal specification with semantics and conventions for

representation

• Recognises complexity» Same objects can participate in multiple diagrams, with different

emphasis or different level of detail or abstraction» Multiple Levels

> From User Requirements (Use Case diagrams) down to coding and implementation (Statechart, Activity, Sequence, Component and Deployment diagrams).

• Emphasis on software implementations» But much wider application for design» Rich and complex. Extensible.

© Survey & Statistical Computing 2001

Page 46: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Data Records Aggregations

Synthetic Data

Models

Metadata

Analysis

Package Diagram

• Types of Information in a Statistical Database

© Survey & Statistical Computing 2001

Page 47: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

LATS Database

System User

«subclass»

Data Maintainer

«extends»

System Administrator

«extends»

Use Information_______________

DiscoveryManipulation and Display

ExtractionTransfer and Linking

Data User

«system»External System

Load and MaintainInformation Store

___________________________________Load Survey Data

Load and link metadataStore derived informationStore results of analysis

Administer System_________________

User roles and permissionsAccess rules for information

«system»Information Resource

«system»Analysis Tool

«extends»

«extends»

«Data File»Extracted Data

Use Case Diagram

• Database Users

© Survey & Statistical Computing 2001

Page 48: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Use Information

Discovery_________

Catalogue SearchText Search

Traverse LinksView Information

Extraction__________

Select subset and variablesSelect Data format

Manipulationand Display

________________________Store Results of Analysis

Modelling

Transfer andLinking

____________________Target system

User Validation_______________

GuestData User

Store Resultsof Analysis

Use Information

«extends»

«system»Information Resource

«system»Analysis Tool

«uses»

«extends»

«extends»

«extends»

«Data File»Extracted Data

«extends»

Modelling?

«extends»

Extension in Use Case Diagram

© Survey & Statistical Computing 2001

Page 49: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Operator 2 Operator Supervisor

New Notification

[Machine-readable data][On Paper]

Enter Identification Data

Verify Identification data

Transfer Identification data

Seek Match

Create New Patient

[No Match]

Transfer Identification information

Check Identification Information

[Patient Found]

Check Notification type

Create New

[Not present]

Enter or Check Details

Refer for Review[Decision not clear]

Enter New Notification

Activity Diagram

• Similar to familiar Flowchart

© Survey & Statistical Computing 2001

Page 50: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Source: Chris Nelson

Class Diagram

• ebXML Registry Model

© Survey & Statistical Computing 2001

Page 51: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

IQML Registry Model

Association

associationType : IntegerassociationRole : StringsourceObjectId : StringtargetObjectId : StringsequenceNumber : String

Package

RegistryEntryAttribute

type : Stringvalue : String

RegistryEntry

0..*

2

0..*

2

1..*

0..*

1..*

0..*

0..*0..*

ClassificationNode

0..1

0..*

+Parent

0..1

+Child0..*

Classification

classificationSchemeId : StringclassificationNodeId : String

0..*1 0..*1

1

0..*

1

0..*

Source: Chris Nelson

© Survey & Statistical Computing 2001

Page 52: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

UML Summary

• Rich system of elements and diagrams for expressing designs» Can be complex to learn, but rewarding» Formal definition and semantics of elements, so designs

can be precise

• Focussed on software development» Round-trip development tools

> Rational Rose, Together, … (particularly for Java)

» Wider application for any design or specification context» Lots of diagram software, from free to expensive» Model building can take time, but if done in detail,

implementation in software can be fast

© Survey & Statistical Computing 2001

Page 53: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

XML: Purpose

• Designed to express complex structures of information in a way that can easily be moved between applications

• Markup Language (based in SGML) » Text with Tags (<Field> field contents </Field>)

» Nested Tags => multiple hierarchies

<ClassificationVersion> <identifier> Sex </identifier><items>

<ClassificationItem> <code> 1 </code> <title> male </title></ClassificationItem>

<ClassificationItem> <code> 2 </code> <title> female </title></ClassificationItem>

</items></ClassificationVersion>

© Survey & Statistical Computing 2001

Page 54: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

XML: Structure

• Generic language» Tags not defined, only the language structure

» Generic tools to read and write XML> Interface tools for application developers, (DOM, SAX)

> Presentation and transformation tools, style sheets (XSL)

» Tolerant applications> Can detect omissions and skip additions

• Schema and DTD - Document Type Definition» Rules about the specific tags and structures allowed in a

specific context> Can have generic tools to check conformance

© Survey & Statistical Computing 2001

Page 55: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

XML: Uses and Standards

• Linear text, solves problem of exchange between applications

• Still have to define and agree on structures» Agreement can be complex, but then easy to generate

XML Schema (eg from UML)

• Many proposals and agreements on standard structures in XML» DDI, Triple-S, MathML, GML, ebXML, …

• Only handles structure, not semantics or behaviour

© Survey & Statistical Computing 2001

Page 56: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

HIV Notification

On Paper (Green Form)

HIV Notification

In File

AIDS Notification

On Paper (Blue Form)

AIDS Notification

In File

Death Notification

In File

HIV File

dBase

AIDS File

dBase

Death File

dBase

HIV and AIDS Reporting System

• System» Separate file for each type of

Notification

• Problems» Duplicate Notifications» People have records in more than

one file» Difficult to identify and match

individuals» Analysis based on People

• Solution in Use» Cross-linking identifiers in all files

• But» Difficult to maintain, not reliable

HIV Notification

On Paper (Green Form)

HIV Notification

In File

AIDS Notification

On Paper (Blue Form)

AIDS Notification

In File

Death Notification

In File

© Survey & Statistical Computing 2001

Page 57: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

HAP System

External Contact

Operator

External Reporter

Quarterly Reports_________________

New Extract

Publication section

External Inquiry________________

New Information

ad hoc ReportRequest

Government

Review Data Quality

«extends»

Enter Notification

Use Case diagram of system activities

HAP System

External Contact

Operator

External Reporter

Quarterly Reports_________________

New Extract

Publication section

External Inquiry________________

New Informationad hoc Report

Request

Government

Review Data Quality

«extends»

Enter Notification

What is really important, so should

be efficient

HAP System

External Contact

Operator

External Reporter

Quarterly Reports_________________

New Extract

Publication section

External Inquiry________________

New Information

ad hoc ReportRequest

Government

Review Data Quality

«extends»

Enter Notification

HAP System: Processes

Can afford to spend time getting

information right

© Survey & Statistical Computing 2001

Page 58: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Redesign database to focus on Patients

PatientPatient

HIV Infection AIDS Diagnosis Death Report

HAP System: Data Stores

Notification

IdentificationType

Need components for the various stages (may not

appear in order)

Important to keep record of Notifications as

Sources

© Survey & Statistical Computing 2001

Page 59: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

HAP System: Process Notifications

Submit new Notification information

___________________________________Match to existing Patient

New PatientNew HIV Infection, or New AIDS diagnosis

New Death reportUpdate Patient information

Patient

HIV Infection AIDS Diagnosis Death Report

Notification

IdentificationType

© Survey & Statistical Computing 2001

Page 60: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

Summary

• Good systems produce better data» Easier to analyse» Better quality results

• Good systems save work» Less effort on cleaning data» Better and more metadata» Can move repetitive tasks into the system

• Get involved» Learn the conceptual ideas» Use the tools» Contribute to the design and development process

© Survey & Statistical Computing 2001

Page 61: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

References: Databases

• Date (2000). Introduction to Database systems, 7th Edition. Addison Wesley. ISBN: 0 201 68419 5.» This is the standard ‘bible’ for relational database

systems, hard work, but important if you want a deep understanding of the strengths and limitations of relational systems.

• Date & Darwen (1997). A guide to the SQL Standard, 4th Edition, Addison-Wesley

• Dowling. Database design and management using Microsoft Access. Letts. ISBN: 1 85805361 7.» A cheap book that works through a development

project using Access

© Survey & Statistical Computing 2001

Page 62: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

References: Systems Design and UML

• Booch, Jacobson, & Rumbaugh. The Unified Modelling Language User Guide. Addison Wesley. ISBN: 0 201 57168 4.

» Further information about UML in general and in the context of Rational Software products is available at www.rational.com

• Coad, Lefebvre & De Luca. Java Modelling in Colour with UML: Enterprise Components and Process. Prentice Hall. ISBN: 0 130 11510 X.

» See also www.togethersoft.com

• Holtzblatt. Contextual Design : A Customer-Centered Approach to Systems Designs. Academic Press. ISBN: 1558604111

• Kruchten. The Rational Unified Process – an Introduction. Addison Wesley. ISBN: 0 201 70710 1

• McConnel, 1996. Rapid Development – taming wild software development. Microsoft Press. ISBN: 1556159005

• Reed. Developing Applications with Visual Basic and UML. ISBN: 0 201 61579 7

• Sheridan & Sekula. Iterative UML development using VB6. ISBN 1 75622701 9.» This book introduces the ideas of iterative development and works through some projects

© Survey & Statistical Computing 2001

Page 63: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

References: UML Software• Microsoft Visio 2000 (Professional or Enterprise edition)

» Contains UML and database modelling tools, as well as general diagram facilities. It links in with MS Visual Studio, which has further UML code design and development tools, based on Rational Rose. A free 60-day evaluation version of Visio is available from Microsoft. New version (2002) just released.

• Rational Rose» The market leader in UML diagram and development support, and various

suites are available (such as for Analysts or Developers) containing additional tools, including a requirements management database and code generation tools. Various presentations and evaluations are available from Rational Software Ltd, www.rational.com/uk, or 01344 295000.

• Together» Another UML modelling tool, though more focussed on Java and round-

trip code generation. www.togethersoft.com

• ArgoUML» Open Source UML tool. argouml.tigris.org

© Survey & Statistical Computing 2001

Page 64: IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

© Survey & Statistical Computing 2001

References: Other Software• Beyond 20/20

» A dissemination and manipulation tool for multi-way tables (data cubes) aimed at statistical users. There are versions for independent use with downloaded files, and for building web dissemination servers. It is being used by a number of statistical offices, including ONS, Unesco, Statistics Canada and the US Census Bureau. The UK distributors are Forvus, at www.forvus.co.uk, and the developer site, at www.beyond2020.com, has various demos and descriptions, including downloads.

• Bridge» A repository database for statistical metadata, developed under the Eurostat project IMIM

(Integrated Metadata Information Management). See www.runsw.com

• Microsoft Office» The Pivot Table component in Excel 2000 is a good demonstration of the manipulation facilities

developed by non-statisticians for data cubes (the earlier versions have less general functionality). » Access (all versions) is a good example of a relational database system, suitable for projects up to

moderate scale (in terms of complexity and number of users, as well as physical size).

• Microsoft Project » Provides the classic project management tools, including PERT and Gantt charts. Evaluation

versions are available from Microsoft. There are also a number of heavyweight project management systems available from other companies

© Survey & Statistical Computing 2001