using metadata analytics to understand legacy digital...
TRANSCRIPT
05/04/2019
1
Using metadata analytics to understand
legacy digital collections prior to
disposal
David Canning, DRO - Cabinet Office
The National Archives - 4 April 2019
Our strategy is to design and build a digital archiving
capability that will operate over three phases.
2
Acquisition ManagementReview &
Transfer
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
05/04/2019
2
The Acquisition Phase, collects, catalogues
and carries out first review at seven years
3
Cabinet Office
‘Story’
Records
discovery at
business unit
level
Annual ‘Spring
Clean’
First Review
and Disposal
Annual basis, every May, Business Units
surrender one (financial) years worth of
information > 7 years after actual or assumed
creation date
Central team work with business areas to
identify main themes of record (e.g. projects,
policy, legislation) over a Parliament and/or
Prime Ministerial term of office
Remove ROT according to policy rules
identified via analytics.
A summarised version of the records
discovery process, noting key events in the
Department’s history during the period
Information is able to be collected and isolated for
analysis
Information related to the work theme is located in file
plan
Ephemera is destroyed and remaining information is
reconstructed into (where possible) chronological
record
ProcessInput Outcome
Operational Selection Policy
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
The Acquisition Phase, collects, catalogues
and carries out first review at seven years
4
First Review
and DisposalRemove ROT according to policy rules
identified via analytics.Ephemera is destroyed and remaining information is
reconstructed into (where possible) chronological
record
We’ll be looking at how we are approaching this part of the Acquisition phase in detail today.
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
05/04/2019
3
The Management Phase curates our corporate
memory and makes it accessible for use
5
Structured
digital
archives
Assisted
Search
Digital
Catalogue
The business may ask the Archivists to carry
out research or searches of the archived
material in order to exploit the department’s
corporate memory.
Where possible, archivists reconstruct the
record around the work themes identified
through records discovery into chronological
order, grouped into Parliaments, subdivided
by financial year. Includes various media and
formats in original and converted form.
The list of what is available in the Archive is
made available to the business via an online
catalogue. This is the department’s first line
knowledge resource.
The department is able to exploit its
corporate memory
We are confident that the record exists
in an organised system.
The contents of the archive are clear and
accessible to our people
ProcessInput Outcome
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
Phase Three is selection, review and transfer
6
Transfer
Appraisal
and
Selection
Sensitivity
Review
Disposal
As now, reviewers will apply explanatory
memoranda and judgement. E-discovery or
similar may be used along with search strings
to identify sensitive areas and reduce the
volume requiring human eyes
As now the Archivists work with TNA to
identify records of particular historic interest.
Informed by OSP and the CO Story.
Information may be of historic value (to TNA
or witheld) or continued knowledge value to
Cabinet Office. Remainder considered to be
of little/no value is destroyed (weeded).
Digital ‘transfer’ to TNA with verification of
integrity.
We avoid opening material that should not be
published
Records in scope for transfer identified and agreed
ProcessInput Outcome
We manage the heap, keeping only what is useful or
needs to be retained for security purposes
We fulfil our obligations to the Public
Records Act, supporting transparency.
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
05/04/2019
4
Our objective was to design and test a process for first
review and disposal of digital legacy content in volume
7
11 million data
objects* 3.51Tb
706 Top level
folders/drives
● We selected a file share containing the oldest digital information in the department.
● The earliest created and last modified dates were 1 January 1970!
*Some of this (circa. 0.3million) is live data still being used by the business
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
8
Overview and programmable search interface
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
05/04/2019
5
9
File paths are provided in a list for visual inspection
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
The first step was to accurately age the data and
understand the volume of file formats for each year
10
Year Created and Volume Year modified and Volume
1970 42 1970 414
1980 826 1980 3,940
1996 407 1996 16,093
1997 14,334 1997 33,818
1998 22,370 1998 59,700
2008 786,583 2008 820,314
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
05/04/2019
6
The first step was to accurately age the data and
understand the volume of file formats for each year
11
Year Created and Volume Year modified and Volume
1970 42 1970 414
1980 826 1980 3,940
1996 407 1996 16,093
1997 14,334 1997 33,818
1998 22,370 1998 59,700
2008 786,583 2008 820,314
● 01/01/1970 is the Unix
default date
● 01/01/1980 is the MSDos
default date
● The metadata in files
subject to corruption is
often absent, but two docs
were created in 2038!
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
The first step was to accurately age the data and
understand the volume of file formats for each year
12
Year Created and Volume Year modified and Volume
1970 42 1970 414
1980 826 1980 3,940
1996 407 1996 16,093
1997 14,334 1997 33,818
1998 22,370 1998 59,700
2008 786,583 2008 820,314
Visual inspection of document titles suggests creation dates ranging from 2002 to 2004. Files ‘modified’ in 1970 include:
-132 .doc-117 .msg- 103 .jpg- 22 .xls- 14 .ppt- 11 .pdf plus various unreadable/exotic formats
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
05/04/2019
7
The first step was to accurately age the data and
understand the volume of file formats for each year
13
Year Created and Volume Year modified and Volume
1970 42 1970 414
1980 826 1980 3,940
1996 407 1996 16,093
1997 14,334 1997 33,818
1998 22,370 1998 59,700
2008 786,583 2008 820,314
3863 have a date of
01/01/1980
258 .doc files (two have the
year 2013 in their title)
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
The first step was to accurately age the data and
understand the volume of file formats for each year
14
Year Created and Volume Year modified and Volume
1970 42 1970 414
1980 826 1980 3,940
1996 407 1996 16,093
1997 14,334 1997 33,818
1998 22,370 1998 59,700
2008 786,583 2008 820,314
First significant volume of
.docs and .wpd files:
2,505 .doc
1,279 .wpd
1,149 .htm
502 .xls
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
05/04/2019
8
The first step was to accurately age the data and
understand the volume of file formats for each year
15
Year Created and Volume Year modified and Volume
1970 42 1970 414
1980 826 1980 3,940
1996 407 1996 16,093
1997 14,334 1997 33,818
1998 22,370 1998 59,700
2008 786,583 2008 820,314
10,805 .docs
1,827 .wpd
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
The first step was to accurately age the data and
understand the volume of file formats for each year
16
Year Created and Volume Year modified and Volume
1970 42 1970 414
1980 826 1980 3,940
1996 407 1996 16,093
1997 14,334 1997 33,818
1998 22,370 1998 59,700
2008 786,583 2008 820,314
21459 .doc
zero .wpd
Between 1995 and 1999
and near doubling of
volume year on year
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
05/04/2019
9
The first step was to accurately age the data and
understand the volume of file formats for each year
17
Year Created and Volume Year modified and Volume
1970 42 1970 414
1980 826 1980 3,940
1996 407 1996 16,093
1997 14,334 1997 33,818
1998 22,370 1998 59,700
2008 786,583 2008 820,314
Print to paper policy ends
and first EDRM introduced
By 2008 the volume of
new documents being
created was thirty five
times that of the volume
in 1998
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
The metadata corruption reveals a history of poor data
management by IT suppliers
18
Document growth and
metadata corruption in legacy
fileshare 1996 to 2016
Created and last modified dates normally arise from the
same calendar year and appear to follow the same
trend. Accurate aging was achieved by using a matrix
of the two, combined with an occasional visual sense
check of file paths.
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
05/04/2019
10
The metadata corruption reveals a history of poor data
management by IT suppliers
19
Document growth and
metadata corruption in legacy
fileshare 1996 to 2016
New IT
platforms
deployed
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
Information was identified for deletion
through a multi-layered filtering process
20
Retain in Corporate Memory (Archive)
File format analysis
Data classification analysis
Human analysis
Operational
Selection Policy
ROT
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
05/04/2019
11
Information was identified for deletion
through a multi-layered filtering process
21
Retain in Corporate Memory (Archive)
File format analysis
Data classification analysis
Human analysis
Operational
Selection Policy
File format analysis identified
a large volume of exotic and
obsolete file formats that are
unreadable remnants of old
software. These make up:
● 87% of the pre-1996
data;
● 36% of the 1996-2008
data, and
● 49% of the 2008-2011
data
● The average is 36%
Further analysis excluded:
● Image files adding a
further 25% to the
ROT pile.
● HTM and HMTL
adding a further 10%
ROT
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
Information was identified for deletion
through a multi-layered filtering process
22
Retain in Corporate Memory (Archive)
File format analysis
Data classification analysis
Human analysis
Operational
Selection Policy
● We were left with:
○ 2million .doc
○ 600,000 .msg and
○ 500,000 .pdf files.
● There were also:
● 422,000 .xls, and
● 138,000 .ppt files
● .msg and .pdf were
retained.
● We used key word
search to identify:
● .docs for
destruction
● .xls and .ppt for
retention
ROT
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
05/04/2019
12
23
‘Catering’ produced a 100% confidence rating for ROT
Mostly .xls but also
order forms
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
24
‘Drinks’ was not so clear cut
Some of this
relates to alcohol
policy
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
05/04/2019
13
25
‘Submission’ produced 25,000 hits, 670 of
which were spreadsheets
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
26
Most spreadsheets are simply financial but
some are an annex to a ministerial submission
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
05/04/2019
14
Conclusions & Next Steps
27
● Enforcement of dates in the content and the file path (i.e. document title) is
critical for the accurate aging of information.
● Metadata corruption is a problem, the downstream effects of which (risks
and costs) need to be better understood.
● Data classification helps to identify smaller groups of files to zoom in
on, but further work is required to develop a reliable (standard) set of
classifications to facilitate automaticity.
● More research is required on the use of data classification:
○ Our work only analysed metadata - applying classifications to document
content may produce different results.
○ Results are often not clear cut and provide a likely proportion of ROT.
The desire to review documents before destruction will depend on an
individual department’s risk appetite and knowledge of its content.
Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019
Questions & Discussion
28 Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019