4.9.07 seite 1 information life cycle, information value and data management prof. rudolf bayer, ph....
Post on 19-Dec-2015
213 views
TRANSCRIPT
4.9.07
Seite 1
Information Life Cycle,Information Value
andData Management
Prof. Rudolf Bayer, Ph. D.Institut für Informatik
Technische Universität München
DEXA 2007
4.9.07
Seite 2
Some basic facts about datavolumes
Datavolumes are growing in industry by factor 1.7 per year, generally accepted1 Fileserver with 10 TBserves 1.000 users, 10 GB space per user
Private Datavolumes growing much faster?example: MyLifeBits, see http://research.microsoft.com/barc/mediapresence/MyLifeBits.aspx Significant shifts of document types with changing technolog, e.g. Video,
HDTV, 3D-HDTV with 6 cameras for fisheye effect?
Value of information growing at the same rate?
4.9.07
Seite 3
Some basic facts about storage
Cost of storage falling dramatically: 0.5 €/GB 500 €/TB = MyLifeBits of Gordon Bell 500 €/PB = in a few years (Jim Gray)
Bottom Line: capacity and prices of storage are moving faster than we can capture
storage is free!
4.9.07
Seite 4
Reclamation of storage?
To reclaim 1 GB = 1000 files = 0.5 €
Bottom Line: deletion of files is a tremendous waste of
time and mental effort Limiting the capacity of personal shares
in industry (< 5 GB) is a bad idea
4.9.07
Seite 5
What about access?
Disk speed: 50 MB/s raw, 5 MB/s real Network: 100 MB/s, no longer bottleneck Remote disk feasible GREP = brute force search
1 MB in < 1 sec1 GB in ~ 200 sec = 3 Min1 TB in 3.000 Min = 2 days1 PB in 2.000 days = 5,5 yearsParallelism?
Bottom Line: access is critical!!
4.9.07
Seite 6
What is memory?
Memory = Storage + Access
AccessRemember, where you put it in a directory Index it for near perfect memoryDB of metadata: file system is a large
multidimensional DWH
find it fast
4.9.07
Seite 7
State of the art: file directories?
Hierarchical organization of data Many criteria, e.g. by subject, by author, by doc type, by
time Old problem of libraries, never solved
physical organization is one-dimensional, shelves, but logical organization is multidimensional, i.e. databases
Cleanup and reorganization of file directories, to avoid complete chaos and lengthy search
Requires discipline and time
Solution: multidimensional meta-DB about data instead of file directories?
4.9.07
Seite 9
Does indexing help?
Index height grows logarithmically, i.e. B-tree for 1 PB of data has height 4 to 5 Access to anything in < 100 ms compared to 5 years for
GREP
Indexing helps: full text index like
Google, Google desktop, Apple spotlight?
gigantic result sets, page rank of Google determines what the world is reading!
fulltext index is extremely helpful, but not the solution
4.9.07
Seite 10
Meta-DB?
Solution: DB for multidimensional properties of data/files to substitute file-directories?
by subject, by author, by doc type, by time, by GPS, by …??
Metadata must be captured automatically!
4.9.07
Seite 11
Back to industry today
Fileservers: serve 1.000 users store 10 million files 5 TB = 5 GB share per user take 17 hours to backup to tape robots with LTO2
technology: 5.000 GB / 80MB/s = 17,4 h are backed up on weekends, >95% unchanged
Consequences of massive fileserver consolidation in industry, dead end!
4.9.07
Seite 12
Some Hypotheses
1. Bandwidth of man without pictures and video Less than 10 files per day: doc, pdf, xls, ppt, … Less than 3 MB/day
2. Data have short life cycle!!! Largely ignored!
3. But are stored for many years on premium storage, intention of MyLifeBits, does not make sense for industry
Hypotheses are plausible checkable quantifyable
4.9.07
Seite 13
Life Cycle of Files
Most files have a surprisingly short life from create to last access: User directories
71% 2 days 84 % 3 days 58 % 4 days 50 % 1 day compare results of Meta Group
Project directories 91 % 7 days 100% 1 day 91 % 1 day
Goup directories 76% 1 day 39 % 1 month 84 % 1 day 85 % 1 month
Life of files is comparable to daily newspaper?
4.9.07
Seite 14
User5: Life Span of Files
life span
0
500
1000
1500
2000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43
days
nu
mb
er o
f fi
les
Reihe1
4.9.07
Seite 15
Projekt Directory: Accesses per days back
accesses to 19.939 files
0
500
1000
1500
2000
2500
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61
days back from today
file
s ac
cess
ed p
er d
ay
Reihe1
Message: less than 30 % of files touched in last 60 days
4.9.07
Seite 16
User5
accesses by volume
01000020000300004000050000600007000080000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43
days back from 1.9.06
Vo
lum
e in
KB
Reihe1
4.9.07
Seite 17
Life Cycle of Files
Most files have a surprisingly short life of only one to three days from create time to last access:
Life of files is comparable to daily newspaper?
Value of information decreases rapidly, e.g. departure gate of a flight or this PPT presentation
Consequences for storage and data organization?
4.9.07
Seite 18
Fileservers today
A
B
NFile-Server F with
Block-Interface
LAN
1
2
3
4 5
1 2 3
1
2
3
4
SAN
Backup-System B
Fileserver stores all files
4.9.07
Seite 19
Simple Idea: Split of Fileserver
A
B
N
LAN SAN
Backup-System B12
3
4
5
1
2
3
4
1 2 3 1 2 3 4 5
1 2 3 4
LAN or
SAN
Shared File-Cache C = Performance Disk
File-Store S = Capacity Disk
4.9.07
Seite 20
Multilevel Architektur
A
B
N
LAN
SAN
Backup-System B
1 2 3 1
2 3 4 5
1 2 3 4
File-Store S
Clients with or without private File-Caches
File-Cache
File-Cache
File-Cache
SAN orLAN
Performance disk is no longer a critical ressource!
4.9.07
Seite 21
Properties of FileCache Architecture 1
Mirroring of all important data True File-Cache: with classical cache-management
algorithms, write through replaces backup system, e. g. Tivoli TSM
Backup: only for File-Store, as background service, continuous backup faster than File-Server Backup at least by factor 10, backup windows disappear
Failure Modes: F-Cache and F-Store have independent failure modes
4.9.07
Seite 22
Properties of FileCache Architecture 2
Recovery of File-Cache: instant recovery, works as empty File-Cache
Recovery of File-Store: by volume, background, minimal impact only for old files
Storage Capacity: <10% of datavolume for File-Caches (32 % Metagroup) and 1/2 for File-Store
Storage Classes: FC-disks for F-Cache, SATA-disk for F-Store
Cost: lower than File-Servers, modulo SW cost Availability: extremely high, comparable to PLATIN
system No lost data!
4.9.07
Seite 23
Cache Size and Algorithms
My measurements show: very small FileCaches <10% of stored datavolume
LRU replacement should work perfectly:Only 5-10 days per year with high activity, e.g. collecting literature
for a dissertation or a projectVery short life cycleLRU could displace files depending on access patterns, e.g. PDF
and ZIP different from XLS files
4.9.07
Seite 24
FileCache Architecture for Databases?
Split relation R into 2 disjunct tables R = R1 + R2, e.g. R1 = live data, R2 = stale data
R1 := R) e.g. create_date > last_archive_date R2 := not R) e.g. create_date <=
last_archive_date View R = R1 + R2 Archiving transaction as cron job to move tuples from R1
to R2
4.9.07
Seite 25
Example of Archiving Transaction 1
declare table R1 (order_received datetime, …)
declare table R2 (order_received datetime, …)
create view R as
select * from R1 union select * from R2
declare table Archive_Date (last_move datetime, …)
declare @move_date datetime
select @move_date = last_move from Archive_Date
Generalizes to arbitrary number of table splits, e.g. for1. orders received,
2. orders in production,
3. orders shipped,
4. orders under warranty,
5. orders in archive
4.9.07
Seite 26
Example of Archiving Transaction 2
Transaction to move stale data from R1 to R2:
begin transelect @move_date = DATEADD (DAY, -13, GETDATE())
insert into R2
select * from R1 where order_received < @move_date
delete from R1 where order_received < @move_date
delete Archive_Date
insert into Archive_Date values (@move_date )
commit tran
4.9.07
Seite 27
User Interface?
User sees relation R query q(R) = select * from R where (R ) Automatic rewrite of q(R) as:
if ( and ( R) = empty then (R2 )
else if ( and not ( R) = empty then (R1 )
else (R )
Interesting query rewrite and query optimization problem, part of the optimizer, invisible at API!
e.g. (R ) is order_received < ‘April 3, 2007’
4.9.07
Seite 28
Integration of FileStore with ILM
FileStore has very low load Stores all data permanently and secured via backup Can manage versions Has database of meta data = 0.1 % of datavolume = 5 GB for a 5 TB
file server Can obey complex ILM rules according to Oxley-Sabanes Multidimensional database for metadata plus fulltext
Domain and user Directory path Filename Version number File extension Time of creation Time of last update GPS position Etc.
4.9.07
Seite 30
The Quest for Eternity
Companies are forced by law, to preserve their data, tax regulations, Oxley-Sabanes act
What about people?
4.9.07
Seite 37
What are people doing with a
complete, perfect life memory?
Spend the second half of your life
to watch the first half on TV?When do you stop recording?
Don´t watch, record for your children or for
alibi!
Quest for Eternity
4.9.07
Seite 38
Private Datavolumes
Human life = 100 years = 1.200 months = 1.2 TB/life with 1 GB/Month (Gordon Bells life) = 36.000 days = 36 TB/life with 1 GB/day (digitizing private videos)
4.9.07
Seite 39
Person Tracking (mobile phones)
Human life = 100 years = 1.200 months = 36.000 days * 60 Positions/day = 2*106 positions/life = 50-100 MB/life
4.9.07
Seite 40
Size of DB for Metadata
Number of new objects: 10 per day = 360.000/life, peanuts for a DB Automatic collection of metadata is easy: Bytes
Date created 10Last change 10Last access 10Object name 50Title 100Directory path or URI 200object type 2Author 50Location of creation 50People present 100Version number 2Total per document < 1 KB
Total per personal life Meta-DB < 500 MB = 1 stick
4.9.07
Seite 41
Multidimensional Meta-DB
Meta-DB is a multidimensional DWH for all data objects Multidimensional indexing allows high precision recall Use UB-Tree as index, works well up to that
dimensionality
4.9.07
Seite 42
Complete, perfect life memory
What will it be used for? (Schäuble)
How will it affect our lives?
Think about it !