bp301: q: what’s your second most valuable asset and nearly doubles every year?
TRANSCRIPT
BP 301: What’s your second most valuable asset and nearly doubles every year?
Henning Kunz, panagenda Consulting Florian Vogler, panagenda
Introduction
Henning Kunz – For about 20 years Services and Consulting guy in the Collaboration space – More infrastructure than development – With panagenda more and more analytics as a basis for
agile transformation projects
Florian Vogler – For almost all his life Client Management guru – Development and infrastructure – panagendas visionary figurehead
Agenda
Speaking of the 2nd most valuable asset and introduction Why are we doing this? Where in the world are files? Collecting BIG data – Basics Statistics – Basics Collecting from the file system Collecting from IBM Notes & Domino Sample reports Possibilities are endless (this session is not)
Before we start with the introduction
Answer to 2nd most valuable asset
1st most valuable asset?
What can you expect from this session?
Thoughts on companies file inventory
Some code snippets to gain inventory information
Demo is based on inventory information collected from our personal production notebooks (and a demo backend system) using the code snippets
– Visualization is prepared using a Visual Analytics Tool
Some ideas on how to use the outcome
FILES ARE EVERYWHERE
A file – from easy …
In the easiest sense, a file has – a potentially mind-boggling number of
attributes, e.g. • folder structure • filename • size
– Content (which may result in attributes, too)
A file – … to complex
Content is king! – Zip files – header vs. files vs. file • Zipping the same files twice creates a unique hash for both zip files …
– Office files (pptx, xlsx, …) • Contains a lot of information „inside“
Why are we doing this (=Why are files so important / interesting)?
Storage Amount = Storage (and backup!) Cost – Increase free disk space, Reduce cost – Beware of DAOS, Centera, … before you get too excited
Understand which (types of) files are created (rather: originated), updated, … … and by whom identify knowledge / working-together clusters Social Business
Going further (not covered in this session) Security & Compliance Content Beyond Windows (Linux, Mac, Mobile, …)
Mostly for French and German attendees
Some of the use cases and examples covered could be a problem with regards to Worker‘s Council regulations
Rethink use case without end user information – E.g. instead of „who all has (created) PowerPoint files“ „how many PowerPoint files do we
have across how many users (min/avg/max – without information about actual end users)
For everyone: Things to be aware of
The name of a file (or folder) can be a big problem on its own – 2015-01-27_money_transfers_to_carribean_account_789XA3_PW_richmaker.xls – Layoff_in_german_office_Q2_2015.docx – Increase_salary_of_mr_jones_to_200000.txt
The mere existence of a file (or folder) can create (at least an ethical) problem on its own
– On someone‘s laptop you find confidential, unauthorized, inappropriate information • e.g. internal DWG (CAD) files, a copy of the meeting minutes from the last meeting of the board
of management, customer data, performance figures, … – And now?
Where files are stored
„Local“ file system – „Fixed“ disks (C:, D:, …) – Local removable disks - A:, B:, USB Sticks, CD-Rom, …
Network file system – Mounted / mapped / UNC / synched (offline files) – File server
NSFs (Email / Applications) – Local (with or without consistent ACL, with or without DB level encryption) – Server – Beware of reader fields, author fields, …
Connections Files, FileNet, Documentum, SharePoint, Dropbox, Teamdrive, …
How to collect: WYSIWYG or AYCE
“WYSIWYG” – Local execution = in context of current OS user • Other users have to login, too (may never happen)
– Network scanning in context of current OS user • Shared network drives across departments/company
“AYCE” – Local execution as Admin (e.g. with SuRunAs) • Includes Windows profiles from all users
– Batch network scanning – Root mount scanning
What to collect
Simple File attributes – Name, “extension”, size, created, last modified, … (Dates and Time zoning!)
Complex (but much more useful) file attributes – Office properties like Author, Subject, last printed, last whatever, … – Zip / Rar / 7z / gzip / … – (e.g. MD5) hash (same same vs. similar)
Very complex file attributes – Security (R/W/…) – NSF & File system – Fingerprints (“Linux magic numbers”)
Hilariously complex: Content (also: similar instead of just same)
Mission impossible
“Impossible” File attributes – Not accessible – Not visible from viewpoint of scanner – Not used (e.g. multiuser PCs where a user doesn’t log on again) – Encrypted (e.g. Zip with password)
Examples of what not to do
Do not harm human beings, animals, plants or goods with your findings – Be good, do good, be a hero!
Do not analyze for files with same filename – Approx. 60-70% of all files on a single machine
Do not just delete duplicates
Also: do not do nothing
A VERY SHORT STATISTICS PITCH
Frequency distribution
In statistics, a frequency distribution is a table that displays the frequency of various outcomes in a sample.
i.e. session survey feedback by 100 session participants
Answer COUNT
Speaker skill was brilliant 15
Speaker skill was good 60
Speaker skill was ok 12
Speaker skill was somewhat poor 8
Speaker skill was very poor 5
Grouped data
A raw dataset can be organized by constructing a table showing the frequency distribution of the variable (whose values are given in the raw dataset). Such a frequency table is often referred to as grouped data.
i.e. time taken to answer a survey by 15 participants
sorted in symmetric intervals (bins) or qualitative characteristics
Time taken [s] 10 11 9 10 14 20 11 9 14 10 9 13 12 21 24
Interval Count
<5 s 0
5s<=t<10s 3
10s<=t<15s 9
15s<=t<20s 0
20s<=t<25s 3
Interval Count
Fast <10s 3
Normal 10s<=t<20s 9
Slow >=20s 3
Histogram
A histogram is a graphical representation of the distribution of data. To construct a histogram, the first step is to "bin" the range of values and then count how many values fall into each interval. i.e. time needed in [s] to rush from Dolphin Southern Hemisphere 1 to Swan Mockingbird 1-2 (Sample of 50 Participants)
rushtime[s] Count 140 1 150 2 160 5 170 10 180 13 190 11 200 6 210 1 220 0 230 1
0
2
4
6
8
10
12
14
140 150 160 170 180 190 200 210 220 230
Coun
t
Rushtime [s]
197 187 186 179 156 179 181 173 188 188 163 202 174 178 193 169 192 170 185 172 192 169 179 174 164 181 161 137 204 167 198 185 186 148 148 185 197 231 175 184 176 175 176 187 210 180 174 180 204 158
Bin and Count
Colle
ct / M
easu
re
SCAN FILESYSTEMS
Local
Scan local Windows based drives (locally mounted hard disks, portable drives or mounted)
Using PowerShell – Script 1. Collect file system information with MD5 and SHA1 hashes – Needs PowerShell V4 – Uses: Scripting.FileSystemObject, get-acl cmdlet, get-hash cmdlet – Run locally with ‘super user’ rights
3 Result files – Folders (Folder Path, LastWriteTime, Size, FileCount, Depth , FolderName) – ACLs (Folder Path, IdentityReference, AccessControlType) – Files (Folder Path, FileName, CreationTime, LastWriteTime, Size, Extension, MD5, SHA1)
A short note on PowerShell Execution Policy
There is something like execution security in PowerShell Execution Policy is set to undefined by default
– Thus it permits individual commands from console, but will not run scripts
Policytypes – Restricted, AllSigned, RemoteSigned, Unrestricted, Bypass, Undefined
Scope – Local Workstation ,CurrentUser, Process
A short note on PowerShell Execution Policy
To see current settings get-ExecutionPolicy –List
To set
set-ExecutionPolicy RemoteSigned –Scope CurrentUser RemoteSigned allows execution of “own” unsigned scripts
– “own” means scripts written/edited/saved in PowerShell ISE on local machine
– we will not talk about signing PowerShell scripts in this session, its not like “sign using current users id”
http://technet.microsoft.com/en-us/library/hh847748.aspx
PowerShell Snippet
Enhancement: Collecting Office attributes for .doc* files
Scan local Widows based drives (locally mounted hard disks, portable drives or mounted )
Using PowerShell – Script 2. Collect file system information with MD5 and SHA1 hashes and .doc* attributes – Uses: -ComObject Word.Application
BuiltInDocumentProperties
3 Result files – Folders (Folder Path, LastWriteTime, Size, FileCount, Depth , FolderName) – ACLs (Folder Path, IdentityReference, AccessControlType) – Files (Folder Path, FileName, CreationTime, LastWriteTime, Size, Extension, MD5, SHA1,
Created, Author, Title, Last print date)
Snippet 2 BuiltinDocumentProperties
1 Title
2 Subject
3 Author
4 Keywords
5 Comments
6 Template
7 Last author
8 Revision number
9 Application name
10 Last print date
11 Creation date
12 Last save time
13 Total editing time
14 Number of pages
15 Number of words
16 Number of characters
17 Security
18 Category
19 Format
20 Manager
21 Company
22 Number of bytes
23 Number of lines
24 Number of paragraphs
25 Number of slides
26 Number of notes
27 Number of hidden Slides
28 Number of multimedia clips
29 Hyperlink base
30 Number of characters (with spaces)
Collecting inventory from “Fileserver 2.0”
Scan SharePoint Inventory Using PowerShell
– Script 3. Collect item information from SharePoint Server – Uses: SharePoint cmdlets
– Result: Web Application, Site, Web, List, Item ID, Item URL, Item Title, Item Created,
Item Modified, File Size, Author, Versions, Filename
Snippet 3
SCAN FILES IN NSF CONTAINERS
IBM Notes & Domino
NSFs (Email / Applications) – Local (with or without consistent ACL, with or without DB level encryption) – Server – ACL, reader fields, author fields, document / field encryption, … – zip-file content – Fields in general (Subject, from, to, cc:, bcc:, created, modified, Body, …) • The Subject of a Notes document can be just as problematic as the name of a file (attachment) • Actually this may apply to pretty much any field • Note: Message Tracking ID
– ATTNQ# (today‘s *00#.*)
Fs_free_main.exe ConnectED 2015 Edition
Special Stand-alone version to scan local file system and nsf files Inspects zip file content (deliberately limited to filesystem) Runs from command line with parameters
– Uses local notes.ini and user.id / server.id – Therefore in security context of used id-file (ACLs, Reader Fields, DB/Document Encryption) – Lists (unprotected) zip file content – Based on C-API
Result: Path,Size,Modified,md5,sha-1
CHART TIME ….EXAMPLE RESULTS DEMO…
Script 1: 16,728 folders 127,000 files Script 2: 1,150 doc files Script 3: 1,316 SP files Fs.freemain: 1,200,000 records (250 MB)
POSSIBILITIES ARE ENDLESS….
Beyond the shown
Until now we just analyzed what's out there
How could we use that information?
Lets think about some interesting use cases
File Server Migrations – File Consolidations
Use the analysis to understand your file inventory With respect to
– File types which files fit into the target system (i.e. office files, pdf, jpg, png, wav versus xml, properties, files from non office applications)
– And their • Volume distribution • Count distribution
– Uniqueness of local files – Time stamps (retention, usage hint)
And act/size based on that information
Suggest Community Clusters
Based on analysis outcomes – Inventory overlap – Same authors, editors – Same access rights – Metadata
Think of it as a one time functionality to rearrange your files world in the first step
Could be used in the context of an attachment like SwiftFile* in the second step
– may require content analysis *http://www-01.ibm.com/support/docview.wss?uid=swg24034409
Companies File Locations
You do not have to store this file again…. As a hint for a so far unknown collaboration cluster/ community
Used in the context of an attachment inside notes
– Shows all MD5 identical files found at formerly scanned locations inside the company
Biggest challenges – Real time performance (needs ongoing periodic scanning of all sources) – Security trimming
(the accounts & groups of all scanned sources have to be resolved/mapped)
THANK YOU NOTE: POSSIBILITIES ARE ENDLESS – MORESO BEYOND FILES
[email protected], [email protected] come and visit us in the TechnOasis #PED G3 A-C! Download the latest slide deck and code snippets www.panagenda.com/connected2015files