bp301: q: what’s your second most valuable asset and nearly doubles every year?

Post on 18-Jul-2015

265 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

BP 301: What’s your second most valuable asset and nearly doubles every year?

Henning Kunz, panagenda Consulting Florian Vogler, panagenda

Introduction

Henning Kunz – For about 20 years Services and Consulting guy in the Collaboration space – More infrastructure than development – With panagenda more and more analytics as a basis for

agile transformation projects

Florian Vogler – For almost all his life Client Management guru – Development and infrastructure – panagendas visionary figurehead

Agenda

Speaking of the 2nd most valuable asset and introduction Why are we doing this? Where in the world are files? Collecting BIG data – Basics Statistics – Basics Collecting from the file system Collecting from IBM Notes & Domino Sample reports Possibilities are endless (this session is not)

Before we start with the introduction

Answer to 2nd most valuable asset

1st most valuable asset?

What can you expect from this session?

Thoughts on companies file inventory

Some code snippets to gain inventory information

Demo is based on inventory information collected from our personal production notebooks (and a demo backend system) using the code snippets

– Visualization is prepared using a Visual Analytics Tool

Some ideas on how to use the outcome

FILES ARE EVERYWHERE

A file – from easy …

In the easiest sense, a file has – a potentially mind-boggling number of

attributes, e.g. • folder structure • filename • size

– Content (which may result in attributes, too)

A file – … to complex

Content is king! – Zip files – header vs. files vs. file • Zipping the same files twice creates a unique hash for both zip files …

– Office files (pptx, xlsx, …) • Contains a lot of information „inside“

Why are we doing this (=Why are files so important / interesting)?

Storage Amount = Storage (and backup!) Cost – Increase free disk space, Reduce cost – Beware of DAOS, Centera, … before you get too excited

Understand which (types of) files are created (rather: originated), updated, … … and by whom identify knowledge / working-together clusters Social Business

Going further (not covered in this session) Security & Compliance Content Beyond Windows (Linux, Mac, Mobile, …)

Mostly for French and German attendees

Some of the use cases and examples covered could be a problem with regards to Worker‘s Council regulations

Rethink use case without end user information – E.g. instead of „who all has (created) PowerPoint files“ „how many PowerPoint files do we

have across how many users (min/avg/max – without information about actual end users)

For everyone: Things to be aware of

The name of a file (or folder) can be a big problem on its own – 2015-01-27_money_transfers_to_carribean_account_789XA3_PW_richmaker.xls – Layoff_in_german_office_Q2_2015.docx – Increase_salary_of_mr_jones_to_200000.txt

The mere existence of a file (or folder) can create (at least an ethical) problem on its own

– On someone‘s laptop you find confidential, unauthorized, inappropriate information • e.g. internal DWG (CAD) files, a copy of the meeting minutes from the last meeting of the board

of management, customer data, performance figures, … – And now?

Where files are stored

„Local“ file system – „Fixed“ disks (C:, D:, …) – Local removable disks - A:, B:, USB Sticks, CD-Rom, …

Network file system – Mounted / mapped / UNC / synched (offline files) – File server

NSFs (Email / Applications) – Local (with or without consistent ACL, with or without DB level encryption) – Server – Beware of reader fields, author fields, …

Connections Files, FileNet, Documentum, SharePoint, Dropbox, Teamdrive, …

How to collect: WYSIWYG or AYCE

“WYSIWYG” – Local execution = in context of current OS user • Other users have to login, too (may never happen)

– Network scanning in context of current OS user • Shared network drives across departments/company

“AYCE” – Local execution as Admin (e.g. with SuRunAs) • Includes Windows profiles from all users

– Batch network scanning – Root mount scanning

What to collect

Simple File attributes – Name, “extension”, size, created, last modified, … (Dates and Time zoning!)

Complex (but much more useful) file attributes – Office properties like Author, Subject, last printed, last whatever, … – Zip / Rar / 7z / gzip / … – (e.g. MD5) hash (same same vs. similar)

Very complex file attributes – Security (R/W/…) – NSF & File system – Fingerprints (“Linux magic numbers”)

Hilariously complex: Content (also: similar instead of just same)

Mission impossible

“Impossible” File attributes – Not accessible – Not visible from viewpoint of scanner – Not used (e.g. multiuser PCs where a user doesn’t log on again) – Encrypted (e.g. Zip with password)

Examples of what not to do

Do not harm human beings, animals, plants or goods with your findings – Be good, do good, be a hero!

Do not analyze for files with same filename – Approx. 60-70% of all files on a single machine

Do not just delete duplicates

Also: do not do nothing

A VERY SHORT STATISTICS PITCH

Frequency distribution

In statistics, a frequency distribution is a table that displays the frequency of various outcomes in a sample.

i.e. session survey feedback by 100 session participants

Answer COUNT

Speaker skill was brilliant 15

Speaker skill was good 60

Speaker skill was ok 12

Speaker skill was somewhat poor 8

Speaker skill was very poor 5

Grouped data

A raw dataset can be organized by constructing a table showing the frequency distribution of the variable (whose values are given in the raw dataset). Such a frequency table is often referred to as grouped data.

i.e. time taken to answer a survey by 15 participants

sorted in symmetric intervals (bins) or qualitative characteristics

Time taken [s] 10 11 9 10 14 20 11 9 14 10 9 13 12 21 24

Interval Count

<5 s 0

5s<=t<10s 3

10s<=t<15s 9

15s<=t<20s 0

20s<=t<25s 3

Interval Count

Fast <10s 3

Normal 10s<=t<20s 9

Slow >=20s 3

Histogram

A histogram is a graphical representation of the distribution of data. To construct a histogram, the first step is to "bin" the range of values and then count how many values fall into each interval. i.e. time needed in [s] to rush from Dolphin Southern Hemisphere 1 to Swan Mockingbird 1-2 (Sample of 50 Participants)

rushtime[s] Count 140 1 150 2 160 5 170 10 180 13 190 11 200 6 210 1 220 0 230 1

0

2

4

6

8

10

12

14

140 150 160 170 180 190 200 210 220 230

Coun

t

Rushtime [s]

197 187 186 179 156 179 181 173 188 188 163 202 174 178 193 169 192 170 185 172 192 169 179 174 164 181 161 137 204 167 198 185 186 148 148 185 197 231 175 184 176 175 176 187 210 180 174 180 204 158

Bin and Count

Colle

ct / M

easu

re

SCAN FILESYSTEMS

Local

Scan local Windows based drives (locally mounted hard disks, portable drives or mounted)

Using PowerShell – Script 1. Collect file system information with MD5 and SHA1 hashes – Needs PowerShell V4 – Uses: Scripting.FileSystemObject, get-acl cmdlet, get-hash cmdlet – Run locally with ‘super user’ rights

3 Result files – Folders (Folder Path, LastWriteTime, Size, FileCount, Depth , FolderName) – ACLs (Folder Path, IdentityReference, AccessControlType) – Files (Folder Path, FileName, CreationTime, LastWriteTime, Size, Extension, MD5, SHA1)

A short note on PowerShell Execution Policy

There is something like execution security in PowerShell Execution Policy is set to undefined by default

– Thus it permits individual commands from console, but will not run scripts

Policytypes – Restricted, AllSigned, RemoteSigned, Unrestricted, Bypass, Undefined

Scope – Local Workstation ,CurrentUser, Process

A short note on PowerShell Execution Policy

To see current settings get-ExecutionPolicy –List

To set

set-ExecutionPolicy RemoteSigned –Scope CurrentUser RemoteSigned allows execution of “own” unsigned scripts

– “own” means scripts written/edited/saved in PowerShell ISE on local machine

– we will not talk about signing PowerShell scripts in this session, its not like “sign using current users id”

http://technet.microsoft.com/en-us/library/hh847748.aspx

PowerShell Snippet

Enhancement: Collecting Office attributes for .doc* files

Scan local Widows based drives (locally mounted hard disks, portable drives or mounted )

Using PowerShell – Script 2. Collect file system information with MD5 and SHA1 hashes and .doc* attributes – Uses: -ComObject Word.Application

BuiltInDocumentProperties

3 Result files – Folders (Folder Path, LastWriteTime, Size, FileCount, Depth , FolderName) – ACLs (Folder Path, IdentityReference, AccessControlType) – Files (Folder Path, FileName, CreationTime, LastWriteTime, Size, Extension, MD5, SHA1,

Created, Author, Title, Last print date)

Snippet 2 BuiltinDocumentProperties

1 Title

2 Subject

3 Author

4 Keywords

5 Comments

6 Template

7 Last author

8 Revision number

9 Application name

10 Last print date

11 Creation date

12 Last save time

13 Total editing time

14 Number of pages

15 Number of words

16 Number of characters

17 Security

18 Category

19 Format

20 Manager

21 Company

22 Number of bytes

23 Number of lines

24 Number of paragraphs

25 Number of slides

26 Number of notes

27 Number of hidden Slides

28 Number of multimedia clips

29 Hyperlink base

30 Number of characters (with spaces)

Collecting inventory from “Fileserver 2.0”

Scan SharePoint Inventory Using PowerShell

– Script 3. Collect item information from SharePoint Server – Uses: SharePoint cmdlets

– Result: Web Application, Site, Web, List, Item ID, Item URL, Item Title, Item Created,

Item Modified, File Size, Author, Versions, Filename

Snippet 3

SCAN FILES IN NSF CONTAINERS

IBM Notes & Domino

NSFs (Email / Applications) – Local (with or without consistent ACL, with or without DB level encryption) – Server – ACL, reader fields, author fields, document / field encryption, … – zip-file content – Fields in general (Subject, from, to, cc:, bcc:, created, modified, Body, …) • The Subject of a Notes document can be just as problematic as the name of a file (attachment) • Actually this may apply to pretty much any field • Note: Message Tracking ID

– ATTNQ# (today‘s *00#.*)

Fs_free_main.exe ConnectED 2015 Edition

Special Stand-alone version to scan local file system and nsf files Inspects zip file content (deliberately limited to filesystem) Runs from command line with parameters

– Uses local notes.ini and user.id / server.id – Therefore in security context of used id-file (ACLs, Reader Fields, DB/Document Encryption) – Lists (unprotected) zip file content – Based on C-API

Result: Path,Size,Modified,md5,sha-1

CHART TIME ….EXAMPLE RESULTS DEMO…

Script 1: 16,728 folders 127,000 files Script 2: 1,150 doc files Script 3: 1,316 SP files Fs.freemain: 1,200,000 records (250 MB)

POSSIBILITIES ARE ENDLESS….

Beyond the shown

Until now we just analyzed what's out there

How could we use that information?

Lets think about some interesting use cases

File Server Migrations – File Consolidations

Use the analysis to understand your file inventory With respect to

– File types which files fit into the target system (i.e. office files, pdf, jpg, png, wav versus xml, properties, files from non office applications)

– And their • Volume distribution • Count distribution

– Uniqueness of local files – Time stamps (retention, usage hint)

And act/size based on that information

Suggest Community Clusters

Based on analysis outcomes – Inventory overlap – Same authors, editors – Same access rights – Metadata

Think of it as a one time functionality to rearrange your files world in the first step

Could be used in the context of an attachment like SwiftFile* in the second step

– may require content analysis *http://www-01.ibm.com/support/docview.wss?uid=swg24034409

Companies File Locations

You do not have to store this file again…. As a hint for a so far unknown collaboration cluster/ community

Used in the context of an attachment inside notes

– Shows all MD5 identical files found at formerly scanned locations inside the company

Biggest challenges – Real time performance (needs ongoing periodic scanning of all sources) – Security trimming

(the accounts & groups of all scanned sources have to be resolved/mapped)

THANK YOU NOTE: POSSIBILITIES ARE ENDLESS – MORESO BEYOND FILES

florian.vogler@panagenda.com, henning.kunz@panagenda.com come and visit us in the TechnOasis #PED G3 A-C! Download the latest slide deck and code snippets www.panagenda.com/connected2015files

top related