ga 1 casc discovery of access patterns to scientific simulation data ghaleb abdulla llnl center for...

13
GA 2 CASC Team Ghaleb Abdulla (0.4) Tina Eliassi-Rad (0.4) Terence Critchlow (0.15)

Upload: meghan-tucker

Post on 06-Jan-2018

212 views

Category:

Documents


0 download

DESCRIPTION

GA 3 CASC Task Objective Identify data storage formats that minimize access times using historical access patterns to the same or similar data sets Use spatial and temporal locality that result from data accesses to format the data on the disk

TRANSCRIPT

Page 1: GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing

GA 2CASC

Team

Ghaleb Abdulla (0.4) Tina Eliassi-Rad (0.4) Terence Critchlow (0.15)

Page 2: GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing

GA 3CASC

Task Objective Identify data storage formats that minimize access times using

historical access patterns to the same or similar data sets Use spatial and temporal locality that result from data accesses to

format the data on the disk

Page 3: GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing

GA 4CASC

Challenges

Data can be accessed using different:— tools,— by different users.

User 1 Tool 1 Data set 1

User m-1 Tool n-1 Data set k-1

User m Tool n Data set k

Page 4: GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing

GA 5CASC

Enabling Access Pattern Discovery

Application area (astrophysics) Visualization tool (VisIt) Analyze history of access patterns on two levels:

—System Level–Disk references–Network overhead–Memory usage

— Application level–Higher level commands–User level info

Page 5: GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing

GA 6CASC

Enabling Access Pattern Discovery

VisIt

Astrophysics

User 1 User n

Application Logging

Disk Logging

Log files

Unsupervised Learner (e.g., k-NN, k-means, etc)

Supervised Learner (e.g., neural net, DT, etc)

Hints

[Pattern, Hints] training data

Patterns

Djehuty

Page 6: GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing

GA 7CASC

Log file collection

Collect logs at the application and disk level Managing log collection process

—Start and stop collection sensors or agents based on demand

—Keep log data in one central place—Detect any failure in the monitoring agents and

restart them—Preferably work in a distributed environment

JAMM from LBL meets our requirements

Page 7: GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing

GA 8CASC

JAMM Architecture

Page 8: GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing

GA 9CASC

What to Collect

Application and user level:—Open—Zoom—Slice —etc.

System level—Network overhead —Disk block size—Buffer size—Disk location, etc.

We need to add our own sensors to collect data

Page 9: GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing

GA 10CASC

Data format

The DTD for our XML files is as follows:

<!ELEMENT logfile (application+)>

<!ELEMENT application (user+)>

<!ATTLIST application name ID #REQUIRED>

<!ELEMENT user (dataset+)>

<!ATTLIST user name ID #REQUIRED>

<!ELEMENT dataset (session+)>

<!ATTLIST dataset name ID #REQUIRED>

<!ELEMENT session (metadata+)>

<!ATTLIST session time NMTOKENS #REQUIRED>

<!ELEMENT metadata (#PCDATA)>

<!ATTLIST metadata

name ID #REQUIRED

time NMTOKENS #IMPLIED>

Page 10: GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing

GA 11CASC

Log File, Example<?xml version="1.0" ?> <!DOCTYPE logfile (View Source for full doctype...)> <logfile>< application name="SimTracker">< user name="Tina"><

dataset name=“astro "> <session time="01/11/2002 13:45:00 PST">

<metadata name="access_speed">100K</metadata><metadata name="storage_utilization">0</metadata> <metadata name="cohesion">0</metadata> <metadata name="fault_tolerance">1</metadata> <metadata name="num_disks_to_strip">20</metadata> <metadata name="start_io_device">16</metadata> <metadata name="stripping_factor" time="01/11/2002 13:55:00">200</metadata> <metadata name="stripping_unit“ time="01/11/2002 13:59:00">64K</metadata>

<metadata name="file_permissions">write</metadata> <metadata name="access_patterns">random</metadata> <metadata name="file_size">210M</metadata> <metadata name="io_buffer_size">128K</metadata> </session> </dataset></user></application></logfile>

Page 11: GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing

GA 12CASC

Data Analysis

Researched publicly available clustering tools Narrowed our choice to two

—CLUTO (University of Minnesota)— R (GNU)

Testing data processing algorithms on randomly generated log files

Hoping to get real log files in the near future:— Logging applications —We are currently looking at the “Flash” Log files

Page 12: GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing

GA 13CASC

Questions

Page 13: GA 1 CASC Discovery of Access Patterns to Scientific Simulation Data Ghaleb Abdulla LLNL Center for Applied Scientific Computing

GA 14CASC

This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.

UCRL-MI-xxxxxx