david walker princeton university in collaboration with at&t research pads: simplified data...

26
David Walker Princeton University In Collaboration with AT&T Research Pads: Simplified Data Processing For Scientists

Post on 20-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

David WalkerPrinceton University

In Collaboration with AT&T Research

Pads:Simplified Data ProcessingFor Scientists

Who:  actress Jennifer Aniston and actor Brad Pitt

When: July 29, 2000

Where: The nuptials took place on the grounds of TV producer Marcy Carsey's Malibu estate

The Ceremony: As the sun sank low in the California sky, two hundred assembled guests watched as John Aniston, known to daytime television fans for his work on Days of Our Lives, walked his daughter down the aisle. Shielded by a flower-bedecked canopy, the bride and groom were able to say ....

3

4

Standard Data Formats

• Behind the scenes, much of this information is represented in standardized data formats

• Standardized data formats:

– Web pages in HTML

– Pictures in JPEG

– Movies in MPEG

– “Universal” information format XML

– Standard relational database formats

• A plethora of data processing tools:

– Visualizers (Browsers Display JPEG, HTML, ...)

– Query languages allow users extract information (SQL, XQuery)

– Programmers get easy access through standard libraries

• Java XML libraries --- JAXP

– Many applications handle it natively and convert back and forth

• MS Word

5

Ad Hoc Data Formats

• Massive amounts of data are stored in XML, HTML or relational databases but there’s even more data that isn’t

• An ad hoc data format is any nonstandard data format for which convenient parsing, querying, visualizing, transformation tools are not available

– ad hoc data is everywhere.

6

Ad Hoc data from www.investors.com

Date: 3/21/2005 1:00PM PACIFIC Investor's Business Daily ®Stock List Name: DAVE

Stock Company Price Price Volume EPS RSSymbol Name Price Change % Change % Change Rating Rating

AET Aetna Inc 73.68 -0.22 0% 31% 64 93GE General Electric Co 36.01 0.13 0% -8% 59 56HD Home Depot Inc 37.99 -0.89 -2% 63% 84 38IBM Intl Business Machines 89.51 0.23 0% -13% 66 35INTC Intel Corp 23.50 0.09 0% -47% 39 33

Data provided by William O'Neil + Co., Inc. © 2005. All Rights Reserved.Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc.Reproduction or redistribution other than for personal use is prohibited.All prices are delayed at least 20 minutes.

7

Ad Hoc data from www.geneontology.org

!autogenerated-by: DAG-Edit version 1.419 rev 3 !saved-by: gocvs !date: Fri Mar 18 21:00:28 PST 2005 !version: $Revision: 3.223 $ !type: % is_a is a !type: < part_of part of !type: ^ inverse_of inverse of !type: | disjoint_from disjoint from $Gene_Ontology ; GO:0003673 <biological_process ; GO:0008150 %behavior ; GO:0007610 ; synonym:behaviour %adult behavior ; GO:0030534 ; synonym:adult behaviour %adult feeding behavior ; GO:0008343 ; synonym:adult feeding behaviour % feeding behavior ; GO:0007631 %adult locomotory behavior ; GO:0008344 ;

...

8

Ad Hoc Data From An Immune Response Simulation

0 8 125 8 3 2 6 0(~6:0:0:0:0~1:0:0:0:1,1:1:0:0:0)

1 3 7 7 2 1 6 0(~6:0:0:0:0~1:1:0:0:0)

2 7 37 6 2 1 5 0(~5:0:0:0:0~1:1:0:0:0)

3 5 16 5 4 3 2 0(~2:0:0:0:0~1:1:0:0:0,1:1:0:0:0,1:0:0:1:0)

4 8 161 2 2 1 1 0(~1:0:0:0:0~1:0:0:1:0)

5 5 27 18 4 5 13 4(~13:0:0:0:0~2:0:0:0:1,1:0:0:1:0,2:0:0:1:0)

6 6 50 5 1 0 5 05:0:0:0:0

....

9

Ad Hoc Data in Chemistry

O=C([C@@H]2OC(C)=O)[C@@]3(C)[C@]([C@](CO4)(OC(C)=O)[C@H]4C[C@@H]3O)([H])[C@H](OC(C7=CC=CC=C7)=O)[C@@]1(O)[C@@](C)(C)C2=C(C)[C@@H](OC([C@H](O)[C@@H](NC(C6=CC=CC=C6)=O)C5=CC=CC=C5)=O)C1

O O

O

OH

AcO

H

O

O

O

HO

NH

O

O

OHO

10

Ad Hoc Data from Web Server Logs (CLF)

207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/[email protected]/confirm HTTP/1.0" 200 941

11

Ad Hoc Data: DNS packets

00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com.00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............'00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465 .ns1...hostmaste00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I.........00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6...............00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 ......linux.....00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c ............mail00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man.............00000090: 0487 cf1a 16c0 0c00 0200 0100 000e 1000 ................000000a0: 0603 6e73 30c0 0cc0 0c00 0200 0100 000e ..ns0...........000000b0: 1000 02c0 2e03 5f67 63c0 0c00 2100 0100 ......_gc...!...000000c0: 0002 5800 1d00 0000 640c c404 7068 7973 ..X.....d...phys000000d0: 0872 6573 6561 7263 6803 6174 7403 636f .research.att.co

12

Who uses ad hoc data?

• Ad hoc data sources are everywhere

– containing valuable information of all kinds

– everybody wants it:

• corporations: acquire data files from partners, mergers and acquisitions

• scientists: chemists, physicists, biologists, economists, computer scientists, network administrators, ...

• programmers: just about all of us at one time or another

13

The challenge of ad hoc data

• What can we do about ad hoc data? – how do we read it into programs?

– how do we detect errors?

– how do we correct errors?

– how do we query it?

– how do we view it?

– how do we gather statistics on it?

– how do we load it into a database?

– how do we transform it into a standard format like XML?

– how do we combine multiple ad data sources?

– how do we filter, normalize and transform it?

• In short: how do we do all the things we take for granted when dealing with standard formats in a reliable, fault-tolerant and efficient, yet effortless way?

14

Most people use C / Perl / Shell scripts

• But:

– Writing hand-coded parsers is time consuming & error prone.

– Reading and maintaining them in the face of even small format changes can be difficult.

– Such programs are often incomplete, particularly with respect to errors.

– Not all that efficient unless the author invests extra effort

• For reliable & efficient data processing, we can do better!

15

Why not use traditional parsers?

• Overall, a fairly heavy-weight solution

– specifying a lexer and parser separately can be a barrier

• data specs as Lex and Yacc files are relatively complicated

• regular expressions and context-free grammars, while good for programming language specs, aren’t necessarily the best languages for specifying ad hoc data

– lexing and parsing tools only solve a small part of the problem

• internal data structures built by hand

• printer by hand

• transforms by hand

• viewers by hand

• query engine by hand

– people just do not do it

• We can do better!

16

Enter Pads

• Pads: a system for Processing Ad hoc Data Sources

• Two main components:– a data description language

• for concise and precise specifications of ad hoc data formats and properties

– a compiler that automatically generates a suite of data processing tools

• robust libraries for C programming

– parser that flags all errors and automatically recovers

– printing utilities

– constraint checking utilities

• converter to XML

• a statistical profiler

– collects stats on common values appearing in all parts of the data; records error stats

• visual interface & viewer (coming soon!)

17

Pads Tool Generation Architecture

PadsCompil

erGene Ontology

description

StatisticalProfiler

Toolgene data

Profile

ACE 25%BKJ 25%...

XMLFormatter

Tool

gene data<foo s d/><bar dd h/>

ViewerTool

gene data

18

Pads Tool Generation Architecture

PadsCompil

er

Gene Ontologydescription

Gene Ontology

Generated Parser

Pads BaseLibrary

Gene OntologyStatistical Profiler

Glue codefor

statisticalprofile

19

Pads Programmer Tools

PadsCompil

er

Gene Ontologydescription

Gene Ontology

Generated Parser

Pads BaseLibrary

Ad Hoc User Program

Ad HocUser

Program in C

20

The Statistical Profiler Tool

• for each part of a data source, profiler reports errors & most common values.

• from example weblog data:

<top>.length : uint32+++++++++++++++++++++++++++++++++++++++++++good: 53544 bad: 3824 pcnt-bad: 6.666min: 35 max: 248591 avg: 4090.234

top 10 values out of 1000 distinct values:tracked 99.552% of values

val: 3082 count: 1254 %-of-good: 2.342val: 170 count: 1148 %-of-good: 2.144val: 43 count: 1018 %-of-good: 1.901.....

21

The Statistical Profiler Tool

• ad hoc data is often poorly documented or out-of-date

• even the documentation of weblog data from our textbook was missing some information:

good: 53544 bad: 3824 pcnt-bad: 6.666

– web server sometimes return a ‘-’ instead of length of bytes, which wasn’t mentioned in the textbook

• data descriptions can be written in a iterative fashion

– use the profiler at each stage to uncover additional information about the data and refine the description

22

PADS language

• Based on Type Theory– in most modern programming languages, types (int, bool, struct, object ...)

describe program data

• the source of most of my research

– in Pads, types describe

• physical data formats,

• semantic properties of data, and

• a mapping into an internal program representation (ie, a parser)

– in Pads, types include

• base types for ints of different kinds, strings of different kinds, dates, urls, ...

• structs and arrays for reading sequences

• unions, switched unions and enums for alternatives

• parameterized types to express dependencies & constraints

• recursive types to express recursive hierarchies (coming soon!)

– Can describe ASCII, binary, and mixed data formats.

23

Future Work

• Ad Hoc Data Transformation & Integration

– language and compiler support for moving data from the format you are given to the format you really want

• specifying simple transforms: permuting, dropping, computing fields; normalizing representations of dates, times, places ...

• correcting errors

• integrating multiple sources

• Pads Applications

– genomics data (with Olga Troyanskaya, Princeton CS)

– networking and telephony data (AT&T)

– financial data (Richard Liao, Princeton ORFE)

24

Challenges of Ad Hoc Data Revisited

• Data arrives “as is”

– Format determined by data source, not consumers.

• The Pads language allows consumers to describe data in just about any format.

– Often has little documentation.

• A Pads description can serve as documentation for a data source.

• The statistical profiler helps analysts understand data.

– Some percentage of data is “buggy.”

• Constraints allow consumers to express expectations about data.

• Parsers check for errors and say where errors are located.

• Ad hoc data is a rich source of information for financial analysts, chemists, biologists, computer scientists, if they could only get at it.

– Pads generates a collection of useful tools automatically from data descriptions

25

Pads Summary

• The overarching goal of Pads is to make understanding, analyzing and transforming ad hoc data an effortless task.

• We do so with new programming language technology based on the principles of Type Theory.

AT&T Research:Kathleen FisherMary FernandezJoel GottliebRobert Gruber (now Google)Ricardo Medel (summer intern)

Princeton:Mark Daly (UGrad)Yitzhak Mandelbaum (Grad)David Walker

http://www.padsproj.org/

End!