xml for scientific applications. outline sq1 returned history manipulating xml xml for science
TRANSCRIPT
![Page 1: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/1.jpg)
XML for Scientific Applications
![Page 2: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/2.jpg)
Outline
SQ1 returned
History Manipulating XML XML for Science
![Page 3: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/3.jpg)
XML History
late 1960s “Generic Coding” for elect. manuscripts
'heading' rather than 'format-17' 1969
Goldfarb, Mosher, Lorie @ IBM Generalized Markup Language
1980 SGML; Standardized by ANSI IRS, DoD were early adopters
![Page 4: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/4.jpg)
SGML
Representation of documents Device-Independent, Software-
Independent Content
Tagged Text Structure
Document Type Definition (DTD) Style
Usually proprietary
![Page 5: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/5.jpg)
SGML: Content
Tagged Text start tags and end tags indentation not important
<section> <heading>SGML Example</heading> <paragraph> <line>first line of SGML</line> <line>second line of SGML</line> </paragraph></section>
![Page 6: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/6.jpg)
SGML: Structure
A section is an optional heading, followed by one or more paragraphs.
A paragraph is one or more lines or a quote.
A heading is just text.
A line is just text.
Dcoument Type Definition (DTD)
<!ELEMENT section - - (heading?, paragraph+)><!ELEMENT paragraph - O (line+ | quote)><!ELEMENT heading - O (#PCDATA) ><!ELEMENT line - O (#PCDATA) ><!ELEMENT quote - O (#PCDATA) >
![Page 7: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/7.jpg)
SGML: Benefits and Weaknesses
Benefits Human-readable Portable Structure-oriented
Weaknesses DTD difficult to define up-front Difficult to implement
syntax abbreviations/rules inference
![Page 8: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/8.jpg)
AN SGML Application: HTML
1990 Tim Berners-Lee @ CERN
a Scientific Data Management problem… writes a DTD for “Hypertext Markup Language”
1994-1996 Web takes off Page designers need guarantees about how
their page is going to look <font> tags, etc. introduced; purists despair MS and Netscape fight
![Page 9: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/9.jpg)
Data on the Web
<title>,<h1>,<font> are for humans
Back to SGML for data, but Stricter syntax (must use end tags)
good for implementors looser semantics (DTD not required)
good for users
Non-standard markup is better than no markup at all
![Page 10: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/10.jpg)
Growth
Sun introduces Java New hope for interoperability
Microsoft pushes XML because…it’s not Java
Individuals can publish data unilaterally Contrast: How do you publish relations? Contrast: Documents? (MS Word? as pdf?) A non-partisan, unimpeachable format
![Page 11: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/11.jpg)
XML File Sample
<?xml version="1.0"?> <dining-room>
<manufacturer>The Wood Shop</manufacturer>
<table type="round" wood="maple">
<price>$199.99</price> </table> <chair wood="maple">
<quantity>6</quantity> <price>$39.99</price>
</chair> </dining-room>
![Page 12: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/12.jpg)
Abstract View of XML
dining-room
table chairmanufacturer
type wood wood“The Wood Shop” price quantity
“round” “maple” “$199.99” “$39.99” 6“maple”
price
![Page 13: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/13.jpg)
Manipulating XML
![Page 14: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/14.jpg)
XPath
dining-room
table chairmanufacturer
type wood wood“The Wood Shop” price quantity
“round” “maple” 199.99 39.99 6“maple”
price
1) //price
2) /*/*/wood
3) //wood/../
4) /dining-room/*[price>100]/type
5) //wood[1]/text()
![Page 15: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/15.jpg)
XPath
XML [Node] language is not “closed” cannot compose xpath expressions
Answers are references to nodes works for local memory…
![Page 16: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/16.jpg)
XSLTdining-room
table chairmanufacturer
type wood wood“The Wood Shop” price quantity
“round” “maple” 199.99 39.99 6“maple”
price
<xsl:stylesheet version="1.0" xmlns:xsl=“…"> <xsl:template match=“//wood">
<xsl:if test=“/price > 100”><material>
<xsl:value-of select=“.”/> </material>
</xsl:if></xsl:template> </xsl:stylesheet>
![Page 17: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/17.jpg)
XSLT
logically, a set of rules pattern1 -> result1 pattern2 -> result2 …
physically, an XML document why is this a good idea?
XML XML, via XML
![Page 18: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/18.jpg)
XQuerydining-room
table chairmanufacturer
type wood wood“The Wood Shop” price quantity
“round” “maple” 199.99 39.99 6“maple”
price
for $v in doc(“example.xml”)//woodwhere $v/../price > 100return <material>
$v </material>
![Page 19: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/19.jpg)
XQuerydining-room
table chairmanufacturer
type wood wood“The Wood Shop” price quantity
“round” “maple” 199.99 39.99 6“maple”
price
for $m in doc(“example.xml”)//table[wood=“maple”]for $o in doc(“example.xml”)//table[wood=“oak”]where $m/price > $o/pricereturn <pair> <maple>$m/type </maple>
<oak>$m/type </oak> </pair>
![Page 20: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/20.jpg)
XQuery
Feels more like a Query language Joins are natural to express
XML XML, via special syntax
![Page 21: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/21.jpg)
XSLT and XQuery
Today’s author sums up: XSLT is for transformation
document-centric view of XML XQuery is for query
data-centric view of world
![Page 22: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/22.jpg)
XML for Science
Exercise: Sensor Data again
![Page 23: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/23.jpg)
Sensor Data
Would Homework 1 be any easier with XML?
![Page 24: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/24.jpg)
Benefits for Science
“better, more precise searching of full-content;
automatic extraction of object metadata; better support for compound document
formats and dynamic formatting generally; and,
improved processes for scholarly dissemination tasks such as peer review, versioning, and archiving.”
-- from NSF / NSDL Workshop on Scientific Markup Languages
![Page 25: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/25.jpg)
XML for Science
Recall features of Science Data: Read-oriented access Provenance
who, what, when, where, why Interesting Data Types
timeseries spatial arrays images
Scale
![Page 26: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/26.jpg)
XML for Science
Read-oriented access? perfect!
Provenance requires some flexibility; no problem
Interesting Data Types …and special file formats
Scale could get ugly
![Page 27: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/27.jpg)
Interesting Data Types
Data locked in binary file formats Binary Format Description Language
[Myers, Chappell 2000] Data Format Description Language
[OpenGrid Project] Retrofitting Data Models
[Howe, Maier SSDBM 2005] PADX
[Fernandez et al, PLANX 2006] XDTM
[Foster, Voeckler et al. Global Grid Forum 2005]
![Page 28: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/28.jpg)
Binary Format Description Language
![Page 29: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/29.jpg)
PADX
![Page 30: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/30.jpg)
1. Precord Pstruct summary_header_t {2. "0|";3. Punixtime tstamp;4. };5. Pstruct no_ramp_t {6. "no_ii";7. Puint64 id;8. };9. Punion dib_ramp_t {10. Pint64 ramp;11. no_ramp_t genRamp;12. };13. Pstruct order_header_t {14. Puint32 order_num;15. ’|’; Puint32 att_order_num;16. ’|’; Puint32 ord_version;17. ’|’; Popt pn_t service_tn;18. ’|’; Popt pn_t billing_tn;19. ’|’; Popt pn_t nlp_service_tn;20. ’|’; Popt pn_t nlp_billing_tn;21. ’|’; Popt Pzip zip_code;22. ’|’; dib_ramp_t ramp;23. ’|’; Pstring(:’|’:) order_type;24. ’|’; Puint32 order_details;25. ’|’; Pstring(:’|’:) unused;26. ’|’; Pstring(:’|’:) stream;27. };28. Pstruct event_t {29. Pstring(:’|’:) state;30. ’|’; Punixtime tstamp;31. };32. Parray event_seq_t {33. event_t[] : Psep(’|’) && Pterm(Peor);34. };35. Precord Pstruct order_t {36. order_header_t order_header;37. ’|’; event_seq_t events;38. };39. Parray orders_t {40. order_t[];41. };42. Psource Pstruct summary_t{43. summary_header_t summary_header;44. orders_t orders;45. };
![Page 31: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/31.jpg)
XML as Data
Maier’s Maxim:
Every popular data exchange format eventually becomes a data storage format.
![Page 32: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/32.jpg)
XML Storage: Shredding
Use RDBMS as your storage engine Two approaches:
Schema-aware Schema-oblivious
dining-room
table chairmanufacturer
type wood wood“The Wood Shop” price quantity
“round” “maple” 199.99 39.99 6“maple”
price
![Page 33: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/33.jpg)
XML Storage: Schema-aware
Table(SKU, Wood, Type, Price)Chair(SKU, Wood, Price)
DiningRoom(Manufacturer, Chairs, Quantity, Table)
![Page 34: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/34.jpg)
XML Storage: Schema-oblivious
Remember fancy node-labeling schemes…
Edge(NodeId, Tag, Value, ParentNodeId)
![Page 35: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/35.jpg)
Left/Right Labeling
dining-room
table chairmanufacturer
type wood wood“The Wood Shop” price quantity
“round” “maple” 199.99 39.99 6“maple”
price
0
1
2 3
4 5
7
6
8
9
34
10 …
Which queries are easy and fast?
What did we say the problems were?
![Page 36: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/36.jpg)
Path Labeling
dining-room
table chairmanufacturer
type wood wood“The Wood Shop” price quantity
“round” “maple” 199.99 39.99 6“maple”
price
0
0.0
0.0.0 0.1.2
0.1
0.1.0.0
0.1.0
0.1.1.0
0.1.1
What queries are fast and/or easy?
What did we say the problems were?
0.1.2.0
![Page 37: XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science](https://reader037.vdocuments.us/reader037/viewer/2022110208/56649dbb5503460f94aad0f8/html5/thumbnails/37.jpg)
Next time
Geographic Information Systems Spatial Data
Study Questions 2 (I’ll post them tonight)