tdx: a high-performance table-driven xml parser

Post on 31-Dec-2015

43 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

TDX: a High-Performance Table-Driven XML Parser. Wei Zhang Robert van Engelen. Department of C omputer Science Florida State University. Outline. Motivation Introduction Recent Work Table-Driven XML Parsing – TDX TDX Construction Toolkit Results and Preliminary Conclusion. - PowerPoint PPT Presentation

TRANSCRIPT

TDX: a High-Performance Table-Driven XML Parser

Wei Zhang

Robert van Engelen

Department of Computer Science

Florida State University

2

Outline

Motivation Introduction Recent Work Table-Driven XML Parsing – TDX TDX Construction Toolkit Results and Preliminary Conclusion

3

Motivation

Enhance performance for XML-based Web Services

Provide flexibility Offer high-level modularity

4

Roadmap

Motivation Introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

5

Introduction

Validating XML Parsing Three stages

• Well-formedsness• Validation• Data conversion

Frequent access to schema Separation introduces

overhead and requires frequent access to schema

well-formedness

data conversion

validation

XMLXML

application

6

Introduction (cont’d) Schema-specific XML parsing (SSP)

Merging well-formedness and validation No requirement to frequent access to

schema Separation stage of data conversion in

implemented SSP

Well-formedness

Data Conversion

Validation

7

Roadmap

Motivation Introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

8

Recent Work

Chiu: “A compiler-based cpproach to schema-specific XML parsing” Merging parsing and validation by

constructing PDA No namespace support Conversion from NFA to DFA may result in

exponentially growing space requirement

9

Recent Work(cont'd)

van Engelen: “Constructing finite automata for high-performance web services” Integrates parsing and validation into one

stage by parsing actions encoded by DFA Cannot process cyclic XML schema

10

Recent Work(cont'd)

van Engelen: ”The gSOAP toolkit for web services and peer-to-peer Computing Networks ” Namespace support Merging parsing and validation Implementing a recursive-decent parsing Disadvantages of recursive-descent

• Code size and function calling overhead

11

Roadmap

Motivation Introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

12

Table-XML Parsing (TDX) LL(1) grammar can be derived from

schema XML documents can be parsed and

validated using LL(1) grammar Well-formedness (parsing) can be verified

through grammar rules Validation can be accomplished using

semantic actions Application-specific events can also be

encoded as semantic actions

13

Illustrating Example<schema> <element name=“book” type=“bookType”> <complexType name=“bookType”> <sequence> <element name=“title” type=“string”> <element name=“author” type=“string”> </sequence> </complexType></schema>

LL(1) Grammar:s ‘<book>’ t ‘</book>’ t t1 t2

t1 ‘<title>’ DATA //imp_s(s.val) ‘</title>’

t2 ‘<author>’ DATA //imp_s(s.val) ‘</author>’

14

Illustrating Example (cont'd)

<book>

<title>

XML Tech

</title>

<author>

Bob

</author>

</book>

s

(a) An XML Instance

t

t1 t

2

imp_s(“XML Tech”)

DATA

imp_s(“Bob”)

(b) Predictive Parsing

DATA

‘<book>’ ‘</book>’

‘<title>’ ‘</title>’‘<author>’ ‘<author>’

15

Roadmap Recent Work Table-Driven XML parsing – TDX

Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/tokenizer

TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

16

TDX - Architecture

<XML>TokenCDATA

Tokens

LL(1)Parsing Table

Ll(1) GrammarProductions and Actions

Events

Error: invalid

Modules

application

Scanner/Tokenizer

(DFA)

Parsing Engine(TDX)

17

Roadmap Recent Work Table-Driven XML parsing – TDX

Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer

TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

18

Token Generation Defined by

<namespace, tag>• Element name (opening and closing)• Attribute name

some data type• Such as Enumeration

Namespace binding Identical tag names under different namespaces are

represented as different tokens Normalized tokens

19

Roadmap Recent Work Table-Driven XML parsing – TDX

Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer

TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

20

Mapping Schema to LL(1) Grammar

Structural constraints are mapped to rules Validation constraints are mapped to

semantic actions Note that many types of validation constraints

are mapped to rules• Such as occurrence, enumeration

21

Mapping Example(1)

<simpleType name=“state”> <restriction base=“string”> <enumeration value=“OFF”/> <enumeration value=“ON”/> </restriction> </simpleType>

state “OFF” | “ON”

<simpleType name=“value”> <restriction base="integer"> <minInclusive value="10"/> <maxInclusive value="250"/> </restriction></simpleType>

value DATA//imp_i(char *s)

22

<complexType name=“example”> <choice> <element name=“id” type=“id_type” minOccurs=“0”/> <element name=“value” type=“value_type” minOccurs=“2”

maxOccurs=“unbounded”/> </choice></complexType>

Mapping Example(2)

c1 ‘<id>’ id_type ‘</id>’ example c1 | c2

c2 c’2 c’2 c’’2

<sequence> example c1 c2

c’2 ‘<value>’ value_type ‘</value>’

c1

c’’2 c’’2 c’2 c’’2

23

Roadmap Recent Work Table-Driven XML parsing – TDX

Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer

TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

24

LL(1) Parsing Table

Constructed from LL(1) grammar Indexed by nonterminals and terminals Contains either index of grammar

production or error entry

25

Roadmap Recent Work Table-Driven XML parsing – TDX

Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer

TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

26

Parsing Engine

Schema Independent Maintains

Parsing table Production table Action table Stack

27

Roadmap Recent Work Table-Driven XML parsing – TDX

Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer

TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

28

Scanner/Tokenizer Constructed from schema Schema provides DFA states

information Element name

• Has attribute? Attribute name

Root element needs special care Schema information

29

Scanner/Tokenizer example

<book xmlns:x ="http://www.x.org" xmlns:y ="http://www.y.org" targetnamespace ="http://www.x.org"> <title>XML Bible</title> <author> <name> Bob </name> <y:title> professor</y:title> </author></book>

<"www.y.org", "title">

<"www.x.org", "title">

DATA

<"www.x.org", "/title">

30

Roadmap

Motivation introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

31

TDX Construction Toolkit

Service.wsdl wsdl2TDX

Service_flex.l

Service_TDX.h

tab.yy.c

Service_TDX.c

flex

32

Roadmap

Motivation introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

33

Experiment Setup

Compare with DFA-based Parser gSOAP 2.7 eXpat 1.2 Xerces 2.7.0

Memory-resident XML message Elapsed real time using timeofday()

34

Parsing Performance(1)

0

50

100

150

200

250

300

350

TDX TDX -Cfa DFA DFA -Cfa eXpat gSOAP Xerces

EchoString Array Size = 1024B

Tim

e(u

s)

validation

decoding+validation

parsing

parsing+validation

35

Parsing Performance (2)

1

10

100

1000

10000

100000

1 10 100 1000 10000EchoString Array Size

Tim

e(u

s)

XercesgSOAPeXpatTDXDFA

36

Conclusion

Enhance parsing speed Flexible framework

Encoding value-based validation and application-specific events as semantic rules

Combining structural, syntactic and semantic constraints in one pass

High-level of modularity

top related