content-structure correspondence: a generic representation ... · retrieval. the nested...

Procedia - Social and Behavioral Sciences 27 ( 2011 ) 226 – 232

Pacific Association for Computational Linguistics (PACLING 2011)

Content-Structure Correspondence: A Generic Representation for

Heterogeneous Structured Document Saravadee Sae Tana*, Enya Kong Tanga*, Bali Ranaivo-Malancona*

a Faculty of Information Technology, Multimedia University, Selangor, Malaysia

Abstract

This on the web, most structured document collections consist of documents from different sources and marked up with different

types of structures. The diversity of structures has lead to the emergence of heterogeneous structured documents. The heterogeneity

of structured documents poses new challenges for document representation in structured document retrieval. The representation

model needs to handle various types of structures as well as multiple structures in a single document. Furthermore, same

information may be represented in different structures and information contained in different documents may be partial and

inconsistent. Therefore, the linkage of semantically related elements in the document collections needs to be modelled in the

representation model. In this paper, we introduce a generic and flexible structured document model to represent heterogeneous

structured documents as well as the similar correspondences in the document collections.

© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of PACLING 2011

Keywords: Parsing; Subcategorization; PP attachment; Coordination attachment; Text understanding; Grammar writing

1. Problem

In recent years, there has been a rapid growth of structured documents on the Web [1]. These structured documents are generated by different parties, originated from different sources and prepared to serve different purposes. As they are designed by different individual, same information can be represented in different structures and the information contained in different documents may be partial and inconsistent. Therefore, the structured document collections available on the Web are highly heterogeneous. The heterogeneity of structured document poses new challenges to the retrieval process when retrieve information from different sources where each adapting its own structure. There is a problem of mapping structural conditions of a query to different and heterogeneous structures of the documents. Here, we discuss some issues on representing structured document in heterogeneous structured document retrieval (SDR).

* Corresponding author.

Email address: [email protected], [email protected]

© 2011 Published by Elsevier Ltd. Open access under CC BY-NC-ND license.Selection and/or peer-review under responsibility of PACLING Organizing Committee.

1877-0428 © 2011 Published by Elsevier Ltd. Open access under CC BY-NC-ND license.Selection and/or peer-review under responsibility of PACLING Organizing Committee.doi: 10.1016/j.sbspro.2011.10.602

http://creativecommons.org/licenses/by-nc-nd/3.0/

http://creativecommons.org/licenses/by-nc-nd/3.0/

227 Saravadee Sae Tane t al. / Procedia - Social and Behavioral Sciences 27 ( 2011 ) 226 – 232

i) Heterogeneous types of structures The first issue concerns the modelling of heterogeneous types of structures. Structured documents may vary in types of structures marked up in the document, e.g. document logical structure [2], domain concept [3, 4], named entity [5] and etc. Markup can be added at different granularity levels of the document content, such that a whole section, an entity or a single word. Furthermore, some structured documents are represented as a tree in well-formed XML, whereas in some cases, annotations applied only to certain parts of content and the documents are represented in non well formed XML.

ii) Multiple structures of a document In many circumstance, a same document may be marked up with more than one type of structure. For instance, Wikipedia articles from INEX Wipedia 2009 collection are marked up with both logical markups and semantic markups. Different type of structures are used to serve different purposes and may have different importance in an information retrieval task. Therefore, there is a need to distinguish and represent various structures of a document in the representation model.

iii) Aggregation of XML elements Based on SDR principle, SDR should always retrieve the most specific part of a document answering the query [6]. Parallel to the issue of which document fragments to return is the issue of indexing unit in structured document retrieval. The nested hierarchical structure poses a challenge in partitioning XML document into meaningful XML fragments. Therefore, a more flexible and generic model is needed to represent aggregated document fragments.

iv) Correspondence of semantically related elements Structure heterogeneity is yet another main challenge in heterogeneous SDR. In many circumstance, same information may be represented by different structures due to the differences in how the information is conceptualized. For instance, a collection of publication entities can be organized around authors, years, or publications themselves. In addition, differences in the granularity of the content being marked up in a document will result in different structures, such as an author name can be represented as a single XML element hauthoriSusan Dumaish=authori or can be further divided into first name and last name hauthorih firstnmiSusanh= firstnmihlastnmiDumais h=lastnmih=authori. Furthermore, different parties may use different tag names to denote a same concept, e.g. hauthori vs hwriteri and hparai vs hpi. On the other hand, a same tag name may be used to describe different concepts, such as hnamei can refer to a person name or a hotel name. 2. A Generic Model for Heterogeneous Structured Document In order to address the problems mentioned above, we propose a generic and flexible structured document model to represent heterogeneous structured documents as well as the similar correspondences in the collection. The proposed model is flexible to represent structures of heterogeneous types. More importantly, the proposed model enables the representation of the correspondences between similar contents in the documents. 2.1. Content-Structure Correspondence Content-Structure Correspondence is defined as a triple (T; S; DCorr), where T is the text content, S is the structural context and DCorr is a set of direct correspondence relations between T and S. 1) Text Content The text content of a document is a string T , which consists of a sequence of terms.

2) Structural Context A structure S = (V; E) is a labeled tree where V is a set of structural nodes and E is a set of edges indicating the parent-children relations in the tree.


- V = v1; v2;:::; vn where n is total number of structural nodes in a structured document. - e(vi; v j) 2 E if vi; v j 2 V and vi is the parent of v j E can be /0 for non well-formed structure 3) Direct Correspondence The relations between T and S are denoted by a set of direct correspondence DCorr. DCorr is encoded on the structural context by attaching to each node v an interval I(v) indicating the the start and end positions of the content enclosed by the node. An interval is denoted as i j where 0 i j n and n is the length of the string - A sequence of intervals, S is denoted as S = i1 j1 + i2 j2 + ::: + ip jp where

- S is ordered, such that ik < ik+1 for all 1 k < p. - S without overlapping, such that jk ik+1 for all 1 k p - The union ([) of a sequence S = i1 j1 + i2 j2 + ::: + ip jp and I = i j is denoted as

- S [I = S itself, if there is a k such that ik i and j jk - S

[ I = S0 augmented by inserting i j to its proper location,

S0 = I + S if j < i1 or S0 = S + I if i > jp or

S0 = i1 j1 + i2 j2 + ::: + i j + ::: + ip jp if there is a k such that k < p such that jk < i and j < ik+1 2.2. Similar Correspondence

A similar correspondence is used to relate similar contents in structured documents in order to make such relation explicit. Similarity between two structured contents is defined in terms of similar correspondence between two Content-Structure Correspondence (CSC). - Let X and Y be a triple CSCs(T; S; DCorr) - A similar correspondence SCorr between X and Y is a triple (X;Y;C(VX ;VY ))

- C(VX ;VY ) is a set of sub-correspondence C = fc1; c2;:::; cjCjg between X and Y - VX and VY are the set of structural node in X and Y respectively

- a sub-correspondence c = (vX ; yX ; w) is denoted by a pair of structural nodes vX and vY and a weight w - vX 2 VX and vY 2 VY , each structural node is denoted by an interval as defined in Section 2.1 - w indicating the degree of similarity between vX and vY

4. Representing Heterogeneous Structured Document Let us now present how the Content-Structure Correspondence (CSC) Model can solve some of the issues in

modeling heterogeneous structured documents using illustrative cases. 3.1. Heterogeneous Types of Structures

The main issue concerns the modeling of heterogeneous structured documents is the heterogeneous types of structures in the document collection. doc1 (Figure 1) shows a structured document marked up with its logical structure and its Content-Structure Correspondence Model CSC1 is illustrated in Figure 2. The relations between the text content and structural context are denoted by the direct correspondence encoded in each structural node. For instance, the structural node < header > is denoted by an interval ‘9 90’ which indicating the start and end position of the content for < header >. The Content-Structure Correspondence Model is able to represent content which consists of discontinuous string by using a sequence of intervals. In the example, the direct content for structural node < p > is ‘To personalize or Not to Personalize: Modeling Queries with Variation in User Intent’ and ‘(Microsoft Research) ’, is


denoted by a sequence of interval ‘91 177 + 227 250’ Figure 3 shows the example of an xml document from DBLP XML collection, which is marked up in semantic

structure as well as its CSC Model. Whereas, Figure 4 gives an example of non well-formed XML document. CSC Model handles this case by allowing null for the set of edges in structural context (E = 0/ in CSC3). 3.2. Multiple Structures of a Document

In some circumstances, a same document may be marked up with more than one type of structue. doc4 in Figure 5 shows an example of XML document from INEX Wikipedia 2009 collection. Both logical markups and semantic markups are represented in a single XML document. Here, a structured document can be represented in more than one CSC, where each CSCs represents a single type of structure. Thus, doc4 is represented by CSC4 (Figure 6) and CSC5 (Figure 7), where CSC4 denotes the logical structure and CSC5 denotes the semantic structure. Both structures can be differentiated in different CSCs and the original structure can be derived from the union of CSC4 and CSC5.


3.3. Flexible Aggregation of XML elements Figure 8 shows an XML document describing the list of tutorials at SIGIR 2010. The XML document has a

root element < tutorials >, a < header > element describing the conference title and a list of < tutorial > elements providing the details of the tutorials. The nested structure poses a challenge in partitioning the XML document into meaningful XML fragments. A common approach is to partition XML document into non-overlapping XML fragments (Figure 9). However, the XML fragment dominated by the structural node < tutorial > consists only details of the tutorial ‘Learning to Rank for Information Retrieval’ and is lack of the conference name and may not be a meaningful information unit to the user.

CSC Model provides the flexibility to represent more meaningful document fragments by providing the flexibility to aggregate any XML elements. For instance, doc5 can be represented in two CSCs, CSC6 and CSC7 (Figure 10 and Figure 11) where each CSC consists of the conference name and detail information about a tutorial.


3.4. Correspondence of Similar Elements Content-Structure Correspondence Model allows the linkage of semantically related elements between structured documents via similar correspondence relation. From the previous example, doc1 (Figure 1) and doc2 (Figure 3) are elements describing a paper by Susan Dumais. Both documents are describing the same information but presented in different structures (Figure 12).

The similarity between doc1 and doc2 is represented by the similar correspodence, CCSC1;CSC2 between CSC1 and CSC2 as below:


fc1; c2; c3; c4; c5; c6g is a set of sub-correspondence between CSC1 and CSC2. c1 = (0 260; 6 298; 0:42) is the sub-correspondence between structural node < article > in CSC1 and < inproceedings > in CSC2. Each structural node is denoted by an interval (‘0 260’ for < article > and ‘6 298’ for < inproceedings >) and a weight (‘0.42’) that indicate the degree of similarity between < article > and < inproceedings >. 4. Conclusion In this paper, we propose a generic and flexible structured document representation model and illustrate how the model can be used to represent structured documents of heterogeneous types and is able to solve some issues of modeling heterogeneous structured documents. More importantly, the proposed model provides the flexibility to represent similar correspondences between contents in the documents. In this paper, many aspects of the model were not formally presented. In future work, we will define all aspects formally and aim to provide a complete and sound structured document representation model. We will also implement our model using real data to evaluate its effectiveness and efficiency compared to others structured document representation model. References

[1] D. Barbosa, L. Mignet, P. Veltri, Studying the xml web: Gathering statistics from an xml sample, World Wide Web 9 (2) (2006)

187–212.

[2] L. Denoyer, P. Gallinari, The wikipedia xml corpus, SIGIR Forum 40 (1) (2006) 64–69.

[3] K. Gabriella, D. Antoine, L. Monica, Overview of the inex 2008 book track, in: Advances in Focused Retrieval 7th International

Workshop of the Initiative for the Evaluation of XML Retrieval, INEX, Springer-Verlag, Dagstuhl Castle, Germany, 2008, pp.

106–123.

/ Procedia Social and Behavioral Sciences 00 (2011) 1–7 7

[4] R. Schenkel, F. Suchanek, G. Kasneci, Yawn: A semantically annotated wikipedia xml corpus, in: Proceedings of Datenbanksysteme

in Business, Technologie und Web, BTW, Aachen, Germany, 2008, pp. 277–291.

[5] R. van Zwol, T. van Loosbroek, Effective use of semantic structure in xml retrieval, in: Proceedings of the 29th European Conference

on IR Research, ECIR, 2007, pp. 621–628.

[6] C. D. Manning, P. Raghavan, H. Schtze, Introduction to Information Retrieval, Cambridge University Press, New York, NY, USA,

2008.

content-structure correspondence: a generic representation ... · retrieval. the nested...

Documents