tip: data scoring: convert data with xquery

15
Tip: Data scoring: Convert data with XQuery Do quality analysis on conversion results Skill Level: Intermediate Geert Josten Consultant/Content Engineer Daidalos BV 29 Sep 2009 The process of converting data is one of migrating information from an unsuitable source or format to a suitable one—often not an exact science. Data scoring is a way to measure the accuracy of your conversion. Discover a simple scoring technique in XQuery that you can apply to the result of a small text-to-XML conversion. Frequently used acronyms HTML: Hypertext Markup Language W3C: World Wide Web Consortium URL: Uniform Resource Locator XML: Extensible Markup Language XSLT: Extensible Stylesheet Transformations Scoring converted data is all about analyzing the quality of the conversion. Quality can mean different things, and converting data from a database carries with it different problems than converting data from documents with more natural language. The technique that this tip presents makes no assumptions: You can apply it to any XML code of interest. To see the technique in practice, you will convert plain text—not comma-separated files, but plain text from news items grabbed from the Internet. Plain text input Data scoring: Convert data with XQuery © Copyright IBM Corporation 2009. All rights reserved. Page 1 of 15

Upload: geert-josten

Post on 20-Jan-2015

251 views

Category:

Technology


0 download

DESCRIPTION

The process of converting data is one of migrating information from an unsuitablesource or format to a suitable one—often not an exact science. Data scoring is a wayto measure the accuracy of your conversion. Discover a simple scoring technique inXQuery that you can apply to the result of a small text-to-XML conversion.Scoring converted data is all about analyzing the quality of the conversion. Qualitycan mean different things, and converting data from a database carries with itdifferent problems than converting data from documents with more natural language.The technique that this tip presents makes no assumptions: You can apply it to anyXML code of interest. To see the technique in practice, you will convert plaintext—not comma-separated files, but plain text from news items grabbed from theInternet.Frequently used acronyms• HTML: Hypertext Markup Language• W3C: World Wide Web Consortium• URL: Uniform Resource Locator• XML: Extensible Markup Language• XSLT: Extensible Stylesheet Transformations

TRANSCRIPT

Page 1: Tip: Data Scoring: Convert data with XQuery

Tip: Data scoring: Convert data with XQueryDo quality analysis on conversion results

Skill Level: Intermediate

Geert JostenConsultant/Content EngineerDaidalos BV

29 Sep 2009

The process of converting data is one of migrating information from an unsuitablesource or format to a suitable one—often not an exact science. Data scoring is a wayto measure the accuracy of your conversion. Discover a simple scoring technique inXQuery that you can apply to the result of a small text-to-XML conversion.

Frequently used acronyms

• HTML: Hypertext Markup Language

• W3C: World Wide Web Consortium

• URL: Uniform Resource Locator

• XML: Extensible Markup Language

• XSLT: Extensible Stylesheet Transformations

Scoring converted data is all about analyzing the quality of the conversion. Qualitycan mean different things, and converting data from a database carries with itdifferent problems than converting data from documents with more natural language.The technique that this tip presents makes no assumptions: You can apply it to anyXML code of interest. To see the technique in practice, you will convert plaintext—not comma-separated files, but plain text from news items grabbed from theInternet.

Plain text input

Data scoring: Convert data with XQuery© Copyright IBM Corporation 2009. All rights reserved. Page 1 of 15

Page 2: Tip: Data Scoring: Convert data with XQuery

For the source, I grabbed text from the Google news Web site using the URLhttp://news.google.com/news/section?pz=1&topic=w&ict=ln. Figure 1 shows theresulting page.

Figure 1. Example news item on Google news

These news items have a basic structure: They start with a heading followed bysource details, the news message itself, and some additional information fromdifferent sources. In this tip, you will extract the headline, location, text, relatedheadline, and the source of the related headline. Figure 2 shows these elements.

Figure 2. Analysis of the text of an example Google news item

developerWorks® ibm.com/developerWorks

Data scoring: Convert data with XQueryPage 2 of 15 © Copyright IBM Corporation 2009. All rights reserved.

Page 3: Tip: Data Scoring: Convert data with XQuery

For this example, I selected the top three news items for that moment, copied thetext, and removed the lines this tip doesn't discuss, just to save space. I alsoseparated the text into the individual news items to give you a head start.

Line endsLine ends are always represented by the numeric character 10 inmemory when working with XQuery, regardless of the operatingsystem.

You will use the input data in Listing 1. (I added paragraph characters [¶] to visualizeline ends. The entity " represents the double quotation mark character.)

ibm.com/developerWorks developerWorks®

Data scoring: Convert data with XQuery© Copyright IBM Corporation 2009. All rights reserved. Page 3 of 15

Page 4: Tip: Data Scoring: Convert data with XQuery

Listing 1. The plain text

let $news-items := ("Afghans go to polls under threat of Taliban violence¶KABUL, Afghanistan (CNN) - Under the menacing threat of violence from theTaliban, Afghans headed to the polls on Thursday in the war-ravaged nation'ssecond-ever national election.¶Afghan people cast votes in hope of better future RT",

"Security stepped up after Baghdad bombings¶BAGHDAD, Iraq (CNN) - The Iraqi government implemented new security measures aday after a string of bombings in Baghdad killed at least 100 people and woundedhundreds more, an interior ministry official told CNN on Thursday.¶Questioning security in Baghdad's Green Zone - 19 Aug 09 Al Jazeera",

"Will Democrats Go At It Alone on the Health Care Bill?¶This is a rush transcript from "On the Record," August 19, 2009. This copy may notbe in its final form and may be updated.¶New Health Care Strategy CBS")

To see the scoring technique in practice, you will define patterns to extractinformation, convert to XML, and apply the technique on the result. The big issue isnot finding patterns but converting those patterns to reliable extraction rules. Thistechnique is a useful tool to help you analyze the reliability of your own rules.

Before you start to apply patterns, however, first break down the structure of eachnews item. Each news item has three lines of data:

• Headline

• Message

• Headline of a related news item

You iterate over all news items and separate the lines by tokenizing on line ends:

let $result :=for $news-item in $news-items

let $lines := tokenize($news-item, '
')

The result variable captures the XML result.

Applying patterns

Analyzing conversion quality is only interesting when information is missing or joinedtogether and must be separated—which this tip covers. Look at each line to searchfor patterns to separate combined information.

The first line

developerWorks® ibm.com/developerWorks

Data scoring: Convert data with XQueryPage 4 of 15 © Copyright IBM Corporation 2009. All rights reserved.

Page 5: Tip: Data Scoring: Convert data with XQuery

The first line contains only the headline. Apply normalize-space to trim redundantspaces:

let $headline := normalize-space($lines[1])

The second line

The second line—the one with the message—obviously contains the mostinformation, but has no reliable pattern except for the location of the event. You canfind this location at the start of the line, followed by a dash. Use the dash to separatethe location from the text:

let $location := normalize-space(substring-before($lines[2], '-'))let $text := normalize-space(substring-after($lines[2], '-'))

The third line

The third line is the most challenging: It contains both the headline and the name ofthe source without a marker to separate them visually. You can't know all the names,so you can't match them literally. Names typically start with a capital letter, whichyou can accommodate using regular expressions:

let $related-headline :=normalize-space(replace($lines[3], '^(.*?) [A-Z].*$', '$1'))

let $related-headline-source :=normalize-space(replace($lines[3], '^.*? ([A-Z].*)$', '$1'))

Basically, you match the string from start (^) to end ($) in these regular expressions.The .*? matches up to the first space-capital combination and should capture theheadline text. The [A-Z].* should capture the source of that headline. By puttingopening and closing parentheses [( and )] around the part you're interested in andusing $1 as the replacement, you should end up with either the headline or thesource, depending on where you put the parentheses.

Conversion result

Add the lines in Listing 2 to tag the extracted information.

Listing 2. Converting the extracted information to XML

return

<news-item>{if ($headline) then

<headline>{$headline}</headline>

ibm.com/developerWorks developerWorks®

Data scoring: Convert data with XQuery© Copyright IBM Corporation 2009. All rights reserved. Page 5 of 15

Page 6: Tip: Data Scoring: Convert data with XQuery

else (),

if ($location) then<location>{$location}</location>

else (),

if ($text) then<text>{$text}</text>

else (),

if ($related-headline) then<related-headline>{$related-headline}</related-headline>

else (),

if ($related-headline-source) then<related-headline-source>{$related-headline-source

}</related-headline-source>else ()

}</news-item>

The if statements around the tags are to ensure that only those tags are writtenthat contain a non-empty value. You can gather all expressions so far and appendthe following lines to make it output the first news item:

return$result[1]

The lines in Listing 3 show the expected output.

Listing 3. XML output of the first news item

<news-item><headline>Afghans go to polls under threat of Taliban violence</headline><location>KABUL, Afghanistan (CNN)</location><message>Under the menacing threat of violence from the

Taliban, Afghans headed to the polls on Thursday in the war-ravaged nation'ssecond-ever national election.</message>

<related-headline>Afghan people cast votes in hope of better future</related-headline><related-headline-source>RT</related-headline-source>

</news-item>

Now, convert all three items. Then, you can start to score how well you did.

Element scoring of the result

To analyze the quality of conversion results, you can choose from several methodsbased on your needs. The technique presented here is basic and you can use it invarious ways. It consists of showing statistics at two detail levels:

• Scoring of each element of interest

developerWorks® ibm.com/developerWorks

Data scoring: Convert data with XQueryPage 6 of 15 © Copyright IBM Corporation 2009. All rights reserved.

Page 7: Tip: Data Scoring: Convert data with XQuery

• Scoring of each distinct element value for a particular element of interest

The elements of interest in this conversion result are all elements containing text:

• headline

• location

• text

• related-headline

• related-headline-source

The first detail level is merely a matter of counting all elements and calculating theratio of elements to the total number of items. You can calculate the total usingXQuery. (Note that the XML output of the news items was captured in a variablenamed $result.)

let $element-name = 'headline'let $element-score := count($result//*[local-name() = $element-name])let $element-ratio := round(100 * $element-score div count($news-items))

Matching elements on their local name is rather slow but saves code and makesyour script more dynamic. In Listing 4, you wrap the calculation results in a smallHTML report.

Listing 4. Creating an element scoring result

let $total-number-of-items := count($news-items)

let $elements-of-interest :=('headline', 'location', 'text', 'related-headline','related-headline-source')

return

<html><body>

<p><b>Total number of items: </b>{$total-number-of-items}</p>

<table border="1"><tr>

<th>element name</th><th>score</th><th>ratio</th>

</tr>{for $element-name in $elements-of-interest

let $elements :=$result//*[local-name() = $element-name]

let $element-score :=count($elements)

let $element-ratio :=

ibm.com/developerWorks developerWorks®

Data scoring: Convert data with XQuery© Copyright IBM Corporation 2009. All rights reserved. Page 7 of 15

Page 8: Tip: Data Scoring: Convert data with XQuery

round(100 * $element-score div $total-number-of-items)

return

<tr><td>{ $element-name }</td><td>{ $element-score }</td><td>{ concat($element-ratio,'%') }</td>

</tr>}</table>

</body></html>

To start, the code in Listing 4 displays the number of items being analyzed to givemeaning to the ratios. Then, it loops over all elements of interest, calculates scoreand ratio, and writes a table row for each. Figure 3 shows the report of theconversion result. (View a text-only version of Figure 3.)

Figure 3. Element scoring of the result

The scores look high, but there are some drop-outs on location and text.Continue with the scoring of element values and investigate.

Value scoring of the result

Value scoring on large datasetsWhen you apply this scoring technique on larger datasets, it canresult in long calculation times on the reports. Consider testing thiscode on a small set first, then optimize the code to use the fullcapabilities of your XQuery processor as soon as the calculationtime exceeds one second. If you use an XQuery-capable XMLdatabase, you should be able to use indices to make things even

developerWorks® ibm.com/developerWorks

Data scoring: Convert data with XQueryPage 8 of 15 © Copyright IBM Corporation 2009. All rights reserved.

Page 9: Tip: Data Scoring: Convert data with XQuery

quicker.

Scoring the values of elements is almost as straightforward as scoring elements.Just determine the distinct values of each element and calculate a score and ratiofor each value.

For additional information, you can calculate the score and ratio of missingelements. Replace the code in Listing 4 with the code in Listing 5.

Listing 5. Creating an element and value scoring report

<html><body>

<p><b>Total number of items: </b>{$total-number-of-items}</p><table border="1" width="50%"><tr>

<th colspan="2">element name</th><th colspan="2">score</th><th colspan="2">ratio</th>

</tr>{for $element-name in $elements-of-interest

let $elements :=$result//*[local-name() = $element-name]

let $element-score :=count($elements)

let $element-ratio :=round(100 * $element-score div $total-number-of-items)

return (<tr><td colspan="2">{ $element-name }</td><td colspan="2">{ $element-score }</td><td colspan="2">{ concat($element-ratio,'%') }</td>

</tr>,

let $distinct-values :=distinct-values($elements)

for $distinct-value in $distinct-values

let $value-score :=count($elements[. = $distinct-value])

let $value-ratio :=round(100 * $value-score div $total-number-of-items)

return

<tr><td>&#160;</td><td><i>{ $distinct-value }</i></td><td>&#160;</td><td><i>{ $value-score }</i></td><td>&#160;</td><td><i>{ concat($value-ratio,'%') }</i></td>

</tr>,

let $missing-score :=$total-number-of-items - $element-score

let $missing-ratio :=round(100 * $missing-score div $total-number-of-items)

ibm.com/developerWorks developerWorks®

Data scoring: Convert data with XQuery© Copyright IBM Corporation 2009. All rights reserved. Page 9 of 15

Page 10: Tip: Data Scoring: Convert data with XQuery

where $missing-score > 0return

<tr><td>&#160;</td><td><i><b>(not found)</b></i></td><td>&#160;</td><td><i>{ $missing-score }</i></td><td>&#160;</td><td><i>{ concat($missing-ratio,'%') }</i></td>

</tr>)}

</table></body>

</html>

The HTML table has six columns in this version of the report, but otherwise, theprevious code hasn't changed. Listing 5 only extends the code to perform a sub-loopfor each item that iterates over all its distinct values, outputting additional table rowswith a score and ratio for each value. It also adds a table row when the appropriateelement is missing in at least one of the results.

Analyzing the scores

The scoring report is helpful for at least three use cases:

• Find drop-outs

• Inspect value distributions

• Search for anomalies in values

Drop-outs

In the element score report (Figure 3 or the alternate text version of Figure 3), yousaw a low score for location and text. Figure 4 shows the part of the valuescoring report for these elements. (View a text-only version of Figure 4.)

Figure 4. Value scoring of location and text

developerWorks® ibm.com/developerWorks

Data scoring: Convert data with XQueryPage 10 of 15 © Copyright IBM Corporation 2009. All rights reserved.

Page 11: Tip: Data Scoring: Convert data with XQuery

The report shows that one news item in which a location could not be extracted andone in which the text could not be extracted. This absence is likely to concern thesame news item. Unfortunately, this report doesn't provide the information necessaryto determine why the conversion failed for those items, but it does at least show thatit failed. Now, check the content and take a second look at the patterns behind theseelements.

Remember that location and text were separated from each other based on thepresence of a dash. Return to Listing 1 and notice one news item in which thesecond line does not begin with a location. There is no dash, so the pattern toseparate the items fails.

Value distributions

Apart from spotting drop-outs, you can also use these reports to analyze valuedistributions. Having converted only the three items, each value occurs only once, asyou will see if you run the whole script and look at the report. Use the report on alarger dataset to experiment with this use case.

Anomalies

A third use case is to inspect the element values to spot anomalies. Figure 5 showsthe part of the value scoring report that reveals the related-headline andrelated-headline-source elements. (View a text-only version of Figure 5.)

Figure 5. Value scoring of related headline and source

ibm.com/developerWorks developerWorks®

Data scoring: Convert data with XQuery© Copyright IBM Corporation 2009. All rights reserved. Page 11 of 15

Page 12: Tip: Data Scoring: Convert data with XQuery

Notice that your extraction rule did succeed but produced incorrect values. Thevalues of the related-headline-source should have been:

• CBS

• RT

• Al Jazeera

Producing correct values here is difficult, but at least this report can help you spotsuch mistakes. If a lexicon of valid source values is available, you can use it tovalidate the captured values, then signal for incorrect values or rule them out asfalse positives.

Conclusion

With simple calculations, you produced small HTML reports of element and valuescores on the result of a small text-to-XML conversion. You successfully used thesereports to analyze the quality of this conversion, spot drop-outs, look at valuedistributions, and find anomalies. This scoring technique can be a useful tool forquality analysis of conversion results.

developerWorks® ibm.com/developerWorks

Data scoring: Convert data with XQueryPage 12 of 15 © Copyright IBM Corporation 2009. All rights reserved.

Page 13: Tip: Data Scoring: Convert data with XQuery

Downloads

Description Name Size Downloadmethod

Source file for this tip data-scoring.xqy.zip 2KB HTTP

Information about download methods

ibm.com/developerWorks developerWorks®

Data scoring: Convert data with XQuery© Copyright IBM Corporation 2009. All rights reserved. Page 13 of 15

Page 14: Tip: Data Scoring: Convert data with XQuery

Resources

Learn

• XQuery 1.0: An XML Query Language: Read the W3C's XQuery specification.

• XQuery 1.0 and XPath 2.0 Functions and Operators: Learn about the variousfunctions and operators available in XQuery.

• HTML 4.01: Read the W3C's HTML specification.

• IBM XML certification: Find out how you can become an IBM-CertifiedDeveloper in XML and related technologies.

• XML technical library: See the developerWorks XML Zone for a wide range oftechnical articles and tips, tutorials, standards, and IBM Redbooks.

• developerWorks technical events and webcasts: Stay current with technology inthese sessions.

• The technology bookstore: Browse for books on these and other technicaltopics.

• developerWorks podcasts: Listen to interesting interviews and discussions forsoftware developers.

Get products and technologies

• IBM product evaluation versions: Download or explore the online trials in theIBM SOA Sandbox and get your hands on application development tools andmiddleware products from DB2®, Lotus®, Rational®, Tivoli®, andWebSphere®.

Discuss

• XML zone discussion forums: Participate in any of several XML-relateddiscussions.

• developerWorks blogs: Check out these blogs and get involved in thedeveloperWorks community.

About the author

Geert JostenGeert Josten has been a content engineer at Daidalos for nearly 10 years, applyinghis knowledge of XSLT and XQuery as well as other, proprietary transformationlanguages. He also works as a Web and Java developer at Daidalos and hasconsulted for dozens of customers in a wide range of areas.

developerWorks® ibm.com/developerWorks

Data scoring: Convert data with XQueryPage 14 of 15 © Copyright IBM Corporation 2009. All rights reserved.

Page 15: Tip: Data Scoring: Convert data with XQuery

Trademarks

IBM, the IBM logo, ibm.com, DB2, developerWorks, Lotus, Rational, Tivoli, andWebSphere are trademarks or registered trademarks of International BusinessMachines Corporation in the United States, other countries, or both. These and otherIBM trademarked terms are marked on their first occurrence in this information withthe appropriate symbol (® or ™), indicating US registered or common lawtrademarks owned by IBM at the time this information was published. Suchtrademarks may also be registered or common law trademarks in other countries.See the current list of IBM trademarks.Adobe, the Adobe logo, PostScript, and the PostScript logo are either registeredtrademarks or trademarks of Adobe Systems Incorporated in the United States,and/or other countries.Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in theUnited States, other countries, or both.Other company, product, or service names may be trademarks or service marks ofothers.

ibm.com/developerWorks developerWorks®

Data scoring: Convert data with XQuery© Copyright IBM Corporation 2009. All rights reserved. Page 15 of 15