xml and content strategy

Publishers and other information providers increasingly use multiple media to display their content for various applications. Books become e-books, online journal articles are published online first and in print later, and figures are aggregated into image databases. Users request chunks of content, or the publisher assembles pieces of content from multiple publications into a new publication. Information users want what they want, when they want, in the form they want. As publishers work to respond to the changing needs of their constituencies, the challenge is: how can publishers “future-proof” their content?

Even today, content takes many forms and has many uses. Publishers find that they need to adapt their content in various ways (figure 1).

If today’s situation is not complicated enough, the future is likely to be even more complex. How can information providers respond to the changing needs of customers and new technologies with greater facility in terms of time, cost, and effort?

XML and content strategy Why and how to “future-proof” your content

SPi Global2807 North Parham Road, Suite 350, Richmond, VA 23294T 1 804 262 4219 www.spi-global.com

Sampling

Web-ready HTML on proprietary platforms•

HTML for web previewing•

PDF for printing/viewing/downloading•

Distribution by third-party aggregators (Ovid, EBSCO)•

Abstract & indexing services (Scopus)•

Mobile devices (iPad, smartphones)•

Archival solutions (Portico)•

Figure 1: Sampling of data output

By far, the most practical, most versatile tool for manipulating content for both current formats and those not invented yet is XML.

By far, the most practical, most versatile tool for manipulating content for both current formats and those not invented yet is XML. XML (eXtensible Markup Language) is an open standard; its power derives from the fact that XML has been adopted by entire industries, many government agencies, and platform developers. When new standards emerge, such as EPUB for e-book readers, the standards are derived from generic XML, allowing even files created a few years hence to flow readily into the new standard.

Some organizations think of XML as a different set of tags. While XML tags are different from those used in other systems like SGML or HTML, XML is actually a different way of thinking about content. Karen Colson, director of publishing and communications at the Association for Research in Vision and Ophthalmology (ARVO) explains it simply,

XML describes content, not appearance.

An XML tag (actually, a pair of tags—one at the beginning and one at the end of an element) might indicate that a section of copy is a first-level heading inside a book chapter. The actual appearance of the heading, however, is determined by a different style sheet for each application. The typeface and size that appears in the book might be completely different if the book is available on an e-reader, and it might be different still if the book is included on the electronic platform of a third-party aggregator.

The tag for a first-level heading also can function as metadata. For instance, a book’s table of contents might be constructed by copying chapter titles and first-level headings. Or, perhaps an aggregator’s general search function could look primarily at first-level headings. In either case, a pair of tags that starts out regulating appearance can have multiple programmatic applications as well.

Organizations that want to get the most out of XML apply it consistently and as early as possible in the content development process. When this happens, editing changes are captured within a single, authoritative XML file, all XML files are built according to the same rules, and the final XML file is the source for all types of output. Creating this capability requires thoughtful planning and technically astute implementation.

Planning for end-to-end XML Workflow

The most reliable and powerful way to apply XML to documents is to do so at the very beginning of the production cycle. In organizations where content is created by employees, the content creator may enter tags, often using shortcuts or templates. For most publishers, however, tags are applied by skilled markup operators based on the list of tags available to them (more on this below). Most markup operators work for compositors, so their function sometimes overlaps with typesetting. But markup is a distinct function in the production process. Once the tags are applied, production can proceed (figure 2).

When an error occurs, the correction is made in the native XML file so that the error can be corrected in every product that flows from the content. Making corrections in the native XML file represents the industry’s best practice, but practical challenges exist even with this approach.

Julia Sawabini, director of e-commerce at Elsevier, explains that to build the web page for a particular product, Elsevier pulls content from a database containing fields variously populated by editorial, production, and marketing people. The information is organized via style sheets but no content is created at this point. “If there’s something wrong on the website, it’s wrong someplace along the way. I can’t change it.”

Once a correction is made, the change may not appear immediately, as the website is updated in batches at specified intervals. The incorrect product information will appear on the site until the update takes place. Also, the incorrect material will remain on the servers of distributors, e-bookstores, and other outlets for the information unless corrected files are sent and uploaded.

An analogous challenge occurs in publishing printed materials. Sometimes a production person spots an error while processing a PDF for the printer. The temptation, and often the reality, is that the production person corrects the PDF and sends it on to the printer, breathing a sigh of relief. Unless the production manager remembers to go back to make the same correction, the error still exists in the XML file.

Implicit in this discussion is the notion that XML workflow includes an element that is rarely critical in a single-medium product—what director of production at Elsevier Phil Schafer describes as “a central content repository with full functionality.” It is not enough to save all content to a particular server. Ideally, the content will flow into a database-like structure that enables the owner or other authorized users to find specific content and manipulate it for specific publishing applications.

XML and content strategy Page 3Why and how to “future-proof” your content


XML markup

Copyediting

Typsetting

Page layout

Proofreading

Content Repository

Multiple outputs

Figure 2: Production process using XML

Organizations that want to get the most out of XML apply it consistently and as early as possible in the content development process.

Content repositories can be critical in highly regulated areas such as medicine. Larry McGrew, head of content and editorial operations at Aetna, relies on multiple content management systems with carefully approved material to populate Aetna’s sites that are central to their members’ experience. McGrew admits that this has been “extremely challenging” to implement.

The DTD

The Document Type Definition (DTD)—the very rough equivalent of type specifications for print products—specifies both how an element will look in print, on the web, on e-book readers, etc., and, to some extent, what the element means. DTDs need to code both data and metadata.

To explain how a DTD functions, look at the different tagging possibilities for how genus and species might be handled depending on the media and application. For instance, we assume that readers of this white paper belong to the species Homo sapiens. It is probably sufficient therefore to surround Homo sapiens with XML tags that mean “put these words in italics no matter what other appearance specifications you have.” But in a zoology book, you might want to put each genus/species into the index. In that case, you could put tags

around the phrase Homo sapiens that indicate “these words are genus and species – put them in italics, and remember to make an index entry for this term.” In an anthropology book, you might want to distinguish between Homo sapiens and other species such as Homo erectus, and treat both species as index sub-entries under the genus Homo. In that case, you’d put a pair of tags around Homo indicating “this is a genus”, and a tag around either sapiens or erectus indicating “this is a species.” Instructions for constructing the index would complete the picture.

The previous paragraph took 186 words to discuss how to treat genus and species in a DTD. Multiply this by the many editorial, functional, design, and marketing considerations in any one publication, and then multiply it again by the range of publications you hope to represent with a single DTD. The considerations become massive, and the temptation might be to skimp on the detail of the DTD (for instance, coding for genus and species together, rather than separately). This might be a false economy, though. Nina Chang, senior publisher for e-journals at Lippincott Williams & Wilkins, points out,

Richly tagged data allow for more precise searching.

In STM and scholarly publishing, searchers want to retrieve the information that really matters, so the detail of the DTD is important to the perception of quality. It’s helpful to refine the DTD as much as possible before implementation.


Page 2 XML and content strategy Why and how to “future-proof” your content

Data in the content management systems are heavily tagged with metadata so users can get optimal search results despite the multiple original sources of the material.

The Document Type Definition (DTD)—the very rough equivalent of type specifications for print products—specifies both how an element will look in print, on the web, on e-book readers, etc., and, to some extent, what the element means.

One approach is to start with a DTD that is already in the public domain.

As Schafer points out, “If we choose to introduce a new element, we have to take it to a supplier support data team to ensure that it’s implemented across all of our journals.” And Chang of LWW points out that changing the DTD has implications for archival data as well. For instance, do you go back and insert new tags to keep up with the functionality of new material? This requires a business decision: What are the changes worth to the users, compared with the inevitable costs?

At large publishing organizations, developing a sufficiently powerful and flexible DTD is a challenge. As we discussed earlier, it is not enough to catalog all of the type specifications that might be needed. A team building the DTD also needs to consider whether to define specific kinds of information and to what degree of detail, and they also need to define the metadata required for their own use and for the use of current and future third parties.

One approach is to start with a DTD that is already in the public domain. For instance, Colson of ARVO has twice used the DTD developed by the National Library of Medicine as the basis for an organizational DTD:

[The DTD from the National Library of Medicine] is comprehensive—it works for books, Annual Meeting abstracts, and all of our other publications.

Colson even used this DTD when she worked at American Geophysical Union (AGU), even though AGU content had little if any relationship to medicine, because the structure worked effectively for other types of scholarly content.

Another approach is to contract with a trusted vendor. Vendors that have developed and worked with DTD’s in the past have a pragmatic knowledge of what works well for their customers, and they also have staff with backgrounds to steer skillfully through the complexities. Outside vendors can do their future-oriented work freeing up in-house staff to manage day-to-day operations. And a good outside vendor can also help train staff to understand the new DTD and/or a new, XML-oriented workflow.

A large proportion of scholarly journals, with their tightly structured, relatively brief units of copy, have migrated with reasonable success to XML. Books have been harder because they are more varied, and authors often don’t have the kind

of career-oriented pressures that impel them to comply with constraints that authors of journal articles will accept. Still, over time elementary-high school and higher education publishers have begun to implement DTD’s, which in turn offer them flexibility. Not only can they put content on multiple platforms to meet student and school district needs but also they can customize the content of publications. This may be one reason why most educational publishers seem fairly confident of their ability to meet the idiosyncratic social science requirements of the single largest school district (ie, the Texas School Board) while continuing to publish their books for the rest of the country.

Custom publishers are another category that has found XML to be an invaluable asset to their business, as seen in the Case Study.

ONIX: A specialized DTD for book metadata

For people in the publishing industry, ONIX (ONline Information eXchange) is perhaps the most familiar example of a DTD for metadata.

ONIX is used extensively in the book trade as a standardized means of communicating information about books—from author and title to weight per copy, minimum order quantity, subject classification, and so forth. These data then populate everything from the publisher’s own Website (for instance, the one maintained by Elsevier’s Sawabini) to industry giants such as Amazon and Barnes & Noble.

Vendors that have developed and worked with DTD’s in the past have a pragmatic knowledge of what works well for their customers, and they also have staff with backgrounds to steer skillfully through the complexities.

Case Study

Triangle Publishing Services, Inc., prepares publications for technology companies like Microsoft, Cisco, and Hewlett-Packard. In some cases, Triangle has prepared all the content in a book so that it can be repurposed.

For example, a book with chapters on applications in a dozen different industries can be disaggregated into a dozen different white papers for distribution online. Or, by searching on XML tags, the book’s case studies can be extracted and used in other settings.

Larry Marion, CEO and Editorial Director at Triangle, says this about taking advantage of the power of XML:

Think about how you want to repurpose content; be as creative and granular as possible. Extra work at the beginning can save you pain down the road.

XML and content strategy Page 5Why and how to “future-proof” your content


In fact, if you need to understand how XML refers to types of content and not their appearance, take a look at the display of any particular title on Amazon, and then on Barnes & Noble. Author, title, publisher’s description, and the like look entirely different, yet they contain precisely the same information.

Other industries and disciplines have their own specialized metadata sets, as well.

Implementation

In some parallel universe, management might be able to send out a memo one Friday afternoon announcing a new production workflow that starts the following Monday morning. In this world, however, it isn’t that simple. Employees may need to perform different tasks, or they may perform the same tasks in different sequence. Managers need to assess performance using different metrics. Suppliers need to accept input that looks different and generate different kinds of output, with possible changes in schedules, prices, and quality management. For a publisher, all of this needs to take place while products already in the pipeline move through the previous workflow, or some hybrid.

The programmatic approach, however, can miss or misinterpret improvised or last-minute changes.

Data conversions are typically done by production vendors, with their in-depth knowledge of publishing workflows and outputs.

display, search, and the like. Similarly, links to tables and illustrations might or might not be captured.

Another challenge is that conversions may not capture important metadata (“this is a chapter, not a scholarly paper”) because the metadata simply don’t exist in the original material. Either the original publisher provides the metadata retrospectively, or the new party provides the metadata using their best, potentially fallible judgment.

Building capacity for end-to-end XML requires an organization to commit staff resources, time on the calendar, and financial resources. Realistically, not every publisher can muster all three kinds of resources conveniently.

Data conversions are typically done by production vendors, with their in-depth knowledge of publishing workflows and outputs.

Another approach is to leave file conversions to the aggregator, e-book platform, etc. that wants to use the data. These companies typically do a good job of ensuring that the XML they generate is effective for their application, but if another vendor approaches the publisher, the process needs to be repeated at the cost of more money and more time.

Time for XML?

For the foreseeable future, information is going to flow into and through multiple platforms— from books, magazines, and newspapers to websites, e-book readers, mobile devices, and inventions that are only sketches on a white board right now. Authorities agree that XML provides the most effective way to cope with the multiple and shifting demands. Colson of ARVO says it well:

Don’t be afraid of XML. Using XML will give you more versatility than any scheme I’m aware of.

XML on the fly

Sometimes, an information provider will need to produce XML hastily. For instance, a content provider may be switching publishers or may be wishing to digitize back file content, or work with a new third party aggregator.

In these situations, publishers need to convert existing data. With typesetting files in hand, a conversion vendor can read the typesetting codes (for instance, “Heading 1”) and change them to XML tags, for the most part programmatically. For instance, if someone sees at the last minute that a “1” head really should have been a “2” head, that person might not change the typesetting code but might simply alter the type characteristics to look like a “2” head. The XML coding will continue to treat the heading as a “1” head, with potential implications for the quality of the applications such as Web


Page 4 XML and content strategy Why and how to “future-proof” your content


The Authors

Rich Lampert •

The Lampert Consultancy www.lampert-consultancy.net

Rich Lampert is owner of The Lampert Consultancy, LLC, established in 2004 to provide strategic, editorial, and marketing services to publishers in STM, professional, and scholarly publishing. Rich is also, Principal, Publishing Services Division, at Doody Enterprises, Inc., which focuses on not-for-profit publishers.

Cara Kaufman•

Kaufman-Wills Groupwww.kaufmanwills.com

Cara Kaufman is co-founder of Kaufman-Wills Group, LLC, which was created in 2000, to offer STM and other scholarly publishers a full range of professional publishing services in the areas of strategic planning, business development, electronic publishing strategy, RFP and self-publishing projects, editorial services, and marketing and market research.

SPi sought the help of Kaufman-Wills Group in developing this white paper.

The Contributors

Special thanks to the following individual contributors:

Nina Chang, Senior Publisher, Online Journals, Lippincott •Williams & Wilkins

Karen Colson, Director, Publishing and Communications, •Association for Research in Vision and Ophthalmology

Mark Gaertner, Senior Web Producer, Team Lead, •BMStudio at Bristol-Myers Squibb

Larry Marion, CEO/Editor-in-Chief, Triangle •Publishing Services

Larry McGrew, Head, Content/Editorial •Operations, Aetna

Julia Sawabini, Web Marketing Director, Elsevier•

Phil Schafer, Director, Journal Production, Elsevier•

xml and content strategy

Documents