essays: taxonomy, terminology and juan sager

© rennie walker 2002, 2003, 2010 www.renniewalker.com

2

TABLE OF CONTENTS Introduction 3 A Model of Taxonomy and its 4 Functional Elements 5 A Model of Communication 15 A Model of Special Languages – Thing, Definition, Term 23 An OntoLexical Layers Model 30 Taking the Notion – An Introduction to the Set of Perspectives on Paradigms 37 February 2003 - A Brief History of Dictionaries: Meanings and Their Words 40 May 2003 - A Brief History of Newspapers: All the News that's Fit to Find 45 Review Essays - Introduction 51 A Practical Course in Terminology Processing by Juan C. Sager. Part 1: Why Terminology?

55

A Practical Course in Terminology Processing by Juan C. Sager. Part 2: Models Let us Devise Methodologies

62

About 75 Affirmation 81

3

Introduction to Essays on Taxonomy, Terminology and the Approach of Juan Sager History of These Essays In 2002 and 2003, after I had left Sageware Inc., an early thought leader in the field of content categorization by software into rules-based taxonomies, I wrote a short series of essays and published them on my then website semanthink.com. semanthink.com is long gone. (Although there is a good archive of semanthink.com in the Wayback Machine – but without the images, and more of that later.) Still to this day I occasionally get asked about one or other of these essays, or about themes and points of interest covered in them. So, I have decided to re-publish them “permanently”.

They were fun to write. I was working on exciting engagements and there was then, in 2002, much less guidance of a holistic theoretical nature which wove together all of what I considered to be the relevant paradigms that affected how we should think about, thus how we should approach analyzing, business problems of content description, modeling of subject domains etc. No map … some map fragments … and some fragments that most definitely should be part of the overall map. So, I wrote my own map! And so this map became important, indispensible, to my work. My work in solving business problems surfacing at various points in the content supply chain was theoretically based in large part on the kinds of ideas contained in these essays (and also in some other essays that just never made it to completion due to work pressures).

When I re-read them today, I still “like” them. And my current work as a consultant is still to an extent predicated on these ideas, even though the information ecosystem today is overlay upon overlay of applications and content types that did not exist back in 2002/2003. Their durability, to me, is because I did attempt to write at an abstract level, at a level of first principles – with a strong view towards business analysis and project design. However, they are old. They were never written, truly, with an eye and mind to “publishing” – they were “just web content” to me, then. So … Caveat Emptor … Happy Trails … and enjoy.

Editorial Changes Many of the original essays had graphics and witty (at least to me) comix interspersed through the content to make a point. Alas, the originals of these are lost to me, and as mentioned, they are not available in the Wayback Machine archive. So these are omitted, along with any references in the text to them.

On re-reading the essays and re-formatting the content, I was tempted to make some minor changes. It’s a natural reaction in a way. But I resisted and have made no editorial changes of any kind. However, I did do my best to re-find URLs referenced in the footnotes where an URL had changed. Mostly I succeeded in this.

The Essays Themselves The essays are of a couple of different types.

4

I was then, and still am today, forceful about the need for us to have models that we can use as the basis for business analysis and data gathering to fulfill business requirements. So, for example, a series of these essays were models for conceptual understanding and as the basis of analysis – e.g. the model of taxonomy (the functional elements of taxonomy) that I wrote for myself, and the model of communication that I abstracted from the work of the terminology thought leader Juan Sager.

Some others were part of a series that I called “Perspectives on Paradigms”. These were short, high-level and strategic thought pieces about paradigms that were changing, before my very eyes, in the world that I worked in. A couple of these are included because the themes, which cover the history of the particular paradigms, e.g. newspapers, dictionaries, are still intrinsically interesting. Some of the others don’t wear so well over time since the paradigm change has been resolved – these are omitted.

There was also a series of review essays on key books that were informing my work. Only one of these review essays was completed – a review of “A Practical Course in Terminology Processing” by Juan Sager, which had a profound impact on how I strategized, communicated to teams, wrote business requirements and approached gathering data to solve business problems. Several other review essays were begun (and nearly finished), but these are not included.

semanthink.com The color scheme used here is the original semanthink.com color scheme. ☺ I’ve also included the content item about my approach which was in the About directory of the site – it’s old, but accurately reflects how I described myself back then. It is included for completeness’ sake only.

rennie walker

clayton, california

june 2010

5

A Model of Taxonomy and its 4 Functional Elements

Understanding how Taxonomies do what they Do The primary purpose of a taxonomy is to represent a domain. It does this by defining all the concepts of the domain (i.e. giving boundaries) and relating the concepts to each other. The representation is always in service of a particular user community. In fact, the representation is how this community "sees" this part of the world, and aligns with how this community does their "work" in this part of the world. This term "represent" is key. Just as any map is not the territory itself, so any taxonomy is not the domain itself. The concept of taxonomy is not expressly or intrinsically linked to a purpose of "organizing" documents (or "leading to" documents). This constraint on purpose is shared also by ontologies. Whereas classification schemes developed within the library science world, such as Dewey and UDC, are expressly purposed to organize information resources. 1

Of course, taxonomies are extremely powerful tools for organizing content. But that is because they have been designed to model a user domain, not because they are designed to capture content characteristics. Documents that include discussion of domain concepts can be associated with nodes from the taxonomy model, and so using the communication tool of taxonomy we can organize documents for a user community. But in fact, what we are really doing is using the taxonomy as a specification of the concepts that need to be recognized (by an application or through editorial workflow). And successful subject recognition allows the recognized subjects to be arranged in the map of the taxonomy that the user community requires. The fact that these discussions of subjects are wrapped in documents is purely artifactual. There is often a great deal of semantic variance around the concept of "taxonomy". We'll just briefly touch on this here, since the foreground task is to isolate functionality. But we'll note in passing, that while the word "taxonomy" is often used as a technical term, the model of special languages tells us that terms have consensual definitions in technical communities. And while Alice did say "... at least I mean what I say - that's the same thing (as saying what I mean), you know" during her adventures in Wonderland, she's both right and wrong. Poor Alice! She's right, in the fact that the model of special languages sees terms and definitions as interchangeable. But she's wrong, immediately she steps outside the context of special subjects and the special languages that are used by communities to communicate their special subject interests. How Taxonomies "Work": Their Functional Space and Functional Elements On the surface, taxonomies seem to be simple tools. Here is a general statement. The better we understand how any tool is "supposed" to work, the better we can design them. Taxonomies are no exception. The more we understand what taxonomies do, and how, the better we can manage project teams to design, apply and maintain them. There are two sets of parameters that we need to carry foreground, at all times, about taxonomies. One set of parameters gives us information about the "functional space" within which taxonomies operate. The other set of parameters is the set of individual functional elements that do the "work", i.e. carry the information that makes the taxonomy useful (and usable) to users. The functional space describes the set of

6

parameters within which the taxonomy does its work. The set of functional elements describes the set of parameters by which the taxonomy does its work. Defining Taxonomy Usability These two sets of parameters are not decoupled, either in design or in use. The functional space is where the taxonomy operates as a communication tool. The individual functional elements are the elements that actually carry the information that needs to be communicated. Taxonomy usability, in a nutshell, is about measuring the information that a taxonomy carries while it performs as a communication tool for a particular user community. Understanding Taxonomies: Their Functional Space Taxonomies function on different levels and in different ways. These "levels" and "ways" are the functional space within which the taxonomy does its work. In the early stages of designing information access projects a number of important characteristics of taxonomies at the level of being a tool are often overlooked completely. Every taxonomy designed to be used within a total content management/information access environment is a -

Cognitive tool Communication tool Culturally predicated tool Model (and a way, or template, to model) Special language variant Terminology

Taxonomies are cognitive tools. They are cognition-facing. A taxonomy is a cognitive tool in the sense that it speaks to the intrinsic component of human cognition in each of us whose purpose is classificatory. A taxonomy offers one arrangement, plus a set of definitions, of a conceptual domain that is useful to at least one user community. This representation aligns with the result of the intrinsic human classification function, which has in turn been socialized within that technical community. As designers, it is our task to elicit the shared conceptual map of the community of practice and re-engineer it into a taxonomy. Taxonomies are communication tools. Within any model of communication there are senders of communication and recipients, and the purpose of the sender is to change the knowledge state of the recipient. Taxonomies play a variety of roles as a communication tool. Taxonomies communicate what is related to what. They communicate the definitions of the concepts. They communicate using the language of the community of practice. Etc. Taxonomies play a role in users changing their state of knowledge when they are seeking information in an information access environment such as an enterprise intranet. Any model of communication should be able to explain the parameters of what degrades effective communication. In a very general sense, this whole article is about what can degrade taxonomic communication. Taxonomies are also tools that reflect a "cultural" view of domains. Every user community in every technical domain is a culture, whether it is capital markets, insurance or manufacturing in-flight entertainment systems. This culture manifests itself

7

in its desired attribution of domain scope and granularity, the selection of terminology to label concepts, the definitions of the concepts (which can in turn, depending on the definition, impact preferred relationships). And so on. Taxonomies are models. They are also ways to model. To build a taxonomy is to follow one way of building particular models that serve particular purposes. In our case we want ways to model sets of concepts, so we can apply subject recognition of those concepts to sets of documents, and so, finally, give knowledge workers access to documents, based upon the known (i.e. recognized) subjects that they discuss. Any taxonomy is a particular typed variant of a special language. Special languages are technical communication tools. The lexicon of any special language substitutes for the full definitions of the concepts of the particular domain and additionally carries information about the organization of the discipline. Taxonomies are closely aligned with the model of special languages. Taxonomies do this by formalizing the elements of any special language that they include. If a taxonomy includes any elements of a special language, then the element in the taxonomy carries (and has to carry) the same definition as that special language. It has to, otherwise the community of users will have two definitions for one concept. When a taxonomy includes elements of a special language it also has to take and carry the same relationships as the terms that it takes. Taxonomies are terminologies. Remove the definitions and the relational information and we have a technical lexicon. These six characteristics of the functional space are all integrated into designing the technical elements of the taxonomy and their components. Understanding Taxonomies: Their Key Functional Features Technically, a taxonomy is an intersection of four different sets of data. These are the four functional elements of any taxonomy -

Domain Scope Concept Definitions Concept Relationships Concept Labels

It is these four functional elements that "do the work" of a taxonomy being a communication tool. These four sets of data have to be collected in any taxonomy design project for there to be any "taxonomy design". Understanding Taxonomies: The Functional Feature of Domain Scope The set of concepts that comprises the taxonomy is the scope of the taxonomy. Scope can be thought of as having two aspects -

Extent (its boundaries) Granularity

Both of these are equally important in designing taxonomies that are usable, and so successful in business terms and in terms of project completion. They are like two

8

dimensions of the conceptual space they aim to model - a dimension of extent and a dimension of depth. One part of the scope is the extent of the domain; it's "boundaries" in "conceptual space". At a pragmatic level these boundaries are defined in a binary way; "is this concept in or out?" These binary decisions are always in the service of user requirements. In the very early stages of taxonomy design in particular, sets of these kinds of binary decisions get queued and are dealt with in the methodology of the project. In the very early stages of the project these kinds of decisions might be dealt with by methods of designated expert review, which is less resource intensive than involving numbers of the user community. The second part of the scope is to discover and designate the level of granularity of the taxonomy; how 'discrete' are our concepts to be? We can formalize this design task as - when does a child node become a leaf node, the end of the branch? Are these leaf nodes granular enough? Or do they bundle "kinds of" that the users require to be disaggregated? These two aspects of taxonomy scope are orthogonal in terms of the workflow of the information access design project, at least to a substantial extent. We can certainly define the boundaries of the domain and move on, if project milestones or time management issues are pivotal, and deal with issues of granularity later. But we cannot really settle on the level of granularity and then move on. Because of the lingering question: "move on to what exactly?" The extent of the domain is still undefined. Both aspect of scope are always iterative in the design stages. And both can be, and should be expected to be, iterative in the ongoing maintenance stages after the taxonomy is implemented. In its function as a cognitive tool, the taxonomy scope tells the user community what things, and their kinds, are to be found. In its function as a communication tool, the feature of scope communicates boundary, what is to be found and what is not, and at what level of granularity. In its function as a cultural translation, the feature of scope is a mapping of the user community's view of the domain or world. Scope is language-independent, so it plays no role as a variety of special language, nor does scope indicate terminology. Understanding Taxonomies: The Functional Feature of Node Relationships Each concept within the taxonomy is one node in a set of relationships. There are two kinds of relationships -

Subsumption Siblingship

The subsumption relationship binds the parent and child together. The relationship between siblings of the same parent, what makes each child a child and what differentiates each child from the other siblings, is siblingship. The subsumption relationship is one of "kind of". Any child is a "kind of" its parent. For example, Pinot Noir is a kind of grape (it is also a kind of wine). And a grape is a kind of

9

… And a Persian cat (like our pet cat Baby Cakes) is a kind of cat (Felis Catus). And Felis Catus is a variety (or "kind of") of ... This "kind of" relationship is one of the key definitional parameters of any taxonomy. This categorically differentiates taxonomy from classification. If the representation is not based on the "kind of" relationship being the sole subsumption relationship, then we do not have a taxonomy. We have some other kind of representation, that can be both highly required and utilitarian, but it is not a taxonomy. "Kind of" is user defined. This can sound strange at first. In fact "kinds" are always defined to meet user needs. (Because we are not in the business of seeking out abstract Platonic ideals, but engineer and deliver useable cognitive maps.) "Kind of", from the point of view of the child, is technically defined as the intersection of what is the key characteristic of its parent together with what differentiates it from each of its siblings. The relationship between a node and its siblings is one of differentiation, or discrimination. Each of the siblings is a "kind of" of the parent, but each is differentiated from the other by its position along some kind of dimension that is the differentiating dimension. This basis for the differentiation is also user defined. This is a Linnaean model of taxonomy design. The model was originally designed (by Linnaeus) to type biology into classes. Understanding Taxonomies: Node Relationships: an Example "Kind of" typing has near universal applicability. Here is an example of typing events into a Linnaean hierarchy to show both the utility of the underlying model and the fact that the definition of "kind of" is user driven and not absolute. Business news carried by newswires and business newspapers is a description of the occurrence of events in the marketplace, and their impacts, implications and consequences. We can create a taxonomy of these events. Let us say we have a user community of asset managers and they want to taxonomize the universe of business events into a number of parents. This community decides that two of these parent are the set of events that impact/change a company's capital structure and the set of events that are changes in a company's behavior at the operational level. An IPO, a stock split and a bond issue all impact a company's capital structure. And a facility closure or a round of layoffs impact a company at the operational level. For this user community we can create a taxonomy that looks something like this - Marketplace Events ...

Capital Structure Events Debt Capital Events

... Equity Capital Events

IPOs Rights Issues Secondary Offerings

10

Stock Buy-Backs Stock Splits Stock Options Plans Introductions & Revisions ...

Operational Events ...

Facilities Openings and Closures Facility Closures Facility Openings

Labor Force Events Layoffs and Furloughs Short-Time Working Arrangements

For a different user community practicing a different special subject we might very well arrange all the instances under the root of "Marketplace Events" quite differently. But we would still make it be a taxonomy, where the subsumption relationship is "kind of". And the set of relationships would reflect the special subject point of view (i.e. the cultural view) of this other user community. Understanding Taxonomies: Node Relationships: Usability Vectors This "kind of" relationship instantiates a usability vector that is often critical to user communities; the ability to browse from broad to narrower, or narrow to broader. This parameter is a semantic parameter - the semantics of moving from parent to child, or child to parent. If our users don't understand the semantics of this relationship, then they are not going to know exactly what they are moving to. This raises a whole set of issues around user expectations. Without the formality of the "kind of" relationship there can be no meaningfulness to traversing from broad to narrower, or narrow to broader. In fact, without the formality of "kind of" we (and our users) cannot move from broad to narrower, and vice versa. Sure, we can move from "something" to "something", but that is all we can say and know. The semantics of traversing a classification scheme (as we define it here) are very different from traversing the semantics of a Linnaean taxonomy. The semantics of traversing a classification scheme are "fuzzier". Ultimately, it is the semantics of traversing classification scheme relationships that allows us to differentiate classification from taxonomy in a formal sense. Because we are native taxonomists in our psychology we understand what broad to narrower and narrow to broader means. We understand the cognitive linearity of it all. The allowance to do this is highly utilitarian for knowledge workers in seek mode working with information interfaces that represent conceptual spaces. But this relationship between "broad" and "narrow", and the utility to traverse conceptual space from broad to narrower, is completely dependent on the subsumption relationship that binds parent to child being of the one formal "kind of" type. If the subsumption relationship is not "kind of", then what does it mean to navigate the hierarchical tree, what are the semantics of this? "Kind of" relationships are intuitively understood because we are cognitive beings. We group every thing in the world into kinds of things: trees, buildings, animals, movies etc. This means that any taxonomy, as a thing, is usually easily learnable/teachable with

11

reference to those who need to know about them to build them and maintain them. This is because we all have the innate capacity to do this cognitive operation, and we do it both continuously and continually (as well as universally). Whilst all user communities, from our point of view as information access designers, are cultures (and so their point of view is "cultural"), and whilst the results of cognitive grouping into "kinds" is always cultural, the fact that we do group things into kinds is not cultural. 2 In its function as a cognitive tool, the set of taxonomy node relationships tells the user community how things, and their kinds, are grouped. In its function as a cognitive tool it also communicates the underlying principles of exactly what parameters relate and differentiate concepts. In its function as a communication tool the feature of relationships kinds and their grouping principle is explicit, and removes any variance that there might be about where to find what. For it to function as a communication tool with a high clarity, it must communicate clearly the semantics of the relationships. In its function as a cultural translation, the feature of relationships is a mapping of the user community's view of the domain or world of their special subject. The feature of node relationships is aligned with special languages. Special languages communicate relationships between concepts. The feature of node relationships is also aligned with a terminology point of view. Many special subjects have rules (sometimes formal) or heuristics that guide types of lexical generation to create terms so that the terms carry, i.e. indicate, the subsumption relationship. Understanding Taxonomies: The Functional Feature of Definition A step that is often overlooked in taxonomy design is the step of defining the nodes. Without definitions the taxonomy is incomplete and unusable (because, how can we consistently, through either editorial workflow or a categorization application, associate documents with nodes if there is no definition of the node?). Pretty much every taxonomy we build includes substantial proportions of the special language that is used to communicate discourse about the domain. All the users of a special language know, and agree on, the definitions of the concepts. Of course, definitions evolve as domains evolve, but this gives rise to a maintenance issue and not a pure-play ontological issue. Because all users of a special language know their own language definitions it is our task as designers to follow their lead (and needs). Definition of a concept directly impacts precision/recall measures from the user's point of view. Mismatches between the definition of a concept in the taxonomy and the definition of the same concept in the subject recognition application or workflow layer directly impact the user. Definitions must be explicit. If they are implicit, then who knows what? At a user experience level, if the definition of the concept is broader than the definition used by the user community then, with a perfectly working categorization platform or workflow, the user community will experience the document set attached to the taxonomy node as exhibiting low precision. The categorization topic (or workflow) will execute correctly and consistently return a set of documents that covers a broader range than the user requires. The converse also leads to a predictable outcome. If the definition of the concept is narrower than the definition used by the community of practice, the

12

users will experience the document set attached to the taxonomy node as exhibiting low recall. Definitions also impact user "satisfaction". If the user says "I know what I mean, and I name what I know." Then the user in question is not going to want labels, which should be referencing the same definition as the user community, including or excluding other aspects of the definition. Just as precision/recall is the basis for a formal, non-emotional, metric so this dilution/distillation is the basis for an emotional metric. With subject recognition applications that are lexically based (as opposed to statistically based) the definition is, in a very real sense, distinct from the code. In a semantics network model of subject recognition there will be sets of synonyms combined together plus vectors of weighting. In a rules-based approach there will be sets of lexical elements, again usually synonyms, plus the rules. But this code has to be shaped by the definition of the node - it does not shape the definition. Additionally, it we strip out everything from this code except the lexical elements themselves then we are not left with grammatical, syntactically useful or indeed semantically meaningful expressions that can pass for definitions. We need the definition so that we can generate sets of appropriate lexical elements and configure the rules of the application to give us the correct semantics of combining different sets of lexical elements. In its function as a cognitive tool, the functional feature of definition tells precisely why any thing is this particular kind of thing. In its function as a communication tool, the feature of definition makes this set of defining rules explicit (so open to discussion and modification). In its function as a cultural translation, the feature of definition is a mapping of the user community's view of each of the things of its domain or world of focus. In its function as a special language variant, the definition will have a corresponding special language term - if the definition exists within the special subject. Understanding Taxonomies: The Functional Feature of the Node Label "Exactly so," said Alice. "Then you should say what you mean," the March Hare went on. "I do," Alice hastily replied; "at least - at least I mean what I say - that's the same thing, you know." "Not the same thing a bit!" said the Hatter. "You might just as well say that 'I see what I eat' is the same thing as 'I eat what I see'!" "You might just as well say," added the March Hare, "that 'I like what I get' is the same thing as 'I get what I like!" 3 Poor Alice! Temporarily disoriented (and wouldn't us terminographers and taxonomy designers also be) after following the White Rabbit, she is muxing ip communication intent, labels and definitions all in one mad brew. Not her fault. But, confusing facets is much worse, from a usability point of view, than mixing metaphors. Luckily for her (and us) the Mad Hatter (and surprisingly also the March Hare) appears to have attended at

13

least one first class taxonomy workshop, and gives all evidence of having read Juan Sager's classic on terminology. He is certainly putting her straight! The node labels are the names for the concepts. We directly incorporate two aspects of terminology work in assigning labels to taxonomy nodes. Firstly, taxonomies share with special languages the fact that the names substitute for the definitions. In special languages terms substitute for definitions through the property of special reference. In information access design work taxonomy designers rarely or never use the technical term (which we could borrow from "classical" terminology work) "special reference". Perhaps we could. Taxonomy node labels do carry the property of special reference. The design issue does get complicated when we are taking substantial proportions of the lexicon of special languages and incorporating that which is not. In this case, as the taxonomy becomes socialized and accepted we have actually created special reference. In a very real sense our taxonomy becomes an expression of a special language that we have created. Secondly, special languages and terminology work pay a great deal of attention to creating lexical derivations and compounding. There are often many "rules" or heuristics within a special subject that as a set constrains a typology of forms of derivation and compounding that the practitioners of that special subject can use. This is particularly so in many of the sciences. Leveraging the ability to create lexical element types is partly formal in its application and partly based upon a set of heuristics. Lexical derivation, lexical compounding and shared lexical elements actively carry information that conveys the cognitive notion of relationship type. Here is an example of using terminology selection to drive level identification. These kinds of rules are common in any well-formed taxonomy and indicate (and should indicate) that we are in the genus-specific-varietal model of Linnaean type taxonomy - Marketing Channels

Advertising Newspaper Advertising

Business Newspaper Advertising General Newspaper Advertising

These two core communication properties of definition and shared special reference contribute to the success (or otherwise) of the taxonomy as a communication tool. These two properties are user-centric and user-critical. We can sum all this up by asserting that the taxonomy has to speak the same (special) language to the user community as it speaks to itself, and has to be designed to do this. In its function as a cognitive tool, the functional feature of node label operates semantically. In its function as a communication tool, the feature of node label carries those semantics clearly and precisely. In its function as a cultural translation, we want to be talking the same special language as our user community. Where we can, we can use their rules of lexical derivation and compounding. Node labels align with the model of special languages and terminology theory. Node labels either are terms, or become terms because they have no choice.

14

Notes and References 1. For example, a particular "treatment" of a topic is implicit in the world of documents. The Universal Decimal Classification includes such document forms as bibliographies, dictionaries, "general reference works" and serial publications. Whereas the concepts of any domain are independent of the notion of being communicated in any documentary form. For a brief introduction to the UDC see the summary classification at the UDC Consortium website. 2. For a high-level exposition of the cultural dimensions of categorizing things and topics see - Women, Fire and Dangerous Things: what categories reveal about the mind / George Lakoff. - University of Chicago Press, 1990. and The Geography of Thought: how Asians and Westerners think differently … and why / Richard Nisbett. - Free Press, 2003. Richard Nisbett is the Theodore M. Newcomb Distinguished University Professor at the University of Michigan. His personal website gives more context to his work. For an implicit alignment of anthropology with technically-based communities see - English Special Languages: principles and practice in science and technology / Juan Sager et al. Oscar Brandstetter Verlag, 1980. 3. Both the Alice books are available online through Project Gutenburg. This quote comes from Chapter VII A Mad Tea Party of "Alice's Adventures in Wonderland" Saying what is meant and the meaning of what is said are congruent in a special languages environment. (Remember, terms are substitutes for definitions.) We never want the presenting layers of knowledge organization to break this congruency.

15

http://www.udcc.org/udcsummary/php/index.php

http://www.udcc.org/udcsummary/php/index.php

http://www.press.uchicago.edu/presssite/metadata.epl?mode=synopsis&bookkey=3632089

http://books.simonandschuster.com/Geography-of-Thought/Richard-Nisbett-Ph-D/9780743255356

http://www-personal.umich.edu/%7Enisbett/selected.html

http://www.brandstetter-verlag.de/neu/screen/frameset.html

http://www.gutenberg.org/files/11/11-h/11-h.htm

A Model of Communication

A Focus on Communication

The semanthink point of view is that the enterprise intranet is both a holistic information access/delivery environment and a holistic communication tool. What makes the intranet function well as a communication tool are the ways it represents and organizes concepts and the ways it successfully recognizes what subjects documents discuss.

Ultimately, information finding happens because of good communication.

Information environments, such as enterprise intranets, are supposed to be communication environments. In fact, we would say that the environment is the communication tool. Intranets are communication tools. Communication tools need communication expertise to build them. As a communication tool, each intranet is built up from a number of different functional elements, each a tool in its own right.

A taxonomy, for instance, is a communication tool. What does it communicate? It communicates a representation of a domain. How does it communicate this? Any taxonomy is a composed of four separate functional elements. These four functional elements together communicate the representation. Integrating any user community's special language (i.e. its terminology) into the taxonomies they need is a core requirement of taxonomy design. You can read more in the semanthink Model of Taxonomy essay in this set of essays.

Similarly, a content categorization application is a communication tool. Although it may not seem like that to the knowledge worker end user. Partly this is because the end results of its categorization functionality are usually delivered through intermediary tools such as search engines or browsable classification schemes. What do categorization applications communicate? They communicate the fact that a certain document discusses particular subjects. How do they do this? They recognize meanings, not words or strings. The core functional elements of lexically-based categorization platforms are subject recognition objects consisting of words and phrases.

How Individuals in Business Communities Communicate

Every community of interest is technical; baseball fans, americana music lovers, insurance salesforce personnel, research biologists, anthropologists, investment bankers specializing in M&A advisory services are all technical communities and all technical cultures. Every technical community speaks in, and uses, a "special language". Because every business subject is "special". In fact, peer communities invent or generate their own special language. This development is largely "organic", but scientific communities can have formal rules and formal standardization procedures that constrain (in a positive way) the development of particular special languages. Special languages are built around sets of terminology. But they are "more" than the lexicon of the terminology.

16

The Business Requirement: Why Do We Want a Model like This?

This model of the communication dynamic is abstracted from Juan Sager's classic of terminology work 1. You can read some different aspects about it, and its wider context, in the semanthink Review Essay on Sager's book.

The model sets up the transaction, of information, in a context of purposes, and motivations, using terminology as the medium. These purposes and motivations revolve around states of knowledge that can be changed in a number of different ways. As information access designers our knowledge organization schemes need to reflect this typology of knowledge state changes.

The model is "formal", but science as a method is "formal". And so is business, in the sense that there are constraints imposed in pursuit of profitability. Formal is good! Formality gives every parameter its place and weight for the particular task at hand.

As information access designers, we need a model that tells us useful information about the "act" or event of communication. What kinds of information do we need to know about the event? We are lacking a purpose for why people want to "know" or "find", and why people want to communicate or tell. In terms of users "wanting to know", they obviously have tasks to perform. But, there is a meta level above the level of particular tasks. Similarly with the motivation to communicate.

Basically, what is the purpose of communication? We need to have a way to understand that users have states of knowledge. These knowledge states vary over time, vary according to information "taken in" and processed. If knowledge states vary between peers, then they certainly vary between any individual user and the total information access environment as one gestalt. If users are in states of knowing that they want to change, one way or the other (but usually not to get more ignorant, but more informed!), we need to know what parameters support that transformation and what inhibit or destroy it. So, We could usefully use a total model of communication that explicitly -

makes a change of a personal knowledge state foreground and central defines a common typology of the ways personal knowledge states are

changed exposes the requirements that make any information transaction effective,

from the recipient's point of view factors in the self-evident truth that assumptions and pre-suppositions about

information recipients exist, from the communicator, and that these should be dealt with

The Model: Communicator, Recipient and Knowledge States

This is how Sager describes his model.

"In a model of specialist communication we assume the existence of at least two specialists in the same discipline, who are jointly involved in a particular situation where the sender (speaker or writer) is motivated to transmit a linguistic message which concerns the topic of his choice and which he expects a recipient (reader or listener) to

17

receive. We assume that the sender's motivation arises from a need or desire to affect in some way the current state of the knowledge of the recipient." [page 99]

So, we have actors, a sender and a recipient. We have motivations, a sender's motivation that is to inform, and a recipient's current non-optimal knowledge state about an aspect of the domain. Sager calls the sender's motivation the "intention … to transmit information which will have an effect on the current knowledge configuration of the intended recipient". [page 99]

Now we have the notion of a "knowledge configuration" or state of knowledge. This gives us three communication parameters so far: actors, intentions, knowledge states.

The knowledge state of the communication recipient can be changed in different ways. Here are some types of knowledge configuration change (there are others, but these ones serve well to give the flavor of it all). The sender's communication can -

augment the recipient's knowledge confirm the recipient's knowledge modify the recipient's knowledge

The Four Parameters of the Model

We now have four parameters in play: actors, intentions, knowledge states, and ways the knowledge states can be changed. With the idealized parties in place, and the parameters in play, the communication event then happens. The "parts" of the model are now made to "work" by the participants.

"Basing himself on presuppositions about the recipient's knowledge and assumptions about his expectations, the sender decides his own intentions, and then selects items from his own store of knowledge, chooses a suitable language, encodes the items into a text and transmits the entire message toward the intended recipient ... Perfect communication can be said to have occurred when the recipient's state of knowledge after the reception of the text corresponds exactly to the sender's intention in originating the message." [page 100]

Enterprise intranets can be modeled as exactly conforming to all these elements of this quotation. There is a difference between augmentation (learning "more"), confirmation (knowing that what you know is what there is) and modification ("changing"). Sometimes these knowledge state changes can be related to specific document types - new regulations supersede old, as an instance of necessarily required modification, or, today's newsfeed will be different from yesterday's (hopefully), and so augmentation begins again (as if it would ever stop!). Senders, and in an enterprise intranet environment this includes both authors, obviously, and those who design the knowledge representations and navigations, have intent. This second may not be as obvious as the clear role we can ascribe to authors, but intention is built into representation and navigation layers, whether we are aware of it or not.

We might not want to use the label "perfect communication" to describe what we would consider to be optimal communication outcomes, because the word "perfect" carries bothersome connotations, but we do aim for intranet users to get what they need for task

18

resolution.

Intention, Knowledge Selection and Choice of Language

The communication event has happened. But our model is not complete to our satisfaction. We still want to understand what makes the difference between "good" communication and "bad" communication. Sager tells us that: "The achievement of successful communication is fundamentally dependent on the three choices the sender has to make in formulating his message". [page 102]

These choices are -

intention selection of knowledge choice of language

Intention

Intention can, at times, be clear from the type of document. A checklist is usually used to confirm knowledge. Instructions are intended to augment. User manuals can augment, confirm or modify.

More commonly, though, intention is expressed through terminology choices and choices from general vocabulary. Here senders have to be careful to align their choice of terms with what they intend the impact to be on the recipient's knowledge state. For instance, an undefined or vague term will not augment a knowledge worker's knowledge state. We can sum all this up in Sager's words.

"The sender must choose an intention that is commensurate with the recipient's expectation. For the communication to succeed, the recipient must capture the sender's intention either from the message or from the situation, and his interpretation of the intention must be accurate." [page 102]

Knowledge States

There are a number of important parameters around the sender's selection of knowledge. For instance, the sender must either have some kind of prior knowledge about a recipient’s current state of knowledge or make correct presuppositions. If the intent is to augment or modify the recipient's knowledge state, then, obviously, the sender needs to know more about the topic than the recipient does. Otherwise, no transfer is going to happen.

19

Language Selection

The use of accepted standardized terms makes the communication economic, precise and appropriate, or not. In the area of choice of language, the sender must choose an appropriate special subject language, or general language.

Taking the Model Into Project Work

In technical communication there are many variations on the theme of modeling the dynamics between those who want to send messages and the intended recipients, and what kinds of outcomes, or end states, arise. We like the communication model used in the terminology processing work of Sager. It is simple, formal, and speaks to what the heart of technical communication is, namely the purpose being to change someone's state of knowledge of defined concepts. Information access systems exist to rebalance unbalanced user knowledge states. Knowledge workers operate in a constant flux of knowledge states. This flux repeats itself continually - users seek, find, use. We explicitly design these kinds of systems, at the semantic and ontological levels, to ensure this rebalancing occurs effectively.

Terms and general vocabulary operate differently from each other in a model of communication. As information access designers we want to use terminology when possible, general vocabulary when we have to. Our focus is on designing around the pitfalls that general vocabulary usage creates in enterprise intranet environments.

We make the model work for us at the user's granular level of making lexical choices at each of the intermediate decision-making points that are faced. Not only do we know that we use a tool called a special language (because our users do) but we now have enough information to apply the tools of special languages and terminologies. If the choices to be made are in the special language they are used to using, then choices will be easier (and faster). Their special language will convey the augmentation, confirmation and modification.

At the level of the link to the individual document we want as much as is possible that our users are able to tell if any document is going to augment, confirm or modify their current state of knowledge. This augmentation, confirmation or modification is not just needed at the level of links to individual documents. When taxonomies, and more commonly, classification schemes are browsable they guide users to sets of documents discussing the same subject. In this way users choose where in the domain space they want to augment, confirm or modify. Keeping up to date is just another, shorthand, way of augmenting, confirming or modifying ones knowledge state.

The model lets us think about enterprise intranets, or any information access environment, in a different way. Because the user seems to always be doing all the "work", of searching, navigating, browsing and so on, we can forget that we need to think about the environment doing much of the work. We can also reframe the environment as being "active". It is a sender or communicator. To be sure, it's only a sender when the user wants it to be (but maybe that's the best kind of sender!). When the environment is a sender it had better follow all the good practices that all good communicators practice. This is the "how" of communicating. These good practices include differentiating between general language and special languages, understanding that terms are the

20

objects and that a special language is the "wrapper", that intent counts, and that the assumptions that the sender makes are in a mirrored relationship to the expectations that recipients have and that both sets of assumptions and expectations have to be worked with. And all this is lexical and ontological work.

We use this model in conjunction with the models of special languages and concept/definition/term. Together they form a powerful triad. We use the OntoLexical Layers Model for both strategic and project management purposes, but we really put this triad to work at the project design and activity definitions levels. We use all of them to keep us focused "on message" and to design projects that deliver.

Which is all good. But the real purpose of these models, as all models, is to make us think before we practice.

Re-Purposing the Model: the User as Recipient

In Sager's model the communication is direct. In some ways his model is a sender-centric model. It focuses on the intent, choices and actions of the sender. Working with designing and re-designing intranets we derive much insight from pivoting the model to be recipient-centric, because intranet users carry many qualities of Sager's recipients. We can focus on the users as information seekers - active recipients rather than active senders. But the model still helps point us towards what we should do as we work to provide information access to documents for users. Intranet users require all that Sager's recipients require for successful information transactions to take place.

The situational factor of the "inequality of knowledge" dynamic interests us. The sender is expected to have a greater knowledge of the subject than the recipient. The document collection holds the "answer", the user knows that the "answer" is out there to a greater or lesser degree of precision. But the document set is not the sender. All the presentation tools of navigation, coupled with the semantic and ontological back-end, together are the sender. And the navigation the means to approach and retrieve it.

Re-Purposing the Model: The Intranet as Sender

Another aspect of the communication dynamic interests us. In Sager's model the communication event is a one-to-one, direct, personal communication. But from the point of view of the individual intranet user the intranet communication process can appear as an instance of a many-to-one, indirect, impersonal process. And the "many" carries multiple connotations: many documents, a variety of navigation, multiple ways to search etc. The "lucky" recipient in Sager's model is given one communications deliverable, one document (or speech). This kind of communication was always push, until the internet/intranet era. Now users spend much time (and effort) in trying to "pull" information from out of the complexity. Users actively want, and require, to be communicated to. The information access system is the sender. It could usefully conform to all the constraints that Sager's senders conform to.

It is this indirectness of the intranet environment, coupled with the scale of publishing, which causes and compounds some of the problems of "hard-to-locate" content. There is

21

certainly intention aplenty in an intranet environment. Authors, navigation designers, taxonomy engineers are all intending that best-fit documents be communicated to users. It just depends how explicit this intention is, and how crafted. We try to integrate all these parameters of communication. We integrate by working with concept identification and concept naming. We build this intermediate layer to speak the language of the user. Regardless of the language of the author. And in doing so we work with the parameters that Sager's sender has to work with -

intention (we know a great deal about our users and their requirements) knowledge (concept identification and definition is our key deliverable) language (terminologies and vocabularies, special languages and general

languages are our tools)

We should always remember that we are designing information access layers to impact the user's current knowledge state that the user herself wants to change: to augment, confirm or modify.

Terminology, and lexical normalization against terminology, clarify this whole process of communication. The information transaction breaks when there is a mismatch in either the sender's or recipient's lexical choices or knowledge. In an intranet, the mismatch is between the lexicon of the user and the system. Not only is the overall model similar, in ways that we can work with effectively in our information access design, it is also identical with the outcomes that we would like to occur in intranet information transactions.

What Degrades Effective Communication?

Last, but definitely not least, we are rarely allowed to forget that successful transmission of the message can fail. The type of failure that concerns us as information access designers is the instance of incompatibility between sender's and receiver's lexical and knowledge structures. The chosen language must be appropriate. Lexical elements carry ontological and semantic information. There are responsibilities on the sender, which accrue to (or are inherited by) the holistic intranet in our use of the model.

"The sender must choose a language and sublanguage which he assumes the recipient to have command of; the recipient must be able to recognize and understand the linguistic forms chosen in order to analyze the message." [page 104]

Sager is clear how failure of communication can occur.

"The linguistic forms to be used in a message must therefore be planned at every level in such a way that the greatest degree of clarity can be achieved by means of systematic and transparent designations of concepts and by the avoidance of ambiguity at every level of expression." [page 104]

And again.

"Accurate transmission can be impeded by incompatibility of the sender's or recipient's

22

lexical and knowledge structure ... the analytical use of language, i.e. the application of inference rules in the process of comprehension, works only if designations are regularly patterned and if both interlocutors know the rules of designation. The linguistic forms to be used in a message must therefore be planned at every level in such a way that the greatest degree of clarity can be achieved by means of systematic and transparent designations of concepts and by the avoidance of ambiguity at every level of expression." [page 104]

Notes and References

1. A Practical Course in Terminology Processing by Juan C. Sager. 1990, John Benjamins B.V.

At time of writing you can buy through the John Benjamin site, which normally has stock to hand and is weeks quicker in fulfillment than Amazon.

23

http://www.benjamins.com/cgi-bin/t_bookview.cgi?bookid=Z_44

A Model of Special Languages – Thing, Definition, Term


Effective taxonomy use, in enterprise intranet environments, is one link in a chain of communication. Effective taxonomy use is also part of a larger process, but the metaphor of "chain" immediately gives us a context of a sequence of dependencies. Obviously, what goes "before" is then critical. What goes "before" as we set in motion a "Taxonomy First" approach to information access?

An understanding, acknowledgement, and re-engineering of the special languages that your communities of interest use, to communicate peer-to-peer about their shared special subjects, is one of these "before" parameters.

Why? It is the semanthink point of view that enterprise intranets are communication tools. To understand communication in enterprise intranet environments holistically we have to understand each of the parameters that together make up the process. Which, of course, is the baseline purpose in devising and using models. One of the key elements of communication is the medium of communication. Since we are focused on designing information environments that deliver ontological and semantic information, we need first to understand how communities convey, to their peers, these two kinds of information. We need to understand this medium first before we can move on to understanding two different perspectives. One of these perspectives is that, if we understand how ontological and semantic information is mediated, then, we will begin to know where to look for the lexical elements that we want to build into our information access environments. The other perspective is that, if we understand how ontological and semantic information is conveyed, we can build into our communication tools, such as taxonomies, a re-engineered version of the special language in question, to convey the very same ontology and semantics. In doing both of these tasks successfully we will mediate between members of a community and the documents they need in the moment for the task at hand, rather than disintermediate them.

The answer to the implied question above is that that members of a community that share a practice of a special subject, communicate by utilizing a "special language".

The Environment that Mediates, or Disintermediates, Resources and Users

Today's business documents are written for specialized communication. Every community of practice is a technical environment from a lexical point of view; investment banking, catering, human resources, logistics, project management are all technical environments. Today's business documents use terminology to convey facts and analyses to communities that practice the same special subject. This mean that today's business documents are essentially written in numerous special languages.

It's important to remember that the concept of "special language" is a model. In fact, it is a working model. It is a model to be used when doing so seems to work for the project at hand. It's also important to remember that users in technical communities are highly unlikely to be aware that they are using a special language, for two reasons. Obviously

24

this is because this is only a model, and not necessarily "true". Additionally, being self-reflective about what makes one good at doing is very different from being good at what one does.

Special languages are really part of Juan Sager's larger model of communication 1. We've extracted the concept from there and wrapped it in its own model. This makes the overall set of models easier to both teach and apply (and think about). We discuss some aspects of Sager's work in the Review Essay of his key book.

It's hard to overestimate the importance of the concept of a special language. How do peers in a technical environment communicate? They communicate through the medium of a special language. The special language of any community of interest is both the lexical wrapper for the special concepts of that particular domain and the set of related definitions of these concepts. This is non-dynamic, in the sense that the metamodel of a special language - its concept of definition, term, and concept itself - remains the same. A special language is a tool, not an event. In the act or event of communication there is a dynamic interplay of actors and parameters. It is within this dynamic that special languages play their role as a communication tool. We will come to this shortly.

Defining a Special Language

We would say that all business communities of interest are cultural, in the widest sense of the word, and so technical. And we would constrain "cultural" to characterize the combination of shared subject of interest, way of conceptualizing it and way of communicating about it. Sager comes to the same conclusion, but says this in his own way. In short, the language used in any business, scientific or technical community is a special subject language. Sager uses the expression "special languages" to indicate that communication occurs in a "technical" environment. This is how he defines a special language -

"Special languages have been defined as semi-autonomous, complex, semiotic systems based on and derived from general language; their effective use is restricted to people who have received a special education and who use these languages for communication with their professional peers and associates in the same or related fields of knowledge." [page 105]

All of these parameters characterize corporate intranet environments. Communities that work together receive a "special education", whether it is highly formal or so informal that it passes unnoticed as a phenomenon. Special languages are semi-autonomous. Large proportions of the lexical elements, the "rules" for compounding phrases, and the meanings that lexical elements convey, are different from the general vocabulary we all use at the periphery of our special subject and outside our special subject. Special languages are complex to outsiders. Special languages are semiotic - the item of terminology in any special language "substitutes" completely for the definition of the concept, regardless of the complexity of the concept (to insiders or outsiders). And "effective use" only comes from an alignment of knowing what you are talking about together with knowing how to talk about it.

"The lexicon of a special subject language reflects the organizational characteristics of the discipline by tending to provide as many lexical units as there are concepts

25

conventionally established in the subspace and by restricting the reference of each such lexical unit to a well-defined region. Besides containing a large number of items which are endowed with the property of special reference the lexicon of a special language also contains items of general reference which do not usually seem to be specific to any discipline or disciplines and whose referential properties are uniformly vague or generalized. The items which are characterized by special reference within a discipline are the 'terms' of that discipline, and collectively they form its 'terminology'; those which function in general reference over a variety of sublanguages are simply called 'words', and their totality the 'vocabulary'." [page 19]

The lexicon carries an ontology, or a representation of the domain. Polysemy is highly and explicitly constrained - one concept to one lexical unit is a highly efficient way of communicating.

Special languages differ from general languages in that they use terms in addition to words. Terms as linguistic expressions, and hence special languages, carry two core properties of communication in general- the property of definition and the property of shared special reference. These two communication properties, of course, carry over into information access environments.

"In special communication terms are considered substitute labels for definitions because only a full and precise definition is the proper linguistic representation of a concept." [page 109]

"Only if both interlocutors in a speech act know the special reference of a term, and, by implication, that they are using terms rather than words, can special communication succeed." [page 105]

"Special reference" refers to the structure of concepts in any special subject, as opposed to general knowledge. Special subjects have a need for delineating the relationships between concepts more strictly, so with much less fuzziness or flexibility, than in general knowledge. But this delineation is one of difference in degree only. This is how Sager puts it.

"Within a subject field, some or all of the included dimensions may assume increased importance, with a greater need for distinction between a larger number of concepts along a given axis: at the same time, the necessity to avoid overlap between concepts, i.e. intersecting regions, will tend to reduce the degree of flexibility admissible in the delimitation of the bounds of any given concept. There is thus a difference of degree between the intradisciplinary structure of concepts in the bounded subspace of a special subject or discipline and the less well-defined, less "disciplined" structure of "general knowledge". This does not mean that general knowledge cannot contain well-defined facts; but only that disciplines have a greater need for more rigorous constraints on the overall delineation of items of knowledge. The referential function at the extremes of this distinction is classified as "special reference" and "general reference" respectively." [page 19]

How Special Languages Communicate

"In special communication terms and standardized terms make a critical contribution to

26

achieving complete and effective communication. This they do by making the choice of language, knowledge and intention more systematic and hence easier. In order to establish criteria for evaluating the effectiveness of communication in special languages we can postulate three objectives or properties:" [page 105] -

Economy Precision Appropriateness

From our point of view of project design these "three objectives or properties" are a way for us to model the use of lexical choices, and to model how each type of choice contributes to successful or failed communication.

The Lexical Expression of Economy

Economy is really of two dimensions -

conciseness co-ordination of content and intent

In term formation there is economy in simply juxtaposing lexical elements to create terms rather than using neologisms (which require traversing a learning curve).

The Lexical Expression of Precision

Precision in working with terminology (as opposed to its technical sense of a metric used in information retrieval) is a measure of the accuracy with which knowledge and intent are conveyed by a set of lexical elements. Precision works with two parameters: precision of reference and precision of syntactic relationships between referents.

The Lexical Expression of Appropriateness

Appropriateness is a measure of the effectiveness of the intention as it is expressed and understood in the message. Economy and precision are in a dynamic relationship.

The Functional Role of Terms in Communication

"Terms can, of course, only be used as such if the user already possesses the

27

configuration of knowledge which determines the role of the term in a structured system. The limiting case of this restriction is the requirement that a new term be learned contemporaneously with new knowledge, e.g. through text books; a term acquired without awareness of the conventional configuration of knowledge to which it relates is communicatively useless." [page 20]

"Within any given language, a wide range of phonological, grammatical and lexical variation is available, but, within the range of possible variations, the social norm operates to determine criteria for the selection of codes, whose phonological, grammatical and lexical properties may be functions of the situation in which communication takes place. In general, diversification at the level of phonology and grammar is most evident in regional and social variation and is therefore of marginal interest for terminology. Variation on the lexical level is most characteristic of special languages, the linguistic subsystem selected by an individual whose discourse is to be centered on a particular subject field." [page 18]

Of course, we don't need enterprise knowledge workers to explicitly know (or even suspect) that they are constantly using special languages when they communicate with peers or customers. Or when they use an intranet. But those of us who design for knowledge workers do need to know this. We also need to understand that when peers in a community of interest communicate they "know" that they are using terms. Another way to approach this point is to think of it as a process of default values. Technical peers "default" to using particular lexical elements (terms) whenever they are in technical communication. This default takes precedence over natural language usage. This default is communication using special languages.


We needed to know what the actual tool is that peers in a community of interest use to communicate. Now that we do know, we see that it is a language tool called a special language. This tool of a special language combines two different types of lexical elements, "terms" and "general vocabulary". The terms of the special language carry two particular functional properties. Terms are substitutes, or a shorthand, for conceptual definitions. Terms carry information about the conceptual or ontological structure of the particular domain, including "special reference", which is information about the precision of delimitation between "adjacent" concepts.

The Model of Special Languages tells us the communication tool that our users use. This in turn tells us that the information access environment that we are designing should also use special languages, equally effectively as our users do. This means that we will use terms where terms are required and general vocabulary when it is called for. And that we will know how to decide which situation is which. Every type of presentation that the user interacts with will be an opportunity to use a special language, or not - navigation schemes, browsable classifications and taxonomies, search results rendering and so on.

Now that we know the core ways in which this tool functions, it is not a large leap of intuition, or pragmatic project management sense, to see where we are headed, in working with lexical information. This model is linked with our other models in a synergistic way.

28

The model of term/concept/definition tells us that unless we have these three inter-related functional elements in perfect alignment for every concept that the user community requires then we are going to build into the environment unwanted outcomes. From an information retrieval point of view this will be issues with precision and recall. From a usability point of view this will be all kinds of user behaviors that usually indicate frustration, lostness and so on. From a business point of view there are going to be efficiency consequences. These will, as always, be hard to measure.

The Model of Communication tells us both about intent of communicators, and states of knowledge that need rebalancing. We want to understand the total information access environment as a communicator. Basically, "how" it communicates to the user. Once we understand this we can begin to harvest the implications of how the information access environment works. Once we can isolate these implications clearly, and since the communication medium is words as representations of concepts, we can then explicitly design our lexical layers (and the workflows that they depend upon).

These three models taken together give us a model that -

distinguishes terminology from general vocabulary distinguishes a special language from a general language defines how a word becomes a term places the definition of concepts (named by terms) at the interface of the

communication transaction

When we add the fourth, the OntoLexical Layers Model, we can see where words, terminology, and representations of domains function, occur and how they should function. Taken together (and used together) we will have at our fingertips a map of four models that explain to us the where, how and why of optimally using lexical elements in an information access environment such as an enterprise intranet.

Bringing the terminology dimension to our ontological and knowledge engineering work is one of the bases of our information access design. Particularly, we want to open up the black box of the process of terminology and vocabulary choices at a granular level. Deciding that a word is an item of terminology is an ultra-granular decision-making process. And this decision-making process is reprised thousands of times in designing for information access. So, we need a model to support the intent of our decision-making processes, and, in truth, to keep us focused (and "on message") on the importance of selecting the optimal lexical elements, every time, that are to be communication tools. And lastly, if we have a model that makes sense, then, we can teach (or socialize) the skill that the model describes.

Are there Exceptions?

Are there exceptions to special languages being used. The answer is "no". If you want to use a different model, then you can model the total information access environment without using the concept of special languages. But you may find that you will need to

29

accommodate certain ideas that the concept of special languages gives you.




30


The OntoLexical Layers Model

Semantics and Ontology: the Core Communication Problem

People, knowledge workers included, are native ontologists. We all classify our worlds in highly skillful and nuanced ways. There are associated costs when this native ability is translated into a human language. People, knowledge workers included, use the semantically rich tool of human languages to communicate definitions, precision and differentiation, and how concepts relate each to the other. We pay some costs for this semantic richness.

Many of these costs are a result of polysemy. In the world of numbers "247" always "means" 247 (by and large). In the world of words, lexical objects such as "promote", "coupon" and "board" don't unambiguously declare, in their isolation, where we are and what we are about. Fortunately (although it could not be otherwise) we rarely use words in such strict isolation. There are always contexts that give us our ontological place and our task at hand.

Meanings and conceptual relationships are what we work with in information access environments. Returning to the everpresent background issue of "information overload" as an angle that never goes away, we don't have "too much" semantic and ontological richness. Actually, we have too little of that which produces semantic and ontological richness. Too little of this, means too much of that - that overload stuff, that is. Too little of effective knowledge organization and effective subject recognition.

How can we process and organize this ontological and semantic richness? We have two types of tools. To work with ontological richness and complexity we use knowledge organization tools. A taxonomy is an instance of a knowledge organization tool. To work with semantic richness and complexity we use subject recognition tools. People are excellent subject recognition tools. To work at scale we can use a categorization application from one of numerous vendors.

Of course, there is much "more" to an optimal information access environment that just knowledge organization and subject recognition. Choice of technology applications does count. As does their conceptual/semantic implementation. Training, the transfer of knowhow, is key. Workflow processes have to be defined and managed. Maintenance never goes away. Because language lives, and so constantly develops, and because subject domains change their scope and sets of relationships, maintenance is key also. And, finally, project design makes for success, as opposed to some halfway house between failure and the need to keep a low profile.

What Models will Help Us?

It's the semanthink thesis that, when working with document sets or corpora, at scale, two of the pivotal functions to "get right" are knowledge organization and subject recognition. In a massive document environment, failure to do this creates a non-negotiable bottleneck. We can define "massive" as that amount of documents that will always increase in number, day by day, while an inordinately large group of people are processing them manually! Sounds like entropic nightmare with a sly attitude.

31

Words are everywhere in intranet environments. Our general vocabulary, and a sub-class of this, the set of technical terms used to communicate about any special subject, act as communication media. We want to use these terminologies to connect users to documents in information access environments. Since lexical elements are the intermediary between users and documents, we need a set of models to help us understand how words operate as carriers of both semantic and ontological information.

And then there are the actors - authors, knowledge workers, taxonomy designers, and so on. Each of those has particular skills, needs and responsibilities.

The interaction of documents, words and people needs an overall model that gives us each their "place" in the overall scheme of things. The OntoLexical Layers Model shows us "where" words and terms "are", performing different communication functions within different functional tools.

The OntoLexical Layers Model

Our task as information access designers is to build systems that deliver meaning to the user. "Delivering meaning" means that the user finds what is required and expected, with little or no "noise". It is the noise that would be irrelevant or unexpected.

What are the layers that lie "behind" the user interface layer? Let's borrow (or, re-use in knowledge management terms) a story from the semanthink Review Essay on Sager's book.

"Once (and still), upon a time, the user sits in front of the screen. The user knows the task at hand. There is a chance that one of the links currently showing on the screen will lead to a document that resolves this task. The user is focused on two or three links in particular. Actually, the user is focused on the words contained in these links. This small moment is a moment of decision; to click through, or not."

In this little story if the user could look behind the pixels of the screen, an informationscape something like this outlined below would be seen. These layers each play an individual function in connecting users to content. The layers are sometimes process and sometimes application.

Author Layer

This is where information life begins.

The role of terminology in the Author Layer: Authors know what they are writing about. They know the special languages of their audience(s).

Content Layer

The Content Layer is the universe of content, the repositories of the intranet, the

32

newsfeeds and their processing etc. The intranet document set, or corpus, is a container of (lexically described) concepts. Documents are wrappers for information on concepts. Documents are also of different kinds that approach the discussion of subjects in different ways. The Content Layer is also a lexicon of (as yet) undifferentiated and processed vocabulary. This lexicon contains terminology (such as names of things), potential terminology (with choices of variance of preferred term) and lexical elements that are expressions of concepts that will require a term to label them. It is within the content layer that what the user seeks, resides - it may be a number of documents, a single document or a part of a document.

The role of terminology in the Content Layer: Terminology exists in this layer, but it is un-differentiated from general vocabulary. We are at a pre-processing stage with regards to terminology compilation.

Subject Recognition Layer

Subject recognition is the process of recognizing the lexical elements (words and phrases) in texts that are used to represent (and discuss) any particular concept.

There are different ways to characterize the small number of different and competing technologies used within categorization platforms to recognize the occurrence of particular subjects in documents. For our purposes here we can simply divide them into two families.

One set of technologies explicitly programs human semantics and syntax into the categorization application. An example of this approach is semantic networks. In a semantic network, words and phrases are grouped into sets of synonyms. Sets of synonyms are associated together to provide the elements required to build (and so match in the texts to be categorized) complex, compound concepts that require more than one synonym set or semantic type (noun, verb etc). We can characterize these kinds of approaches as knowledgebase-based approaches. The product model is to build a knowledgebase of meanings and their words, and to program this knowledgebase into the application.

When people "do" subject recognition against documents, they do it lexically only. It's important to note that people are also subject recognition experts - ask any cataloguer!

The other set of subject recognition technologies, from our terminology-based information access point of view, do not program human semantics and syntax into applications. Instead they rely on building mathematical or statistical models that characterize the content of documents. Examples of these technologies include Bayesian probabilistic methods, Latent Semantic Indexing and other vector approaches. This approach does not treat words as lexical elements but as strings of characters, with no semantic meaning or syntactic purpose. Complex sets of relationships between character strings are computed for documents, sets of documents and corpora. Models are derived from the products of these computations that characterize the "conceptual signature" of any document - what it is, or might be, about. These document classifications have to be validated or accepted by content managers or knowledge workers to ascribe "meaning". This set of approaches is terminology-free. Character strings are not accounted for semantically or syntactically, and so cannot be terms. Concepts and their definitions

33

exist only insofar as knowledge workers pragmatically accept or reject classifications based on the underlying mathematical or statistical models that have been generated.

While it is true to say this classification process is terminology-free, there are important requirements for terminology in the overall information access design that incorporates statistical modeling methods. Sets of documents that are classified as discussing the same concept require labels for their set. Sets of classified documents require to be related to other sets of documents, in one meaningful way or another, to create a taxonomy or classification scheme. These labels of the taxonomy or classification will require to show or indicate the nature of these hierarchical relationships.

The Subject Recognition Layer is shown "deeper" in the model than the Metadata Scheme Layer. While the metadata scheme contains the elements that contain the instances of, for example, the labels of a domain taxonomy, the documents in the Content Layer are not identified with their "metadata signature" until the subject recognition processing step takes place. So in this sense the Subject Recognition Layer is "deeper", in terms of process, than the Metadata Scheme Layer.

The role of terminology in the Subject Recognition Layer: If the underlying technology of the subject recognition application is not lexically-based, but involves building statistical or mathematical models, then terminology as a class of lexical elements is not a parameter in this environment. If the underlying technology is lexically-based, such as approaches founded on semantic networks or rules-based topics, then both terminology and terminology variance can be re-engineered into the subject recognition knowledgebase.

Subject recognition is a process. But when you have created subject recognition executables, based on any categorization technology, and load them into a categorization application, these executables are part of the platform layer. The deliverable of subject recognition is sets of meta-tags associated with individual documents. The subject recognition layer is required to recognize the concepts needed by the user communities when they are discussed in particular documents. It is not required to recognize "all" the concepts in the document set. Where do these concepts that require recognition come from?

Metadata Scheme Layer

The metadata scheme contains all the kinds of descriptions to be applied to documents to support both user requirements for information access and content management requirements for maintenance and management reporting information. Taxonomies, ontologies, topic maps and other Knowledge Organization schemes belong in the metadata scheme, as elements of the scheme. Designing the metadata scheme is a process. The metadata scheme itself is a platform layer.

The role of terminology in the Metadata Scheme Layer: Any controlled vocabularies in the metadata scheme are terminology, and require design, management and maintenance. Uncontrolled vocabularies, such as author names or search engine input misspellings, still need to be managed by rules and require variance to be normalized.

34

Knowledge Organization Layer

The process of engineering a knowledge organization scheme includes how we identify concepts required by the user communities. Note that we say "identify" here, whilst it is the function of the subject recognition process to "recognize" the concept so identified. The Knowledge Organization Layer is where the Subject Recognition Layer "obtains" the concepts that need to be recognized. Within the knowledge organization scheme the identified concepts are arranged in relational-hierarchical relationships to each other. We arrange these identified concepts in knowledge organization schemes such as ontologies and taxonomies. Ontologies, taxonomies and other knowledge organization systems are integrated into the enterprise metadata schema and become part of it. There, they are part of the platform layer.

The Knowledge Organization Layer represents the users' world. It does not, and should not, represent the total abstract world of the collective of concepts lexically described in the document set. This is why the Knowledge Organization Layer and the subject recognition process have to be tightly coupled. Apart from identification (and so required recognition) of concepts, it is important to be clear about the definitions of these concepts. Definitional tensions in the Knowledge Organization Layer lead inevitably to application tensions in the Subject Recognition Layer - if we don't define the scope, inclusions and exclusions of a taxonomy node, how can we build a subject recognition executable to correctly identify discussion of the concept in texts?

The three fundamental steps of terminology processing - concept identification, concept definition and concept naming - are also three of the key stages in any development methodology of a knowledge organization scheme.

The role of terminology in the Knowledge Organization Layer: The Knowledge Organization layer requires terminology to name the concepts. Without a name the concept cannot be communicated. Without a definition the concept does not exist.

Terminology Layer

Terminology identification and standardization is a process. The process of terminology identification and application allows concepts to be named. Terms are stored in the metadata scheme. Taxonomy node names are usually terminology. Other metadata element types, such as document object types and audience characterization types, are terminology.

Navigation Presentation Layer

The presentation of concepts to users occurs in navigation schemes and search results rendering. This presentation layer requires terminology, usually in the form of terminologies that describe different kinds of concepts, to enable users to connect to particular concepts discussed in particular documents. Concepts are then displayed to user communities in the Navigation Presentation Layer, using terminology, where

35

appropriate, and general vocabulary.

User

The user in a sense is a "container" of tasks, just as the content is a container of concepts that are discussed. In the context of this model of layers the user performs tasks that require the locating of information. So, information finding is a precursor task to the task of using it. That means we must design information access solutions focused on user seeking becoming user finding, as quickly and efficiently as possible. In the context of Sager's model of communication, we pivot the metaphor of communicator and recipient so that the information environment that we design is the "communicator" and the user is the recipient of communication.

Why Do We Want a Model like This?

We want a model like the OntoLexical Layers Model to be able to take the total information access environment apart, conceptually so we can analyze both the purposes and the functions of each of the layers. We want to do this for two kinds of reasons - to make our work more precise at the strategic level and at the project design and management level.

We use the model in strategic planning to provide a placeholder for all the elements, outside budget and politics, required to deliver information access strategies. Strategically, we need to look at the requirements of all the information-using communities on the one the hand. We also want to understand and audit what applications, workflow, outputs and quality measures we have on the other hand. Putting these two sets of facts together lets us plan and prioritize.

At the project design and project management level we want to be able to deliver workflow processes and configured/customized content handling applications. When we look at each of the layers individually we discover all the variable parameters that are configurable for our information access design. Each of the layers carries its own set of parameters we identify to -

ascertain status define maintenance and management issues identify tasks (that may later become projects) specify required applications functionalities specify required workflow processes

There are many other ways to model this environment. But any model is only ever devised to serve the total process of creating deliverables that are effective, and this model gives us all the parameter placeholders that we need at both an abstract level and a granular level.

36

Perspectives on Paradigms

Project Management as a Paradigm-free Environment

The emerging space of bringing organization to enterprise content management is overflowing with paradigms in transition. There is a complexity of paradigms emerging and fading, each impacting single elements of the overall problems/solutions space that we all work with. This makes our tasks exciting and hopeful, but also ultra-complex. We need to ground ourselves in understanding this inevitable flux of paradigms. Then, we will be well placed to deliver solutions that are successful. There is a risk, using "risk" as a technical project management term, in not including in our understanding of the overall information access environment a perspective on paradigms.

Although semanthink is about delivering information access solutions that convey the semantic and ontological information required by user communities, semanthink is really (between you and me) about project design. If we get project design right, then we own a deserved momentum. And project design incorporates either solutions to, or at least awareness of, all the parameters we need to figure with and configure. These include, amongst many, redefining skillsets, understanding the functional elements of tools such as taxonomies, deciding what quality is, in relation to semantic and ontological information and assessing it, and understanding above all that information organization is communication. All of these are impacted by paradigm changes currently in play.

semanthink is about designing information access projects that deliver success. Don't think for a minute that project and program management is a paradigm-free environment. It's not. Knowing which paradigms are operating in a project space is one of the keys to designing successful projects. Somewhere just after strategic vision ends and just before implementation begins is a cusp. Some times our awareness of operating paradigms slips through this gap. It's the moment where we want strategy and implementation to intersect, not diverge. These early moments are when we frame our project design parameters, knowingly or not. The more explicit this framing is, then the better our chance that our projects will kick off on a robust intellectual footing, and that we can make the difference between success and struggling to make the success. This series of perspectives is about bringing some operating information industry paradigms from background to foreground.

There are two sets of content, of course. Content originated by the enterprise itself, and content from publishers that serve the enterprise. Perspectives on Paradigms also aims to discuss some of these paradigm dynamics from the point of view of publishers. After all, they are part of the problem of information access. The more alignment there is between publishers-for-the-enterprise and intranets that require to incorporate publisher content the better. A problem shared, in this context so far, is not a problem solved/halved. But some solutions that publishers could adopt, over time, will remove some of the problem areas that intranet managers face. Publishing is one of the industries that has a huge amount to offer (and a huge amount of gain to accrue) if they were to align product development with solutions to the endemic enterprise problems of information organization and categorization.

Taking the Notion – An Introduction to the Set of Perspectives on

37

Paradigms

I've taken the notion of "paradigm" from Thomas Kuhn's 1 famous (paradigmatic even) work on the history of science. While the practice of science and business are markedly different in all kinds of ways, the concept of paradigm, and paradigm shifts or changes, is now commonly applied by analysts, business thought leaders and CEOs in understanding (and being best responder to) the dynamic of business shifts.

The Problem with Paradigms: Paradigms Influence Projects

For instance, you want to implement an enterprise document categorization solution where the underlying subject recognition technology is lexically-based, as opposed to statistically-based. One emerging paradigm here is that the future is semantic. One of the differences, then, between success and still reaching for success will depend upon the lexical skills available to your project team. Or, suppose you want to investigate the feasibility of implementing an application to help you build your enterprise taxonomy, or parts of it. The re-emerging paradigm here is that of the re-centralization of active human intelligence at the heart of the decision-making process. But the type of human intelligence that the project will require is a skill with building ontological representations of areas of interest. Another useful skillset for the project team here might well be a familiarity with terminology practices. The era of buying the software, booting up the software cd and sitting back to leisurely browse your new enterprise taxonomy is over. (In fact, that era never really arrived.)

What is a Paradigm?

What is a paradigm? It's a shared worldview of a way to understand a slice of the larger world, and one that provides ways to model problems and solutions, at all levels.

Paradigms become Foreground and Background

Paradigms emerge and fade. To be replaced, of course, by others. Some paradigms are not true, but are conceits, or marketing hype. For instance, software that can automatically "understand" text was certainly marketed as a paradigm shift. But the underlying paradigm just wasn't there. Desire does not a paradigm make.

The business model of content categorization companies has changed markedly since the late 90s. And amongst all the growing pains that new and emerging business segments face, and the ebb and flow of business advantage in the marketplace, there was one ... quite definite ... paradigm change.

38

Categorization Vendors

Then, vendors such as Verity and Convera sold powerful "functional shells", that could, if programmed lexically, recognize in documents any subjects that you would wish to define. By "functional shell" I mean that the application contained all the subject recognition functionality ... except the lexical content, the missing and essential "magic sauce". So, where was this lexical content, to build subject recognition rules and semantic networks, going to come from? Or more importantly, from the point of view of project and program managers, how was this mission-critical lexical content going to be delivered, what project activities needed to be defined, what resources and timelines allocated, and so on.

Now, Verity, Convera, and others including InXight, Entrieva and nStein bundle lexical content with their subject recognition applications. How is this all going to play out in the marketplace? Well, not all lexical content is created equal. That's a telling early point we can make. There are vast gaps in domain coverage when we knit together all the available good domain vocabularies that can be re-engineered into first-class business-facing taxonomies and subject recognition objects. Perhaps the history of technical dictionaries has something useful that we can intuit about the coming play of forces. And all of this means? ... that there are a whole host of project design issues emerging anew. Exactly.

The Publishing Industry

Similarly, publishers of must-have trade, professional and business news face interesting options if (or when) paradigms emerge that suggest taxonomies and semantic networks, or other sets of subject recognition objects, may be of huge value in cementing a different kind of relationship between their content and their customers. (Cemented business relationships are good, because cement is hard to break to let in competitor publishers.) The days of business magazines just "telling it like it is" is a long-gone paradigmatic past, of course. Content available in ontological organization that represents, and so fits with, the ways customer communities of users think and work, and categorized precisely and finely ... is that part of the new publishing model, or not? So, there is a whole host of issues emerging around this dynamic of publishers being part of the problem or part of the solution.

What Lies Between? ... Interoperability

Ontological interoperability is a paradigm waiting to happen. Collaboration requires ontological and semantic interoperability. But how can the global enterprise achieve ontological and semantic interoperability (internally even) if the origination of their knowledge organization and subject recognition layers is fragmented? But, enough ...

Point of View

Perspectives on Paradigms aims to provide a take on some of the different vibrant

39

paradigms that come into play in developing projects to deliver a total information access environment. Particularly those that impact these two key dependencies for creating an optimized enterprise information access environment, the design of the knowledge organization and subject recognition layers.

References

1. The Structure of Scientific Revolutions 3d edition / Kuhn, Thomas S. University of Chicago Press, 1996 -

An excellent outline of and study guide to "The Structure of Scientific Revolutions" is available from Professor Frank Pajares of Emory University

It's only fair to point out that there is genuine controversy in history of science and philosophy of science communities about Kuhn's concept of paradigm. Begin with a Scientific American review of Steve Fuller's book on Thomas Kuhn - Thomas Kuhn: a Philosophical History for Our Times / Steve Fuller. University of Chicago Press, 2000 -

And Steve Fuller's book is available from - University of Chicago Press

40

http://www.press.uchicago.edu/cgi-bin/hfs.cgi/00/13220.ctl

http://www.emory.edu/EDUCATION/mfp/Kuhn.html

http://www.sciam.com/article.cfm?articleID=000C138A-D70A-1C73-9B81809EC588EF21&pageNumber=1&catID=2

http://www.sciam.com/article.cfm?articleID=000C138A-D70A-1C73-9B81809EC588EF21&pageNumber=1&catID=2



February 2003

A Brief History of Dictionaries: Meanings and Their Words

We are, right now, inside the transition point between two paradigms that define the knowledge economy in a truly major way. The transition impacts every company playing in the global economy. One of these paradigms is fading from the foreground, and one is emerging. As they do. The cycle of the one fading has been long. Centuries.

But first, let's visit the business problem.

The Current Business Problem

Recognizing the occurrence of concepts in texts automatically 1 is a key dependency in creating an information access environment at scale. But, more than this, is it also a problem that is a "bottleneck problem" - solve it, and we can move on; not solve it, and we can't move on. What does the unsolved subject recognition issue prevent us moving on to?

If we cannot recognize concepts in texts, we cannot automatically associate documents with any domain taxonomy, even if it is the best taxonomy in the world, nor can we group and relate them ontologically. If we cannot recognize concepts in texts, we cannot automatically leverage the definitions of users' ways of thinking into any application's way to retrieve and differentiate documents based upon the subjects that they discuss.

That's the business problem from the "outside". "Inside" the business problem, solving it at the project level, we realize that there are implementation issues, hence project design issues. These project design issues are all centered around working with words. Using words to find words. Or, to say it another way, putting words into a categorization application to find words, and so the concepts they signify, in documents. All this at a scale that matches the requirements of the global knowledge economy.

At the project level we are going to need to be focused on skillsets (lexical skillsets) and deliverables (lexical deliverables). Ways with words, and ways through words.

The Paradigm Transition Defined

I've often been in presentations where the presenter goes back to the Stone Age, or some other ancient time, and stakes out the beginning of a paradigmatic vector. For instance, cave paintings as the emergence of semiotics, or the ancient library at Alexandria as a beginning benchmark for information overload etc. And I've smiled.

So ... my version goes something like this ...

There are three extreme paradigm shift milestones in human culture becoming

41

dependant on text as an integral part of human culture itself. These are -

invention of the printing press invention of the dictionary invention of lexically-based subject recognition models such as semantic

networks

The defining tipping point of this transition is the move away from a focus on words and their meanings towards a focus on meanings and their words.

The Invention of the Printing Press

The invention of the printing press, amongst other effects, drove fundamental changes in literacy (and the concept of literacy), authoring, mass communication, and what subjects were deemed candidates for writing about.

The Invention of the Dictionary

The invention of the dictionary 2 allowed absolutely essential normalization at the level of the individual printed word. The functionality of the word as a conveyer of meaning, and a communication device between writer and reader, was enhanced through this kind of normalization. Spellings were normalized. Pronunciation was normalized, through the adoption of a standard code that translated phonetic symbols coupled with morphological units. Meanings, at the level of the word were normalized, and polysemous senses pointed out. The invention of the dictionary allowed translation (i.e. interoperability) between languages. All this created manifest cultural impacts, such as providing a platform for education and science, which are not really relevant in an information access design context.

But, all this was at the level of the word, each word in large isolation from each other. How is the concept of the dictionary as a tool limited (and limiting) in an information access design context? It is true texts are words. But what we really want to do with texts is to recognize the concepts that they discuss regardless of the ways the same concept may be worded.

As the scale of information publishing increased, new tools were needed. Because scale created issues around finding. Dictionaries were not enough on several dimensions. Dictionaries specify usage, how we can write. But they don't help in finding ways to word concepts. Dictionaries help us to normalize what we communicate, "push" out. But they don't overly help us with what we want, as knowledge workers, to "pull" in. Lexical knowledgebases, such as semantic networks and other sets of subject recognition objects, are designed around the notion of meanings and their words, and so help us find.

Where we are on the paradigm change curve today is dependent upon texts(s), but not yet able to leverage their full content exhaustively and precisely.

42

The Invention of Lexically-Based Subject Recognition

The invention of lexically-based categorization models is the last piece of the puzzle. It is the emergent paradigm that allows finding in large document sets and corpora. Just as dictionaries allowed normalization at the level of the word, so lexically-based subject recognition knowledgebases offer, essentially, normalization at the level of the individual meaning. They give us the ability to work with meanings and their words. We put sets of words together and so build models that recognize subjects in texts.

There are a whole range of platforms that allow us to work with meanings and their words. Verity delivers quality subject recognition results using a rules-based platform. Convera, with its RetrievalWare 8.0 product, delivers quality subject recognition results using a platform based upon a semantic networks model. Although the approach is slightly different in each case, both allow the combination of sets of lexical elements that together model the way to recognize the occurrence of a complex topic in a text. Both, when implemented, are from the point of view of this paradigm a lexical knowledgebase whose semantic/syntactic objects can be combined.

We can also look at this transition as being from origination-focused to reuse-focused. The dictionary specifies usage parameters for "correctly" (culturally) applying words in content and a lexically-based categorization application allows reuse of content by enabling finding.

How Far Away is the Future-is-Semantic?

The emerging paradigm is that the Future-is-Semantic. It has been emerging for a while, but paradigms do emerge over time.

Where are we really positioned time-wise, from a global economic point of view, in the emerging Future-is-Semantic paradigm? Are we at the Dr Johnson 3 stage in dictionaries, or better, or worse? Dr Samuel Johnson is often considered (including his own self-promotion) to have written the "first" dictionary (of the English language). He didn't. But it was the first widely adopted and widely modeled one.

From my perspective, I see the global economy as being pre- the Dr Johnson stage in semantic knowledgebases. Not so pre-, but definitely pre-.

Language is a precise and utilitarian tool. We don't necessarily need to make it "better", more precise and more utilitarian. Nor could we, since the imposition of rules and regulations to govern language use is impossible at best. The communication function of language is basically OK. So the "push" side of language works. Where we do need to need to focus is in making finding, the "pull", better. We are going to have to become better at using language to find language.

43

How to Spend Project Dollars

It's not even that the global economy needs more precise and more utilitarian applications for finding subjects discussed in texts. We have these (and they are getting manifestly better all the time). What we have is fine. We just need to make them work to their full potential.

So, at this stage in the emerging paradigm how can you wisely spend your dollars allocated to taxonomy and subject recognition in your information access environment?

Really the question should be more along the lines of - what kinds of project activities do I need to explicitly design and incorporate into the implementation? Suppose you have chosen an application whose functionality fits in with what you want to deliver to users. What now? Well, now it's all lexical, not IT. Taxonomy/categorization software is largely purchased through IT departments and budgets, but the skillsets required to create lexical deliverables are … lexical skillsets. Understanding the lexical nature of the task to implement Verity, Convera and their peers is a mind-set issue. Once the application is integrated into the overall enterprise platform then we are immediately in a language paradigm transition, not a silicon wafer paradigm transition nor a software paradigm transition.

If we want a mantra to guide the scope of these project activities how about "Meanings and their Words"? This is precisely the mantra that indicates we are no longer in the dictionary paradigm, because the dictionary paradigm mantra is "Words and their Meanings". It is also the mantra that indicates we are no longer in the information technology paradigm, because the IT paradigm is all about APIs and repositories and network protocols - words are numbers, bytes and bytes.

Knowing that you are in a language paradigm allows you to focus on, and make explicit, the lexical tasks that have to be carried out at the project level. Knowing that you are in a language paradigm means that you will have to allocate dollars, by either internal or external resource, to working with words, not just APIs. This means lexical expertise has to be incorporated into the project, whether it is in-house or consultancy-based.

Postludes

Some of the emerging normalization at the level of the meaning is unique and some aims to be standard. Intra-corporate is (largely) unique. For companies to interoperate within a domain then we require standardized taxonomies and standardized subject recognition.

Interestingly, we could assert that taxonomy is a tool that is necessarily required at this third shift in the paradigm, where we have reached the need for lexically-based subject recognition models. We need the tool of taxonomy to give us the ontological architecture that underlies our classifications of documents. Taxonomies built as faceted taxonomies are ideally suited to taxonomic indexing and allow for the post-coordination of complex subjects. The functional taxonomy elements of scope and node definition, together with the set of nodes and their relationships, gives the specification for the semantic knowledgebase we need to build for our subject recognition application.

44


1. The word "automatically" is often incredibly vague as to its actual meaning. The word "automatically" is often used as a term - i.e. given the kind of constrained meaning that occurs in peer groups working within a special subject area. But in fact, vendors and customers often use "automatically" in quite different ways. And this, in its turn, is part of the larger dynamic of the fudging of the line between genuine terminology and general vocabulary (which always fails). All content categorization applications require the expenditure of the resource of human time to configure them lexically and ontologically. It just varies along the parameters of when in the workflow, from the customer's point of view, and doing exactly what tasks or activities. We will go into "automatically", in the depth that it deserves, elsewhere and at a different time.

2. For both a complete and "proper" history of the dictionary as a functional tool see - The English Dictionary from Cawdrey to Johnson 1604-1755 / De Witt T. Starnes and Gertrude E. Noyes. John Benjamins Publishing Company, 1991.

3. For more on Dr Johnson and his dictionary, and more on him as a literary and lexicographical genius begin at the homepage of Jack Lynch, Assistant Professor at Rutgers University.

45

http://www.benjamins.com/cgi-bin/t_bookview.cgi?bookid=SiHoLS_57

http://www.benjamins.com/cgi-bin/t_bookview.cgi?bookid=SiHoLS_57

http://andromeda.rutgers.edu/%7Ejlynch/Johnson/Guide/dict.html


May 2003

A Brief History of Newspapers: All the News that's Fit to Find

The news industry 1, particularly that which sells business and industry news, entered a distinct paradigm shift a while back. It's been sitting in this paradigm shift for some time, and adoption of that which creates real and progressive information utility for readers, and so value for originators, is patchy.

As always and usual, it's all about those two core layers of information access - knowledge organization and subject recognition. And this particular paradigm shift that publishers are sitting through revolves around what part they want to play in making content easier to find. Knowledge organization and subject recognition are, or are going to be, (depending on when you are looking from) two of the non-negotiable functionalities of all the text information handling products that together provide the platform for the knowledge economy.

The future is always a set of what ifs. But it's never too early to think about the possibilities, which lead to probabilities, which lead to business models or business decisions. What eventually happens in the news publishing space will impact every enterprise that employs knowledge workers whose job it is to re-engineer the facts of news content.

We can enter the paradigm by asking ourselves a simple question. Is a taxonomy of any monetary value to a business or professional news publisher? Maybe. It depends how you model the business opportunity. Monetizing it may not be direct.

The Current Business Problem

The business problem for knowledge workers (aka readers) always remains the same. First they have to find. The business problem, from the point of view of news publishers, is a bit more complex. Do news publishers want to make it (a bit of) their business to make individual items of content easier to find, or do they want to give that business away to KO/SR intermediaries?

What if …

What if value propositions change as paradigms shift?

If we strip out advertising, editorializing and opinion promotion, entertainment and focus only on news collection and publishing and return to the pure roots of the original communication tool, we return to news finding, selection and publishing.

Traditionally, advertising apart, all the money was made on selling the day's news once. Although certain parameters of the news (its audience, which created its market and so

46

circulation, for instance) drove the rate card for advertisers. Then came, in addition, online databases, which were re-sellers and aggregators. And then a different generation of aggregators. So, there are plenty of channels for any news originator to sell its content multiple times.

But channel is only one part of the value proposition. What if other factors were going to be important? What kinds of factors might these be? Knowledge workers spend a lot of time seeking, and not necessarily finding (which is a lot worse than spending time and finding what one needs, but still costly). Also, a lot of time reading that which they don't need, or want, to read. This kind of reading is just another form of seeking, of sifting through. So, factors around knowledge worker behavior and the unmeasurable cost of it will be a factor. We all know that information overload is a complaint, but what if it became a factor that drove content purchasing decisions?

What if … Taxonomies were Important?

Let's make up a scenario. Here are the actors 2 -

A publisher which specializes in one, or more, special subjects Its set of corporate customers The domain of the publisher's customers (hence the domain of the publisher) A taxonomy of the domain

And here's what is happening in the business space. The publisher has competitors. The publisher has multiple channels of delivery to the customers - print, web, multiple flavors of aggregators. The customers are frustrated. But this frustration is generalized against information hassle in general and is not especially targeted at our publisher. There are a number of emergent taxonomy/categorization applications vendors. While knowledge organization and subject recognition have been with us since the great library at Alexandria, these vendors are literally whizzing through paradigm shift upon paradigm shift 3 of their own. All in all, everybody's a bit unhappy, if they are aware of their situation. And, then, there is the thing about interoperability. Interoperability is still a Great Problem to come. The thing about interoperability hangs in the answer to questions like this one: "Do we (whoever we are) want each and every domain to have more than one taxonomy, or not?" This is the publishing problem, or opportunity.

What can history tell us?

A Brief History of Text Information Tools

Throughout the history of the printed word as a carrier of information we see the emergence of new information-communicating tools as new information-finding needs become foreground enough for some entrepreneur or visionary to step in. This is not a new phenomenon. This has always been the case since the invention of the printing press ushered in the era of text becoming intrinsic to human culture. In no order of any

47

timeline some of these are newspapers, dictionaries, thesauri, encyclopaedias, classification schemes. cataloguing schemes, catalogues, abstracts, indexes, citation indexes and so on. These are all tools. Specifically they are all communication tools. More are on the way today - markup languages, interchange formats, ontologies to support the semantic web, and so on.

As a set, built up over time and each addressing specific as-then unmet requirements, their functional elements are all to do with semantics, ontology and finding. Because the user problems are all to do with finding semantics and ontology.

A Brief History of Newspapers: All the News that's Fit to Find

The worth of business news to users has long since moved on from the traditional value proposition of finding stories, making a selection and working this into journalism.

News users requirements of news are different now, since the first corantos of the early 1600s. In fact, news users requirements are additional, not entirely new. It's often forgotten that with the information explosion the weight of ways of working with information has changed. A lot of information use by knowledge workers is collating and analyzing material that is not today's news, but is historical. This way of knowledge working makes knowledge organization and subject recognition foreground. Without these two functions knowledge work is hard to do.

The question for business and professional news publishers is "what part of this business do I want?"

Fit to Find

So, to find and remain focused on a more contemporary value proposition, we need a new mantra. The business news model now is not just "all the news that's fit to print" but now has to include "all the news that's fit to find". (The play on words is intentional.) News originators know this. But it's all the parameters that make the dynamic complex that interests those of us who look from the point of view of categorization and taxonomy.

What if … a Taxonomy could have a Monetizable Value?

If you are a vertically-focused business news publisher what does a taxonomy buy you? Well, if you give this technical community 4 a domain representation that works for them, and those you consider to be your competitive peers don't, this is worth something, possibly.

Let's return to our little scenario and see what might have happened since last we looked.

Well … the publisher made the business decision to contract an information access design consultancy to build a domain taxonomy. It was a faceted taxonomy, to enable

48

post-coordination of elementary concepts to form complex topics. They then tagged their news (and several years of legacy news) against the taxonomy. They used a categorization vendor for this. And built a workflow around their digital asset management (DAM) system to categorize and tag every day's news. They then had the ability to design new information content products, based on classifications, and potentially personalized classifications, derived from the faceted taxonomy.

The line of reasoning behind the business decision went something like this.

Firstly, the publisher decided that one of the antidotes to information hassle, from their customers' point of view, is the availability of precision and recall. This "availability" doesn't just happen like business magic, of course. Particularly since precision and recall lie in the eyes of customers. But if you give a set of customers, who all work in the same general special subject area, their representation of their business world then you have given them precision and recall. All the publisher had to do was insist that their categorization vendor could recognize the subjects of the taxonomy, up to the level of whatever the acceptability criteria were. Which opens up another interesting vista for content originators to think about information overload; if you are not part of the solution, then … are you part of the problem? In a more distant future than the next couple of years, will the information vending model be to sell information with knowledge organization, or information without knowledge organization?

Secondly, the publisher realized that monetizing a product enhancement can be more than just charging for it, more than just being able to measure a new and differentiable cash flow attributable precisely to the enhancement. If the new taxonomy-based approach to product development took off, and they had every belief in this, then it would impact their competitors, particularly the two that they collectively lost the most sleep over. And so this competitive differentiation was how they would monetize the benefits of the taxonomy. Their customers would value the classifications of the content that they had to work with every day as knowledge workers. The customers could port the taxonomy into their news portal. The taxonomy becomes a lock to lock in customers.

So far it looks like win/win/win, aka win3. Everybody could be happy. Publisher, categorization vendor, knowledge worker customers. Everybody sticks to what they do the best.

But there's more. More happiness all round in prospect. The happiness that interoperability brings.

Community or Domain?

We correctly separate the community, which is a set of users, from the domain, which is a "thing", the field that the users operate within. However, this separation should disappear at the taxonomy layer. It is taxonomy that brings both together, so that all user tasks that begin with ontological and semantic information are fulfilled.

Every technical community does have a shared view of its business world. Most likely, this representation of the business world is not (yet) in the form of a taxonomy. There are remarkably few "good" special subject taxonomies out there. Which fact is an opportunity for whoever thinks they can take it successfully.

49

The background to all of this is the notion that each technical community or (domain) only really requires one "good" representation scheme to model the domain. One gives interoperability, unambiguously. Two, or more, may, or may not, give true interoperability.

A taxonomy, such as the one our publisher above built and integrated into product development, tells us something about the business of the enterprises whose knowledge workers use it. Maybe it would be licensable to categorization vendors that port knowledge organization schemes formatted to their technology. In which business case it would become even more ubiquitous.

Text is Different from Data?

Of course it is. It is the differences that create the semantic and ontological information with many-to-many references. But it's instructive to look at what is happening in the world of data. Data ontologies are all going to be standardized and interoperable 5. A lot of business taxonomies are going to include many of these concepts that data elements quantify, because documents are written about these concepts, not just data crunched. So, if it's data our data structures are interoperable, but if it's text then our taxonomies and subject recognition objects are not? That seems hard to argue for, from this point of view.

Who Does What when Paradigms Change?

Paradigm changes are all about "what's going on?" They are also all about "who's going to do what?"

Vendors

A number of categorization vendors are successfully working the publisher space. This is an interesting model that is different from addressing vertical markets, i.e. domains. Publishers can publish in any domain.

Convera has accrued a long list of publisher customers. Convera is actively marketing "cartridges" to all its customers, each of which covers a different domain. A Convera cartridge is a bundle of a taxonomy and a semantic network to recognize the concepts in the taxonomy.

nStein is actively targeting the publisher space. In their nserver suite of products they bundle the IPTC news taxonomy. A publisher can publish in any domain, of course, but their business needs are all identical. To make their content objects more findable and information products more usable and to take part in the revolution to do away with information overload by in-building the potential for precision and recall into their content objects. And to be able to develop new kinds of content products.

50

Postlude

News publishers currently have an opportunity to print all the news that's fit to find. Do they want to be in the taxonomy business, or not? Opportunities don't last forever, of course. Content categorization applications can do the finding, if some other party does the knowledge organization building. The enterprise will be able to look after itself, by purchasing the appropriate categorization application and then using (internal or bought-in) lexical and ontological skillsets to integrate publisher content with intranet environments.


1. In the continuing spirit of these PoP histories being histories with an ultra-short recollection span you can concurrently begin with a timeline of early milestones in the UK from the British Library.

... and Mitchell Stephens, Professor of Journalism and Mass Communication at New York University has posted a fine introductory encyclopaedia article.

2. This isn't based upon a strict use case reading of the scenario. Is the domain an actor? More likely not. But the domain gives rise to the creation, and design, of the taxonomy, and is part of the model of the dynamics.

3. What is a good collective noun for a set of paradigm shifts?

4. "Technical Community" is used in a technical sense. All communities of users that share a common business task or role are technical. See the semanthink Review Essay of Juan Sager's classic text.

5. As examples of what is happening in the world of data interoperability see some of the following - �XBRL - eXtensible Business Reporting Language �FpML - Financial products Markup Language �MDDL - Market Data Definition Language

51

http://www.bl.uk/reshelp/findhelprestype/news/concisehistbritnews/britnewspaper.html

http://www.bl.uk/reshelp/findhelprestype/news/concisehistbritnews/britnewspaper.html

http://www.nyu.edu/classes/stephens/Collier's%20page.htm

http://www.xbrl.org/

http://www.fpml.org/

http://www.fpml.org/

http://www.mddl.org/default.asp

Review Essays

Project Management and Information Over-and-Under-Load

One of the persistent tasks of semanthink is "information overload" and how to solve it. Information overload is a symptom, and like many symptoms it doesn't show it's underlying causes clearly.

In fact, the term "information overload" has probably run the course of its short but very useful life. Information overload is not "just" about "too much". True, there is too much, but what causes us and users problems with information overload is really a combination of two powerful and different processes; entropic 1 information "disorganization" and active information "misorganization".

We think that "information overload" for knowledge workers is really a symptom of the absence or inadequacy of information access design. From this point of view knowledge workers are at the literal mercy of either entropy or (non-optimized) design. Or both. In corporate intranet environments we solve information overload by working with the underlying structural causes. One of these structural causes is lack of effective metadata design. How does this in turn cause business problems? It is within the metadata element set that we want to place information about the concepts that documents discuss. Without those concepts being defined, named, recognized and associated with individual documents, through some application of subject recognition or categorization upstream, it is hard, downstream, to enable information access, at the level of the concept, to document users.

If information overload is bad for knowledge workers, it can be even worse for those tasked with solving information access problems for groups of users in enterprise intranets. There is too much in the world to read for information access project and program managers, just us there is for all of us. "Too much to read" is a very different problem from too much to sift through to find what you really need for the task at hand. Additionally, and more problematic, project and program managers working with taxonomy and content categorization projects suffer from information "underload" - too little of what might be helpful to them in project design. Hence the semanthink site.

Project Design Based upon Models and Methodologies

What are the basic dynamics of the business problem of "information overload"? How can we open this black box so we can find variable parameters that we can work with and optimally configure? One of the persistent thematic solutions of semanthink is the advocacy of models, methodologies and project design, all in synergistic combination. It is this synergistic combination that is too little written about. Which is only to be expected, because of our place in the larger business cycle regarding which business management problems are adopted as problems to solve. And while it is only to be expected, it is also time to remedy this state.

52

Why a Review Essay?

The books reviewed here are largely "old", i.e. they are not new releases into the information flux. For instance, "A Practical Course in Terminology Processing" 2 by Juan C. Sager was published in 1990. The question "why these reviews, then?" obviously needs answering. semanthink review essays aim to take some key but largely under-appreciated, or put aside and never returned to, (because of overload) books that have important things to say about models and methodologies that can be incorporated into information access project design. By choice, these won't be the best sellers.

There are a number of goals for these review essays. These goals are interlinked. One set of goal is to do with improving the overall economic contribution that taxonomy and content categorization applications make to corporate economic performance; i.e. let's implement the knowledge economy and create economic benefits. These goals are aimed to appeal at a level of the importance of the richness of ideas.

More importantly, another set is concerned with improving the performance of our information access deliverables; i.e. let's implement the knowledge economy and create efficiency and effectiveness benefits. These goals are aimed to appeal at a level of practical design and implementation.

Business Contexts and Project Design

Is a "review essay" different from a vanilla book review? It's more applicability-based. A review essay allows the writer to move outside the boundaries of the concepts of the book to apply what it has to teach to areas it was not originally conceived for, or that were not possible at the time. The books reviewed here all share one aspect in common, that is non-negotiable. They all contain at least one (or one part of a) model or methodology that we can usefully re-engineer and use to enrich our design of taxonomy and content categorization projects.

These review essays are written from the point of view of working within any enterprise intranet or newsfeed environment, but much of the modeling is equally applicable to e-commerce and e-learning environments, and information transaction environments in general.

The Economic Justification of Information Access Design

This is what Juan C. Sager says about terminology improving business performance -

"The primary function [of terminology] is the collection of terminological information which is undertaken in order to improve communications and its economic justification lies in this objective".[page 208]

Improved communication and an economic payback is a compelling juxtaposition that is worth working towards. Improved communication and an economic payback are what actually implementing the knowledge economy is all about.

53

The Current Emergent (and Paradigmatic) Phenomenon of Lists

Lists of things are everywhere. From a sales point of view, they can be a highly effective way to cross-sell. All credit to Amazon!

Lists are classification in the raw. Anyone who doubts that classifying aspects of the world is highly personal regarding what is utilitarian only has to either read "Women, Fire and Dangerous Things" 3, or, visit Amazon. What makes a list? A theme. (What makes a good list is a different question entirely.)

So, thinking a question along the lines of "what are the most important under-appreciated books that can enrich our models of information access design?" we start to hit against the parameters of what makes a good list. Size of the list is one parameter. Some numbers are just too small. "The ten most important jazz albums of all time"? Hmm. That would not work for a lot of people. Ten is too small a number and the span of jazz in years and genres is too vast. Which might indicate that time span covered is an important parameter.

And so it is that size and time span are two ways to turn a fine idea into a poor deliverable. There is one final parametric trick or tweak. The deliverable may well be a list of books, but the impetus is actually a list of ideas. semanthink review essays aim to take ideas that are vital to understand, re-engineer and integrate into information access design projects. Under-appreciating useful books is one thing. Under-appreciating useful ideas is quite another.

Once there is more than one essay review then we will certainly have a list. The final size of the list is not known. And as far as time span is concerned they will likely be "old", as we have mentioned (or at least non-current). And as far as genre is concerned they most likely will not be mainstream information architecture, usability etc. So, we can safely say that semanthink review essays are not ("rarely", may be safer) going to cover recent hit books. Current and mainstream presumably have a certain momentum of attention, and they won't really need more of the same from here.

Books or ideas? Now there is an interesting dynamic. Books are only wrappers, after all. Wrapping ideas in a book "review" is a way to review ideas, and bring certain ideas from the background to the foreground. Foreground/background is a key notion in the gestalt psychology understanding of perception and the need of the moment. From background to foreground is one of the key dynamics that we use in our work, whether it is project design, training or evangelizing. That which is important but overlooked or under-appreciated always needs to be brought to the foreground for attention and decision.

And lastly, synthesis and re-use. Knowledge management is all about synthesis and re-use. Under-used ideas that help us with our varied lexical and ontological work need to be brought to the foreground.

Happy and engaging reading!

54


1. There is little descriptive research on the natural entropy of non-organized information asset sets. (But we do know what it looks like when we experience it!) Research carried out by IBM, Alta Vista and Compaq, on the web as the subject (which is a hyper-linked environment, but otherwise non-organized) show the web as "bow tie" shaped and essentially partitioned. The study can be access at the IBM Almaden Research Center.

It is our assumption at semanthink that this model will apply to most corporate intranets before a project program to optimize information access, and that the partitions would tend to be descriptively comparable.

(Incidentally, this type of study tends to disprove the "six degrees of separation" metaphor.)



3. Women, Fire, and Dangerous Things: what categories reveal about the mind by George Lakoff. 1987, University of Chicago Press.

55

http://www.almaden.ibm.com/almaden/webmap_press.html

http://www.almaden.ibm.com/almaden/webmap_press.html



Review Essay May 2003

A Practical Course in Terminology Processing by Juan C. Sager

Part 1: Why Terminology?

Essay Theme

This review essay is written primarily for project and program managers responsible for the scoping and design of taxonomy and content categorization projects for enterprise intranets. These tasks of scoping and design require models of different aspects of the communication that takes place in enterprise intranets. In this essay we extract three core models from Juan Sager's classic of terminology work. 1 We re-engineer each of these models and use them to understand a different aspect of the communication that needs to happen, between user and information environment, to ultimately connect users to the documents they really require.

In this essay we focus on terminology. Terms and words from general vocabulary occur in different functional tools, in different communication roles, in different layers of the total information access environment. The persistent question, from a project design point of view, is "how do they get there?" For instance, how are taxonomy nodes and concepts in controlled vocabularies named? Through what process, and using what methodologies? We do it by incorporating the methods, models and theories of "classic" terminology work into the information access design process. More precisely, we base our methodologies and project activity definitions upon a set of models, including the three highlighted here.

A Focus on Communication

One of the key functional communication elements in the information access environment, that connects users to documents, is the choice of term for each and every concept in the domain. What makes this key? Users use words to find words; they use words in information access schemes (such as taxonomies and classification schemes) to find words in documents that indicate or denote particular concepts. Or, looking the other way, the total information access environment uses words (in information access schemes) to communicate indications of concepts in documents to users. But, users are more than "just" users - they are also members of a community of technical peers. These communities create, and re-create, their own terminologies continually. Terms in the information access environment, in a taxonomy or classification for instance, are required to be their terms.

This is how it all began ...

Once (and still), upon a time, the user sits in front of the screen. The user knows the task at hand. There is a chance that one of the links currently showing on the screen will lead to a document that resolves this task. The user is focused on two or three links in particular. Actually, the user is focused on the words contained in these links. This small moment is a moment of decision; to click through, or not.

56

The Business Problem

Whenever the user genuinely does not know what the document is "behind" a link, then our user is effectively asking a question of the form: "What (kind of document/information) will I find if I click on this link? Will it help me?" It's a question that is asked hundreds of thousands of times each day in every corporate intranet in the world.

We, however, being information access designers, are focusing on the words in the links and asking different questions. "How did these words get there?" And, "are they working well?" We ask these questions because much, but not all, of the user's cognitive decision-making process is dependent upon the words in these links, and how they work.

There is a whole host of sub-questions to both those questions of ours. This first question is to do with the process of identifying and selecting words for this important communication role. That is, what methods (or heuristics), project design processes and theories (or ideas) were used to design the workflow that delivered those words? The second question is to do with the success (or otherwise) of these words as communication tools for particular communities of users with particular tasks to resolve. That is, are these words successfully mediating between user and topic, by both correctly identifying the concept and labeling it in a way most useful to the user? In this little scenario we will never know. But, this review essay is an introduction to how words that carry key communication functionalities should get into any information access layer. And also about how words selected as terminology can be made to "work".

"Words" and "Terms"

One of these sets is obviously a subset of the other, one way or the other. What is the difference between "a word" and "a term"?

Every community of interest is technical; baseball fans, americana music lovers, insurance salesforce personnel, research biologists, workplace anthropologists, investment bankers specializing in M&A advisory services are all technical communities and all technical cultures. Every technical community speaks in, and uses, a "special language". 2 A special language is a set of terms, each of which carries a precise and consensual definition, and which "substitutes" exactly for the concept in communication between technical peers. Terms are one of the differentiators between special languages and general language.

Terms differ from general vocabulary, even though both are "words". Terms are "technical". They declare concepts in technical communities. And they declare those concepts monosemously. In most technical communities terms carry only one meaning; they are not polysemous in the way that much natural language is. The meaning of any term has to be "defined" and the use of the term "standardized" if it is to be an item of terminology. This standardization is not necessarily formal, although in some scientific areas it is. It is defined and standardized within the technical community and culture that uses it to declare a concept. This cultural consensual definition and standardization of technical communication does not apply to general vocabulary.

Both words and terms are used in the layers of the total information environment that sit

57

between users and documents. In some layers they are completely separate, or excluded, from each other. But this is rare, and situational. More often, there is not a complete separation between terms and general language. For instance, some thesauri are "just" terminology, but even here they have to reference the non-preferred (non-terminological) lexical elements. Usually, the functional element of the taxonomy that is the node label is terminology, but not always. Document titles and descriptions mix and match terminology and natural vocabulary. As do documents themselves.

And precisely here is the genesis of what makes "good" automatic subject recognition over "poor" subject recognition. We need to know both the special language and general language lexical variance to be able to combine them for accurate subject recognition. Similarly, here is the genesis of "good" communication or "poor" communication. If we mix and mismatch terms and general vocabulary in our presentations to users we can create communication breakdowns at the semantic and ontological levels.

Why an Understanding of Terminology is Important in Project Design and Development

Because of these characteristics of special languages, deciding that a lexical element is an item of terminology carries consequences in the way that it is managed, maintained and applied. An understanding of terminology enables us to effectively deploy terminology in applications. But, before that, before we can apply terminology, we have to collect it. Knowing what terminology does, or what it is "supposed" to do, clarifies what we need to do to collect it. We are going to get the understanding of what terminology is "supposed" to do by looking at three models that together model how terminology makes successful communication. These models are at the heart of Sager's book and will shape how we define our project activities. These are a ‐

model of the relationship between concept, definition and its term

model of special languages model of communication

To work at the project design level we do need to begin with models and methodologies. Models are based on theories and methodologies are based on the inputs we have to work with and the deliverables we want to produce. We derive our inputs from what our models tell us should be happening.

Without these models (or any models) project design in our kind of work sometimes carries a similarity to alchemy - in this case the alchemy of vocabulary into terminology - rather than applied communication science. This alchemy is a black box we want to open, so as to discover processes and hence methods. Building models is one way to open any black box.

58

Where is Terminology?

Terminology is everywhere in the intranet environment. Content managers and content

publishers use terminology in many of the layers of the total information access environment. Here are some of the layers that connect users to documents (and documents to users) where we can apply terminology ‐

Classification Schemes Document Titles and Descriptions Navigation Systems Subject Recognition Executables Taxonomies Thesauri and Controlled Vocabularies

Some of those deliverables that consume high-quality terminology are elements of the overall metadata scheme, such as taxonomy or ontology. Other elements of the metadata scheme that are controlled vocabularies will require terminology. For instance, if we carry out an audience characterization study we will need terminology to name the audience types. Or, if we want to type documents as kinds of wrappers of content (such as FAQ, glossary entry, analyst report, forms, committee minutes etc) we will need controlled terminology for this. Or, common workflow tasks that can be unambiguously associated with particular documents or document types will also need terminology design.

Thesauri and controlled vocabularies, used for search engine results optimization and other processes, need to be based on terminology principles. Here the idea is to collect, as exhaustively as is still cost effective, the lexical variance of the ways to denote the concept alongside the "technical" term and normalize against the technical term.

Our work in terminology, in part, strongly influences our promotion of best practice technical communication in the authoring of these two fundamental user-facing information architecture elements, document descriptions and titles. Titles and descriptions of documents that use appropriately chosen terminology items will communicate a certain clarity or precision, in support of user click-on decisions, compared to those that use general vocabulary alone.

Other uses of terminology occur in functional elements of the total information access environment that are outside the metadata scheme. For example, terminology is also a key aspect we take into account when naming navigation and classification labels so that they conform to the consensus of the community of users. Contextual links can also benefit from having terminology designed into them. Lexically-based and knowledgebase-based approaches to subject recognition in documents, whether these are rules-based or aim for a semantic network model, benefit from the assimilation of terminology principles.

Outside information access design work terminologies support interoperability through translation from one language to another. In the global economy customers of global companies (and the companies themselves) are caught in a dynamic of globalization/localization. Terminologies are used to bridge the kinds of gaps that occur because of this dynamic.

What is Terminology?

Let's now begin at the beginning and understand what terminology is, from Sager himself. There are a number of senses of the lexical element "terminology", and most of

59

us will be familiar with this one -

"A terminology, i.e. a coherent, structured collection of terms, is a representation of an equally coherent, but possibly differently structured system of concepts."[page 114]

We would all recognize this definition - the noun meaning a collection of words that together form a particular way of describing a part of the world at large for a particular community that works within that world. However, there are two other senses not so commonly recognized or expected. This is how Sager defines these meanings. "Terminology is the study of and the field of activity concerned with the collection, description, processing and presentation of terms, i.e. lexical items belonging to specialized areas of usage of one or more languages."[page 2]

When we separate out the three meanings of the word "terminology" this is what we get 3 ‐

It is an activity: the set of methods and methodologies used to collect, describe and present terms. The activity is what terminologists carry out.

It is an activity: the set of methods and methodologies used to collect, describe and present terms. The activity is what terminologists carry out.

It is a theory to explain the relationships between concepts and terms. It is a noun meaning a vocabulary of a special subject area.

These three meanings are all used in project work to create information access environments. As an activity we use it to help us set the tasks of project activity definitions. As a theory we use it to robustly design our projects at the project level. As the noun, it faces two ways. To project managers, a "terminology" is a deliverable. To a particular user community it is their own communication tool - the set (or sets) of terms that they use to name and relate its concepts.

Life Lived One Click at a Time

Links and their labels sit between users and documents. The user is in front and the document is behind. In "front" of what, and "behind" what? Terminology and general vocabulary. A cognitive decision by the user and a click are all it takes to move from the link to the document. Because it's a cognitive decision, we have to ask ourselves "what supports the user's decision to click through or not?"

Let's go back to the beginning ...

... The user sits in front of the screen. The user knows the task at hand. There is a chance that one of the links currently showing on the screen will lead to a document that resolves this task. The user is focused on two or three links in particular. Actually, the user is focused on the words contained in these links ...

This is a small moment in time (but repeated vast numbers of times in the corporate world each day). The decision of this moment is binary, no or yes. The user has expectations of the link. "Will it … or won't it?" We see the task of all text labels as making this user decision effortlessly binary. "Yes, it will!" Or, "No, it won't!"

60

As project managers or designers, we can now add a set of additional questions, requiring answers, to our earlier ones. "How do I define and collect terminology?" "What instances of general vocabulary do I need to collect?" "And why?" "Where can I mix and match terms and general vocabulary?" "And how will this impact the functionality of the tool where I do this?" "Where shouldn't I mix and terms and general vocabulary?"

Life lived one click at a time is a highly existential, binary life fraught with stress because of lack of information about information. Robust concept-focused information access design schemes give decision support to the user at this very granular level. Every time. Users do have expectations. Users do have information tasks to resolve. The quality of terminology and vocabulary in the layer that presents information choices to users is a key driver of the effectiveness, or not, of the total information access environment.




2. See also English Special Languages by Juan C. Sager, David Dungworth and Peter F McDonald. Oscar Brandstetter Verlag.

3. When we talk about special languages and terminology we are ourselves using a special language and a set of terms. There are some other distinctions that are worth noting.

One particular area of confusion highlighted by the POINTER Project is that of the differences between these terms - terminology and lexicology, and terminography and lexicography.

"While lexicology is the study of words in general, terminology is the study of special-language words or terms associated with particular areas of specialist knowledge. Neither lexicology nor terminology is directly concerned with any application. Lexicography, however, is the process of making dictionaries, most commonly of general-language words, but occasionally of special-language words (i.e. terms). Most general-purpose dictionaries also contain a number of specialist terms, often embedded within entries together with general-language words. Terminography (or often misleadingly "terminology"), on the other hand, is concerned exclusively with compiling collections of the vocabulary of special languages. The outputs of this work may be known by a number of different names - often used inconsistently - including "terminology", "specialized vocabulary", "glossary", and so on."[section 1.2, page 31]

POINTER Final Report / POINTER (Proposals for an Operational Infrastructure for Terminology in Europe)

It is also available online at Citeseer, in a variety of formats including pdf.

61


http://www.brandstetter-verlag.de/neu/screen/p36.html

http://www.brandstetter-verlag.de/neu/screen/p36.html

http://www.computing.surrey.ac.uk/ai/pointer/

http://citeseer.ist.psu.edu/

Review Essay May 2003

A Practical Course in Terminology Processing by Juan C. Sager

Part 2: Models Let us Devise Methodologies

An Introduction to Models: Redefining the Business Problem

Models let us devise methodologies, methodologies enable us to properly scope project activities, and through projects we implement solutions.

Models of processes are highly useful. They define what elements, actors and parameters are essential to building the process, whatever the process is. They include all of these, and show how they inter-operate. This inter-operation requires communication. Communication requires a medium. The medium, in our case of organizing enterprise intranet content, is lexical. Understanding what the medium should contain lets us design activities to engineer precisely what needs to go exactly where. We take these kinds of models, and their parameters, and build methodologies and design project activities from them.

What Models will Help Us?

We are going to be looking at how to leverage "naturally occurring" terminologies to connect users to documents in information environments, particularly enterprise intranet environments. Since lexical elements are the intermediary between users and documents, we need a set of models to help us understand how words and terms operate as carriers of both semantic and ontological information.

We are going to walk through three models extracted from "A Practical Course in Terminology Processing". We are going to re-engineer these to our purposes of project design and project activity definition. These are a -

model of the relationship between concept, definition and its term model of special languages model of communication

Why do we want to build these particular kinds of models? Here are some of the kinds of questions that need answers if we are to build effectively communicating information environments. And the projects to deliver them. Users and the total information environment interact. To what purposes? How? What do users "carry" within (a bit like programming) that governs their use of semantics and their classifications of the objects and events of their world? How do they use these in a responsive and interactive information environment? What do users share in common with their peers? How is this mediated? How does the information environment get its semantic and ontological programming? How do we ensure that these are mapped to the users views? What causes breakdowns in communication of precision and recall between users and the environment? How does this happen? And so on.

These three models are only part of the overall set of models we use in designing

62

information access systems. We never, or rarely, use these three models in isolation. For instance, there is a semanthink model called the OntoLexical Layers Model. It models the layers in the information access environment that carry conceptual information though using lexical elements as signs. It shows us "where" words and terms "are", performing different communication functions within different functional tools. We will need to bear this model in mind as part of the bigger overall context. This is described in at length in the Models part of the semanthink site. Other models that we habitually use we will not even touch on here.

Model 1: A Model of Concepts, their Definitions and their Terms


This model is so simple that it is sometimes overlooked precisely because it is so simple. But its purpose is crucial.

The label for any concept is always cultural, because any and every community of users is a culture. This means that labels are both consensual and shared.

Any concept, its definition, and its possible labels must be de-coupled so that choice of the term can be chosen to be most user-centric in any layer of the content access platform. Without having access to this kind of model we might not realize the importance of this de-coupling. How any concept is defined is different from what it is labeled, and what delivers these are different project activities.

The Model

"The primary objects of terminology, the terms, are perceived as symbols which represent concepts. Concepts must therefore be created and come to exist before terms can be formed to represent them. In fact, the naming of a concept may be considered the first step in the consolidation of a concept as a socially useful or usable entity."[page 22]

Sager is telling us that although we all know "terminology" to be about "words", in one way or another, terminology is actually concerned with compiling and labeling concepts. Concepts need names of course, if we are to communicate about them. But the name of the concept is not the concept itself, it only represents the concept to the user. Words are simply what we use to label concepts. This means that the terminology work we do is concept-oriented work.

"Through the activity of definition we fix the precise reference of a term to a concept, albeit by linguistic means only; at the same time it creates and thereby declares relationships to other concepts inside a knowledge structure … We expand the knowledge structure of a subject field by the addition of new concepts for which we have to create linguistic forms before they can be used in special subject discourse."[page 21] More precisely, terminology work is concerned with concepts, their names and their definitions. Terms are precise references to defined concepts, i.e. they "substitute" for

63

defined concepts. We isolate concepts, collect and compile them, define their boundaries and represent them lexically, through the medium of terms. The term connects users to concepts, whether they are communicating or seeking, pushing information or pulling information. Additionally, Sager points out to us that no concept exists in isolation from others. The cognitive world is made up of domains - concepts in relationship sets. Concepts declare relationships to other concepts.

"A theory of terminology is usually considered as having three basic tasks: it has to account for sets of concepts as discrete entities of the knowledge structure; it has to account for sets of interrelated linguistic entities which are somehow associated with concepts grouped and structured according to cognitive principles; it has, lastly, to establish a link between concepts and terms, which is traditionally done by definitions."[page 21]

Taking the Model into Project Design

There are other ways to model this set of relationships. The semiotic triangle would model this as word, concept and reference. But this simple theory of reference works for us, especially in the highly lexical environment of an intranet.

Naming is one issue for information access projects. We need to know that much of what users look for is information discussing concepts. These concepts can be named in two ways. We can either use a technical term in a terminology, or we can use natural language labels. What counts is the way that users are naming particular concepts. We need to come to know this.

Defining is another key issue for both knowledge organization and subject recognition. The importance of definitions is often overlooked. After the process of making the definition explicit we have a deliverable, a definition. This deliverable is one of the four functional elements of a taxonomy. These kinds of definitions are linguistic descriptions of concepts. These kinds of definitions interface between users and content. Their business purpose, within our applications, is to support users in making decisions about which kinds or which items of content to retrieve, or where to go within a knowledge scheme to find that particular concept-predicated content.

Apart from taxonomies, we will use definitions of concepts in subject recognition applications. For instance, if our categorization platform is based upon a semantic networks model, these definitions will specify the synonym sets we will intersect for the subject recognition to be accurate. Without definitions we cannot build accurate subject recognition models that categorize against individual taxonomy nodes, for without definitions what are we trying to recognize? Our goal in content categorization is to make sure that the concepts we categorize documents against are defined the same way that users are expecting taxonomy nodes to be defined.

For these two reasons of knowledge organization and subject recognition we have to make explicit these definitions for the application and the environment. It is also important to remember that these definitions belong to the users within a particular technical community, even if they have only been made explicit by our project work.

64

Model 2: A Model of Special Languages

Special languages are really part of Sager's larger model of communication. We've extracted the concept from there and wrapped it in its own model. This makes the overall set of models easier to both teach and apply (and think about).

It's hard to overestimate the importance of the concept of a special language. How do peers in any technical environment communicate? They communicate through the medium of a special language. The special language of any community of interest is both the lexical wrapper for the special concepts of that particular domain and the set of related definitions of these concepts along with their terms. Terms and their definitions are interchangeable in a special language. In the act or event of communication there is a dynamic interplay of actors and parameters. It is within this dynamic that special languages play their role. A special language is a tool, not an event.

The Model

We would say that all business communities of interest are cultural, in the widest sense of the word, and so technical. And we would constrain "cultural" to characterize the combination of shared subject of interest, way of conceptualizing it and way of communicating about it. Sager comes to the same conclusion, but says this in his own way. In short, the language used in any business, scientific or technical community is a special subject language.

This is how Sager defines a special language -

"Special languages have been defined as semi-autonomous, complex, semiotic systems based on and derived from general language; their effective use is restricted to people who have received a special education and who use these languages for communication with their professional peers and associates in the same or related fields of knowledge."[page 105]

All of these parameters characterize corporate intranet environments.

"The lexicon of a special subject language reflects the organizational characteristics of the discipline by tending to provide as many lexical units as there are concepts conventionally established in the subspace and by restricting the reference of each such lexical unit to a well-defined region. Besides containing a large number of items which are endowed with the property of special reference the lexicon of a special language also contains items of general reference which do not usually seem to be specific to any discipline or disciplines and whose referential properties are uniformly vague or generalized. The items which are characterized by special reference within a discipline are the 'terms' of that discipline, and collectively they form its 'terminology'; those which function in general reference over a variety of sublanguages are simply called 'words', and their totality the 'vocabulary'."[page 19]

65

Special languages differ from general languages in that they use terms in addition to words. Terms as linguistic expressions, and hence special languages, carry two core properties of communication in general- the property of definition and the property of shared special reference. These two communication properties, of course, carry over into information access environments.

"In special communication terms are considered substitute labels for definitions because only a full and precise definition is the proper linguistic representation of a concept."[page 109]

This property of "special reference" effectively means that terms are substitute definitions. Terms are obviously more efficient, in terms of lexical overhead, than the full definition required to capture any particular concept.

"Only if both interlocutors in a speech act know the special reference of a term, and, by implication, that they are using terms rather than words, can special communication succeed."[page 105]

"Special reference" also refers to the structure of concepts in any special subject, as opposed to general knowledge. Special subjects have a need for delineating the relationships between concepts more strictly, so with much less fuzziness or flexibility, than in general knowledge. But this delineation is one of difference in degree only. This is how Sager puts it.

"Within a subject field, some or all of the included dimensions may assume increased importance, with a greater need for distinction between a larger number of concepts along a given axis: at the same time, the necessity to avoid overlap between concepts, i.e. intersecting regions, will tend to reduce the degree of flexibility admissible in the delimitation of the bounds of any given concept. There is thus a difference of degree between the intradisciplinary structure of concepts in the bounded subspace of a special subject or discipline and the less well-defined, less "disciplined" structure of "general knowledge". This does not mean that general knowledge cannot contain well-defined facts; but only that disciplines have a greater need for more rigorous constraints on the overall delineation of items of knowledge. The referential function at the extremes of this distinction is classified as "special reference" and "general reference" respectively."[page 19]

Special reference ensures complete communication.

"In special communication terms and standardized terms make a critical contribution to achieving complete and effective communication."[page 105]

66

Taking the Model into Project Design

We need to understand that users in technical communities know two things about their special language. Firstly, they "know" when they are using their special language. This knowledge is intuitive, since the concept of "special language" is a concept belonging to terminologists and communication experts - i.e. users almost certainly don't know that others have labeled what they are doing "using a special language". (The term "special language" is part of our special language as information access designers.) Nor do we need enterprise knowledge workers to explicitly know (or even suspect) that they are constantly using special languages when they communicate with peers or customers. Or when they use an intranet. But those of us who design for knowledge workers do need to know this.

Secondly, users in technical communities also know how much ambiguity or polysemy exists in their special language (very little). But those of us who design for knowledge workers need to know this also. Why should knowledge workers create little ambiguity and little lack of precision when they communicate directly, but experience (much) more of both when there is an intermediate system?

Another way to approach this point is to think of it as a process of default values. Technical peers "default" to using particular lexical elements (terms) whenever they are in technical communication. This default takes precedence over natural language usage. This default is communication using special languages.


Models have to be synergistic with their peer models.

We needed to know what the actual tool is that peers in a community of interest use to communicate. Without knowing what the tool was, our project design options were both limited and unfocused. Now that we do know, we see that it is a language tool called a special language. This tool of a special language combines two different types of lexical elements, "terms" and "general vocabulary". The terms of the special language carry two particular functional properties. Terms are substitutes, or a shorthand, for conceptual definitions. Terms carry information about the conceptual or ontological structure of the particular domain, including "special reference", which is information about the precision of delimitation between "adjacent" concepts.

Now that we know the core ways in which this tool functions, it is not a large leap of intuition, or pragmatic project management sense, to see where we have come from and where we are headed, in working with lexical information.

The OntoLexical Layers model tells us where instances of any special language lexical elements can, or should, exist.

The model of term/concept/definition tells us that unless we have these three inter-related functional elements in perfect alignment for every concept that the user community requires then we are going to build into the environment unwanted outcomes. From an information retrieval point of view this will be issues with precision and recall. From a usability point of view these will be all kinds of user behaviors that

67

usually indicate frustration, lostness and so on. From a business point of view there are going to be efficiency consequences. These will, as always, be hard to measure.

The model of special languages tells us the communication tool that our users use. This in turn tells us that the information access environment that we are designing should also use our users' special languages, equally effectively as they do. This means that we will use terms where terms are required and general vocabulary when it is called for. And that we will know how to decide which situation is which. Every type of presentation that the user interacts with will be an opportunity to use a special language, or not - navigation schemes, browsable classifications and taxonomies, search results rendering and so on.

Model 3: A Model of Communication


Let's now look at Sager's model of the communication dynamic. It is idealistic, or "perfect", but that does not necessarily count against this model, or models in general. Its idealistic point of view is communication between scientific/technical peers. But that is fine, because every intranet environment uses its own enterprise-specific terminology. From our point of view, the intranets of insurance companies and investment banks are "as technical" as a scientific/technical environment. The model sets up the transaction, of information, in a context of purposes, and motivations, using terminology as the medium. It is "formal", but science as a method is "formal". But so is business, in the sense that there are constraints imposed in pursuit of profitability.

We need a model that tells us useful information about the "act" or event of communication. What kinds of information do we need to know about the event? We are lacking a purpose for why people want to "know" or "find", and why people want to communicate or tell. Basically, what is the purpose of communication? We need to have a way to understand that users have states of knowledge. These knowledge states vary over time, vary according to information "taken in" and processed. If knowledge states vary between peers, then they certainly vary between any individual user and the total information access environment as one gestalt. If users are in states of knowing that they want to change, one way or the other (but usually not to get more ignorant, but more informed), we need to know what parameters support that transformation and what inhibit or destroy it. So, We could usefully use a total model of communication that -

explicitly puts a change of a personal knowledge state at the center defines common different ways that personal knowledge states can be impacted exposes the requirements of the information recipient (or seeker) that must be

met for any information transaction to be effective explicitly points out that assumptions and pre-suppositions about information

recipients exist, and so these should be dealt with

All of these elements are included in Sager's model of communication.

68

The Model: Communicator, Recipient and Knowledge States

This is how Sager describes his model.

"In a model of specialist communication we assume the existence of at least two specialists in the same discipline, who are jointly involved in a particular situation where the sender (speaker or writer) is motivated to transmit a linguistic message which concerns the topic of his choice and which he expects a recipient (reader or listener) to receive. We assume that the sender's motivation arises from a need or desire to affect in some way the current state of the knowledge of the recipient."[page 99]

So, we have actors, a sender and a recipient. We have motivations, a sender's motivation that is to inform, and a recipient's current non-optimal knowledge state about an aspect of the domain. Sager calls the sender's motivation the "intention … to transmit information which will have an effect on the current knowledge configuration of the intended recipient".[page 99]

Now we have the notion of a "knowledge configuration" or state of knowledge. This gives us three communication parameters so far: actors, intentions, states.

These knowledge states of the person who is the recipient can be changed in different ways. Here are some of the "effects" on the recipient's current knowledge configuration (there are others, but these ones serve well to give the flavor of it all). The sender's communication can -

augment the recipient's knowledge confirm the recipient's knowledge modify the recipient's knowledge

We now have four parameters in play: actors, intentions, states, and ways the states can be changed. With the idealized parties in place, and the parameters in play, the communication event then happens. The "parts" of the model are now made to "work" by the participants.

"Basing himself on presuppositions about the recipient's knowledge and assumptions about his expectations, the sender decides his own intentions, and then selects items from his own store of knowledge, chooses a suitable language, encodes the items into a text and transmits the entire message toward the intended recipient … Perfect communication can be said to have occurred when the recipient's state of knowledge after the reception of the text corresponds exactly to the sender's intention in originating the message."[page 100]

The communication event has happened. But our model is not complete to our satisfaction. We still want to understand what makes the difference between "good" communication and "bad" communication. Sager tells us that: "The achievement of successful communication is fundamentally dependent on the three choices the sender

69

has to make in formulating his message". [page 102]

These choices are -

intention selection of knowledge choice of language

Intention can, at times, be clear from the type of document. Sager gives examples of some document types. A check-list is usually used to confirm knowledge. Instructions are intended to augment. User manuals can augment, confirm or modify.

More commonly, though, intention is expressed through the use of terms. Here senders have to be careful to align their choice of terms with what they intend the impact to be on the recipient's knowledge state. For instance, an undefined term will not easily serve to augment someone's knowledge state. We can sum all this up in Sager's words.

"The sender must choose an intention that is commensurate with the recipient's expectation. For the communication to succeed, the recipient must capture the sender's intention either from the message or from the situation, and his interpretation of the intention must be accurate."[page 102]

There are a number of important parameters around the sender's selection of knowledge. For instance, the sender must either have some kind of prior knowledge about a recipient’s current state of knowledge or make correct presuppositions. If the intent is to augment or modify the recipient's knowledge state, then, obviously, the sender needs to know more about the topic than the recipient does. Otherwise, no transfer is going to happen.

The use of accepted standardized terms makes the communication economic, precise and appropriate, or not. In the area of choice of language, the sender must choose an appropriate special subject language, or general language.


In technical communication there are many variations on the theme of modeling the dynamics between those who want to send messages and the intended recipients, and what kinds of outcomes, or end states, arise. We like the communication model used in the terminology processing work of Sager. It is simple, formal, and speaks to what the heart of technical communication is, namely the purpose being to change someone's state of knowledge of defined concepts. Information access systems exist to rebalance unbalanced user knowledge states. Knowledge workers operate in a constant flux of knowledge states. This flux repeats itself continually - users seek, find, use. We explicitly design these kinds of systems, at the semantic and ontological levels, to ensure this rebalancing occurs effectively. We use lexical elements as the intermediary.

Sager observes how terms, distinct from general vocabulary, operate in a model of

70

communication. We, of course, don't want to forbid general vocabulary. We want to use terminology when possible, general vocabulary when we have to. Our focus is on designing around the pitfalls that general vocabulary usage creates in an environment such as the enterprise intranet, which from our point of view of conceptual communication is quintessentially a technical environment.

We make the model work for us at the user's granular level of making lexical choices at each of the intermediate decision-making points that are faced. Not only do we know that we use a tool called a special language (because our users do) but we now have enough information to apply the tools of special languages and terminologies. As much as possible we want to support our users achieve the change of knowledge state that they require.

At the level of the link to the individual document we want as much as is possible that our users be able to tell if any document is going to augment, confirm or modify their current state of knowledge. Do you remember the little story of our user at the beginning, scanning links to documents? This augmentation, confirmation or modification is not just needed at the level of links to individual documents. When taxonomies, and more commonly, classification schemes are browsable they guide users to sets of documents discussing the same subject. In this way users choose where in the domain space they want to augment, confirm or modify. Keeping up to date is just another, shorthand, way of augmenting, confirming or modifying ones knowledge state.

The model lets us think about enterprise intranets, or any information access environment, in a different way. Because the user seems to always be doing all the "work", of searching, navigating, browsing and so on, we can forget that we need to think about the environment doing much of the work. We can also reframe the environment as being "active". It is a sender or communicator. To be sure, it's only a sender when the user wants it to be (but maybe that's the best kind of sender!). When the environment is a sender it had better follow all the good practices that all communicators practice. This is the "how" of communicating. These good practices include differentiating between general language and special languages, understanding that terms are the objects and that a special language is the "wrapper", that intent counts, and that the assumptions that the sender makes are in a mirrored relationship to the expectations that recipients have and that both sets of assumptions and expectations have to be worked with. And all this is lexical and ontological work.

Lexical and ontological work is granular work. Graphic designers work with one pixel to left or right, or one point or Em here or there. We work with a word or the turn of a phrase. Or, how a concept relates to another in an ontological space; it is parent or child or sibling, and are we sure the relationship type is "kind of" and that its label is reified correctly. And then, can we collect the lexical variance to build a subject recognition object that recognizes with high precision and high recall? This granular work is reprised thousands of times in designing for information access.

While we use the OntoLexical Layers model for both strategic and project management purposes, we only really want the other three model together to work for us at the project design and activity definitions levels. We use them to keep us focused "on message" and to design projects that deliver.

Which is all good. But the real purpose of these models, as all models, is to make us think before we practice.

71

Re-Purposing the Model: the User as Recipient

In Sager's model the communication is direct. In some ways his model is a sender-centric model. It focuses on the intent, choices and actions of the sender. Working with designing and re-designing intranets we derive much insight from pivoting the model to be recipient-centric, because intranet users carry many qualities of Sager's recipients. We can focus on the users as information seekers - active recipients rather than active senders. But the model still helps point us towards what we should do as we work to provide information access to documents for users. Intranet users require all that Sager's recipients require for successful information transactions to take place.

The situational factor of the "inequality of knowledge" dynamic interests us. The sender is expected to have a greater knowledge of the subject than the recipient. The document collection holds the "answer", the user knows that the "answer" is out there to a greater or lesser degree of precision. But the document set is not the sender. All the presentation tools of navigation, coupled with the semantic and ontological back-end, together are the sender. And the navigation the means to approach and retrieve it. The user wants augmentation, confirmation, modification etc.

Re-Purposing the Model: The Intranet as Sender

Another aspect of the communication dynamic interests us. In Sager's model the communication event is a one-to-one, direct, personal communication. But from the point of view of the individual intranet user the intranet communication process can appear as an instance of a many-to-one, indirect, impersonal process. And the "many" carries multiple connotations: many documents, a variety of navigation, multiple ways to search etc. The "lucky" recipient in Sager's model is given one communications deliverable, one document (or speech). This kind of communication was always push, until the internet/intranet era. Now users spend much time (and effort) in trying to "pull" information from out of the complexity. Users actively want, and require, to be communicated to. The information access system is the sender. It could usefully conform to all the constraints that Sager's senders conform to.

It is this indirectness of the intranet environment, coupled with the scale of publishing, which causes and compounds some of the problems of "hard-to-locate" content. There is certainly intention aplenty in an intranet environment. Authors, navigation designers, taxonomy engineers are all intending that best-fit documents be communicated to users. We try to integrate all these parameters of communication. We integrate by working with concept identification and concept naming. We build this intermediate layer to speak the language of the user. Regardless of the language of the author. And in doing so we work with the parameters that Sager's sender has to work with -

intention (we know a great deal about our users and their requirements) knowledge (concept identification and definition is our key deliverable) language (terminologies and vocabularies, special languages and general

languages are our tools)

We should always remember that we are designing information access layers to impact the user's current knowledge state that the user herself wants to change; to augment,

72

confirm or modify.

Terminology, and lexical normalization against terminology, clarify this whole process of communication. The information transaction breaks when there is a mismatch in either the sender's or recipient's lexical choices or knowledge. In an intranet, the mismatch is between the lexicon of the user and the system. Not only is the overall model similar, in ways that we can work with effectively in our information access design, it is also identical with the outcomes that we would like to occur in intranet information transactions.

What Degrades Effective Communication?

Last, but definitely not least, we are rarely allowed to forget that successful transmission of the message can fail. The type of failure that concerns us as information access designers is the instance of incompatibility between sender's and receiver's lexical and knowledge structures. First, the chosen language must be appropriate.

"The sender must choose a language and sublanguage which he assumes the recipient to have command of; the recipient must be able to recognize and understand the linguistic forms chosen in order to analyze the message." [page 104]

Sager is clear how failure of communication can occur.

"The linguistic forms to be used in a message must therefore be planned at every level in such a way that the greatest degree of clarity can be achieved by means of systematic and transparent designations of concepts and by the avoidance of ambiguity at every level of expression." [page 104]

And again.

"Accurate transmission can be impeded by incompatibility of the sender's or recipient's lexical and knowledge structure … the analytical use of language, i.e. the application of inference rules in the process of comprehension, works only if designations are regularly patterned and if both interlocutors know the rules of designation. The linguistic forms to be used in a message must therefore be planned at every level in such a way that the greatest degree of clarity can be achieved by means of systematic and transparent designations of concepts and by the avoidance of ambiguity at every level of expression." [page 104]

Conclusion

Terminology is a "big" subject. And an exhaustive discussion of how to apply terminology

73

knowhow to enterprise intranets is too big a subject for one essay. Juan Sager's book is one of the fundamental texts of "classic" terminology work. Because of this, assimilating the ideas it contains, and re-purposing them, is a solid foundation for beginning any project to build the deep layers of information access design.

Lots of book reviews conclude with a recommendation that the book is question is "essential reading" for such and such an audience. This review concludes differently. "A Practical Course in Terminology Processing" really is essential reading for project and program managers who are going to have to implement lexical projects in information access environments. More than just essential reading, there are models here that can be re-engineered and applied to lexical projects. Their application will make the difference between success and still (after all those months of) not reaching that success. For those who need to come to terms with terms, this is the one.

It is also more than just a book to read. It is a resource that you can refer to again and again. Who knows how many expertly designed taxonomy and categorization projects are currently successfully underway? But this book is how to begin, how to create one of those, and how to stay successful to the end.

74

About semanthink.com

About Me

[This last section is the content item about my approach which was in the About directory of the site – it’s old, 2002, but accurately reflects how I described myself back then. It is included for completeness’ sake only. I describe myself a bit differently today on renniewalker.com ]

I'm Rennie Walker. I'm a consultant with both KAPS Group and KCurve. I specialize in information access design within two particular layers of the total knowledge management problems/solutions environment. I prefer to call these two issues "knowledge organization" and "subject recognition", and I use these expressions constantly. Knowledge organization is the design of conceptual maps, or representations, of domains. These knowledge organization schemes are usually formalized as taxonomies or ontologies. Subject recognition is the ability of applications (or people) to recognize the concepts that documents discuss. These concepts are specified by, and organized in, the taxonomy.

KAPS Group is a knowledge architecture consulting company. KCurve is a consulting group that designs information access solutions for internet and corporate intranet content. I work with the clients of these two companies to design and implement information access solutions. I also work with the taxonomy/categorization and search technology partners of KAPS Group and KCurve to implement, lexically and ontologically, their taxonomy, classification and categorization applications.

I am a Convera Certified Taxonomy Developer. I have taken the InXight Technical Partner SmartDiscovery program. I am both a psychologist and information scientist. I have a Masters in psychology from Edinburgh University, in Scotland, and following that I completed a post-graduate Diploma in Librarianship, from what is now a unit of the University of Wales.

About semanthink.com

The aim of semanthink.com is that is that it will contain all that I know and use in my client work, about

what taxonomies are, what they actually "do", and how they do this the project design process for building taxonomies using a taxonomy as a specification for creating subject recognition executables

(with a variety of categorization applications) so that we can associate documents with taxonomy nodes

writing and modeling sets of lexically-based subject recognition executables for content categorization applications that can use them

teaching all of the above

semanthink.com is my personal professional site. Here I take the space to write in depth about subjects that can be very "small". I personally see a requirement for this focus on

75

http://web.archive.org/web/20040908164648/http://www.convera.com/Services_edu/edu_coursesK-203.asp

http://web.archive.org/web/20040908164648/http://www.inxight.com/partners/

http://web.archive.org/web/20040908164648/http://www.inxight.com/partners/

the "small" some times with the clients I work with. Though it is my personal professional site, I am hugely indebted to the colleagues that I discuss ideas and client solutions with nearly every day. And obviously, to our clients as well, who talk us through their information access design issues and who give us professional inspiration.

The semanthink.com Audience

Semanthink.com is written for those tasked with making strategic decisions about enterprise content organization and categorization, and those tasked with implementing such solutions. The intent is that CIOs, content managers, IT managers, project and program managers, and technical groups involved in enterprise knowledge organization and subject recognition, will find something useful at some time in the semanthink.com site. semanthink.com is not meant to be hip, or new, or catalytic. It is designed as a straightforward resource for these people. "These people", you, may have a requirement to find the kinds of questions you need to ask of yourself and your applications and professional services vendors, so you can implement the knowledge economy. Or, if you already have the questions you may be looking for some answers.

Consolidation and Memory in an Information-Rich World

One of the issues that interests me as a psychologist is how we deal with (and can possibly deal with) this information-rich world that we are constantly adding to and building on. The psychological tasks that press to the foreground for me in my own everyday work are consolidation of personal knowledge and remembering what we do actually know. How can we continually consolidate what we know so that we become steadily "more" expert on our chosen domain? How can we remember the breadth and depth of what we "know" about our chosen domain?

With this in mind, the semanthink.com site is quite consciously designed around the theme of consolidation that the individual needs. In a very real sense, the semanthink.com site is part of my professional memory. And, because we live in a wired world, semanthink.com becomes a memory object that you can plug into, if you wish. Part of remembering and knowing what we know is organization of what we want to know. Which applies to every knowledge worker in the world. So, semanthink.com also focuses on consolidation of and around some of the key ideas critical to connecting users to the documents (and only the documents) that they really need in their moment of need. So many of us across the knowledge economy are working together on an antidote to information disorder and information misorder. This is intended to be part of the antidote. So the focus will remain always on the same small number of themes.

Themes

A number of themes inform the way that I work.

From psychology comes a focus on cognition. People communicate, categorize, create ontological structures and semantics, and build languages. These are all native human

76

skills. Hence my focus on taxonomies as communication tools. And on subject recognition giving knowledge workers in special subject areas the semantics (and accuracy) that they themselves use everyday in peer communications.

From information science comes a focus on what information is, and how to organize its origination and description to align it with human cognitive structures - and so create a process. In fact, I see information science as a discipline in service to human cognition.

From my time as an information management professional in the strategy consulting and investment banking communities comes a focus on the business problems of finding information. These years gave me an acute familiarity with the ease with which facts become lost in information disorder and information misorder. These years also gave me an acute sense of the role that lexical variance plays, particularly in the English language which is the one I am most proficient at, in communicating or trying to find information.

My years at Sageware Inc., a Mountain View, CA, -based early thought leader in the content categorization space, reprised a great deal of what had gone on before. At Sageware we worked to meld a lexical approach to subject recognition, with a logical functional "shell" of our application, together with a professional services practice that gathered knowledge organization requirements from our customers. My work since then, and today, is putting aspects of this three-part solution model together to create knowledge organization and subject recognition tools that work for special subject communities.

Project management design became a focus out of necessity, after a while. There's an old story. I heard it as an Irish joke, but here's a (slightly) Scottish version. I've also come across variants in collections of Sufi and Zen stories, which is probably where its true origination lies.

"A youngster is lost in the countryside, and is wandering hither and thither, trying to get to the destination agreed yesterday with friends. The youngster scrambles through yet one more hedgerow and sees an oldster sitting on a stile. 'Oh, I hope you can help me', says the youngster. 'I'm lost and want to get to Such-and-Such a place'. The oldster sits chewing and musing for a bit. Then turns to the youngster. 'Aye, well if Ah were you, laddie/lassie (you choose the interactive bit here) Ah wuldnie start frae here.'"

The semanthink.com Purpose

It's always good to know where you're going. It's equally and exactly important to know where you begin from. I see project design, when the project aims to deliver knowledge organization or subject recognition deliverables, as beginning with a precise and total perspective. This precise and total perspective includes, particularly, models and methodologies, and knowledge we can take to design working models and methodologies. Otherwise we start from the wrong place. All the semanthink.com content aims to make some applicable contribution to this precise and total perspective that informs the design of these kinds of projects. Some of this content will appear richer than others in this usefulness, but what is required always depends on the need and the moment.

77

The semanthink.com Rationale

My focus on information access design is with two particular layers of the total information access problems/solutions environment -

The Knowledge Organization Layer The Subject Recognition Layer

I see the issues that arise from the lack of these two functioning layers as creating an information access bottleneck. There is only so far we can go in creating an information access environment without addressing and solving the issues created by the absence of these two layers. Until we do, we hit a glass ceiling, to do with precision and recall, and the absolute extremes of these where finding particular information is actually systemically impossible.

The functional elements of the knowledge organization layer are all metadata. Taxonomies, faceted taxonomies, ontologies and other knowledge organization metaphors are all ways to represent a domain to users who work within that domain. Other knowledge organization tools include vocabularies, terminologies and classification schemes. The individual functional components of each these tools must meet the total set of user requirements. The set of user requirements can be both wide and complex, but will normally include such parameters as domain scope, formal conceptual relationships in the representation model and the use of terminology in labeling etc.

The functional elements of the subject recognition layer are essentially sets of lexical elements (often in combination) that categorization platforms, of one kind or another, use to identify the subjects that any document discusses. Through subject recognition we can associate documents with taxonomy nodes and with classification schemes.

semanthink.com tries quite consciously to take, and to give, a different perspective on taxonomies (and knowledge organization schemes in general). Similarly with optimizing subject recognition platforms. We also work hard to make any paradigms that seem to be at play in early stages of project design explicit (so that we can discuss them).

Why Focus on Project Design?

We absolutely require robust project design. We require information access project designs that tightly couple a highly analytical methodology with, particularly, project activity definitions and project risk mitigation. For instance, if we don't have an analysis of how taxonomies "work", then how can we adequately (or at all) define project activities to re-engineer data to make them work? If we don't have a roadmap for what we are doing, then risk lies around every corner (or in every email and meeting). Even in "simple" taxonomy development projects risk arises from different types of variables. Conceptual variance, arising from words (like "taxonomy") that carry different meanings

78

for different participants, leads to risk of confusion. The "Black Box Effect", of not breaking problems down into the optimal elements to work with, leads to the risk of being unable to define exactly what to do (and to do it).

Implementing the Knowledge Economy: the Business Thesis

As we resolve the business problems associated with implementing integrated knowledge organization and subject recognition solutions, then we will be in a position to effectively implement the Knowledge Economy. It's my deep perception that enterprise content that is not optimally organized and categorized, coupled with the scale of content creation and content purchasing within the enterprise, causes systemic business problems around the effective use and, equally paramount, re-use of unstructured information resources.

Perspective: The Metaphor of the Last Mile

Information access design, as a core business function, has its own "Last Mile" issue. Within telecoms/media the Last Mile verbal shorthand allowed us, at the time, to discuss and "issue-ize" the problems of getting data from the efficient backbone pipes into customers homes - this distance being the Last Mile.

Within information access design the issue is analogous. We call our equivalent of pipes, "shells". So, search/indexing engines, categorization applications and portal and content/document management applications are all shells in that they provide the enterprise-wide infrastructure platform for many different kinds of information transactions, information workflow processes and information presentation outputs and metaphors, all based upon a robust features and functions set. But, they are purchased "empty". Empty of the content to immediately, off-the-shelf, organize and recognize the topics and concepts in your content that are required by your user communities. The information access Last Mile is robust knowledge organization and subject recognition functionality. And these predicated upon considered methodologies and project designs.

Perspective: The Motivation of the Last Mile

It is only when we can associate, with a high degree of precision, documents to concepts, or concepts to documents, that we then have the ability to connect information users to documents. This is our business motivation.

These tasks of knowledge organization and subject recognition, that a knowledgeable person can personally execute easily and precisely on small numbers of documents, is a major implementation effort when a platform of applications is required to execute the same kinds of knowledge organization and subject recognition tasks at enterprise scale. And "automatically".

The complication in all of this, of course, from the point of view of knowledge organization, is knowing what subjects, how to relate them and what to label them. And the corollary complication in all of this, from the point of view of subject recognition, is

79

the syntactic and semantic nature of the languages in which we write our textual communications to each other. This is no small matter when we want to implement robust, precise and effective subject recognition capabilities.

semanthink.com, most categorically, takes a The-Future-is-Semantic kind of perspective on the future of what will lead to enterprise knowledge management success. Which is why I focus on that which absolutely needs to be incorporated into our working lives (as information access designers) in the way of models of communication, cognitive categorization, terminology and special languages. We need to either program our semantics into our solutions, or implement solutions that implicitly leverage our semantics (accurately and usefully).

Competitive Advantage

Myself and colleagues work with companies to creatively and powerfully solve business problems. We consider all of what we do to be in the service of competitive necessity, competitive differentiation or competitive advantage for those that we work for.

We take the view that in the knowledge economy managing companies for ongoing success will mean that management in general and management of knowledge will align more closely than they do at the present, and at increasingly senior levels within the enterprise. We are aware of the distinction between vision and evangelism, but like the thesis stated this simply.

We view ourselves as business problem solving experts. We just happen to specialize in knowledge organization and subject recognition. One of my favorite business mantras currently is Knowledge Organization is Power.

Easy to Publish, Hard to Find

In the Knowledge Economy's current state, it is both simple and easy to publish/send any document. But it is not nearly as simple, often, given the current status of information and its management in most enterprises, to find and retrieve any particular item of content, or any particular related set of content items.

Knowledge workers require and want to find documents for what the document contains - facts, ideas, theories, opinions, analyses, predictions, forecasts and so on. Documents are containers of these. Documents discuss concepts. And the code that these concepts are written in is text. So, knowledge workers who seek discussion of concepts need a means to retrieve the document, or set of documents, that discuss whatever the subject of their current interest and requirement is.

There is a manifest decoupling in the corporate knowledge economy between the two information tasks of publishing and finding. This decoupled supply and demand cycle is where the high-level systemic breakdown in the knowledge economy takes place. That publishing is easy, is good. That finding is hard, is bad.

80

Affirmation

This Breath

In this Moment

I dedicate to the Benefit of All Creation.

This Moment

In this day


This Day

In this Life


This Life

Of this Thread of Eternal Creation I Am


Acknowledging that Gratitude is my Sustenance,

Acknowledging that Giving is my Receiving,

Acknowledging that “my” Stillness

Is “my” Expansion,

Acknowledging that my Heart is the One Heart,

And that my “mind” is of the One Mind,

I dedicate “myself” to All Creation.

Om Shanti Shanti Shanti Om Om Shanti

81