Content Categorization Tools Taxonomies & Technologies for
Infrastructure Solutions
Tom ReamyChief Knowledge Architect
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
2
Agenda
KAPS Group & Categorization Research The Answer is Taxonomy, What is the problem? Machine Categorization
– Companies, Methods, Directions
The Place of Taxonomy in the Enterprise– Taxonomy as an infrastructure activity– Foundation for Content Management, Search, Portals, Smart
Applications
3
KAPS Group
KAPS Background – Knowledge Architecture Consultants– Organize and contextualize content, communities, and tasks– Professional Services partner to Categorization Companies
Categorization research– Evaluated 20+ companies– More companies, more new technologies– The answer is categorization, not Google
4
The Answer is Taxonomy.What is the Problem? Professionals spend more time looking for information than
using it Professionals spend up to 2 hours a day searching Corporate Intranets Survey
– Can’t find anything– Search Stinks - Can’t find good content– No good content
5
The Answer is Taxonomy.What is the Problem?
Infoglut: More information is being generated every day in modern companies than our entire corpus from the Athenian golden age
Quantity of information overwhelms our ability to present and classify it.
Search is not enough.– Humans search concepts, not strings
6
A Modest Proposal:A Solution to Infoglut Bury all new content for 2,500 years Lose most new content in a library fire Unless you can convince a group of monks that your
content is worth copying, it gets tossed Dark Ages Solution: Stop writing for a thousand years
7
Infoglut:A Really Radical Solution Hire librarians, editors, information architects to categorize
your content
OR Develop technologies that:
– support and enhance the ability of authors and editors to characterize content
– enhance the ability of users to find content
AND Create a hybrid human/automatic solution
8
New Technologies: Categorization Explosion
Autonomy Semio Verity Inxight Topical Net Mohomine LingoMotors H5Technologies YellowBrix Entopia
Bridgewell MetaTagger Applied Semantics Sageware SmartLogik Inktomi/Quiver Stratify Vivisimo Textology Other - Tacit
04/19/23 Inxight Confidential
Auto-Categorization: Methods
– Semi-Automatic: Rules, If-Then• Maximum precision & flexibility
– Catalog by Example: Bayesian, SVM, Neural• Training Sets (5-500)• Speed, Learning
– Statistical Clustering• Set of Documents & Taxonomy Level
– Semantic Analysis & World Knowledge
04/19/23 Inxight Confidential
Origins of Auto-Categorization
News Feeds and Content providers• uniform content, size and structure• professional writers• Simple or standard vocabulary
Corporate intranet• Wildly varied content• Mix of good, bad, and ugly writers• Tower of Babel: Acronyms, special meanings
04/19/23 Inxight Confidential
New Technologies: The Human Element
Automatic Categorization is Not Humans are better, but not as consistent
– Bring outside contexts to the document• Purpose, similar documents, common sense
– Understandable mistakes Computers are faster and cheaper
– Faster yes, Cheaper ?– Cost of poorer quality categorization
• Intranet: 20,000 users taking 60 seconds longer = $20,000 a week
The Best Answer is Hybrid or Cyborg Categorization
12
Summary
No clear leader in categorization No one has it all. Immature industry and pent up demand No out of the box solutions: Support Distributed Hybrid Look for
• Advanced Algorithms• Clustering, Auto-Summarization, noun phrase extraction• World Knowledge, import public & custom taxonomies• Integration – rules, metadata, components & product
• CM, Search, Portals, Expertise, Collaboration, Applications
13
Location of Taxonomy in the Enterprise:An Infrastructure Activity
Technology• $Millions and 1,000’s of
people
Organizational• Recognized Value• fundamental to business
activity
Intellectual• A couple of librarians• No budget• First to be laid off
3 Infrastructures
Technological Organizational Intellectual
14
Location of Taxonomy in the Enterprise:An Infrastructure Activity
Technology• $Millions and 1,000’s of
people
Organizational• Recognized Value• fundamental to business
activity
Intellectual• A couple of librarians• No budget• First to be laid off
3 Infrastructures
Technological Organizational Intellectual
3 Infrastructures
Technological Organizational
Intellectual
15
Creating an Intellectual Infrastructure
Knowledge Audit / Knowledge Map Knowledge Creating
– Innovation, Content Management, E-learning
Knowledge Sharing / Transmission– Collaboration, Retrieval - content, experts
Knowledge Using– Smart Applications, CRM, Portals
Knowledge Architecture People
16
Content Management and Taxonomy
Taxonomic Publishing Model– Publish by Category, not web site– Web Site the wrong unit of organization
Distributed Work Flow• Collaborative Categorization and keywords by Subject Matter
Experts, aided by software
Content Re-Organization– Rich Web of Related Content
• Basic information + contexts
Content Re-Organization: Next Steps– Document can be wrong unit of organization
17
Taxonomy and SearchKnowledge Retrieval: Information + Contexts Information Retrieval: ProductName
– List of Documents, ranked by frequency of keyword
Knowledge Retrieval: ProductName– Personal & Community & Historical Filters – List of Documents – about product– Categorized list:
• Features of Product• Comparisons of Products• Legal / Policy documents• Activities associated with product
– Background Resources • Glossaries, Communities