looking beyond plain text for document representation in the enterprise
DESCRIPTION
In many real life scenarios, searching for information is not the user's end goal. In this presentation I look into the specific example of corporate strategy and business development in a university setting. In today's academic institutions, strategic questions are those that relate to dependency on funding instruments, the public private partnerships that exist (and those that should be extended!), and the match between topic areas addressed by the research staff and those claimed important by policy makers. The professional search tasks encountered to answer questions in this domain are usually addressed by business intelligence (BI) tools, and not by search engines. However, professionals are known to be busy people inspired by their own research interests, and not particularly fond of keeping the customer relationship management (CRM) or knowledge management systems up to date for the organisation's strategic interest. This then results in incomplete and inaccurate data. Instead of requiring research staff (or their administrative support) to provide this management information, I will illustrate by example how the desired information usually exists already in the documents inherent to the academic work process. Information retrieval could thus play an important role in the computer systems that support the business analytics involved, and could significantly improve the coverage of entities of interest - i.e., to reduce the effort involved in achieving good recall in business analytics. The ranking functionality over the enterprise's (textual) content should however not be an isolated component. Our example setting integrates the information derived from research proposals, research publications and the financial systems, providing an excellent motivation for a more unified approach to structured and unstructured data.TRANSCRIPT
![Page 1: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/1.jpg)
May 31st, 2013 First SICSA MMI Information Retrieval Workshop
Looking beyond plain text for document representation in
the enterprise
Arjen P. de [email protected]
Centrum Wiskunde & InformaticaDelft University of Technology
Spinque B.V.
![Page 2: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/2.jpg)
Outline
Motivation Mixed structured and unstructured
sources Search by strategy Equip Open ends
![Page 3: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/3.jpg)
Enterprise Information Needs
Hang Li et al. A new approach to intranet search based on information extraction. CIKM’05
![Page 4: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/4.jpg)
Strategic and business development needs
What funding schemes are the primary source of income? E.g., can we move to Europe when Dutch funding
dries up?
Who has active relations with partner X? “Valorisation”; new national funding requirements
What industry sectors do we depend upon? E.g., how many projects in smart cities? Green
energy? Cloud computing? Etc.
How are strategic decisions implemented? E.g., has objective “move from Telecom toward ICT”
been achieved, and how does it develop over time?
![Page 5: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/5.jpg)
A week in the life
![Page 6: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/6.jpg)
Date: Wed, 15 May 2013 15:14:49 +0200From: Theme Coordinator “INFORMATION”
To: Group Leaders Information ThemeSubject: List of company relations for internal CWI distribution
Dear Information Theme Group Leaders, The theme coordinators have been asked whether they: "een lijstje kan maken met de bedrijfscontacten en daarbij aan te geven van welke aard de contacten zijn".
Could you send me the names of Dutch companies you are currently working with or have worked with in the recent past by the end of Friday 17th May.
The Theme Coordinator
![Page 7: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/7.jpg)
Date: Fri, 24 May 2013 11:33:04 +0200 From: Theme Coordinator Life Sciences
To: Group Leaders Life Sciences TeamSubject: Life Sciences: contacts with NL companies?
Dear all,
The CWI themes are currently collecting all contacts we have with Dutch industry and companies (but also hospitals and TNO etc.) in order to get an overview. I am doing this for the theme "Life Sciences". Can you please send me a list of your contacts with short description?
Life Sciences Theme Coordinator
![Page 8: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/8.jpg)
From: Project Leader Project X Date: Sun, 26 May 2013 17:34:15 +0200
To: Project X Subject: [Project X: 33] @WP-leiders X-BeenThere: Project X @ Y.org
Beste WP-leiders,
Ik kreeg van Het Programma Management het volgende verzoek: > Mag ik je vragen me een lijstje te sturen van welk EU onderzoek en welk internationaal onderzoek er loopt bij de partners gerelateerd aan Project X (internationale inbedding).
Dit is mijn meest urgente punt. Kunnen jullie zsm aan mij sturen een lijstje met de volgende punten: - lijst van lopende EU projecten waarbij mensen uit jouw WP betrokken zijn; geef aub aan wi de partners zijn, financieringsbron, of het een STREP (of NoE of ...) is, en of jouw WP een participant of coordinator levert; - lijst van aangevraagde EU projecten, met zelfde extra's - lijst van eventuele andere internationale samenwerkingen die niet door een formeel project zijn afgedekt
Stuur me de lijstjes aub zsm maar niet later dan dinsdag 18u. Bedankt voor jullie hulp. De Projectleider
![Page 9: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/9.jpg)
Surely, academia is not like…
![Page 10: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/10.jpg)
The High Cost of Not Finding Info
If you employ 1000 knowledge workers: 50% of content unindexed $2.5
million/year
6.25% of effort is spent reproducing information that already exists $5 million/year
Knowledge workers spend 15-25% of their time on non-productive information-related activities
Feldman and Sherman.IDC Technical Report #29127, 2003
Butler Group Report: Enterprise Search and Retrieval. Oct-2006“many organisations are frittering away up to 10% of their staff costs on wasted effort because employees simply can’t findthe right information to do their jobs.”
![Page 11: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/11.jpg)
So… “the real world”
“Real” companies (as opposed to academic institutions) attempt to address these information needs a priori, by setting up a Customer Relationship Management system (CRM)
Shan L. Pan and Jae-Nam Lee, "Using e-CRM for a unified view of the customer", Communications of the ACM 46(4) (2003): 95-99
![Page 12: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/12.jpg)
![Page 13: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/13.jpg)
However…
So-called “Professionals” are well known to focus on their own expertise
They do not have (or take) the time to maintain adequate descriptions of their network, skills, projects etc. – neither for most other types of “management overhead”
![Page 14: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/14.jpg)
We only need to organize ourselves!!
![Page 15: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/15.jpg)
Funding Proposals
Proposals submitted (are supposed to) pass by the faculty’s (TUD) “contract managers” or the institute’s (CWI) “project bureau” E.g., checks for liability, IPR and valid budget Proposal and (partial) metadata are added to
a content management system (CMS) The CMS used at my faculty at TUD is DECOS; a
few other faculties plan to use Microsoft Sharepoint; CWI deploys BSCW
![Page 16: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/16.jpg)
![Page 17: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/17.jpg)
Step 1
Index all the proposals submitted with your favourite IR system
![Page 18: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/18.jpg)
Incompleteness
The DECOS metadata entered is usually incomplete from the start For many projects for example, only the coordinator is
entered as partner
Also, a proposal’s metadata does not reflect subsequent change; e.g., as in PuppyIR: People hired after funding secured Partner change when key person moved job Teams evolved Priorities shifted New tasks introduced and tasks (re-)assigned …
![Page 19: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/19.jpg)
Incompleteness
In general: A project’s proposal or even the contract
seldomly represents the project’s exact future
![Page 20: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/20.jpg)
Inaccuracy
Key information necessary for strategy & business development scenarios missing
Adding those is error-prone Infer domain (big data, green energy, cloud
computing, …) from keywords or content Extract names automatically Copy amounts manually; inconsistencies in
tables in proposal text are not uncommon
![Page 21: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/21.jpg)
Incomplete & inaccurate Data
Ambiguity When describing domain, e.g., cloud
computing vs. clouds in environmental models
Names of people and companies involved Typos & OCR mistakes Entity resolution
Amounts of funding per partner, own contribution Funding request may not equal funding
received
![Page 22: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/22.jpg)
The real world to rescue (1)
Not much work gets done without payments…
![Page 23: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/23.jpg)
ERP
All large organisations deploy Enterprise Resource Planning (ERP) systems Typical modules include accounting, human
resources, manufacturing, and logistics ERP integrates the modules, data
storing/retrieving processes, and management and analysis functionalities
Baan, Oracle, PeopleSoft, SAP, …
![Page 24: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/24.jpg)
More complete and more accurate data from ERP
Financial details of each project as executed Project leader People who are reimbursed from the project Exact duration of project activities ...
![Page 25: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/25.jpg)
Step 2
Index all the ERP data with your favourite IR system
Link the ERP project identifiers to the CMS proposal identifiers Surprisingly, an n:m relationship…
DB +
![Page 26: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/26.jpg)
The real world to rescue (2)
![Page 27: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/27.jpg)
Institutional Repository
Publication metadata helps validate existing (and may even extend) the management info required: Authors Author affiliations Projects and funding schemes (from
acknowledgements)?
Again incomplete data though… Especially my faculty notoriously bad at
maintaining their part of the institutional repository
![Page 28: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/28.jpg)
Step 3
Crawl the Institutional Repository using the Open Archives Initiative (OAI) harvesting protocol
Index all the publications data with your favourite DB + IR system
Relate projects to publications by author name, similar title, etc.
![Page 29: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/29.jpg)
Result: Unified Access
Proposals from an XML dump of the CMS
Actual project administration from CSVs extracted from ERP
Publications crawled using OAI, from the IRP
![Page 30: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/30.jpg)
Schema
![Page 31: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/31.jpg)
Heterogeneous content!
BAAN-project (ERP) Decos-project (CMS) Decos-document (CMS attachments) Publication (Institutional Repository) Publication-document (Institutional Repository PDFs) Person (adress lists, ERP + CMS mentions) Company (CMS + ERP + document mentions) Subsidy (CMS) Department (address lists, CMS) Web addresses (extracted from documents) Topic (assigned to publications) Research programme (dependent on funding scheme)
![Page 32: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/32.jpg)
Schema V2
![Page 33: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/33.jpg)
How to search that graph???!
Rank (un-/semi-)structured data to deal with incompleteness & inaccuracies
Structured data representation for attributes including project revenu, people’s names, starting dates, etc.
Use cases varying from “expert search” to “data cleaning” and “visual analytics”
![Page 34: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/34.jpg)
Search by Strategy
First, visually construct search strategies by connecting “building blocks”
![Page 35: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/35.jpg)
Search by Strategy
First, visually construct search strategies by connecting “building blocks”
Next, generate the search engine specified by that search strategy
![Page 36: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/36.jpg)
Strategies: DB+IR query plans
DatabaseSpinque: RDBMS (MonetDB)
BB1(in1,in2,in3, u1,u2)
in1 in2 in3
out
BB2(in1)
in1
out
• Data flowSpinque: strategy
• Query: strategy made operationalSpinque: PRA
CREATE VIEW a AS SELECT ..
CREATE VIEW b AS SELECT ..
CREATE VIEW c AS SELECT ..
Strategy
Relational DB
![Page 37: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/37.jpg)
Probabilistic Relational AlgebraStrategy
Relational DB
• SQLexplicit probabilities
CREATE VIEW x AS SELECT a1, a3, 1-prod(1-prob) AS prob FROM yGROUP BY a1, a3;
• PRA: probabilistic relational algebra (Fuhr and Roelleke, TOIS 2001)
x = Project DISTINCT [$1,$3](y);
![Page 38: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/38.jpg)
Rank by Text
![Page 39: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/39.jpg)
Expert Finding
![Page 40: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/40.jpg)
![Page 41: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/41.jpg)
Search User Interface
![Page 42: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/42.jpg)
Search results
![Page 43: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/43.jpg)
Result List Interactions
Zoom in on item using “+”: Open item in left pane Shows results of item as query, using a
result-type specific search strategy Goal to provide contextually most related nodes
from underlying graph
Marking any item red/yellow/green for later usage
![Page 44: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/44.jpg)
![Page 45: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/45.jpg)
![Page 46: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/46.jpg)
Browse by facet
![Page 47: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/47.jpg)
![Page 48: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/48.jpg)
Strategic and business development needs
What are our industry relations? Who of these partners collaborate with
more than one group? What funding schemes support these
collaborations?
![Page 49: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/49.jpg)
Note: relations between partners and departments, edge strength represents revenue
![Page 50: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/50.jpg)
Note: relations between partners and departments, edge strength represents revenue
![Page 51: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/51.jpg)
Multi party relationsGrouping of external relations
ForeignUniv.
NL Univ.
Fundingagency
Public NL
Publicforeign
Privatesector
Multi party relationsGrouping of external relations
ForeignUniv.
NL Univ.
Fundingagency
Public NL
Publicforeign
Privatesector
Note: External relations with at least two departments; node size w.r.t. number of relations
![Page 52: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/52.jpg)
Initial Findings
The integrated search helps improve recall, reducing the effort involved and leading to higher quality analyses
Many things that could be done even more automatically (albeit not perfectly) seem less important than expected We use very simple rules to extract URIs and
companies; no information extraction yet Information professional will always look into
results in detail
![Page 53: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/53.jpg)
Open issues
Integrate visualization Idea: select result list and facet
Too many facets Idea: group facets
Result explanations Idea: describe path through graph
Entity support ++
![Page 54: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/54.jpg)
Open issues
What strategy is good? Why? Idea: test using past usage data
What are the right user roles? Who should do the searches? Who should write strategies?
~ who writes the SQL queries in traditional DB?
Human in the loop for retrieval, but not yet for indexing…
![Page 55: Looking beyond plain text for document representation in the enterprise](https://reader033.vdocuments.us/reader033/viewer/2022051820/55381e624a79598f768b4682/html5/thumbnails/55.jpg)
Questions?