0 · web view‘seeing the wood for the trees’ phil turner , alex r. rogers‡, susan turner and...

31
LE 1182 TREE DELIVERABLE IDENTIFICATION Identification number LE 1182-D-20 Type Management Report Title The TREE Project - Final Report Status Draft Deliverable D20 Date February 3, 1999 Version 1.0 Number of pages 31 Author(s) Jeremy Ellman, Alan Goddard; MARI Group Ltd Anders Green, Joakim Nivre; Göteborg University Luca Gilardoni; Quinary Alan Wallington; UMIST Siobhan Walsh; Newcastle City Council WP/ Task responsible WP0 Project contact point: Jeremy Ellman MARI Computer Systems Wansbeck Business park Rotary Parkway Ashington Northumberland NE63 8QZ Tel: +44 191 402 0191 Fax: +44 191 402 1112 E-Mail: [email protected] EC project officer Pierre-Paul Sondag Status Public Actual distribution Consortium / EC Supplementary notes Key words Abstract The is the project’s Final Report. Status of the abstract Public Received on Recipient's catalogue number DOCUMENT EVOLUTION Version Date Status Authors Notes 1.0 11/11/98 first draft AG 1.1 3/2/99 Revised JE Circulated for partner feedback

Upload: others

Post on 01-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

LE 1182 TREE

DELIVERABLE IDENTIFICATION

Identification number LE 1182-D-20Type Management ReportTitle The TREE Project - Final ReportStatus DraftDeliverable D20Date February 3, 1999Version 1.0Number of pages 21Author(s) Jeremy Ellman, Alan Goddard; MARI Group Ltd

Anders Green, Joakim Nivre; Göteborg UniversityLuca Gilardoni; QuinaryAlan Wallington; UMISTSiobhan Walsh; Newcastle City Council

WP/ Task responsible WP0Project contact point: Jeremy Ellman

MARI Computer SystemsWansbeck Business parkRotary ParkwayAshingtonNorthumberlandNE63 8QZTel: +44 191 402 0191Fax: +44 191 402 1112E-Mail: [email protected]

EC project officer Pierre-Paul SondagStatus PublicActual distribution Consortium / ECSupplementary notesKey wordsAbstract The is the project’s Final Report.Status of the abstract Public

Received onRecipient's catalogue number

DOCUMENT EVOLUTION

Version Date Status Authors Notes1.0 11/11/98 first draft AG1.1 3/2/99 Revised JE Circulated for partner feedback1.2 9/2/99 Revised AG NCC Contribution included for partner

feedback

Page 2: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

Language Engineering Project LE-1182 TREE

1. Executive Summary......................................................................................................... 3

1.1 Technical achievements...................................................................................................3

1.2 Validation results............................................................................................................. 4

1.3 Impact and future prospects.............................................................................................4

2. Project Timetable............................................................................................................. 5

2.1 Contractual matters.......................................................................................................... 5

2.2 Stages of work................................................................................................................. 5

2.3 External reviews............................................................................................................... 5

2.4 Conferences, exhibitions, user group meetings.................................................................5

2.5 Other important events..................................................................................................... 6

3. Achievements................................................................................................................... 6

3.1 Demonstrator system or service........................................................................................ 6

6.5 Contact list of Project User Group.Names......................................................................20

-2-

Page 3: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

LE-1182 TREE Final Report

1. Executive Summary

1.1 Technical achievements

The technical objective of the TREE project has been to develop a web-based service offering multilingual access to job vacancies, allowing users not only to search for jobs in their own language but also to get automatically generated descriptions of the jobs in the same language.

The TREE system designed to meet this aim has evolved throughout the course of the project. Three successive prototypes have been built; the first (P01) based on requirements from user groups, and the others (P02 and P03) based on feedback of users from the previous version. The scope of the job areas covered has increased dramatically from the Tourism and Leisure domain (P01) to all job sectors (P02 and P03). The architecture also changed significantly between P02 and P03 to counter the instability caused by some elements of the underlying technology.

In its current version (P03), the system consists of three main modules: the user interface (UI), the database system (including both a vacancy database and a terminology database), and the generator. (In addition, there are various tools for importing data into the databases.)

The UI is based on standard Web technologies (HTML and JavaScript), and is accessible through any standard (frames-supporting) browser. It has been tested with Microsoft Explorer (version 4) and Netscape Communicator (version 4). The UI now (i.e. P03) supports five languages (Flemish, French, English, Finnish, Swedish), as against three languages in previous versions.

The core of the UI is the vacancy search mechanism. This uses a hierarchy of list-boxes for the selection of job titles, and clickable maps to select the desired locations. The search results in a list of hyper-text links (of the format ‘jobtype in location’) to jobs which fulfil the search criteria entered by the user. If there are no appropriate jobs in the location specified, the geographical scope of the search is progressively widened.

The TREE System is supported by a database system designed to hold vacancy data in a language-neutral format. This is achieved by storing references to language dependent items, such as the job description, by means of codes which refer to a terminology database containing equivalent terms in all the languages supported. The database system, implemented upon an Oracle RDBMS, is therefore split in two main components, one for job vacancies and the other for terminology.

The vacancy database has been designed with the aim of being generic enough to enable storage of any kind of job vacancies. The schema is normalised and attributes fall into two main categories: (i) terminology data: attributes whose values are expected to be term codes that enable translation to different languages (e.g. job title, qualifications required); (ii) value data: attributes whose meaning is language independent (such as numbers or street addresses). An API is provided to enable the analyser/loader module to take incoming data and to store vacancies in the correct format, while access from the search engine is via normal SQL queries.

The terminology database stores terminology deemed relevant to the TREE domain. For each of the vacancy attributes containing terminology data, a hierarchy of terms is maintained. For each term, it is possible to define synonyms (including multi-word expressions) in all the supported languages.

The generator module in the TREE system has been developed with the aim of using a natural language grammar with a simple semantic component to produce multilingual job adverts in different languages from a language independent database representation. The generator produces HTML-coded texts to be displayed as the result of searching the job database. The input to the generator is a string representing the general structure of the database entry for a specific job advert together with the terminology that applies to some particular job. From the internal structure the generator then builds a syntactically well formed text in the current language using the terminology and grammar. The generator as such is written in Prolog with some I/O code written in C. In the latest prototype (P03) the generator is run as a standalone application compiled into binary code using a Warren Abstract Machine representation of the generator constructed using BinProlog 5.75.

The aim for the generator module was to provide a way to make it possible to use well known linguistic technology to generate job adverts in natural language. The generator engine as such is not domain dependent nor is the environment it was implemented in. The database interface, terminology and language grammars need to be developed but other modules can be reused without much effort. This makes it possible to transfer technology to other parts of the software industry.

Originally, the TREE system was supposed to contain an analysis module, using information extraction techniques to gather vacancy data from free text job ads. Throughout the course of the project, however, the emphasis has changed, becoming more focused on the input of vacancy data from large existing databases. In order to prove the viability of this approach, two data conversion packages have been implemented, one for

-3-

Page 4: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

Language Engineering Project LE-1182 TREE

vacancy data from VDAB (Belgium), the other for vacancy data from AMS (Sweden). Another kind of input module has been developed to allow NCC job vacancies, input via a Lotus Notes form, to be automatically inserted into the database.

1.2 Validation results

To ensure that the TREE application will work as intended it is not only required that the logical function of the linguistic search mechanism works, e.g. that definitions for interpretation of search terms and presentation of the result to the user are established. It is also required that the system meets the demands of usability from a psychological point of view. If the user does not understand the functionality of TREE, it does not matter how sophisticated the techniques behind the search mechanism are; the user will still not find any relevant vacancies. The purpose of this study is to investigate the usability of the TREE user interface from psychological aspects of human-computer interaction

The research tasks undertaken in the evaluation of PO3 reflect both quantitative and qualitative approaches. In all, the use of methods served to:

· Monitor (via System log) and assess user interaction with PO3 in user testing beds across Europe.

· Establish User views and attitudes towards TREE (usability issues and information provision).

· Validate the development of PO3 as a user driven product. (Compared to the status of PO2 and other employment based websites).

For the PO3 user trials, a cross-site approach was adopted. Each TREE language with the exception of French was tested. Therefore users include English, Swedish, Flemish and Finnish. Three sites, Newcastle, Antwerp and Goteborg represent the sample. In Goteborg, the trials will include a sample population of both twenty Swedish users and twenty Finnish users. This cross-site approach will facilitate the requirement for a comparative assessment of user feedback and will allow for a pan-European application of these findings.

The WAMMI (Website Analysis and Measurement Inventory) questionnaire was used as an aid to assessing user reaction to TREE. WAMMI, which was delveloped by Nomos, Sweden, is a visitor orientated evaluation tool for designing better web sites, as seen from the eyes of the end-user. It is based on a questionnaire that aims to measure users’ subjective opinion related to site attractiveness, control, efficiency, helpfulness and learnability. WAMMI enables users to tell site developers how useful and useable they found the site.

The cross-site analyses of the results of the PO3 TREE user trials are not clear-cut. Apart from the UK where the results of the pre-set tasks, WAMMI and interviews are in the main supportive of one other, this was not the case with Flemish and Swedish users. Hypothetically, if users score higher on system navigation performance, one would expect there to exist higher levels of user receptivity to a system. Reflecting on the results of Swedish and Flemish users, it is surprising that despite that on average Flemish users took twice as long to navigate TREE and as the system log data suggested encountered more problems, they were more receptive to TREE when compared to Swedish users.

According to the Nomos website measurement inventory, UK users (a sample that contained proportionally more inexperienced internet users than other sites) rated TREE well above average for all criteria, assigning a figure of 74/100 for global usability. Similarly, Flemish users attributed an above average rating for global usability, which indicates that the site speaks the user language, although for other criteria, negative values were allocated. In comparison, Swedish users gave a rating to TREE that scored below the general average for attractiveness, controllability, efficiency, learnability and global usability. Many of these users did however provide feedback that contained some valid constructive criticism.

1.3 Impact and future prospects

TREE would appear to be well placed for exploitation, particularly in Europe. The continuing expansion of the EU means that more and more European citizens have the right to move freely between neighbouring EU countries, and find work there. Indeed to is one of the EU’s goals to facilitate the ability of EU citizens to live and work in any other EU country. Compared to some other parts of the globe (e.g. North or South America) Europe has a great diversity of languages within a relatively small geographical area. Owing to the highly developed transport system EU citizens can move easily and quickly between EU member states; in the process often moving from one linguistic zone to another. It is even possible to move between linguistic zones whilst travelling inside the same European country (e.g. Belgium, Ireland, Spain, and the UK).

However, inability to speak the language(s) in a given country can inhibit prospective employees from looking for jobs there, which likewise reduces the pool of suitable employees for employers in that country. There is thus a need for a system such as TREE, which can be accessed from anywhere in the world, and provides multiple linguistic interfaces to the same group of jobs spread across a number of countries.

-4-

Page 5: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

LE-1182 TREE Final Report

There is obviously a need for such a system, but there do not currently seem to be systems other than TREE with this profile. There are a number of systems offering jobs in one or more countries, but with a UI only in a single language. There are thus a number of possibilities for exploiting TREE. Partners could host a TREE system and charge large employers, national employment services, commercial employment agencies, etc. to carry their vacancies. The software could equally be sold or franchised to a third party to host their own site. Outside Europe, any other multi-lingual country could be a target for exploitation of TREE, Canada being a strong possibility.

2. Project Timetable

2.1 Contractual matters

The TREE contract commenced in the late November 1995, and terminated on 20 November 1998. The main change to the composition of the consortium occurred at the end of 1997 when VDAB withdrew from the project. This created a problem, as VDAB (the Belgian employment service) was a major provider of job vacancy information. Fortunately talks between GU and AMS (the Swedish employment service) subsequently resulted in AMS’s participation in the project, as a provider of job vacancy information.

2.2 Stages of work

The progression of the TREE project (as defined in the Technical Annexe) is based upon a three-iteration cycle of:

· implementing a prototype· testing the prototype· validating the prototype by means of user-trials, and feeding the results back into the next iteration

The main milestones of the project are the production of the three prototype systems, and their associated user trials (see table below).

WP Number

WP Name From - To Comments

WP 0 Project Management Nov. 95 – Nov. 98

Co-ordinate and manage consortium. Report to CEC.

WP 1 Requirements Baseline Nov. 95 – May 96 Study user requirements; produce TREE functional specification. Risk analysis.

WP 2 Specification, Implementation and Assessment

May 96 – May 97 First iteration of implement-test-validate cycle.

WP 3 Specification, Implementation and Assessment

Feb. 97 – Oct. 97 Second iteration of implement-test-validate cycle.

WP 4 Intermediate Assessment Aug. 97 – Nov. 97

Investigate possible TREE products; alter work, exploitation plans accordingly.

WP 5 System Refinement Jan. 98 – Aug. 98 Third and final iteration of implement-test-validate cycle.

WP 6 Wider Validation and Dissemination

Sept. 97 – Dec. 98

Publicise TREE. Validate with additional users and feedback to WP 5.

WP 7 Exploitation, Promotion April 97 – July 98 Reassess user requirements, dissemination, and promotion. Feedback to WP 5.

2.3 External reviews

Two external reviews took place during the course of the project. The first on 14-15 January 1997 on the MARI site at Ashington, Northumberland, UK. The second review was on 18 March in Luxembourg.

The results of these reviews are summarised in the ‘Evaluation and Assessment’ section (below).

2.4 Conferences, exhibitions, user group meetings

TREE was presented at the Association for Computational Linguistic's Fifth Conference on Applied Natural Language Processing in March/April 1997 (pp 269-276) Washington DC. USA.

'Multilingual Generation and Summarization of Job Adverts: the TREE Project' (H.Somers, B.Black, J.Nivre, T.Lager, A.Multari, L.Gilardoni, J.Ellman, A. Rogers)

-5-

Page 6: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

Language Engineering Project LE-1182 TREE

The project was also presented informally by Somers at the Australian Natural Language Postgraduate Workshop held at the University of Melbourne, January 1998, as part of the Australian Natural Language Processing Fortnight.

Also, see the table under the heading ‘project-level dissemination, awareness, publications, etc’.

2.5 Other important events

See the table under the heading ‘project-level dissemination, awareness, publications, etc’.

3. Achievements

3.1 Demonstrator system or service3.1.1 main functions supported

The purpose of the work has been to produce a system in which the job seekers can look for job vacancies, and get job adverts, presented in a language of choice, irrespective of what language the advert originated from. As a part of this, an on-line search function for European job adverts has been developed, intended for use on the Internet.

In addition, TREE provides users with links to many different soucres of information, of partivcular relevance to those who are looking for work in another EU country. This information covers a range of topics such as employment law, housing, legal status, etc.

3.1.2 technologies and components

3.1.2.1 IntroductionThe TREE system has evolved throughout the course of the project. Three successive prototypes have been built; the first (P01) based on requirements from user groups, and the others (P02 and P03) based on feedback of users from the previous version. The scope of the job areas covered has increased dramatically from the Tourism and Leisure domain (P01) to all job sectors (P02 and P03). The technological base has changed from mSQL (P01), to Oracle and the Oracle Web Server (P02 and P03). The architecture also changed significantly between P02 and P03 to counter the instability caused by some elements of the underlying technology.

Also, the UI was changed for each version; initially in response to user opinion, and subsequently when it was found that the resulting UI made it very difficult for users to actually select the jobs they wanted. Initial versions (P01 and P02) were in three languages (French, Flemish, English), whereas P03 also supported Finnish and Swedish.

3.1.2.2 User InterfaceThe UI is based on standard Web technologies (HTML and JavaScript), and is accessible through any standard (frames-supporting) browser. It has been tested with Microsoft Explorer 4 and Netscape Communicator 4.05.

The UI now (i.e. P03) supports five languages (Flemish, French, English, Finnish, Swedish), as against three languages in previous versions.

The core of the UI is the vacancy search mechanism. This uses a hierarchy of list-boxes for the selection of job titles, and clickable maps to select the desired locations. The search results in a list of hyper-text links (of the format ‘jobtype in location’) to jobs that fulfil the search criteria entered by the user. If there are no appropriate jobs in the location specified, the geographical scope of the search is progressively widened.

Clicking on one of the hyper-text links causes details of the job to be displayed, along with the option of viewing the original text of the advert.

The UI has evolved considerably over the course of the project. User groups who tried P01 requested a free-text interface for the entry of search criteria in P02. As a result P02 was developed with a hybrid interface; a free-text interface as the default, with the option of using a list-based Java applet. However, in practice the free-text approach was found to be unsatisfactory, as most users who initiated searches failed to find any vacancies matching the job-titles they had entered. As a result, the P03 interface adopted the strategy described above.

3.1.2.3 DB Structure and Design

-6-

Page 7: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

LE-1182 TREE Final Report

The Tree System is supported by a repository designed to hold Job ads data in a language-neutral format. This is done by storing references to language dependent items, such as the job description, by means of codes which refer to a terminology repository holding linguistic variants and hypernym, hyponyms relations between terms.

The repository, implemented upon an Oracle RDBMS, is therefore split in two main components, one for job ads and the other for terminology.

The job ads schema has been designed with the aim of being generic enough to enable storage of any kind of job ads. The schema is normalised (in a relational sense) and attributes fall in two categories:

· terminology data: attributes whose value are expected to be term codes (fillers) to enable translation to different languages; examples are the job description, or qualifications needed

· value data: i.e. attributes whose meaning is language independent (such as numbers, for number of jobs posted, or street addresses).

An API is provided to enable the analyser/loader module to take incoming data and to store ads in the correct format, while access from the search engine is via normal SQL queries.

The terminology module provides storage services for terminology entries deemed relevant to the TREE domain. For each of the Job Schema attributes deemed to contain linguistic expressions a hierarchy (really a DAG) allowing the expression of taxonomies of terms is maintained. For each term, uniquely identified by a code (the filler), used to fill the schema db, it is possible to define synonyms (including multi-word linguistic expressions) in all the Tree supported languages.

3.1.2.4 NL GenerationThe generator module in the TREE system was developed with the aim of using a natural language grammar with a simple semantic component to produce multilingual job adverts in different languages from a language independent database representation. This approach enables flexibility when defining the output from the system, both in terms of how the output is formatted and in terms of consistency within the generated text, depending on the database content.

Databaseentry string

CoreGenerator

Engine

TextGrammars

HTMLformatting

OutputHTMLText

From database entry to HTML. The picture shows aschematic overview of the Natural Language Generator ofthe TREE sytem.

The generator module consists of the following parts: a core generator engine and a module for HTML encoding, which are language independent, and a grammar, terminology lexicon and postprocessor, specific to each language. The generator as such is written in Prolog with some I/O code written in C. In the current version of the prototype the generator is run as a standalone application compiled into binary code using a Warren Abstract Machine representation of the generator constructed using BinProlog 5.75.

Text GrammarsThe grammars developed for the TREE project provide HTML-coded texts to be displayed as the result of searching a job database. The input to the generator is a string representing the general structure of the database entry for a specific job advert together with the terminology that applies to some particular job. From the internal structure the generator then builds a syntactically well formed text in the current language using the terminology and grammar.

The input string reflects the relational structure of the database. The data from the input string is stored internally in the Prolog engine which operates on its own internal database. Only minor consistency checking

-7-

Page 8: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

Language Engineering Project LE-1182 TREE

is performed during the generation process to ensure that there is enough data to generate at least some vital information such as the job title and the location. The way the generator works it will suppress parts of the advert text in cases where there is no string data or when the language specific term cannot be found. This way generated adverts only show information which add something which is relevant to the advert, or which is at least explicitly specified.

The generator starts by identifying the type of advert to generate. In this version of the system a short version and a longer version can be generated. The short version only contains information about the job title and the location of the job. Using the English generator a longer advert might appear as below:

Linguist in Stockholm (Number of jobs: 2)* Description:

Linguist* Education requirements:

+ University education* Specifications:

Work time: full-time day - 40 hours per week - 5 days per weekDuration: full-timeAge: from 22 to 28 yearsExperience: some/extensive experienceSalary: per month, 12000 - 27000 SEKLanguage skills: English (excellent), Dutch (excellent), Danish (good)

* Contact:

in writing with CV for the attention ofJ. DoeFoo ABBox 240212345 STOCKHOLMhttp://www.foo.se

[email protected]

The grammar rules have the following form, where B is a grammatical category and C is a constraint on this particular rule (in effect a Prolog call to the database).

Head ---> B1, ..., Bn # C1, ..., Cn.

Optionally the body statement can be a list of words.

For instance the grammar rule to present the URL for the company

looks like this:

text:sem(url, Vac) ---> ['<A HREF="'], [Url], ['">'], [Url], ['</A>']

# url(Vac, sem(url, Url)).

The last clause has the form url(ID, Sem) and is a call to one of the interface predicates to the internal database. When there is value specified in the slot URL in the database the string is returned and the HTML-code to link to the employer is returned and shown in the advert:

<A HREF="http://www.foo.se">http://www.foo.se</A>

By combining the special functionality of the core generator with the full strength of the Prolog interpreter it is possible to use Prolog debugging features and control predicates to support the development of grammars.

Concluding Remarks on GeneratorThe process of grammar development is highly dependent on the target language. For the case of Swedish the grammar was constructed using the English generation grammar as a model. Only minor changes needed to be

-8-

Page 9: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

LE-1182 TREE Final Report

done. Finnish however needed great consideration e.g. in order to handle case properly. The fact that there was no Finnish Prolog programmer available required very close co-operation with a Finnish translator. But given that there are language experts available in the language for which a grammar is to be written the need to use programmers fluent in the language for each new grammar is not strictly necessary.

The aim for the generator module was to provide a way to make it possible to use well known linguistic technology to generate job adverts in natural language. The generator engine as such is not domain dependent nor is the environment it was implemented in. The database interface, terminology and language grammars need to be developed but other modules can be reused without much effort. This makes it possible to transfer technology to other parts of the software industry.

3.1.2.5 Analysis and NLP IssuesThe Tree System is designed to deal with large databases of job advertisements, specifically those of AMS and VDAB. These databases represent job adverts in a structured form, with a number of fields containing different types of information e.g. job titles, location, minimum age, qualifications etc, and with a number of different coding conventions used to represent the information e.g. plain text, numerical codes.

For each database a program reads the adverts from a text file dump. Fields will be skipped if the information is not to be used by TREE. Otherwise the information will either be converted into TREE codes if there are language independent terms to represent the information, or left as it is, if the information will be the same in all supported languages e.g. numbers, or text that will not be translated such as addresses.

Finally, this "TREE format" information is input into the job schema database via a 'C' API.

A further version of the analyser was developed to allow NCC job vacancies, input via a Lotus Notes form developed as part of the project, to be automatically sent by e-mail, and inserted into the database.

As the scope of TREE has become more ambitious with the transition from PO1 to PO3, the role of analysis in TREE has changed and with it the nature of the analysis module. Paradoxically, given the expansion in TREE's technical ambitions, the demands made on the analysis module have become fewer, and in particular in PO2 and PO3, the new technology of Information Extraction was not used to anything like the extent that was envisaged at the start of the project. The reasons for this change are briefly outlined below. A brief description of the original assumptions concerning the role of analysis in TREE will first be given and then the reasons why this was not followed through in PO2 and PO3.

Information ExtractionIt was originally assumed that TREE would cover just one employment area namely the hotel and catering industry. The number of job titles would correspondingly be limited, as would likely types of employers, e.g. hotels restaurants etc., the likely qualifications and skills required, and probably the nature of the benefits. Indeed, it would not be unreasonable to talk about typical catering jobs as opposed to typical jobs in the computing industry. It was also assumed when TREE was first started that the system design would permit users offering jobs to submit via an e-mail feed job advertisements more or less without restrictions.

To deal with this restricted range of job advertisements, the analysis technique chosen fell into the relatively new paradigm of analogy or example-based processing. In the following paragraphs we explain the analysis process and discuss our reasons for preferring this to a more traditional string matching or parsing approach.

Example-Based ProcessingThe input that the TREE system will accept is partially structured, but with much scope for free-text input. One possible way of analysing this would be to employ a straightforward pattern-matching approach, searching for "trigger phrases" such as employer:name is seeking job-title, with special processors for analysing the slot-filler portions of the text. This simple approach has certain advantages over a more complex approach based on traditional phrase-structure parsing, especially since we are not particularly interested in phrase-structure as such. Furthermore, there is a clear requirement that our analysis technique be quite robust: since the input is not controlled in any way, our analysis procedure must be able to extract as much information as possible from the text, but seamlessly ignore - or at least allocate to the appropriate "unanalysable input" slot - the text which it cannot interpret.

However, both these procedures can be identified as essentially "rule-based", in the sense that linguistic data used to match, whether fixed patterns or syntactic rules, must be explicitly listed in a kind of grammar, which implies a number of disadvantages, which we will mention shortly. An alternative is suggested by the paradigm of "example-based" processing (Jones, 1996) now becoming quite prevalent in MT (Sumita et al., 1990; Somers, 1993) though in fact the techniques are very much like those of the longer established paradigm of case-based reasoning.

-9-

Page 10: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

Language Engineering Project LE-1182 TREE

A flexible approachIn the example-based approach, the "patterns" are listed in the form of model examples, such as typical hotel and catering job advertisements from a database that would be used by TREE. Semi-fixed phrases are not identified as such, nor are there any explicit linguistic rules. Instead, a matcher matches new input against the database of already (correctly) analysed models, and interprets the new input on the basis of a best match (possibly out of several candidates) robustness is inherent in the system, since "failure" to analyse is relative.

The main advantage of the example-based approach is that we do not need to make explicit what the linguistic patterns look like. Instead, the common patterns will be implicit in the database. To see how this works to our advantage, consider the following. Let us assume that our database of already analysed examples contains an advertisement which includes the following: Knowledge of Dutch an advantage, and which is linked to a schema with slots filled roughly as follows:

SKILLS:LANGUAGE:LANG:nl

SKILLS:LANGUAGE:REQ:"an advantage"

Now suppose we want to process advertisements containing the following texts:

Knowledge of the English language needed.(1)

Some knowledge of Spanish would be helpful. (2)

Very good knowledge of English. (3)

In the rule-based approach, we would probably have to have a "rule" which specifies the range of (redundant) modifiers (assuming our schema does not store explicitly the level of language skill specified) that fillers for the req slots can be a past-participle, a predicative adjective or a noun, and are optional, and so on. Such rules carry with them a lot of baggage, such as optional elements, alternatives, restrictions and so on. The biggest baggage is that someone has to write them.

In the example-based approach, we do not need to be explicit about the structure of the stored example or the inputs. We need to recognise Dutch, English and Spanish as being names of languages, but these words have "terminological status" in our system. If the system does not know "would be helpful", it will guess that it is a clarification of the language requirement, even if it may not be able to translate it. Furthermore, we can extend the "knowledge" of the system simply by adding more examples: if they contain "new" structures, the knowledge base is extended; if they mirror existing examples, the system still benefits since the evidence for one interpretation or another is thereby strengthened.

The matching algorithmThe matcher, which has been developed from one first used in the MEG project (Somers et al., 1994) processes the new text in a linear fashion, having first divided it into manageable portions, on the basis of punctuation, lay-out, formatting and so on. The input is tagged, using a standard tagger, e.g. (Brill, 1992) There is no need to train the tagger on our text type, because the actual tags do not matter, as long as tagging is consistent.

The matching process then involves "sliding" one phrase past the other, identifying "strong" matches (word and tag) or "weak" (tag only) matches, and allowing for gaps in the match, in a method not unlike dynamic programming. The matches are then scored accordingly. The result is a set of possible matches linked to correctly filled schemas, so that even previously unseen words can normally be correctly assigned to the appropriate slot.

The approach is not without its problems. For example, some slots and their fillers can be quite ambiguous: cf. moderate German required vs. tall German required (!) while other text portions serve a dual purpose, for example when the name of the employer also indicates the location. However, the matcher is extremely flexible, and if on-line or e-mail feedback to the user submitting the job advertisement were to be assumed, this should means that the analysis module can degrade gracefully in the face of such problems.

Information Extraction in PO2 and PO3The change from PO1 to PO2 saw a major change in the nature of job input and in the nature of the jobs themselves. The principle source of jobs would no longer be individuals or companies e-mailing in their job advertisements; it would be unlikely that the prototypes would accumulate a sufficiently impressive database of job advertisements if this had remained the case. Instead, a large existing database of job advertisements was to be used. Furthermore, it was decided to allow job advertisements in all job sectors. Both changes could be accommodated by bringing in the Flemish Employment Exchange VDAB.

The VDAB job database is structured with different fields containing different types of information e.g. job

-10-

Page 11: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

LE-1182 TREE Final Report

titles, location, minimum age, salary, qualifications and so on. One field contains up to 270 characters of free-text for additional information that the employer might wish to give.

As far as information extraction was concerned there were a number of consequences of this change. Firstly, for almost all the information that we would wish to extract from a job advertisement, the task had already been done, and the task for analysis was of converting a VDAB code or representation into the TREE code. To this end, a program was written that would either "skip", "convert" or "copy" the information in the VDAB fields into the TREE Schema.

There remained the free-text fields in the VDAB database and it was considered using the MEG analyser to extract information from these. This proved unsuccessful for two reasons:

Firstly, and most importantly, there was no information in the free-text fields that corresponded to information slots in the TREE schema; these could all be filled using the information in the structured part of the VDAB database.

Secondly, example-based information extraction requires a corpus of typical examples. However, the point about the VDAB free-text fields was that information that didn't fit in anywhere else went in these fields. All the standard aspects of the job advertisement had already been placed in their own fields. Consequently what was typical of the free-text fields was their very lack of typicality. The option of adding to the TREE schema slots the sort of information that could be extracted from the free-text fields did not occur.

The work on example based information extraction was not totally abandoned at this stage, although the effort expanded on this particular task was reduced. Instead, work continued on converting the MEG analyser, and a search was made for job databases that would be broadly comparable to VDAB in terms of size and type of job covered, but which would be predominantly free-text.

It must be stressed that compared to the original aims, this was an ambitious undertaking. The move to all job sectors requires a very much larger corpus of model job advertisements than would be the case if we were only dealing with the hotel and catering trades. Although some aspects of a job advertisement are likely to be common to all job advertisements, one would not expect an advertisement for the chief executive of a large company to be formatted in the same way as an advertisement for casual bar staff. Then, once the model examples have been collected, they need to be analysed correctly by a human. This involves taking the list of schema slots, which represents the information that TREE considers worth extracting, and then determining what portion of each job advertisement corresponds to any of the schema slots.

Job HunterA search for large databases of job advertisements in predominantly free-text form was made. Surprisingly few instances were found. Job Hunter < http://www.jobhunter.co.uk>, however, was one such case. This is a compilation of the job advertisements that appear in most of the principle regional newspapers in the UK.

Unfortunately, using Job Hunter proved much harder than expected for a number of reasons:

· There appeared to be "house styles" with a job from one newspaper having much in common in terms of style and layout with other advertisements from the same newspaper. This meant that model examples ideally needed to be found for the different newspapers as well as for many of the different job sectors and so a large corpus of examples was required. Consequently, a long time was spent associating advertisements with their correct TREE slots. A total of 300 jobs were so analysed for a corpus, but the results when used were very poor, suggesting that the corpus wasn't large enough.

· Much of the information, in particular geographical and contact information could only be understood by someone with local knowledge. This reflects the fact that the advertisements originated in local newspapers. For example, the first part of a telephone number representing the local area may be omitted if telephoning within the same local area.

These problems together with the fact that many large databases of jobs were already structured lead us to abandon example-based information extraction for PO3.

3.2 software and hardware requirements

The delivered TREE prototype is based on an Oracle rdbs, which is fed or accessed for maintenance by application programs mostly written in C/C++/PL-SQL/Java/Prolog (integrated into C) and accessed by Web based services implemented on an Oracle Web server with html/Java script.

Current prototype run on Sun Workstations (Solaris 2.5 or higher) with Oracle 7.3/8. A Java 1.1 compliant virtual machine is needed for running a tool to inspect terminology, while modification to software modules could require a Java development environment (Symantec Visual Café 2.5 used, but others allowed as well), C/C++ compilers (gcc 2.7.2 or higher) and Prolog compiler BinProlog 5.75.

-11-

Page 12: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

Language Engineering Project LE-1182 TREE

The system could be made available on different Unix architectures provided availability of Oracle and SW compilers. Moving to different RDBMS (e.g. Sybase) or HTTP servers (e.g. Apache) do not pose architectural problems. However, major changes would be necessary, as another mechanism would have to be written to replace the PL/SQL and its supporting infrastructure.

3.3 prerequisite software and portability

The software produced by the TREE consortium is strongly tied to the particular software packages on which it is based: principally the Oracle database server (version 7.3), the Oracle Web Server and the PL/SQL Agent. Although all major database products are based on ANSI-standard SQL, they all offer their own non-standard SQL extensions. Likewise they all have their own customised API and/or other access mechanism (PL/SQL in Oracle’s case). This means that portability is not always that simple a goal to achieve.

Oracle was chosen because it is an industrial-strength product, which offered an integrated Web-database solution. PL/SQL is a procedural extension of SQL (itself a non-procedural language). PL/SQL procedures can be compiled and stored in the Oracle database. They can then be run using the PL/SQL Agent. The Oracle Web Server can be configured to recognise which cgi-type requests are to PL/SQL procedures/packages. It then channels these requests to the PL/SQL Agent.

It is thus possible to write PL/SQL procedures/packages which accept input data from an HTML form, use this data to compose SQL queries to the database, and return a fully formatted HTML page to the browser to display the result.

The above functionality was found to be a powerful tool in the development of the TREE software (and certainly preferable to the immensely convoluted Oracle ‘C’ API which was another alternative). However, the use of these technologies does mean that to move to another web server and/or database would require that significant portion of the TREE code be rewritten in some other language or script.

On the other hand, a number of elements are wholly or largely portable:

· the HTML (in five linguistic variations)

· the scripts for the creation of the database schema

· the hierarchies of job-codes (and other terms) translated into five languages

3.4 Research results 3.4.1 key research areas addressed and advances made

There was considerable pressure from users (VDAB, NCC) to cover all employment areas and build a useful and viable system for multilingual access to job vacancy information. The project was consequently less innovative and experimental than planned with regard to research objectives. However, two areas deserve to be mentioned.

The first is the area of text generation, where the approach developed in the TREE project has some interesting features. First of all, it is an integrated approach, combining not only grammar rules, templates and canned text into a single formalism, but also text planning, sentence planning and sentence realisation into a single efficient process of text realisation. Secondly, it is a structure-driven approach, which means that the generation process is guided primarily by the aim of generating a well-formed text, with semantic content being instantiated and refined as a “side effect”, using semantic database constraints as restrictions on the applicability of (text) grammar rules. We believe this to be a very useful and efficient strategy in applications where the generated texts have a predictable and fairly rigid (though not completely fixed) structure. The approach is currently being tested on other domains, and a lengthy article presenting the results is under preparation (Nivre, J. & Lager, T. Constraint-based text generation: A structure-based approach. Manuscript).

The second area is that of information extraction, which was originally intended to play a major role in the TREE system. However, for reasons already outlined above, the emphasis of the project shifted in such a way that very little effort has in fact been devoted to this problem (cf. section 1.1.2.5).

3.5 Other results 3.5.1 project-level dissemination, awareness, publications, etc

Consortium members participated in a number of activities to disseminate information on the TREE project, and to promote the exploitation of the product:

-12-

Page 13: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

LE-1182 TREE Final Report

Meeting Date Attendee(s) LocationProject Line Conference 11-12 January 96 MARI LuxembourgLE Concertation Meetings November 1995–

November 1998MARI Luxembourg, etc.

UK IGC 6-7 June 96 NCC NewcastleNCC: Exploitation Meeting with Siemens Electronics.

Sept 96 NCC Newcastle

presentation to LE symposium “Communication or Cacophony? Opportunities for Language Engineering in Information Society”

October 1996 GU Göteborg

presentation at national conference at the Ministry of Telecommunications

January 1997 Quinary Rome

EURHOTEC Trade Show — the pan-European Hotel Technology Exhibition and Conference

February 1997 NCC Amsterdam

Reed International 25 February 97 MARI Newcastle presentation (including demonstration) at LE Concertation Meeting

March 1997 MARI

presentation at the UNICOM Conference “Natural Language Processing: Extracting Information for Business Needs”(Published paper.)

March 1997 MARI London

presentation at the Association for Computational Linguistics Fifth Conference on Applied Natural Language Processing

3-4 April 97 UMIST Washington DC, USA

participation at the ICT Conference, IEE (Institute of Electrical Engineers)

April 1997 NCC Birmingham

presentation at exhibition on business information services (Published paper.)

June 1997 Quinary Rome

Gateway to Europe Conference October 1997 NCC Newcastlepresentation the 3rd ERCIM Workshop “User Interfaces for All”

November 1997 MARI Obernai.

presentation at the Australian Natural Language Postgraduate Workshop

January 1998 UMIST Melbourne

London Enterprise Agency 2 February 98 YN LondonCentrepoint Streets Ahead Job Agency 3 February 98 YN LondonSt Basil’s Foyer ENTA Training Centre 11 February 98 YN BirminghamLondon Connection 26 February 98 YN LondonJoint Meeting – Anti-Poverty Advisory WG

27 February 98 NCC London

Telematics Applications Conference March 1998 NCC, Quinary BarcelonaStamford Foyer 6 March 98 YN Stamford

-13-

Page 14: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

Language Engineering Project LE-1182 TREE

Meeting Date Attendee(s) LocationYMCA England Training for Life 18 March 98 YN LondonCentrepoint Foyers 19 March 98 YN PeterboroughNACRO 1 April 98 YN LondonNewcastle Foyer 9 April 98 NCC, YN, MARI NewcastleGOSIP 29 April 98 NCC Newcastledemonstration at language technology exhibition

May 1998 GU Göteborg

Foyer Federation 6 May 98 YN LondonSAHA 7 May 98 YN LondonEurope for Youth Conference 8/9 May 98 YN LondonOffenders Employment Forum 18 May 98 YN LondonCentrepoint – New Deal 26 May 98 YN Londonpresentation at Transnational EVS Seminar “On the Move” with French and UK Foyers, German Jugendwohnheim and youth projects and Danish projects, and SOS (DGXXII)

June 1998 NCC

Gateshead Youth Organisations Council 11 June 98 NCC Gateshead Tyneside Foyer 11 June 98 NCC NewcastleNational Foyer Federation Conference July 1998 NCC LondonPromotion of TREE at meeting with UK cabinet minister

July 1998 by NCC Newcastle

Presentation to Youth Exchange Centre September 1998 NCC Newcastle

3.5.2 Contribution to standards making, licenses and patents generated.

Tree has conformed to the standards of publishing on the World Wide Web. No patents have been generated.

4. Evaluation and Assessment

4.1 Validation

The results of the user trials were encouraging, although not uniform across all language groups.

The following feedback came for example from the UK user trials:User's First Impressions: 90% reacted positively to the TREE front-end. In their accounts, most commented that

TREE appeared as simple and basic to use. Only 10% expressed that TREE appeared daunting at first glance.

Information Content; Language and Terms: All users understood the use of language and terms used in TREE. Many expressed that the language and terminology was straightforward and catered for an open audience of people.

Vacancy Details: Whilst the majority felt that the details provided on TREE were well presented and met their basic information requirements, most of these went onto say that more information was needed particularly if the job required reallocation to another EC country.

Navigation: 95% said that in general terms, they did not find TREE difficult to navigate. Functionality of Icons: All users fully understood the purpose of the icons. Some of these said that this reflected

the demonstration provided by the reviewer.Interactive Map: Over one half of the sample thought that the map was very user friendly. 45% of users were less

receptive to the map mechanism. Most of these made references to the difficulty associated with identifying towns and specific areas within a city.

Search: User Friendliness: All respondents expressed that the structured search tool was a helpful method for retrieving relevant job vacancy details. 95% indicated that they found it very easy to use.

Search Tool Preference: 70% said that they preferred the existing structured search facility rather than a free text entry tool. 25% of respondents expressed a preference for a free text entry search tool and one user recommended that TREE should have both methods available to users, as both are complimentary to one another.

-14-

Page 15: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

LE-1182 TREE Final Report

Belgian users were significantly more sophisticated, and gave specific advice on fonts and use of icons. For example it was reported that the Dutch flag does not represent Flemish to Belgians. Their reaction to TREE was not as positive as that of UK users (although still acceptable), possibly because they suffered slow access to the system.

Swedish users were less positive about TREE.

4.2 Feedback

Feedback was obtained from both formal reviews (of which there were two), and user groups.

Review reports

Review Jan 97

Following the 2 day review the reviewers agreed that the project should continue but with the proviso that a number of suggestions should be followed. These proposals were as follows:

· Revisit User Requirements Survey· Revision of the TREE business plan· Revision of the TREE project plan· To find a second ‘user’ partner to replace ManPower.

The reviewers expressed a generally favourable attitude towards the aims of the project, the cohesion and technical abilities of the consortium and the technical progress made. One specific comment concerning the user-interface of the demonstrator was that users should not be allowed to enter free-text for a given form field if there were only a limited number of valid options and that such form fields should be menu driven (thus limiting users’ options to those available in the menu).

Review March 98

The reviewers noted that since the last review there had been: ‘Evident commitment of the consortium to implement the reviewers’ recommendations’. The overall view of the reviewers was favourable, although there were reservations on some points.

Of the twelve headings given specific scores, two were rated ‘poor’, seven as ‘satisfactory’ and three as ‘good’. The areas criticised were ‘User involvement and commitment’ and ‘Ability and commitment to exploit the results’. The points praised were ‘Promotion and dissemination activities’, ‘Level of European added value’ and ‘Analysis of market sectors for application of the results, and exploitation plans’.

The reviewers recommended that:

‘In the remainder of the project lifetime, the consortium is to take vigorous effort to stimulate broader user interest in the project, for instance through dedicated workshops with national, as opposed to local authorities, agencies or other institutions related to the employment market. Validation should start ASAP, and details on this should be communicated to the EC (validation plan). Exploitation plans from the public and private users should likewise be submitted to the EC well before the project is over.’

User group meetings

Throughout the duration of the project, MARI Group Ltd, Newcastle City Council and (in the final year its subcontractor YouthNet) have publicised the TREE service in an attempt to exploit its potential to employers, job seekers, careers advisors, existing job vacancy/placement networks and other organisations, and voluntary agencies working with young people.

Audience reaction and interest has generally been good from the non-governmental organisation (NGO) sector. The resulting action has been varied, since levels of understanding, availability of IT resources and involvement in international work also differ enormously between the different agencies. The lack of IT resources among certain agencies hinders dissemination, since one method is to send information about TREE to projects using e-mail.

The educational sector, on the other hand, tends to be well equipped with IT and possess a good understanding of the possibilities of IT and react positively. The presentation at the International Youth for Europe Conference, with delegates from 14 countries, was received positively and led directly to contacts with Careers. A group of delegates are also planning to use TREE to facilitate trans-national work experience exchanges.

Those NGOs with a clearly defined purpose that are part of a trans-national network in turn create comprehensive Internet services; so the European Youth Hostel Federation has a web site, which has sites for its 14 country members, which represent 1681 hostels. In turn they are part of the International Youth Hostel

-15-

Page 16: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

Language Engineering Project LE-1182 TREE

Federation which has had 300,000 hits on its web site. Bookings can be made on-line. The European Alliance of YMCA (EAY) represents 3 million users and also has a well-set up network, right across Europe, with links to sites in 18 countries and information about YMCA facilities in another 15 countries. These agencies have tended to be receptive to the concept of TREE and both these networks have agreed to host a link to TREE on their sites.

There was a positive reaction from the delegates at the Trans-national EVS Seminar “On the Move”, but most delegates were not equipped with Internet facilities to enable them to participate within the deadlines of the EVS programme.

Those agencies, that are starting out and are still at the stage of “conscious ignorance”, are trapped in a vicious circle. Because they are not well equipped with Internet facilities, staff can not access the web, send E-mail or participate in information exchange and therefore undervalue the value of the Internet; this opinion was voiced by some agencies at the Foyer Federation conference.

There are also agencies which set up web sites, but they have not been maintained and have been allowed to atrophy perhaps when there is a management change: invariably their web site remains under construction for large periods of time, which in turn means that users do not increase.

Likewise there are variations in the perceived need for TREE, even with those agencies that have IT resources and expertise. The spread of the use of English within trans-national agencies has restricted the need for TREE: with one EC Department there has been a move to require all trans-national placements to be posted on a data base only in English, rather than in more EC languages, as would be expected.

Finally, Göteborg University has had close contacts with the Swedish national employment organisation, AMS, a large potential user of the TREE technology. This has resulted in the use of AMS terminology resources and vacancy data for the third TREE prototype, as well as support from AMS for user trials of the Swedish version, which was carried out with real job seekers at AMS offices. During 1998 a series of bilateral meetings between Göteborg University has taken place, both in Göteborg and Stockholm, where one of the items discussed has been the future exploitation of TREE. AMS seems to be interested in an extension of TREE, on the one hand for Swedish immigrant languages, and on the other hand for the matching of job seekers as well as vacancies.

4.3 Internal collaboration

The members of the Consortium collaborated well together (see the 1998 Project Review, comment on the heading ‘Project cohesion and synergy’ is ‘Research-industry partnership seems to work well’).

Given that the whole basis of the TREE project was its trans-national nature, it would have been difficult to achieve the same results with a development team drawn from any one country. Being able to specify how the system should work depended on having viewpoints on a number of different facets of life in a variety of EU countries. Educational systems, qualifications, unemployment/benefits systems, post codes and other location identifiers were just some of the subjects which needed a wide overview, to avoid adopting a convention which would map well onto conditions in one country, but would be wholly inappropriate in another.

Inevitably, collaboration between a number of different organisations adds an overhead compared to working within a single organisation. However, few individual organisations could furnish the diversity of experience and expertise needed for this project. As with all collaborative projects, a certain spirit of compromise and collaboration is needed, given the lack of a hierarchical command structure. The TREE consortium managed to collaborate successfully.

The areas of expertise necessary for the project necessitated organisations from radically different backgrounds. NLP expertise could probably only have been obtained from an academic organisation. Vacancy data required major employment services and other large employers, and Web and database skills were most appropriately sought from commercial companies.

The number of technical partners was not large. However, even with only four partners producing code and software modules, the iterative cycles of integration become significantly more involved and complex. Experience of partners has shown that even a slightly higher number of active partners (say six or seven) makes even the best managed integration exercise a major headache.

-16-

Page 17: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

LE-1182 TREE Final Report

5. Conclusions and future prospects

5.1Synthesis and conclusion 5.1.1 technical feasibility

The TREE prototype has been through three iterations: the architecture is now considered to be stable and reliable. The run-time section of TREE is built from standard components; Web server, database, CGI-style script mechanism. In P03 the implementations of the above components actually used were Oracle Web Server, Oracle database and PL/SQL, respectively. However, with a certain amount of work it would be quite possible to port the current TREE code onto any suitable combination of Web server, database and scripting-language.

Thus, while the current implementation of TREE is closely linked to the Oracle environment, the software could be adapted to a different environment.

There are a number of specific issues affecting the technical feasibility of TREE. These are discussed briefly below.

Scalability

TREE is based upon standard technologies and an industrial-scale database. Given an adequate platform and infrastructure there is no obstacle to TREE's scalability. The major areas where scalability is an issue are:

· quantity of data in the database. The major area for increase will be due to the number of vacancies held. As well as the vacancy data itself, each vacancy has a number of pre-generated display strings stored in the database. For each vacancy there are n*2 database rows of stored strings (where n is the number of languages supported, and ‘2’ refers to the fact that both long and short format display strings are stored). Oracle should be capable of coping with any foreseeable increases in this area.

· the number of hits on the Web server. Again standard Web solutions should be able to cope with this problem.

Stability

The stability of the TREE system was greatly improved during the development of P03. In P02 there were two different methods of searching. Both of these searches used its own Oracle Web Request Broker (WRB) Cartridge. These two cartridges were written in ‘C’. They used an API to access the database via PL/SQL routines that had been compiled and stored in the database. The output from these PL/SQL routines was then fed into the generator (written in Prolog). The output from the generator (in the appropriate language: French, Flemish, etc.) was written to the screen: it constituted the results of the users search.

This method worked but was unstable; any run-time error in one of the cartridges could crash the corresponding Web Listener, making the site unavailable to the outside world.

For P03 the (WRB) Cartridges were dispensed with altogether. Also the generator was only run offline. Software was written to run the generator for each vacancy in the database, and generate the corresponding output in each language for both long and short formats. This output was stored in a new table (generator_cache) in the database.

When a user launched a search the output for the screen could then be obtained by a series of relatively simple database accesses, using PL/SQL procedures.

Conclusion

The overall conclusion is that the TREE architecture is realistic and practical. TREE is scalable and, given a certain amount of effort, portable. The system is easily extensible to different languages, assuming that the necessary linguistic expertise is available. The user interface has undergone considerable modification, but the user trials showed that it could still be made more comprehensible and intuitive for the user. In short, that the system is technically feasible.5.1.2 economical viability

An initial break-even analysis for a TREE service based at ACE can be guided by current costs at ACE and some proposed charging structures and presumed distributions of adverts and bulk space contracts.

Starting with the simplest charging scheme of 1ECU per month per job advert, the annual break-even would be 74000 adverts. This is based on three not unreasonable assumptions:

-17-

Page 18: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

Language Engineering Project LE-1182 TREE

· All adverts are only live for one month. · All adverts are single sales, not purchased through bulk ‘media space’ contracts· TREE has no other sources of income

In addition, if ACE hosted a free CV/resume database, then recruiters could be charged for searches. Links to professional and trade publications should also be considered.

A TREE service is clearly viable, but much work will be needed to ensure its success. However, a shift in sales towards bulk contracts substantially changes the break-even point in terms of the number of contracts that need to be sold. Most of the scenarios below break-even after general costs are added.

The conclusion must be that the distribution of contract types will be critical to TREE’s success, and that efforts should be concentrated on the larger contracts (100 advert-weeks and above). Basing contracts on number of adverts rather than advert-weeks should also be considered.

5.2 Business perspectives 5.2.1 Anticipated development of the market sector addressedThe market sector addressed has undergone major modifications during the course of the project. These changes have been mainly in the area of Internet-based job offering/search related services, but also due to change in regulations through Europe. The first factor - i.e. the Web explosion, was well anticipated by the project aim of exploiting the media. However the increase in Web usage has far exceeded the expectations of most observers. As a consequence, generic job search services appeared and at the same time a number of companies offer job posting on the web directly. On the regulation side, the European job market has opened new possibilities, e.g. by allowing activities of temporary job placement agencies, which were restricted or even prohibited in the past.

Two main business models were identified. The first model is exemplified by search services, much on the line of Web search engines / news watchers but focused on job posting, gathering job ads on the net and making them available for search, generally supported by advertising or related services fees. The second model focused on the needs of a single organisation, which could be a commercial company, a job agency, a public organisation or some other agency (Tree participants, NCC, YN and AMS mostly falling in this latter category).

There are two major issues of relevance in the Tree project, i.e. firstly being able to generate multilingual descriptions and secondly being able to search ads by a terminologically rich and sound structure. At present no existing service which is known to the authors combines their two elements. Both could play a significant, albeit maybe different, role, in both the above emerging business models. The ability to generate multilingual descriptions could significantly leverage services whenever targets explicitly include different language users, like in the case of Swedish/Finnish for AMS or French/Flemish for VDAB (not to mention potentially the EC itself!). The ability to sort out ads on the basis of a semantically meaningful description could prove a major boost to search services when adequately coupled with suitable analysis mechanisms. This would be the case whether or not applied to multiple languages - this feature could prove a major improvement even on single language based services.

5.2.2 Importance attached to the project in-house; interest shown by prospective customers

The different TREE partners attach importance to different aspects of the TREE product.

The training arm of the MARI Group has a major involvement in government sponsored training initiatives for unemployed young people. A number of presentations of TREE have taken place government representatives and civil servants who were very interested in the possibilities offered by TREE. Discussions are continuing over the possibility of funding to use TREE for the dissemination of job vacancies.

UMIST's main interest in TREE has been in information extraction technologies and in the use and availability of terminological resources. A number of outside bodies have expressed an interest in these technologies. For example, UMIST was recently invited to exhibit the TREE system at the EUROMAP seminar being held on January 19th in Cambridge. At this seminar there will be a number of in invited speakers from a number of fields including research, EU, business, etc.

Quinary’s main interest in the TREE project has been in refining technology for managing interactions between structured (db) and unstructured (text) information in a multilingual environment; technology which could be used in fields completely different from the one which was the primary objective of Tree (labour market).

-18-

Page 19: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

LE-1182 TREE Final Report

Within Göteborg University numerous presentations of the project at internal seminars and workshops have resulted in collaboration with other internal projects. There is also considerable interest in reusing results of the project internally; in particular developing a generation package based on the generator developed for the TREE system. As far as commercial exploitation is concerned, there is potential interest from AMS and the Swedish Immigrants’ Institute to develop a customised version of TREE to support Swedish immigrant languages.

Newcastle City Council have exploited the work done during TREE to develop public access to career opportunities within the council, which is itself a major local employer.

5.3Exploitation planning 5.3.1 Benefits to the partners and consortium

The consortium believes that TREE is a product which has real marketing potential. Demonstrations to major employers and employment services (including the UK and Swedish Employment Services) have provoked a lot of interest, and contacts are continuing with a view finalising contracts. The option discussed with these organisations was the hosting of job vacancies on the current TREE site. However, other TREE partners are now in the process of installing TREE and the associated Oracle software. They will then be in a position to host job vacancies for third party organisations themselves.

Another option being examined is selling/franchising TREE software to private employment agencies who want to publicise their vacancies in different countries: present their appears to be no other system available which has TREE’s multi-lingual capacity.

UMIST has submitted a proposal under the title of 'ELM Employment for Linguistic Minorities' to the Engineering and Physical Sciences Research Council (EPSRC) for funding to investigate the possibility of creating a TREE like product for 'Non-Indigenous Minority Languages' (NIML) in the UK. It is expected that much of the TREE results and technologies will be carried over to ELM, and contact will be maintained with other consortium members.

5.3.2 business plan

MARI

The following options have been identified as being appropriate for the market exploitation of TREE:

Hosted Service

In this configuration ACE, as a consortium member, would undertake to run and manage a TREE service, whilst MARI would be responsible for selling the ‘TREE service’ into the market, and would undertake all aspects of vacancy and service management.

Alternately MARI could integrate an existing Internet/intranet service or as a stand-alone element with an existing service. The TREE service would then be ‘branded’ with the customer’s name.

Licence Software

In this configuration MARI (and other partners) would ‘sell’:

A complete solution of a hardware and software licence for TREE. Such a configuration may well require the configuration of the customers’ systems including all Internet links

A licence for TREE service to run with customers’ existing service or systems or a new service to augment their current services. The University of Gothenburg is likely to follow this route with the Swedish National Employment Service, AMS.

Technology Skills

MARI and other consortium members would offer the skills they have developed during the TREE project, for example in development of the database job descriptions and translation into target languages.

Linking of TREE ‘engine’ to other management information systems such as Lotus Notes. This is the route that will be followed by Newcastle City Council.

NCC

Newcastle City Council is preparing to go live with TREE in April 1999. Over 100 employers will use the system including every area of the City Council and many of its associated bodies including all schools in the Newcastle area. The TREE system will be fully integrated within the Newcastle City Council’s Intranet and

-19-

Page 20: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

Language Engineering Project LE-1182 TREE

Internet sites. Via the Intranet site TREE will be accessible to over 2,000 employees of the Council.

Newcastle City Council’s Internet site is currently visited over 3,000 times per week. According to web site statistics, our main visitors are from Universities and independent sector companies. The Internet site is also available to the general public in all 19 libraries in the City and through five 24 hour on street information terminals located across the City.

The TREE system should enable the City Council to save a minimum of £15,000 in its first year of operation through reduced operating costs. Longer-term benefits have not been determined at present. Use of the TREE system however will enable the City Council with it’s commitments to ensuring equality of opportunities and making the Council more accessible for all of its citizens.

Quinary

Although there is no current plan to directly exploit a TREE-based Job Search service in Italy, one of Quinary’s current three main directions for business development a concerns a language enabled information management system, where Quinary expects to get a significant share of next year’s revenue from both consulting and providing system integration services. In this area, the Tree project has already been kept exposed as a clear example of a potential area of exploitation for web related - language enabled - services, and technology and experiences developed are a key element in Quinary marketing activities in this business area.

6. Appendices

6.1 Digitised audio-visual record of project achievements, runnable under Windows 95, 98 or NT: 6.1.1 Screencam of TREE

A screencam of the TREE site was produced, and is included in the TREE CD-ROM.

<<MARI to produce a screencam : this is at final_report/screencam stored as both .exe and non-exe versions. Backup versions also in PC in conference room>>

6.1.3 Slide show

A Power Point slide show of TREE was produced, and is included in the TREE CD-ROM.<<MARI to produce a PowerPoint slide show (has NO embedded screencan) : this is at final_report/slides/FINAL_REP.ppt>>

6.2 List of public deliverables and reports

A list of public deliverables and reports has been compiled, and can be accessed from the TREE WWW site.

<<this has been done, but the link (from the top level index.html page) is commented out until all the deliverables are finished. File containing the list is …./P03/ui/public_deliverables.htm >>

6.3 Project leaflet and/or brochure

A copy og the TREE brochure is enclosed with this report.

6.4 Papers presented at conferences, published articles, etcRome January 97. Gilardoni L. presented a paper for Quinary at Conferenza sul Trattamento Automatico Delle Lingue Nella Società dell’Informazione - Ministero delle Poste ("Text Classification For the Financial and Business World")..Published in a special issue of ‘La Comunicazione’ – Istituto Superiore Poste e Telecomunicazioni.The paper was general on LE techniques and potentialities and included a section on TREE.

TREE project was presented at the Association for Computational Linguistic's Fifth Conference on Applied Natural Language Processing in March/April 1997Washington DC. USA, and several other conferences

· "Multilingual Generation and Summarization of Job Adverts: the TREE Project” Somers H, Nivre J, Multari A, Lager T , Gilardoni L, Ellman J , and Black W. in Proc ANLP 1997 (pp269-276)

· "Foreign Language Information Extraction: An Application in the Employment Domain” Ellman J,

-20-

David Miller, 03/01/-1,
Siobhan Do you have a figure for this?
Page 21: 0 · Web view‘Seeing the wood for the TREEs’ Phil Turner , Alex R. Rogers‡, Susan Turner and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai

LE-1182 TREE Final Report

Somers H, Nivre J, Multari A, Lager T , Gilardoni L, Rogers A, and Black W. Proc UNICOM Workshop: Natural Language Processing: Extracting Information for Business Needs, March 1997

· ‘Seeing the wood for the TREEs’ Phil Turner§, Alex R. Rogers‡, Susan Turner§ and Jeremy Ellman in Proc 3rd ERCIM Workshop “User Interfaces for All” , Obernai.

The project was also presented informally by Somers at the Australian Natural Language Postgraduate Workshop held at the University of Melbourne, January 1998, as part of the Australian Natural Language Processing Fortnight.

6.5 Contact list of Project User Group.Names

User groups (NCC, GU)

Arbetsmarknadsstyrelsen (AMS), SE 171 99 Solna, Sweden. Tel. +46 8 7306000 (Contact: Clas Almén.)

-21-