name managment strategy for database integration …web.mit.edu/smadnick/www/wp2/1990-15.pdf ·...

Name Managment Strategy forDatabase Integration

Almudena Arcelus

May 1990 WP # CIS-90-15

NAME MANAGEMENT STRATEGY FOR DATABASE INTEGRATION

Almudena Arcelus

Master of Science Thesis in ManagementMassachusetts Institute of Technology

Cambridge, MA 02139

ABSTRACT

In this thesis, two cases of database integration are analyzed to study the problem of namemanagement. Name management is defined as the capability to recognize that two names are referring tothe same entity, even if they are not spelled in the same way. The study is done in a homogenous databaseenvironment and in two heterogeneous databases. The focus of the thesis is to establish a strategy to analyzethe name management problem in corporate environments.

The thesis studies the problem of name management among company names and determines thatthe most common problems are: a) organizational matching and b) canonical matching. This work studiestechnologies available to help solve this problem and its shortcomings and suggests a strategy to define andsolve the name management problem for corporate databases.

The most important conclusion of this work is that to solve a name management problem isnecessary to define the purpose of the database integration and its links with the strategic goals of thecompany. This step immediately suggests the path to follow to solve the name management problem. TheComposite Information System strategy, that considers the relationship between information technologies,organizational, and strategic goals of a company to solve database integration problems, is suggested to solvethis important step.

KEY WORD: name management, database integration, corporate database, composite information systems.

ACKNOWLEDGEMENTS: Thesis advisor: Stuart E. Madnick, Reader: Richard Wang, Prof. Michael Siegel,Prof. Robert Goldberg, Butch Norckauer, Bertrand Rigaldies, Natasha Collins and the assistance from theCIS/TK Research Group.

Table of ContentPAGE

INTRODUCTION

1. COMPOSITE INFORMATION1.1 Introduction

SYSTEM

1.2 Composite Information Systems1.3 Problems Encountered .....1.4 Diversity of Situations ......

1.4.1 Inter-organizational .1.4.2 Inter-divisional ....1.4.3 Inter-product .....1.4.4 Inter-model ......

1.5 Connectivity Issues and CIS . .1.5.1 Physical Connectivity1.5.2 Logical Connectivity

1.6 Purpose of the Thesis ......

Model (CIS)

2. CORPORATE DATABASES ............................2.1 Database Definition ............................2.2 Corporate Databases ..............................

2.2.1 MIT Sloan School Alumni Database ...........2.2.2 Placement Database .....................2.2.3 Reuter's Database IP/Sharp ................

2.3 A pplications ...........................2.3.1 Create an MIT Sloan School Alumni Directory ....2.3.2 Create a List of Companies Interviewing on Campus

3. ANALYSIS OF APPLICATIONS WITH CORPORATE DATABASES3.1 MIT Sloan School Directory3.2 List of Company Interviewing .......................

4. NAME MANAGEMENT RESEARCH .......................4.1 Name Management on Corporate Databases .............

4.1.1 CIS/TK ................................4.1.2 Naming Standards ..........................

4.2 Related Studies ..................................4.2.1 Spelling Correction ........................4.2.2 Approximate String matching Classification .......4.2.3 Soundex Algorithm ........................4.2.4 Direct M ailing .........................

4.3 Conclusion .................................

5. RECOGNITION PROCESS .............................5.1 Knowledge and Understanding .......................

5.1.1 Process of Understanding ..................5.1.2 Knowledge a Computer Needs to Understand .....

5.2 Bounding the Knowledge ............................

. . . . . .

. . . . . . . . . . . . . . . . . . . . . . ..

.

6. MODEL6.16.2

PRO PO SED ........................................... 40Conclusions from the Applications on Chapter 3 ......................... 40Proposed Methodology .............................................. 41

7. APPLICATIONS OF THE MODEL ....................................... 457.1 MIT Sloan School Alumni Directory ..................................... 45

7.1.1 Methodology ................................................ 457.1.2 Conclusions ................................................ 46

7.2 List of Companies Interviewing in Campus ............................ 467.2.1 Methodology ................................................ 467.2.2 Conclusions ................................................ 48

8. CONCLUSIONS ........................................... 48

BIBLIOGRAPHY .................................................... 50

APPENDIX A. Software Filters ........................................... 52

APPENDIX B. Soundex Algorithm ....................

APPENDIX C. Acronyms Used on this Paper ............

. . . . . . . . . . . . . . . . . . . . . 53

. . . . . . . . . . . . . . . . . . . . . 57

List of Figures

1 Name Management Problem on a Bank's Credit Dep1.1 Need for a Process to Link Strategic Applications, In

artment ......formation Techno

Organizational Context ......................................1.2 Composite Information Systems Model ........................2.1 MIT Sloan School Alumni Directory per Company ..............2.2 List of Companies Interviewing at Sloan ....................3.1 Create a Directory of MIT Sloan School Alumni per Company .....3.2 First Letter Distribution on the Alumni Database ...............3.3 Problems Distribution on the Alumni Database ...............3.4 Create a List of Companies Interviewing at MIT ..............3.5 First Letter Distribution on the Placement Database .............3.5 First Letter Distribution on the AD and PD .................3.7 Problems Distribution when integrating two Heterogeneous Databases6.1 Summary of the Methodology Suggested ....................

. . . . . . . . . . . 4logy, and. . . . . . . . . . . 6. . . . . . . . . . . 7. . . . . . . . . . . 11. . . . . . . . . . . 12. . . . . . . . . . . 13. . . . . . . . . . . 15. . . . . . . . . . . 20. . . . . . . . . . . 21.. . . ... .. .. 23. . .. .. .. ... 24... . .. .. ... 30.. . . ... .. .. 44

NAME MANAGEMENT STRATEGY FOR DATABASE INTEGRATION

INTRODUCTIONOn a meeting at the International Financial Services Research Center at the Sloan School of

Management one of its members says: "The number one priority in my bank is to have a system thathandles name management".

Name management is becoming one of the main problems in the information technology arena. Itis a term that is used to refer to the problem of identifying that two names on the same or on differentdatabases are referring to the same entity. For example, to recognize when and how "GTE Products Corp"and "GTE Spacenet Inc" can be considered as the same company.

The term name management refers to the need to deal with [201:

1. Canonical problems, e.g.:* Abbreviation: "IBM" vs. "International Business Machine"* Punctuation: "Teradyne Inc" vs. "Teradyne, Inc."* Capitalize: "Johnson & Johnson" vs. "JOHNSON & JOHNSON"* Termination: "IBM Corp" vs. "IBM"

2. Organizational problems, e.g.:" Changing names: "International Harvester" to "Navistar". Parent-Subsidiary-Division: When to consider that "IBM Corp" and "IBM Europe" are the same.

One example of the strategic application of name management was used at a bank in New YorkCity. The bank's credit department at the different branches had access to a database where the name ofthe same customer was written in a different way (see figure 1). So, for example, when the customerRobert Smith will ask for credit it was given immediately, but when credit was asked for the same customerunder the R. Smith name the credit was denied. This fact was affecting enormously the image of the bank.To solve the problem this bank sent every month, a tape listing 8 million names to a company specializedon reducing mailing lists. The company would get back a list of 6 million names. For the bank this becamean on-going process that lasted several months.

The company that performed this consolidation work for the bank claims a success rate of 90%for persons names and 40% for company names.

BalancePobert Smith 100,0001 Laurel St.

R. Smith 201 Laurel St.

Figure 1. Name management problems on the Credit Department of a Bank.

This thesis studies the name management problem on corporate databases and concentrates on theproblem of company names. The analysis presented here is based on two databases located at MIT: theAlumni Database and the Placement Database. Two experiments were performed. One was the reductionof the Alumni Database and the other one was the concatenation of the Alumni Database and thePlacement Database.

The results of the analysis of the applications drove me to design a model of possible approachesto consider toward the analysis of the name management problem. The model was tested for bothapplications.

The first chapter explains the model of Composite Information Systems used in the thesis to evaluatethe strategic advantage and implications of name management.Chapters 2 and 3 go more carefully over examples of name management on corporate databases. Chapters4 and 5 give theoretical foundations for the model. Chapter 6 presents the model and Chapter seven goesover the results of using the model.

The main conclusion of the thesis is that the first step to have a successful name managementapplication is to define the purpose of the database integration and its links with the strategic goals of thecompany.

1. COMPOSITE INFORMATION SYSTEMS

1.1 Introduction

Significant advances in computer and telecommunications technologies have created a wide rangeof opportunities to meet information needs and to gain strategic advantages. It has become increasinglyclear that the identification of strategic applications alone do not result in success for an organization.Madnick and Wang [8] suggest that a careful and delicate interplay between choice of strategic applications,appropriate technology, and appropriate organizational responses must be made to attain success.

Porter [14] found that information technology is changing the rules of competition by: (1) changingthe industry structure; (2) creating competitive advantage; and (3) spawning completely new businesses.These strategic goals as explained by Madnick and Wang [8], can be achieved only if the underlying technicaland organizational infrastructure can support the required deployment. (figure 1.1).

An effective corporation is one that successfully reconciles the problems and opportunities across

these three domains. It is important to recognize that no single pattern of interconnection among thesethree domains is likely to be consistently successful.

Thus, one corporation may wish to lead from its technological domain and reconcile the other two

domains accordingly. In contrast, another corporation may wish to develop its strategic applications from its

product/market choice and develop its technological and organizational capabilities accordingly. It is how

the corporation successfully matches its internal capabilities with external requirements that determines its

level of success in the marketplace.

Figure 1.1 Need for a Process to Link Strategic Applications, Information Technology, andOrganizational Context.

1.2 Composite Information Systems Model (CIS)

The Composite Information System model has been developed by the Composite InformationSystems Laboratory (CISL) at the Massachusetts Institute of Technology to try to reconcile the technicaland organizational obstacles that constrain the ability to connect systems that have been developedindependently.(figure 1.2).

In this CIS process model, strategic goals are specified based on visions, experiences, and/or theories.Using this specifications, appropriate characteristics of a CIS that meet the strategic goals can be identified.A set of technological and organization problems associated with each type of CIS is identified and matchedwith appropriate solutions.

In order to perform the matching task more methodologically, it is necessary that [81:. The critical strategic goals of an organization be clearly articulated.* The major characteristics of CIS be identified.. Technological problems and solutions be investigated.. Organizational problems and solutions be examined.. Inter-relationships among strategy, technology, and organization be studied.

TechnicalObstacles

TechnicalSolutions

OrganizationalObstacles

Organizational

Solutions

A Composite Information Systems Model

Strategic

Goals

Figure 1.2

1.3 Problems Encountered

Some of the technical problems that have been encountered [8] are: Heterogeneous hardware thatdo not communicate with each other, different software environments, independent databases, and differentdatabase management systems.

An example of organizational problems that can be identified are: Lack of standards, great degreeof divisional or departmental autonomy, and lack of communication among groups.

The interconnection of these systems can reduce costs by exploiting economies of scales throughthe sharing of resources, eliminating human intervention and tape hands-off, reducing the number of errors,and eliminating duplication of functions, thus reducing operational costs.

1.4 Diversity of Situations.

Madnick and Wang [9] have identified four categories of situations requiring such integration:

1.4.1 Inter-organizational: which involves two or more separate organizations. For example, direct

connection between production planning system in one company and order entry systems in another

company.

1.4.2 Inter-divisional: which involves two or more divisions within a firm. For example, corporate-

wide coordinated purchasing.

1.4.3 Inter-product: which involves the development of sophisticated information services by

combining simples services. For example, a cash management account that combines brokerage services,check, credit cards, and saving account features.

1.4.4 Inter-model: which involves combining separate models to make more comprehensive models.

For example, combine economic forecasting model with optimal distribution account model to analyze the

impact of economic conditions on distribution.

1.5 Connectivity Issues and Composite Information Systems.

Madnick and Wang [9] have categorized connectivity problems as first- and second-order issues.

The first-order issues, which have been subject of most of the research on distributed heterogeneous

database management system can be thought of as annoying and inconvenient problems caused bydifferences in physical connections and syntax of commands. The second-order issues refer to the difficult

problems of reconciling semantic differences between information provided from multiple sources and

inferring information that is not explicitly represented in the underlying databases.

1.5.1 Physical Connectivity:

The first-order issues are encountered immediately when attempting to provide access to and

integration of multiple information resources:. multi-vendor machines: IBM PC/RT, VAX 11/750, AT&T 3B2, etc.. physical connection: Ethernet, token-ring, etc.. different database accesses: Oracle/SQL, IBM's SQL, etc. information composition (formatting)

1.5.2 Logical Connectivity:

The second-order issues are the next step after the physical connectivity has been solved. Theseissues are harder to solve:

. database navigation: How to move through the different databases.

. attribute naming: company attribute vs. comp_name attribute.. simple domain value mapping: Dollars, Yens, Pounds.. name management: IBM Corp in one database vs. IBM in the same or in another database.

At the CISL, a system has been designed and a working prototype build to address these issues.The system is implemented in the UNIX environment to take advantage of its multiprocessingcommunications capabilities in order to provide the user with simultaneous access to multiple remote datasources.

In order to provide the physical and logical capabilities the research team has utilized artificialintelligence as well as Database Management Systems (DBMS) technologies.

1.6 Purpose of the Thesis

The purpose of this thesis is to explore one of the logical connectivity problems: name matching.I explore this problem in two corporate databases with more than one name for the same company. Ipropose a model to find the appropriate name matching strategy for a particular database.

2. CORPORATE DATABASES

Databases are playing a major role in keeping track of customers and their relevant information.These new "record keeping" systems have made possible a more sophisticated way of doing business.Marketing has the need to use the information to create "micro-worlds" to target more specific markets.The Finance Department has the need for more accurate information for making better forecasts on sales.The Human Resource department can provide better services and motivate better the employees if they areable to access information, like the company's profits forecast.

2.1 Database Definition:

According to Elmasri & Navathe [4], a database is a collection of related data. By data, they meanknown facts that can be recorded and that have implicit meaning. For example, we can consider the names

of companies with their respective addresses and sales. This information can be stored in a notebook or on

a computer.

The databases that I am studying have the following properties:1.- To be logically coherent2.- To have a specific purpose3.- To represent some aspect of the real world4.- To be computer based

2.2 Corporate Databases:

The definition of a corporate database is a set of records that contain information regarding a

company, such as address, telephone, name of employees working there, sales, return on investment, etc.

A good example of corporate databases are the ones studied at the Massachusetts Institute of

Technology (MIT) Composite Information System Laboratory (CISL), the MIT Alumni Database, thePlacement Database, and Reuter's Disclosure Database.

2.2.1 MIT Sloan School Alumni Database:

It contains records of information regarding the different companies where MIT Sloan Schoolalumni graduated from 1%8-1980 are working for. Is a database loaded in a UNIX machine running underINFORMIX-SQL.

It contains the following tables:

1.- alumnitb table (13sequence num:

prefix:first-name:

lastname:degree:birthdate:countrycode:

positioncode:

companyname:

address_11:

address_12:address 13:zipcode:

columns, 1442 rows)Number assigned by the MIT Alumni Office to each alumni (exactly 10digits).Mr., Mrs, Ms, Dr, etc (up to 10 characters)First name of alumnus as listed in Alumni directory ( up to 30 characters

Last name as listed (up to 30 characters).Degree title and graduation year (SM 1%8)Date of birth (exactly 6 digits, month/day/year, example 120356)3 letter code for each country, assigned by the MIT Alumni office (a blankis left for american nationals); example FRA for France5 digits code, where the third digit corresponds to a professional field andthe last two digits correspond to a position.Name of the company as entered by the MIT Alumni Office Staff, onecompany may appear under different names (up to 30 characters).First line of the company address, as it is entered by the staff (in somecases it is a division name); up to 30 charactersSecond line of the company address (up to 30 characters).Third line of the company address (up to 30 characters).5 digit zipcode

2.- positiontb table (2 columns, 70 rows).positioncode: See above in alumnitbposition name: Name of the position (may be repeated through the table for different

professional fields), up to 20 characters. The first letter only is capitalized(e.g. President)

3.- countrytb table (2 columns, 228 rows)countrycode: See above in alumnitbcountryname: Name of the country, all letters capitalized (e.g. FRANCE)

4.- siccodetb table (2 columns, 95 rows)sic code: Standard Industry Code (5 digits code)industry: Name of the industry

5.- sicnumtb table (2 columns, 1413 rows)sequence num: See above in alumnitbsic code: Standard Industry Code for the Industry where the alumni is involved

2.2.2 Placement Database

It contains information about the companies that are currently recruiting at Sloan School. It runs

under a UNIX computer under INFORMIX SQL. Some of the tables are:

1.- company tablecompanyname:companyid:siccode:

(3 columns, 131 rows)Company name. Ex: Rohm & Hass Co.Defined by the placement office, to match their folders. Ex: rohmStandard Industry Code. Ex: 89

2.- schedule table (4 columns, 375 rows)sdate: Date of the interview. Ex: 01/11/90schednumber: Number of schedule given by the PD. Ex: 1posnumber: Number of position offered, assigned by PD. Ex: 30intvlength: Length of the interview (min). Ex: 45

3.- position table (2 columns, 230 rows)posnumber: See above in schedule tablecompanyid: See above in company table

2.2.3 Reuter's Database IP/Sharps

Located in Toronto Canada. It serves 34 different databases. I am concentrating incalled: Disclosure.

one of them

Disclosure contains current and historical corporate time series statistics and textual informationfor over 12,000 publicly owned companies that have filed with the U.S. Security and Exchange Commission.Annual data includes balance sheets, income statements, funds flow, and key financial ratios. The databasealso provides a full corporate resume for each company as well as the latest (quarterly) price, earnings, anddividend information. The textual data includes names, ages, titles, and annual remuneration for officers anddirectors; ownership, subsidiaries, filing tables, auditor's reports, exhibits, other corporate events, managementdiscussions and president's letter for companies. This database won't be examined in here.

2.3 Applications

In this thesis I use the first two databases to analyze the problem of name management in corporatedatabases.

2.3.1 Create a Directory of the MIT Sloan School Alumni that are working for the same company.

For this application the input is going to be the records on the Alumni Databasewould look like the one on figure 2.1:

Company:Address:

Alumni:

and the output

International Paper Co77 W 45th StNew York NY 10036

Henry AsareYoonsuh Kim

Figure 2.1 MIT Sloan School's alumni directory per company.

2.3.2 Create a List of the Companies Interviewing on Campus with the date of the interview and

the name of the alumni that work for that company.

The inputs are the Alumni Database for the name of the alumni and the Placement Database forthe date of the interviews. The output should look like the one on figure 2.2.

Company: Johnson & JohnsonAlumni: Andrea Portante

Carlos Torres

Date: 01/11/1990

List of companies interviewing at Sloan, the datework for them.

of the interview and the alumni that

The Alumni Database (AD) and the Placement Database (PD) were chosen because their sizemade it easy to verify the results of the methodology suggested in this thesis.

3. ANALYSIS OF APPLICATIONS WITH CORPORATE DATABASES.

In order to better understand the process of doing name matching on corporate databases, I decidedto do the procedure manually and to record the steps. This in turn will become the basic name managementalgorithm.

3.1 MIT Sloan School Directory

The goal is to create a list of all the alumni that are working for the same company, so MIT cancreate a directory per company. See figure 3.1

This process implies to consolidate 1442 records into 1027 records. I need to find the 415 matches.I used the company name as the matching attribute to determine which companies were the same.

Manual Process

Step 1: Retrieve unique company names using the local DBMS.

Process:select unique companynamefrom alumnitb

Analysis:

a) A list of 1442 is reduced to 1089. I found 353 out of the 415 that I needed. This means 85%

reduction just using the machine's DBMS.

Figure 2.2

Henry AsareInternational Paper Co.Internationa Paper PIz77 W 45th StNew York NY 10036

L - .--.. - - J

Yoonsuh KimIntl Paper Co77 W45th StNew York NY 10036

L------ J

Alumni Database: 1442 Records

Man

N aa g

m0nt

Alumni Directory: 1027 Records

Figure 3.1 Create a directory of MIT Sloan School alumni per company.

Step 2: Print the list and analyze it.

Process:Look at the different names and find a pattern.

Analysis:

International Paper Co

77 W 45th StNew York NY 10036

Henry AsareYoonsuh Kim

a) I got the distribution of the different letters of the alphabet for the first position in the firstword:

A 8.2%B 7.5%C 9.9D 3.2E 4.6F 3.3G 3.6H 2.81 4.9J 2.3K 1.4L 1.7M 7.7N 4.040 1.2P 4.7Q 0R 3.5S 8.7T 6.2U 4.2V 1.1W 2.9X 0Y 0z 0.5

On figure 3.2 this distribution is shown.

b) On the list there were three kinds of errors:

companyname =

companyname = ts (no more information regarding this name)companyname = companyname

c) The first letter of the names was typed correctly in this database.

Percentage

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Alphabet Letters

Figure 3.2 First letter distribution on the Alumni Database

Step 3: Match the names manually.

Process: Using common knowledge go through the list and match the companies that I think arethe same:

In all the list I found 46 groups, they are:

Group Matches I made: A priori knowledge:Number Companies Problem

1 A T Kearny Inc Diff. DivisionsA T Kearny Technology Inc

2 AT&TAT&T Communications IncAT&T Technologies Inc

Diff. Divisions

3 Air Products & Chemicals IncAir Products and Chemicals Inc

4 Allied-Signal CorpAllied-Signal Inc

5 Amoco Canada Petroleum Co LtdAmoco Corp

6 Arco Chemical CoArco Intl Oil & Gas Co

7 American Express Bank LtdAmerican Express TRS

8 Arthur D Little IncArthur D Little Intl

9 Arthur YoungArthur Young & Co

10 Boeing Commercial Airplanes CoBoeing Computer Services Co

11 Bull CPBull Groupe

12 Chase Manhattan Bank NAChase Manhattan Overseas Corp

13 CitibankCitibank FranceCitibank IntlCitibank NA

14 CiticorpCiticorp Investment BankCiticorp Investment Mgmt Inc

15 Continental BankContinental Illinois Natl Bank

16 Coopers & LybrandCoopers & Lybrand Consltng Grp

17 E F HuttonE F Hutton & Co Inc

18 Exxon Biomedical SciencesExxon Central ServicesExxon Chemical

& = and

= Inc

Diff. Divisions

Diff. Divisions

Diff. Divisions

Diff. Divisions

& Co at the end

Diff. Divisions

= Groupe ??

Diff. Divisions

Diff. DivisionsCitibank = Citicorp?

Diff. DivisionsCitibank = Citicorp?

Same Banks ??

Diff. Divisions

& Co Inc at end

Diff. DivisionsTypo Chemical, Chemicals

Chemical AmericasChemical USAChemicals USACompany USACorporationIntl CoOffice Systems CoRsch & Engineering Co

Americas = USA??

19 FordFordFord

Espafia SAMotor CoMotor Credit Co

Diff. Divisions

20 GTE Products CorpGTE Spacenet CorpGTE Telenet IncGTE-General Telephone of Calif

21 General Motors CorpGeni Motors Corp

22 Georgetown Consulting GrpGeorgetown Consulting Grp Inc

23 Harvard Business SchoolHarvard Sch of Public HealthHarvard University

24 HoneywellHoneywell Inc

25 IBM Canada LTDIBM CorpIBM Europe

26 ITP Boston IncITP Italiana SLR

27 Index Group IncIndex Systems Iac

28 International Paper CoIntl Paper Co

29 Kendall CoKendall Square Associates

30 Mass Gen HospMassachusetts General Hospital

31 McKinsey & CoMcKinsey & Co Inc

Diff. Divisions

Equivalence

Inc at the end

Diff. Divisions

Inc at the end

Diff. Divisions

Diff. Divisions

Diff. Divisions

Intl = International

Diff. Companies??

Acronym

Inc at the end

ExxonExxonExxonExxonExxonExxonExxonExxon

32 Merrill Lynch & CoMerrill Lynch Capital MarketsMerrill Lynch PierceMerrill Lynch Pierce Fenner

33 Mgmt Sci for Health MSH/NepalMgmt Science for Health

34 Mitsubishi Bank LtdMitsubishi Bk LtdMitsubishi Corp

35 New York UnivNew York Univ Med Ctr

36 Nomura Securities Co LtdNomura Securities IntlNomura Securities Intl Inc

37 Ryder System IncRyder Truck Rental Inc

38 SRI InternationalSRI Intl

39 Searhson Lehman Bros IncShearson Lehman Bros Inc

40 Shell ChimieShell Intl Petroleum CoShell Kosan KKShell Oil Co

41 Siemens AGSiemens Communication Sys Inc

42 Standard Oil CoStandard Oil Co of Ohio

43 Sumitomo Metal Inds LtdSumitomo Special Metals Co Ltd

44 Technical Marketing Assocs IncTechnical Mktg Assocs Inc

45 Wells Fargo BankWells Fargo Bank NA

46 Westinghouse Elec CorpWestinghouse Electric Co

Diff. Divisions

Diff. Divisions

Bk = BankDiff. Divisions??

Diff. Divisions

Inc at the endDiff. Divisions

Diff. Divisions??

Intl =International

Misspelled

Diff. Divisions

Diff. Divisions

Diff. Divisions

Metal = SpecialMetals??Diff. Divisions

Mktg = Marketing

NA at the end

Elec = Electric

Analysis

a) I found theclassification:

Problem

Organizational

Equivalence

Termination

Acronym

Misspelling

Not sure of

reasons why the local SQL hadn't find them in the first place. I made the following

Recurrence

52%

20%

17%

4.5%

4.5%

2%

Example

Ex: AT&T CommunicationsAT&T

Ex: Intl = Internationaland = &

Ex: & company& co inc

Ex: Mass Gen HospMassachusetts General Hospital

Ex: Searhson Lehman Bros

Ex: Mgmt Sci for Health NSH/NepalMgmt Science for HealthGroup = Systems

See figure 3.3 for a graphic description of the problems

Organizational: Means all the companies that by legal or internal organization reasons have calledtheir companies or divisions with different names or qualifiers. But by no means the fact that they are calleddifferently is a mistake. For example, Ford Espafia S A and Ford Motor Co, the first one is a subsidiaryof the Ford Motor Co and they should be named differently.

Equivalence: There are certain words that arelist of equivalence that I have found more commonly

and & (not in the middle of a word)Assocs = Association = AssocietsBk = BankCo = Corp = Corporation = LTD = CompaGnl = Genl = Gen = GeneralGpe = Grp = GroupInc = IncorporatedInfo = InformationInst = InstituteIntl = InternationalMgmt = ManagementMktg = MarketingNatl = NationalSys = SystemSoc = Societe = SocietyTech = TechnologyUniv = University

used as synonyms in the context of a company. Theused are:

Organizat! 315 141

Not-sure-of2%Misspelling

5%Acronym

5%Equivalence20%

Terminatior

Problem -distribution on the Alumni Database

Termination: There is a high tendency to end the names of the companies differently. The mostcommon ones were:

& CoIncNA

Acronym: This is an interesting problem, because the acronyms can mean many things. In thisdatabase I had an easy example: Mass Gen Hosp = Massachusetts General Hospital.

Misspelling: This is a classification where I considered letter transposition, and typographical errors.

Figure 3.3

Not sure of: Was a classification used for the names of the companies that I was not sure werethe same ones, just looking at the name. Examples of these were:

Mitsubishi Bank = Mitsubishi CorpKendall Co = Kendall Square AssociatesMgmt Sci for Health MSH/Nepal = Mgmt Science for Health

3.2 List of company interviewing. the date and the alumni that work for that company

The goal is to create a list with the date that a company is interviewing on campus and the namesof Sloan School alumni that work for that company. See figure 3.4.

Alumni Database

1442 Records

Placement Assitance Service

Johnson & Johnson Intemational01/11/1989

/ 131 Records

Company and Alumni Information: 69 Records

List of companies interviewing at MIT, the date of the interview, and the name of thealumni working for those companies.

Johnson & JohnsonAndrea Portante

Johnson & Johnson CoCarlos Torres

Figure 3.4

I now need to create a list of 69 records that will be the result of integrating the Alumni Database(AD) of 1442 records and the Placement Database (PD) with 131 records.

Step 1: Retrieve unique company names using the DBMS in place from both databases:

Process:select unique companynamefrom alumnitb

select unique companynamefrom company

Analysis:

a) After reduction, from the Alumni Database I got the same 1089 records as with the

first application.b) From the Placement Database I didn't get a reduction. I got the same 131 records.

Step 2: Print the lists and analyze them.

Process:Look at the different names and analyze them.

Analysis:

a) I got the distribution of the different letters of the alphabet for PD:

Company Name Percentage(%)

Number 0.8A 13B 12C 10D 3E 3F -3G 3H 1.5I 5J 1.5K 2L 0.8M 9N 40 2P 8Q 0.8R 2

See figure 3.5 for a graphic distribution of the first letter and figure 3.6 for a graphic distributionof AD and PD together.

b) One errorcompanyname = companyname

c) The first letter of the name was always typed correctly.

14-

12

10

8Percentage

6

4

2

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Alphabet Letters

Figure 3.5 First letter distribution of the company name on the Placement Database

Percentage

IABCDEFGHIJKLMNOPQR

Alphabet Letters

ALUMNI

0 PLACEMENT OFFiCE

Figure 3.6 First letter distribution of the company name

Step 3: Match the two lists manually.

Process: Use common knowledge and match them. Icorrect and 62 didn't match.

GroupNumber

I

Matches I made:

AD A T Kearny IncA T Kearny Technology Inc

PD A. T. Kearny

on the AD and PD databases.

found 69 matches, from which just 7 were

A priori knowledge:

Diff. DivisionsPunctuationTermination

.n.lI.iST UVWX Y Z

I 11-1. 1111filk,

2 AD Air Products & Chemicals IncAir Products and Chemicals Inc

PD Air Products & Chemicals

3 AD AmericanAmerican

PD American

4 AD AmericanAmerican

PD American

Express Bank Ltd Diff. DivisionsExpress TRS AcronymExpress Travel Related Services

Management Sys Inc Sys = SystemManagement System Inc Sys = SystemsManagement Systems, Inc Punctuation

5 AD Amoco Canada Petroleum Co LtdAmoco Corp

PD Amoco Production Company

6 AD Analog Devices IncPD Analog Devices, Inc.

7 AD Analysis Grp IncPD Analysis Group

8 AD Apple Computer IncPD Apple Computer, Inc.

9 AD AT&T

Diff. Divisions

Punctuation

Grp = GroupTermination

Punctuation

Diff. Divisions

AT&T Communications IncAT&T Technologies Inc

PD AT&T

10 AD Bain & CoPD Bain & Company, Inc

11 AD Barra ConsultingPD Barra

12 AD Baxter Travenol Labs IncPD Baxter Healthcare Corporation

13 AD BayBanks IncPD BayBanks Systems, Inc.

14 AD Bear Steams & Co IncPD Bear Steams & Co., Inc.

15 AD Boise Cascade CorpPD Boise Cascade Corporation

16 AD Booz Allen & Hamilton Inc

Co = CompanyPunctuationTermination

Termination

Diff. Divisions

TerminationPunctuation

Punctuation

Corp = Corporation

Punctuation

& = and

Termination

PD Booz-Allen & Hamilton, Inc.

17 AD Boston Consulting Group IncPD Boston Consulting Group

18 AD British Petroleum Co PLCPD BP America

19 AD Braxton Associates IncPD Braxton Associates

20 AD Canaan GroupPD Canaan Group Ltd. (The)

21 AD Chase Manhattan Bank NAChase Manhattan Overseas Corp

PD Chase Manhattan Bank

22 AD CitibankCitibank FranceCitibank IntlCitibank NACiticorpCiticorp Investment BankCiticorp Investment Mgmt Inc

PD Citicorp/Citibank

23 AD Codex CorpPD Codex Corporation

24 AD Conoco IncPD Conoco, Inc.

25 AD Drexel Burnham Lambert IncPD Drexel Burnham Lambert

26 AD Dun & Bradstreet CorpPD Dun & Bradstreet Corporation

27 AD ExxonExxonExxonExxonExxonExxonExxonExxonExxonExxonExxon

PD Exxon

Biomedical SciencesCentral ServicesChemicalChemical AmericasChemical USAChemicals USACompany USACorporationIntl CoOffice Systems CoRsch & Engineering CoCorporation

Termination

Diff DivisionAcronym

Termination

TerminationPrefix

Diff. DivisionsTermination

Diff. Divisions

Corp = Corporation

Punctuation

Termination

Corp = Corporation

Diff. DivisionsTypo Chemical,ChemicalsAmericas = USA??

28 AD First Boston CorpPD First Boston Corporation

29 AD Ford Espada SAFord Motor CoFord Motor Credit Co

PD Ford Motor Company

30 AD General Electric CoPD General Electric Company

31 AD General Motors CorpGenI Motors Corp

PD General Motors Corporation

32 AD Goldman Sachs & CoPD Goldman, Sachs & Co.

33 AD Hewlett-Packard CoPD Hewlett-Packard Company

34 AD Hughes Aircraft CoPD Hughes Aircraft Company

35 AD IBM Canada LTDIBM CorpIBM Europe

PD IBM Corporation

36 AD Index Group IncIndex Systems Inc

PD Index Group, Inc.

37 AD Information Resource IncPD Information Resource Inc.

38 AD Intl Finance CorpPD International Finance Corporation

39 AD J P Morgan Investment LtdPD J.P. Morgan & Co., Inc.

40 AD Johnson & JohnsonPD Johnson & Johnson International

41 AD Manufacturers Hanover CorpPD Manufacturers Hanover Trust Company

42 AD Mars & CoPD Mars, Inc.

Corp = Corporation

Diff. DivisionsCo = Company

Co = Company

EquivalenceCorp = Corporation

Punctuation

Co = Company

Co = Company

Diff. DivisionsCorp = Corporation

Diff. DivisionsPunctuation

Punctuation

Equivalence

PunctuationDiff. Divisions

Diff. Divisions

Diff. Divisions

Equivalence

43 AD McKinsey & Co Inc at the end

McKinsey & Co IncPD McKinsey & Company, Inc.

44 AD MerrillMerrillMerrillMerrill

PD MerrillMerrill

Lynch & CoLynch Capital MarketsLynch PierceLynch Pierce FennerLynch & CompanyLynch Capital Markets

Punctuation

Diff. Divisions

45 AD Morgan Stanley & Co IncPD Morgan Stanley & Company, Inc.

46 AD Nomura Securities Co LtdNomura Securities IntlNomura Securities Intl Inc

PD Nomura Securities International, Inc.

47 AD Northern Telecom IncPD Northern Telecom, Inc.

48 AD Norton CoPD Norton Company

PunctuationCo = Company

Inc at the endDiff. DivisionsPunctuationEquivalence

Punctuation

Co = Company

49 AD Oracle Corp Corp = CorporationPD Oracle Corporation

50 AD Owens-Corning Fiberglas Corp PunctuationPD Owens/Corning Fiberglas Corporation Equivalence

51 AD Pfizer Taito Co Ltd Diff. DivisionsPD Pfizer, Inc.

52 AD Procter & Gamble Co Co = CompanyPD Procter & Gamble Company

53 AD Prudential Ins Co of America Diff. DivisionsPD Prudential Capital Corporation

Prudential Corp. Finance/Prudential CapitalPrudential Portfolio Management GroupPrudential Realty GroupPrudential-Bache Capital Funding

54 AD Putnam Hayes & Bartlett Inc Diff. Divisions

PD Putnam Companies PunctuationPutnam, Hayes & Bartlett, Inc.

55 AD Salomon Brothers Inc Punctuation

PD Salomon Brothers, Inc.

56 AD Shawmut Corp Diff. Divisions

PD Shawmut National Corporation

57 AD Smith Barney Harris Upham & CoPD Smith Barney, Harris Upham & Co., Inc

58 AD Temple Barker & Sloane IncPD Temple, Barker & Sloane, Inc

59 ADPD

Teradyne IncTeradyne, Inc.

Punctuation

Termination

Punctuation

Punctuation

60 AD United TechnologiesPD United Technologies

61 AD World BankPD World Bank (The)

62 AD Xerox CorpPD Xerox Corporation

Corp OrganizationalCorp./Otis Elevators Co.

Punctuation

TerminationPrefix

Corp = Corporation

Analysis

a) Find the reasonsas follows:

why a program could not match them in the first place. The classification was

Recurrence Example

Organizational

Punctuation

Equivalence

Termination

Prefix

Acronym

Misspelling

28%

28%

25%

14%

2%

2%

1%

Ex: AT&T CommunicationsAT&T

Ex: Apple Computer IncApple Computer, Inc.

Ex: Intl = Internationaland=&

Ex: & company& co inc

Ex: World BankWorld Bank (The)

Ex: British Petroleum Co PLCBP America

Ex: Exxon Chemical USAExxon Chemicals USA

See figure 3.7 for a graphic representation of the problems.

Problem

Orgaruzational 28%

Punctuation 28%

MissperingPrefix 2%

Acronym 2%

Termination 14%

:E,,, I a

Figure 3.7 Problems distribution when integrating two heterogeneous databases.

I added two more classifications:

Punctuation: All the problems that come from commas, slashes, periods, etc.

Prefir. For the problems of the articles before the names such as: The Canaan Group.

With these two examples of name management problems in mind I am going to enter chapter 4

and chapter 5 to snmmarize the research that has been done previously and related to name management

and to give an overview of learning and understanding theory. On chapter 6 1 will present a model of the

problem of name management in corporate databases based on the observations made in this chapter.

4. NAME MANAGEMENT RESEARCH

The problem of name matching has been an important concern in the research done at theComposite Information System Laboratory (CISL) at MIT. The first part of this chapter concentrates onthese efforts and on the second part I will review name management fields other than database integration.

4.1 Name Management on Corporate Databases

4.1.1 Composite Information System Tool Kit (CIS/TK)

The research done at CISL is reflected in different implementations done with a prototype systemcalled CIS/TK (Composite Information System Tool Kit). CIS/TK is a combination of artificial intelligence,networking and database management systems (DBMS) technology that solves the problem of communicatingheterogeneous databases. These databases can be inter- or intra-organizational.

To solve the problem of logical connectivity and name management, Paget [131 suggests the useof synonym tables that can facilitate joining databases which follow different data conventions: for example,indicating that "US", "USA", "United States", and "United States of America" are all the same within thecompanyname columns. Translation tables and facilities can determine more complex translations: forexample, a marketing database which rates a product from "poor" to "excellent" may be mapped to aproduction division's quality test in which products are ranked on a scale from 1 to 10.

Wang, Madnick, and Horton [181, designed an interesting technique called inter-database instanceidentification. It employs a combination of database management systems and artificial intelligence techniquesto solve the instance matching problem. Common attributes in the disparate databases are applied first toreduce the number of potential candidates for the same instance. Other attributes in these databases,auxiliary databases, and inference rules are exploited to identify the same instance. This work provides asolid base for further optimization and extension of the identification problem.

Godes [51 makes an interesting use of the CIS methodology and goes through three financial cases.He defines more broadly the concept of logical connectivity. He presents a list of some of the few problemsthat a CIS designer can expect to face in solving the name matching problem:

. Inter-Database Instance Identification: How does one ensure that, for example, company-leveldata for the same company is retrieved from databases that use different formats (and values) for theircompany identifiers? While General Motors may be known as "General Motors, Inc." in one database,another might represent it as "General Motors Incorporated, USA". While a person has little difficultyresolving that the two are the same, the normal computer has no analog to this reasoning capability.

. Levels of Granularity: This can be seen at several levels. For example, at the company level,one database may provide information for General Electric desegregating all of its operating groups, suchas NBC and Kidder Peabody, while others may simply assume all financial divisions under "GE". Further,at the attribute level, one company may provide detailed financial data through on-line databases, whileothers provide annual-report-like highly aggregated information. Clearly, comparison between these twocompanies would be extremely difficult given these different levels of granularity.

. Concept Inferencing: Often, the specific attribute that the user is seeking is not explicitly in anyof the data sources. However, by using several of them in concert, that attribute might be inferred. Thegoal is for the CIS to be able to acquire enough information to be able to perform certain levels ofinferencing on its own.

A suggested method was developed by Madnick and Wang et al. 1191. To solve the problem theysuggest a combination of three techniques: key semantic matching, attribute semantic matching, andorganizational affinity.

. Key Semantic Matching: To match the two IBM's, they use rules such as "Corp." and "Inc."suffixes as equivalents. These rules can be context dependent. As an example, they don't want"Inc Magazine" to be matched with "Corp Magazine".

. Attribute Semantic Matching: The two MIT's are harder cases since it is unlikely that one wouldhave to rule that "M" and "Mass." were equivalent. Instead one can identify the match based uponthe attributes (both have CEO "Paul Gray" and HO city "Cambridge").After semantic instance matching (both key and attribute)) has been performed, the results can besaved in the Inter-Database Instance Identification Table. Thus, subsequent join operations ca beaccomplished rapidly through automatic use of this table.

- Organizational Affinity: In many cases there exists an affinity between two distinct organizations.For example, the "IBM Federal Systems Division" is distinct from "IBM Corp" (i.e. it is a divisionof IBM Corp). Also, although "Continental Airlines" is a separate corporation from "Texas AirCorp" Texas Air Corp owns Continental. Thus, depending on the purpose of the query, it may bedesirable to treat two distinct entities as being the same (e.g. from marketing perspective Continentaland Texas Air may be merged, from a legal liability perspective they should be kept separate). Theorganizational affinity facility requires two steps: First, external knowledge must be supplied todescribe the nature of the affinity between the organizations -- this may be supplied manually orthrough processing of other organizational database. This information is held within the Inter-Database Instance Affinity Table. Second, a context can be provided for any query or session (e.g., "if entity A owns more than 51% of entity B, then treat them as matching entities") which willbe used to define generalization the instance matching process.CISL's technique is a non-invasive technique, i.e. it leaves the systems the way they are now.

4.1.2 Naming Standards

Another approach to solve name management problems on corporate databases is by starting fromscratch. [22,16] suggest a software that creates a data dictionary, so each time a name is entered into thedatabase it will have to look first on the data dictionary for its existence. If it exists it will entered themnemonic into the database, if it doesn't exist the program will create the new mnemonic asking for certaininformation.

This is an interesting approach that assumes that the customer can implement this data dictionary,by creating a new database and data dictionary from the beginning. This is called referential integrity andis difficult to do between separate databases.

As not much research has been done on corporate databases to solve the name managementproblem, the study was extended to other fields with a related problem to explore different approaches tothe problem.

4.2 Related Studies

4.2.1 Spelling Correction

As explained by Roach, Berghel and Talburt [151 a string is defined as a linear sequences ofsymbols defined over some alphabet. Such strings, and operations involving them are frequently encountered

in word processing and text editing applications. Global search and spelling correction features are typicalstring-oriented procedures.

In the case of search operations, an attempt is made to exactly match the target string with a tokenwithin an electronic document. The matter is more complicated with spelling correction.

Berghel, Roach and Talburt [11, give an interesting category table of spelling errors:

I. Typographical Errors (In decreasing order of occurrence)

1. character transposition2. one extra character3. one missing character4. one wrong character5. two extra characters6. two missing characters7. two characters transposed around a third

II. Knowledge Errors

A. orthographical

1. consonantsa. doubled consonants

i. unnecessary doubling (e.g., gallactic)ii. failure to double (e.g., mispelling)

b. mistaken consonant order (e.g., nmemonic)

2. vowelsa. dropped vowels (e.g., mathmatics)b. added vowels (e.g., mayorality)c. mistaken vowel order (e.g., heirarchy)

B. phonological

1. errors from pronunciation (e.g., numonia)2. homonymal errors ( e.g., boar for bore)

C. syntactic

1. prefix confusion (e.g., priempt)2. suffix confusion (e.g., policys)3. gender confusion (e.g., fianc6 vs fianc6e)

D. semantic

1. confusing unrelated words with similar spellings (e.g.,diary and dairy)2. confusion based upon plausible substitutions (e.g., cold slaw)

In spelling correction, one attempts to match a target with a set of similar string in a lexicon. Thereare then two kinds of algorithms: a) Exact string matching like: Knuth-Morris-Pratt, Boyer-Moore, and

Rabin-Karp which achieve sub-linear complexity; b) Approximate string matching like: Levenstein's metricand non-metric n-gram analysis.

In a related article Hall and Dowling [7] present a summary of studies done regarding the kindsof spelling errors, they report that in computer-based information systems, errors of typing and spellingconstitute a very common source of variation between strings. The most common errors while typing isthe substitution of one letter for another, followed by the omission of a letter and then the insertion of aletter. In the investigation of typing errors in a number of bibliographic databases, research has found asmany as 22.8 % of its index terms to be misspellings in one database and as low as 0.4% in another, withan average of 10.8% over all the databases sampled [7,23]. Usually over 80% of all typing errors are singlecharacter omission and insertions, substitutions,and the reversal of adjacent characters [7,24]. Spelling errors,by contrast, may be phonetic in origin. Phonetic variations are particularly common in transliterations, asin the example "Tchebysheff" and "Chebyshev" [7].

The motives of error investigation studied for Hall and Dowling are:

* concern with the quality of information retrieval and advise better controls to reduce theproportion of these errors.. concern with error recovery and error correction in compilers.. relations of errors as an indicator of programming competence

Optical character recognizers and other automatic reading devices introduce similar errors ofsubstitutions , deletions, and insertions, but not reversal. The frequency and type of errors are characteristicsof the particular device. Pattern recognition researches seek to "correct" these errors using "context"information, either by finding the best match among repertoire of possible inputs or by using a generallinguistic structure.

Many approaches to speech recognition deal with strings of phonemes or symbols representingsounds, and attempt to match a spoken utterance with a directory of known utterances. Variations instrings can be due to "noise" when one phoneme is substituted for another similar to it, or phonemes areomitted or inserted, but again not transposed. Another source of variation in phoneme strings is theduration of the spoken word. While words and phrases can be spoken at various speeds, speech to phonemetransducers often work at fixed time intervals, and thus slow speakers produce longer sequences of the sameor similar phonemes.

4.2.2 Approximate string matching classification

Hall and Dowling [7] review the approximate string matching algorithms available when there maybe a spelling mistake or other error in the keyword. The methods that they found are classified as either:

a) Equivalence: If two strings which are superficially different can be substituted for each other inall context without making any difference in meaning, they are equivalent.

Common examples of equivalence are alternate spelling of the same word, the use of spaces asformatting characters, optional use of upper- or lowercase letters, and alternative scripts. For example, allthe following strings might be considered equivalent:

Data Base, data-base, data base, DATABASE

database, d a t a b a s e, Database

A very important example in keyword searching in information retrieval is the treatment of allgrammatical variances of a word as far as retrieval is concerned. Normally, mechanisms here attempt toreduce words to their stem or root and then to treat all words that can be reduced to the same stem asequivalent.

Synonyms are more properly considered as similarities.

Abbreviations can be viewed as alternative spelling and thus as equivalences. For example, Ltdfor Limited. In general this is not possible since several words may have the same abbreviation. Forexample, St. for both Saint and Street.

Hall and Dowling suggest that the use of a set of rules can solve the equivalence problem. Forexample, they use the following set of rules for producing a canonical form for English and Americanspelling:

Rule 1: Change internal z's to s's when preceded and followed by a vowel or y.

Examples: razor, analyze, realizeCounterexamples: hazard, squeeze

Rule 2: Replace all internal occurrences of 'ph' by 'f'.

Examples: sulphur, peripheral, symphonyCounterexamples: uphill, haphazard

Rule 3: For words of at least six letters, replace a word ending 'our' by 'or'.

Examples: flavour, humourCounterexamples: devour

Rule 4: After removing endings such as 'e', 'ate', and 'ation', replace the endings 'tr' by 'ter'

Examples: centr(e), filtr(ate)

b) Similarity: By far the most usual understanding of "approximate" or "like" is that of similaritybetween two strings. By some inspection process, two strings can be determined to be similar or not. Theimportant property of similarity which makes it very different from equivalences is that similarity is notnecessarily transitive.

Synonyms are considered as similarities. In all languages there are many words which mean moreor less the same things. Hall and Dowling consider the following example taken from Roget's Thesaurus:

Gun Rifle Cannon Revolver

There is a tendency to think of synonyms as equivalences, but if one looks at the example

Hot Warm Cool Coldthey notice that the synonymity is not transitive. Hot is no synonymous with Cold. However, whensynonyms are controlled by a thesaurus, they are often treated as equivalent, often referring to the variousalternative words as denoting a particular concept.

One of the solutions that Hall and Dowling suggest to measure the similarity between two stringsis the Soundex algorithm, which reduces all strings to a "Soundex code" of one letter and three digitsdeclaring as similar all those with the same code. However, the relationship of having the same code is anequivalence relation, but the string matching problem this proposes to solve is a similarity problem. Notsurprisingly, the Soundex method and other methods like it can sometimes go very wrong. Yet theseapproaches can provide significant extra flexibility to systems that use them. They give two references ofsuccessful application of the Soundex method in a hospital patient index and in airline reservations.

4.2.3 Soundex Algorithm.

The idea is to transform the name into a Soundex code of four character in such a way that like-sounding names end up as the same four characters. The first character is the first letter of the name.Thereafter numbers are assigned to the letters as follows:

0 A EIOU H W Y1 BFPV2 CGJKQSXZ3 D T4 L5 MN6 R

Zeros are removed, then runs of the same digit are reduced to a single digit, and finally the codeis truncated to one letter followed by three digits. For example, while DICKSON and DIXON are assignedthe same code of D25, RODGERS and ROGERS are not assigned the same code.

Variations of this Soundex algorithm are common to solve these problems. Some of them attemptto estimate the likelihood that a word is correctly spelled by appeal to linguistic regularities.

4.2.4 Direct Mailing

Another application of string matching algorithm is in the area of direct mailing businesses. It isgrowing as more information is available on targeting specific problem. There are two kinds of companiesthat deal with direct marketing lists:

a) Companies that provide the service of making lists.b) Companies that write the software to make direct mailing list. ( some of them also provide theservice).

The companies that just provide the service usually use the software developed by other vendorssuch as: Group 1 Software Company or for mailing lists any other software approved by the U.S. PostalOffice.

One example of a company that provides the service and develop their own software is: HarteHank. The process that they use to solve the problem is:

i) Normalization: Parsing the lines of a file, so they can identify which line contains: customername, title, company name, street name, city, phone number. This process is crucial, because this

information is piped into the next filter.

ii) Standardization: Once you have identify the address lines then they change all the Street atthe end for St. and the Boulevard for Blvd.

iii) Zip Code Validation: Every record is checked for the validity of its address.

It seems that the household databases are easier to match than the corporate ones. For examplethe rate of success for households according to their statistics is 98% and for corporate databases is just40%. This is due to the fact that usually the records on the corporate databases have more than one lineto refer to one person including titles, divisions, company names, or P.O. Box. These kinds of records areharder to parse, and that is the reason for their failure. The way Harte Hank solves the problem is bycustomizing the software to a specific application. For example they once had to match the same personsin a Canadian database, they customized the software so the computer would recognize that married womenin Canada add their husband's name before their last name and add a hyphen, so Juliet Smith is the sameperson than Juliet Perez-Smith. After the customization of the software they identified problems and passedthe "doubtful" data through a human interface.

Working for the CISL, Norckauer [121 makes a deeper study of the implications and problems ofduplication entry detection in mailing and participation lists.

43 Conclusion

There are many applications and algorithms that one can consider when solving a related namemanagement problem, but with every method the analyst has the risk of both Type 1 error (where a recordthat should have been declared as a duplicate is not),and Type 2 error, (where a record that should not havebeen declared as a duplicate was) occurred. So in addition of establishing the definition for a duplicate, onemust also assess the effect of the Type 1 and 2 errors.

5. RECOGNITION PROCESS

From Schank's [17] studies, one can get some ideas of what is the process by which a person readsand understands what is reading. This information is relevant, because it helps to analyze the manual namemanagement recognition process and to determine some possible heuristic rules.

5.1 Knowledge and Understanding

Knowledge: The crucial determinant in what can be read by anyone, is the amount of backgroundknowledge they possess to help them determine the meaning of what they are reading.

A large part of the understanding process is to attempt to make explicit what is implicit in asentence, or a situation in general. Language is a means of conveying information. But frequently whatis expressed in an actual sentence is only a small part of what the speaker wishes to convey. Often, muchof the intended information is left implicit. It is thus to job of our memories to fill in what has been leftout. Ex: when one sees MIT, one has to remember that it is referring to Massachusetts Institute ofTechnology.

One learns to make guesses or predictions about what one hears, in order to help us fill in thedetails of what one has been told.

5.1.1 Process of Understanding

Schank divides the process of understanding in four parts:

1.- Analysis, that has several subprocess:i) Lexical look-up. To assign a meaning the words a person reads, the person must determine

the meanings of them individually. To do this people employ a kind of mental dictionary.ii) Event Identification: Is the process of fitting the concepts that have been obtained lexical

look-up into an event or state description. Example: MIT, once identified the name as Massachusetts Instituteof Technology, the persons thinks that it is likely that MIT is a University.

2.- Inference-Making: Is the process of making best guesses about what a speaker must have meant,apart from he said explicitly. For example: When I read American Express Travel Related Services in thePD database and then I saw the company name on American Express TRS. I inferred that TRS stands forTravel Related Services and that both names were the same.

3.- Memory: Is not a separate enclave, divorced from the processes of inference-making or analysis.In fact, each of those processes needs to consult memory as a repository of information. Memory providesthe answer to: What does American or Express mean for the lexical look-up process? Memory is thus agiver of information to process rather than a process itself.

One of the kinds of information present in memory is called a belief. A belief is an attitude or afeeling a person has towards an event or state of the world. People have beliefs about many kinds ofthings. As understanders, one uses or consults these beliefs when one decides what is proper to do, aswell as when one attempts to decide why someone is doing something.

Other kinds of information available in memory include: facts, knowledge about episodes in one'sown life, knowledge about history and current events, etc. In other words, one understands everythingone hears in terms of what one already knows and has stored in our memories.

4.- Event Connection: The last process in understanding is called event-connection. In event-connection, one attempts to find out if the new input one has received fits in with any beliefs, contradictoryfacts, or other information that will help to explain, or connect together, the new event or states of theworld which one has just been informed.

For example, I read the company name on the PD database: Amoco Production Company. I tryto match it to the AD database, and I don't find an identical entry, so I infer that the two entries withthe same first name can be related to it:

Amoco Canada Petroleum Co LtdAmoco Corp

I know that:

* Oil companies incorporate every new division as an independent entity, so the corporation as awhole can't be sued.

. Organizationally speaking it is useful sometimes to divide the companies.

So I determine that all the Amoco companies can be grouped together, because in fact they belongto the same corporation.

5.1.2 Knowledge that a computer needs to understand

Schank also relates the knowledge that an adult has with the knowledge that a computer needs inorder to understand stories. The summary of his results in the Artificial Intelligence field are:

1.- Make Simple Inferences2.- Establish Casual Connections3.- Recognize Stereotyped Situations4.- Predict and Generate Plans5.- Track People's Goals6.- Recognize Thematic Relationships between individuals and Society.7.- Access and Utilize Raw Facts.

The seven kinds of knowledge roughly categorize what a persons knows about the world, theseseven types referred above help a reader to interpret what she/he hears, sees, and reads, which in turnhelp her/him determine the import of what sh/he has read. To see how this works, Schank goes throughthe following example:

"John hated his boss. He went to the bank and got twenty dollars. He bought a gun. The nextday at work he decided to ask his boss for a raise. But John was so upset by his own plan that he toldhis boss he was sick and went home and cried."

This is a simple story, on the surface. Beneath the surface of simple words are some complicatedideas that require and adult's understanding of the world to interpret.

1.- Inference: In order to understand a sentence fully, it is necessary to draw conclusions from thatsentence about the things that were not explicitly stated but which nevertheless are true. People are rarelyaware that they have made one when that inference is violated or in error for some reason.

Some inferences necessary for understanding the above story are:

a) After buying the gun, John has the gun, i.e., buying implies having.b) The gun cost twenty dollars, i.e., buying requires money.

2.- Casual Connections: Adults have an understanding of how one event relates to another. Adultsattempt to determine the casual relationships inherent in what they are trying to understand. One Casualrelationship in the above story is:

Going to the bank enabled John to get moneyi.e., "going" can enable actions that ordinarily takeplace at the location arrived at.

3.- Stereotyped -Situation (Scripts). People have a great deal of information about stereotypedsituations. In this story those that are referred to are banks, stores, and offices.

4.- Plan Prediction and Generation: To understand this story fully, it is necessary to postulate aset of possible plans under which John is likely to be operating. In order to postulate such plans, however,one has to be able to generate them oneself. Even the simplest of plans is hard to follow if you have notlearned how to follow another person's plan.

5.- Goal Tracking: What is John going to do and why is he doing it? these are the questions thatoccur as one reads this story. However, answering such questions requires knowing about goals such as beingwell treated, respected, well paid or whatever complex set of goals are reasonable to postulate inunderstanding this story.

6.- Thematic Relationship: Understanding this story requires a good assessment of how an employeemight feel towards a boss, and an interpretation of hatred in this context. Another important thematicrelationship is the fear of being an outcast, immoral person, or a criminal, all of which one can imagine tobe something going through John's mind.

7.- Beliefs: A person, as a reader, believes certain things about what is right and wrong. If Johnis going to threaten his boss for a raise, or possible kill him, one views it as misguided, and probablyterribly wrong. These beliefs about what is a correct course of action in the world are very much a partof how one understands and thus of how one reads.

Raw Facts: Banks have money. Bosses give raises. Crying releases tensions. Stores sell guns.All these are simple facts about the world, without which it would be hard to understand this story.

5.2 Bounding the Knowledge

Another important idea given by the research of Artificial Intelligence is a methodology to findout all the possible space of solutions.

First step: Understand the problem and build all the possible solutions.

Second step: Limit the possible set of variation of the solutions by defining the feasible set of thesevariations. This last step is done by the knowledge that one has acquired on the problem.

With these concepts in mind, in the next chapter I will try to build a model of the applicationsexplained on chapter 3 and the review of the research that has been done on name management explainedon chapter 4.

6. MODEL PROPOSED

6.1 Conclusions from the Applications on Chapter 3

The following conclusions can be drawn from the two applications that were studied on chapter3:

1.- The reasons for performing a database integration have to be justified and clearly analyzedbefore trying to solve the different name management problems. For example, on the first case thejustification for using name management techniques is to create a unique directory per company that willsave money, and provide information per company to the students.

2.- It can be expensive to match all the possibilities of name management without a purpose thatwill help bound the problem. The name matching process could be substantially reduced if the characteristicsare known. For example, I can limit the problem by specifying in my system that the grouped alumni haveto be working for the same company on the same city. In this figurative example the attribute address willhelp me bound the problem even further.

3.- The name management problem is a greater concern when integrating heterogeneous databases.For example, for the second application, creating a list per company with the alumni that work for it andthe date of the interview, I used the local DBMS and I just got 10% of the matches I should have obtained.

4.- The name management problem is not as serious in homogeneous database application. Forexample, for the AD application, the DBMS picked by itself 85% of the correct matches.

5.- The errors of why the regular DBMS didn't obtain all the right matches can be classified andgrouped. In this case I found eight classifications: organizational, punctuation, equivalence, termination,acronym, misspelling, prefix and not-sure-of.

6.- The most common problem is the organizational (52% for the first case and 28% for the secondone). It can be divided in two:

a) companies with similar names: like the case of Merrill Lynch on the AD database that hasfour different names: Merrill Lynch & Co, Merrill Lynch Capital Markets, Merrill Lynch Pierce, MerrillLynch Pierce Fenner.

b) companies with different names: like the case mentioned by Madnick and Wang [19] ofContinental Air Lines and Texas Air International.

7.- At the AD and PD databases, all the company names with organizational problems had thefirst token of the string equal. For example, AT&T, AT&T Communications Inc, AT&T Technologies Inc;all have AT&T as a common denominator.

8.- One can establish a certain pattern per database. For example, in the PD database the problemwith punctuation was very distinctive: the use of commas at the end of the company name and periods withthe abbreviations.

9.- On both databases the first letter of the first word of the name was always typed correctly.

10.- The database can be divided in files containing names of companies grouped by their startingletter. e.g. Amoco and AT&T can be placed in the file that keeps all the As.

11.- For establishing event connections, one needs to know the history of the database. For example,the AD database has information concerning alumni from 1968 until 1980, while the PD database wascreated in 1989. That could explain differences among the names of the companies for example: UnitedTechnologies in AD and United Technologies/Otis Elevators in PD. The analyst can group them togetherif he/she knows that United Technologies took over Otis Elevators in 1973, and that may explain thedifference on the two names. One can use this information to possibly draw heuristic rules.

12.- To analyze the database one needs to know the initial purpose of the database. For example,the PD database was created to centralize and gather recruiting companies for the MIT Career DevelopmentOffice, so they could have more control and information during the recruiting process.

13.- The attributes and other tables on the database can be source of valuable information. Forexample, the sic code can give us information if the companies belong to the same industry.

6.2 Proposed Methodology

Based on the literature and experimental research done on name management on corporatedatabases, this thesis proposes the following methodology':

1.- Objective determination. This step is the most important one. A management team that thinksthat there is a name management conflict in their logical connectivity, should first use the CIS modelexplained on chapter 1 to find the organizational and technical issues that will make the use of namemanagement techniques strategic to the future of the company. Financial justification may not be needed,but it needs to be considered.

2.- First Database Reduction. Using the DBMS installed on the computer, one can reduce thenumber of matches needed.

3.- Sample Pattern Studied: Once the strategy has been established on step one, it can be helpfulto find out about the background of the database. For example, the analyst should start by asking:

. which were the initial and current purpose of the database

. how many records it has

. which are the attributes that are available

The analyst should then get a sample data of different parts of the database so a pattern may besuggested. For example, punctuation pattern, letter distribution, error numbers on the first letter entered, getthe most common equivalences, obtain the most commonly used acronyms in the database.

4.- Second Database Reduction: From the research, it seems that a reduction on the number ofrecords is a main concern with database managers on financial institutions. Their main concern was thefact that they all have very large databases ( > 1,000,000 records). One of the banks even said that itcosts them $1 to access a record.

The solution suggested here is to identify the first letter of the keyword and retrieve all the firstwords that start with the same letter. For example, if I want to ask for AT&T, I should just start to lookthe match of all the company names that start by an A. Assuming an As distribution of 10%, I can reducethe numbers of records to be analyzed from 1,000,000 to 100,000 approximately depending on thedistribution of the letters in the database.

5.- Canonical Form Conversion: Define the rules that are going to be applied to the list from step4. A suggested list follows.

- translate into caps* define the equivalence rules: For example: co = company. ( A more extended list can be foundon chapter 3.) -. eliminate the punctuation at the end of words. eliminate the terminations like: and company, etc

With these rules the equivalence, punctuation, different caps, and termination problems should beeliminated

6.- Third Database Reduction: A new match with the new list should be attempted to furtherreduce the list.

Please note that the numbers used in this description should not be interpreted as having to befollowed on the same order. I number them in the way the steps were used during this research.

7.- Misspelling Correction: Using an approximate string matching algorithm like the Soundexalgorithm explained on chapter 4, I can obtain a list of the most likely possibilities. In this step a humaninterface is needed.

8.- Matching of Acronyms and Special Organizational Problems: Through the use of Inter-Database-Instance-Identification-Table (IDIIT) suggested by Wang and Madnick on chapter 4, 1 can find the companieswith special links like: Mass Gen Hosp and Massachusetts General Hospital, and Continental Air Lines andTexas Air International.

The IDIIT can be fixed or can use heuristic rules to determine the companies. Wang, Madnickand Horton's work [18] can be a good example of how to use heuristic rules to find name matching.

Also Schank's work on chapter 5 can be useful to help determine the rules to be used.This step can be batch or interactive, depending on the amount of information and the degree of

information that is required.

9.- Fourth Database Reduction: To enter in the last phase of name matching: Organizationalmatching. I suggest to reduce the list further, by making a file with the list of the companies that have thesame first word that the key word. Example: Siemens AG and Siemens Communication Sys Inc.

10.- Organizational Matching: With this reduced list on hand, I can use the following techniquesto approach the last problem:

a) Using heuristic rules, to let the system determine which companies can match the key words.For this step is important to have a good knowledge of attributes and possible links among them. Todetermine these rules, I suggest to refer to Schank's work.

b) Interactive, with the help of a human interface, to let the person determine if the companiesshould match or not. This interface can also be used to help determine and create the heuristic rules forthat database.

A summary of the model can be seen on figure 6.1.

Please note that not all the steps are required and not even in the same order, however, thisheuristic procedure seems to guarantee a good degree of efficiency (reduce cost and time) and effectiveness(reduce type 1, 2 errors), as well as a framework for the research.

On the next chapter, I will apply the model explained above to help us to analyze the applicationsI studied on chapter 3. -

METHODOLOGY METHOD PROBLEM SOLVEDS UGGESTED I_______

I_________

Objectve Composite Informadon Find purpose of databaseDetermination System Methodology Integration

First Database Use of DBMS: SQL Reduce the database byReduction using its own dbms

Sample Pattern Observation Create rules for the canonicStudied conversion

Second Database File creation with Wirst Reduction of large databaseReduction leter

Canonical Form Software titers, Termination, Equivalence,Conversion Soundex, Punctuation, Misspelling

Other algorithms Prefix

Third Database Consolidation Further ReductionReduction

Matching Acronyms IDIIT. heuristic rules Special cases: Acronyms,&Special Org. OrganizationsFourth Database File creadon by first ReductionReduction word

Organizational Heuristic rules, hurnan OrganizationalMatching Interface (Name matching completed)

Figure 6.1 Summary of the Methodology suggested.

7. APPLICATION OF THE MODEL

The purpose of this chapter is to provide an example of a methodology that can be used to solvename management problems. This is just a suggested approach and the order of the steps can be changeddepending on the problem. The applications considered here correspond to those analyzed on chapter 3.

7.1 MIT Sloan School Alumni Directory

Creation of a directory of MIT Sloan School alumni per company. For this example I used theAlumni Database (AD). I needed to reduce the current 1442 records to 1027.

7.1.1 Methodology

Step 1: Purpose of the creation of a directory. In this case the objective is simple: To reduceexpenses to MIT by sending one package of information per company. This information can also helpMIT to make a more effective raising-fund drive, by giving them more information on the alumni and theworking relationships among them.

Step 2: First Database Reduction: Make an SQL query for the unique list of names. I reduced thelist from 1442 to 1089.

Step 3: Sample Pattern Studied: In order to get a pattern of the AD I did the following:

. interviewed people that have used the database

. used some SQL queries to find the size of the database and its letter distribution (Chapter 3)

. got a sample of the first 200 company names.

From there I obtained the following pattern:. no punctuation problem. no spelling errors. no errors on the first letter entered. a list of the most common equivalences: given on chapter 3.

Step 4: Second Database reduction: For the application of database reduction I created a file perletter of the alphabet and worked file by file through the following steps.

Step 5: Canonical Form Conversion: Using software programs written in Shell2 (Appendix A) Isolved the problems of equivalence, punctuation, different caps, and termination problems.

Step 6: Third Database Reduction: The software reduced my total list to 1074.

Step 7: Misspelling Correction: Using a Soundex algorithm written in C (Appendix B) the onlyproblem of misspelling was solved. Searhson Lehman Brothers. My list now was 1073. The rest of theproblems to be solved is one acronym and the rest are organizational problems.

Step 8: Matching Acronyms and Special Organizational: I define some heuristic, by followingShank's step of the knowledge that a computer needs in order to identify two names as the same. Shank'ssteps on chapter 5 are applied to understand text, but I think that to understand how I found that two wordsare the same should be simpler.

2 Bourne Shell, command line interpreter in many UNIX based systems

a) Inference: I tried to understand how did I reach the conclusion that Mass Gen Hosp andMassachusetts General Hospital were the same. For example, I realized that it was because Mass togetherwith Gen and Hosp it is more likely to mean Massachusetts that a catholic celebration.

b) Establish loose connections & recognize stereotyped situations step: By living in the areaI have seen many times that MGH, Mass Gen Hosp, Massachusetts General Hospital usually refer to thesame entity and that a hospital can be a working place for an MIT graduate.

c) Plan Prediction and Generation: I recognized that if the two companyname were referringto the same working place, both names should have the same address or phone. In this case the attributeaddress_13 were the same: Boston MA 02114, but address11 and address_12 were different. )dGoal tracking: What was the typist thinking when decided to write the abbreviation?. In this case the typistdidn't know if the whole name would fit, because in address_11 I read: Ambulatory Care Div, instead ofan address. So I can't treat this acronym as a general acronym for the Massachusetts General Hospital, Ithink the must common will be MGH.

e) Thematic relationship: Both companynames are on the same context, i.e. same database.

Using Shank's seven steps I realized that the acronym in this database was not one that could be

determined by a general rule, so at this stage the analyst can have the option to put it on an IDIIT tableor change the companyname. For the case of the AD, I couldn't find any general rule or attributes thatcould help me identify this case of acronym as the same one. So for this case I would use a synonym table.At this point I am left with 1073 records. On the AD I didn't find a case of Special Organization.

Step 9: Fourth Database Reduction: From the individual files from step 2, using Shell I made a

program that will ask me if the companies were the same. This program grouped together companies withthe same first name.

Step 10: Organizational Matching: At this step I did a manual review of the different files, I realized

that for finding if two companies were the same, all the files have fulfilled that purpose. It was easy to findinside the files which companies could be grouped together.

7.1.2 Conclusions

1. The name management problem for database reduction or consolidation is a time consuming

event, because I need to find all the possible alternatives for every single case on the database.

2. This kind of methodology solves the problem, but it involves a skilled person to analyzed it,

define the strategy and the need, set rules and review that the process is done.

3. The rules developed here are database specific, I couldn't transfer them without rethinking thestrategy to another database.

7.2 List of Companies Interviewing in Campus

Create a list with the date that a company is interviewing on campus and with the names of the

alumni that work for that company. For this example I used the AD and the Placement Database (PD).The objective is to create a directory of 69 companies with those characteristics.

7.2.1 Methodolg

Step 1: The purpose of the directory is to give the students the opportunity to contact persons that

work for a certain company that can provide more information regarding the working experiences in the

company. The MIT Career Development Office increases its services and its visibility.

Step 2: First Database Reduction: By using the SQL query on both databases I got a reductionon the AD database but I didn't get a reduction on the PD database, even if there were redundancies onit.

Step 3: Sample Pattern Studied:i) AD database: I did the same steps described on the first case.ii) PD database: For this case I:. got a sample of the first 50 company names.. interviewed people that have used the database. used some SQL queries to find the size of the database and its letter distribution (Chapter3)

From the sample I observed the following pattern:* punctuation problems: comas at the end of the names and dots at the end of abbreviations.. no spelling errors. no errors on the first letter entered. a list of the most common equivalences were given on chapter 3.

iii) When put the data together, I found that as described on chapter 3:- punctuation problems were the major concern. equivalence problems were similar that the ones in Case 1.. misspelling wasn't a major problem.. prefix problem was something to notice, but not to worry because it was at the end of thecompany name and not at the beginning.

Step 4: Second Database Reduction: I started by choosing one companyname from PD and lookfor it in on AD. I chose as an example: AT&T. So I reduced my 1089 records to 90 (8.2% of databaseAD are As).

Step 5: Canonical Form Conversion: I converted the 90 records on the "A-file" using the rules definedfor Case 1, as well as the keyword. Just to follow the example, before the canonical process the names thatthe program should find at the end were like:

Keyword: AT&TAD: AT&T

AT&T Communications IncAT&T Technology Inc

After the canonical conversion, the names were:

Keyword: AT&T

AD: AT&TAT&T COMMUNICATIONSAT&T TECHNOLOGY

Step 6: Third Database Reduction: I matched all the 90 names and just found one: AT&T byitself, I was still missing:

AT&T COMMUNICATIONSAT&T TECHNOLOGY

Step 7: Misspelling correction: The rest of the 89 words were subject to the Soundex algorithm,that in this case didn't provide us extra information.

Step 8: Matching Acronyms and special organization: In this case I don't have AT&T on a specialIDIIT list yet, but a hypothetical case could be where one needs to match AT&T and Pacific Bell. It doesn'tapply in this case.

Step 9: Fourth database reduction: I reduced the 89 records to a file with two records: AT&TTECHNOLOGY and AT&T COMMUNICATIONS.

Step 10: Organizational Matching: By visual examination I find the right answer to my problem,but in this step the analyst should analyze the attributes, and apply Shank's or Wang, Madnick and Horton's[18] techniques described on chapters 4 and 5 to give a more general solution to the problem.

7.2.2 Conclusions

1. In order to find the keyword on the AD database, using this methodology it was helpful toreduce the possibilities from 1089 to 90 to 2.

2. The conclusions for the MIT directory are also valid here: The need for a reasonably skilledperson to follow the steps to solve the name management problem, and the specificity of this methodologyto the AD and PD databases in the last stages of the name-matching process.

8. CONCLUSIONS

Name management on corporate databases is a common problem in all modern institutions. Thegrowth and dynamism of the economy are forcing information technologies to develop better techniquesto adapt themselves to the rapid changes of the companies.

The key to a successful use of name management techniques is based on the management abilityto tie the integration problems with organizational strategic goals.

No other bibliographical references were found, where the problem of name management had beenstudied for integrating database while leaving the original databases or applications intact.

This thesis suggests a methodology divided in ten non-sequential steps, that should be regarded asa series of guidelines that can be used for the analysis and solution of name management problems.

The study presented here was based on a limited research: two databases that belong to the sameinstitution where their record number is small compared with the databases of financial institutions. Themethodology of this thesis should be expanded with larger databases.

It is important to determine the scope of the problem before attempting to use any namemanagement technique. Based on the definition and depth of the problem a cost should be estimated toconsider for type 1 and type 2 errors. The knowledge and cost of the problem are useful to bound thepossibilities of using name management techniques.

The name management problem is a greater concern when integrating heterogeneous databasesthan when integrating homogeneous databases. Factors like same DBMS and company entry-policytechniques can account for these differences.

One important conclusion of this thesis is the fact that an analyst can classify the non-matchingproblems and can group them. This step helps determine the software techniques needed to solve theproblem. In this work eight classifications were found: organizational, punctuation, equivalence, termination,acronym, misspelling, prefix and not-sure-of.

The most common and complex problem encountered was the organizational, that can be dividedin two: companies with similar names and companies with different name. This problem is interesting anddifferent from the rest of the problems because it doesn't depend exclusively on the current databaseapplication, but it also depends on the new application that needs to be built. So for example, I would liketo consider together Continental Air Lines and Texas Air for consolidated reports, but separate for operatingratios.

During the research done with two databases certain patterns were found: all the company nameswith organizational problems (similar names) had the first token of the string equal, the first letter of thefirst word of all the company names was entered correctly, and the punctuation problems of a specificdatabase could be determined easily by looking at a sample of the database. The fact that certain patternscould be found on all databases is hard to establish, given the limited research sample.

Knowledge of the origin of the database, initial purpose and information contained on the attributescan be useful to determine heuristic rules to solve organizational and acronym problems.

The methodology suggested on this thesis solved all the problems presented in the creation of twoapplications with two different databases.

For future research I would suggest to continue the research on name management techniques tosolve organization problems. It is the least studied, the more complex, and the more interesting.

BIBLIOGRAPHY

[1] Berghel H., Roach D., Talburt J. "The Logic of Spelling: Applications of Approximate StringMatching." PC Al, January/February 1990.

[2] Berghel H., Roach D., Talburt J. "The Mechanical Cruciverbalist: New Directions in ApproximateString Matching." PC Al, November/December 1989.

[3] Berghel H.L. "A Logical Framework for the Correction of Spelling Errors in Electronic Documents"Information Processing & Management Vol.23, No. 5, pp. 477-494, 1987.

[4] Elmasri R., Navathe S.B. "Fundamentals of Database Systems" The Benjamin/Cummings PublishingCompany Inc. California, 1989.

[5] Godes D.B. "Use of Heterogeneous Data Sources: Three Case Studies" Sloan School of Management,working paper, June 1989. WP # CIS-89-02.

[6] Guida G., Tasso C. "An Expert Intermediary System for Interactive Document Retrieval"Automatica, Vol. 19, No.6, pp: 759-66, Nov. 1983.

[7] Hall P., Dowling G. "Approximate String Matching" Computer Surveys, Vol. 12, No. 4, December1980.

[8] Madnick S.E., Wang Y.R. "A Framework of Composite Information Systems for Strategic Advantage"Connectivity Among Information Systems, MIT, CIS Volume 1, 1988.

[9] Madnick S.E., Wang Y.R. "Logical Connectivity: Applications, Requirements, and An Architecture"Connectivity Among Information Systems, MIT, CIS Volume 1, 1988.

[10] Madnick S.E., Wang Y.R. "Facilitating Connectivity in Composite Information Systems" ConnectivityAmong Information Systems, MIT, CIS Volume 1, 1988.

[11] Marcus, R.S. "An Experimental Comparison of the Effectiveness of Computers and Humans asSearch Intermediaries" Journal of the American Society for Information Science, 34(6):381-404;1983.

[12] Norckauer H.R. "Duplicate Entry Detection in Mailing and Participation Lists" Sloan School ofManagement, Master's thesis, May 1990.

[131 Paget M.L. "A Knowledge-Based Approach toward Integrating International On-line Databases"Sloan School of Management, working paper, March 1989.

WP # CIS-89-01.

[141 Porter M.E., Millar V.E. "How Information gives you Competitive Advantage" Harvard BusinessReview, August 1985.

[15] Roach D., Hal B., Talburt J. " Intelligent Problem Solving with Strings: New Directions inApproximate String Matching." PC Al, September/October 1989.

[161 Salin, Tim. "What's in a Name ?" Database Programming & Design, March 1990.

[171 Schank R. "Reading and Understanding" Lawrence Erlbaum Associates Inc., Publishers, New Jersey.

[181 Wang Y.R., Madnick S.E., Horton D.C. "Inter-Database Instance Identification in CompositeInformation Systems" Connectivity Among Information Systems, MIT CIS Volume 1, 1988.

[19] Wang Y.R., Madnick S.E.,et al. "CISL: Composing Answers from Disparate Information Systems"Extended abstract, IEEE Workshop on Heterogeneous Database Systems, September 1989.

[201 Weinberg B.H., Cunningham J.A. "Word Frequency Data in Full Text Database Searching" NationalOn-line Meeting Proceedings, 1984.

[211 "CIS/TK Open Issues" CISL Sloan School of Management, October 17, 1989.

[221 "$NAME* Naming Standards" Global Software, Inc.

[23] Bourne, C.P. "Frequency and Impact of Spelling Errors in Bibliographic Databases" Inf. ProcessManagement 13,1 (1977) 1-12.

[24] Damerau, F.J. "A Technique for Computer Detection and Correction of Spelling Errors"Communication ACM 7,3 (March 1964) 171-176.

APPENDIX A

SOFTWARE FILTERS

This program will give put in canonical form the list of companies contained on a file calledfilename.

Command line: all filename

Input looks like:

Amoco Canada Petroleum Co LtdGenI Motors Corp

Output looks like:

AMOCO CANADA PETROLEUMGENERAL MOTORS

all:

cat '$1" 1 tr "[a-z]" "[A-Z]" I sed -f translate I sed -f translate2 | sed -f translate3

translate:

s/$/ /gs/ ASSOCS / ASSOCIATES /gs/BK /BANK /gs/ CO / COMPANY /gs/ COMP / COMPANY /gs/ CORP / INCORPORATED /gs/ ELEC / ELECTRIC /gs/ GNL / GENERAL /gs/^GENL /GENERAL /gs/ GENL / GENERAL /gs/ GPE / GROUP /gs/ GRP / GROUP /gs/ INC / INCORPORATED /gs/ INFO / INFORMATION /gs/ INTL / INTERNATIONAL /gs/^INTL /INTERNATIONAL /gs/ INST / INSTITUTE /gs/ MGT / MANAGEMENT /gs/ MGTM / MANAGEMENT /gs/ MKTG / MARKETING /gs/ NATL / NATIONAL /gs/ PLC / INCORPORATED /gs/ SYS / SYSTEM /gs/ SOC / SOCIETE /gs/ SOC / SOCIETY /gs/ TECH / TECHNOLOGY /g

s/ UNIV / UNIVERSITY /gs/ & / AND /gs/-/ /gs/\./ /gs/,/ /g

translate2:

s/ CORPORATION / COMPANY /gs/ INCORPORATED / COMPANY /g

translate3:

AND COMPANY//gAND COMPANY//gCOMPANY//gNA//g

APPENDIX B

SOUNDEX ALGORITHM

This program will translate a keyword and a list of company names using the algorithm explained

on chapter 4.

Command line: soundex keyword filename

Input on filename looks like:

SEARHSON LEHMAN BROS INCSHEARSON LEHMAN BROS INC

Output looks like:

S625 455 162 52S625 455 162 52

soundex.c program:

#include <stdio.h>#include <string.h>

main(argc,argv)int argc;

char *argv[];{

int i, j, letter;char query[801, bufferl[80], buffer2[80], translateo;FILE *fp;

if (argc < 3){printf("USE: %s keyword file name\n",argv[O]);exit(1);}

query[O] = *argv[l];i=1;

while(*(argv[1] +i))query[iJ = translate(*(argv[11 +i));i+ +;}

query[i] = '\0';

reduce(query);

printf("= = =%s %s= ==\n",argv[1],query);

if ( (fp=fopen(argv[2],"r")) = = NULL){printf("Can't open file %s\n",argv[2]);exit(1);}

i = 0;j = 0;while ( (letter=fgetc(fp)) != EOF ){

if(letter != '\n'){bufferl[j] = (char)letter;buffer2[j] = (char)letter;j+ +;

}else continue;

while( (letter= fgetc(fp)) != '\n'){buffer1[j = (char)Ietter;buffer2[j] = translate((char)letter);j+ +;}

buffer1j] = '\0';buffer2[j] = '\0';j=0;

reduce(buffer2);printf("[%d] %s\t%s\n",i,buffer1,buffer2);i+ +;}

char translate(let)char let;{

char num;

switch(let){case 'A':case 'E':case 'I':case '0':case 'U':case 'H':case 'W':case 'Y':

num = '0';break;

case 'B':case 'F':case 'P':case 'V':

num = '1';break;

case 'C':case 'G':case ':case 'K':case 'Q':case 'S':case 'X':case 'Z':

num = '2';break;

case 'D':case 'T':

num = '3';break;

case 'L':num = ;break;

case 'M':case 'N':

num = '5';break;

case 'R':num = '6';break;

default:

num = let;break;

}

return(num);

}

reduce(string)char *string;

{int i = 0;char *base, line[80];

base = string;

while(*string){

while ( *string == *(string+ 1)) string+ +;

if (*string = = '0');else {

line[i] = *string;i++;}

string++;

}

line[i] = '\0';

strcpy(base,line);

}

APPENDIX C

ACRONYMS USED ON THE THESIS

AD Alumni Database

CIS Composite Information System

CISL Composite Information System Laboratory

CIS/TK Composite Information Systems Tool Kit

DBMS Database Management System

IDIIT Inter-Database Instance Identification Table

MIT Massachusetts Institute of Technology

PD Placement Database

SQL Structured Query Language

: on KBIISE related activities was performed under the aegis of the Composite

)n Systems Laboratory Project. Dr. Gupta and Professor Madnick were the Co-Principal icy (RF)ors. mericanITE INFORMATION SYSTEMS LABORATORY PROJECT :nefits of

tive of our Composite Information Systems Laboratory (CISL) Project (forerunner of the silities of

ontext Interchange Initiative) at MIT was to investigate issues concerning the integration )ads had

of disparate information systems that characterize virtually all large organizations. ;erved as

during the eighties led to the SATORI framework, and work since then has included ilize this

; case studies involving industry partners (e.g., Salomon Bros., Citibank, Merck), :spective

ient of theories for all three levels of connectivity with particular emphasis on technical

-ity, and delineation of approaches on technical connectivity that can be utilized across National)plication scenarios. ion with

ble toolse work was conducted as part of the CISL project at MIT's Sloan School of Management from thection with MIT's Center for Transportation Systems, International Financial Services ; further,

Center, Leaders for Manufacturing Program, Intelligent Engineering Systems gnificant

y, Center for Coordination Sciences, Laboratory for Computer Science, and a number of al with.)

T research centers. Other projects related to these collaborations range from neural

techniques for image processing of handwritten checks to integration of factory-level

nformation for automated parametric release of parenteral drugs and these are being

I as part of the Productivity from Information Technology Initiative at MIT.

liduals leading the CISL research activity also served as technical advisors on the Total

ion System (TDS) concept of the U.S. Army and the U.S. Department of Transportation

The KBIISE and TDS work has been done as joint government, academia, and industry

tive efforts under the aegis of the MIT Center for Transportation Studies (CTS), and its

ation industry affiliates, with government funding facilitated via a multi-year broad

agreement between MIT's Center for Transportation Systems (CTS) and DOT's Volpe

Transportation Systems Center (VNTSC).

)MMUNICATIONS INFORMATION MANAGEMENT SYSTEM (TIMS) PROJECT

al telephone bill of the Federal Aviation Administration (FAA) exceeds $300 million per

itil the early eighties, all the voice, data, and radar circuits were procured from AT&T.

breakup of AT&T, FAA became legally obliged to obtain goods and services from

telecommunication companies. A number of different technologies were adopted byach typically accompanied by a separate information system purchased from the respective

By the late eighties, FAA had set up a separate group, the Telecommunications

Tient and Operations (ASM-300) Group, to coordinate the procurement and the

ient of the telecommunications services and the more than one dozen information systems

o them; each of the latter systems involved a different hardware make/model. At the

)f Department of Transportation, Dr. Amar Gupta served as technical advisor to identify

:w different options available to FAA; subsequently, he led the MIT team which assisted in

-mentation of the selected option. In all, MIT personnel and consultants performed work

g $2 million over a 5 year period. Dr. Gupta received a federal achievement award for his

TIMS.

Edit 3

I.H.2. Facilities

The proposed work will be performed using the facilities of MIT (in Cambridge, MA) and DoDcontractors such as AmerInd Inc., (in Alexandria, VA; Dayton, OH; and other locations). WithinMIT, this initiative will be conducted as part of the Productivity From Information Technology(PROFIT) Initiative which concentrates on the use of contemporary information technology toenhance productivity in private and public enterprises. Current research relates to four areas: (i)redesign of business processes; (ii) measurement of productivity in the context of informationtechnology; (iii) conversion of data from paper and other media to computer-accessible media; and(iv) dissemination of information held on heterogeneous computing resources. Problems areaddressed at three levels: strategic, organizational, and technical. Technical contribution includecase studies involving government agencies and industry partners and development of approachesthat can be utilized across various application scenarios.PROFIT was established by MIT's President Charles Vest and Provost Mark Wrighton in 1992; ithas been designated as an interdepartmental research initiative, coordinated by MIT's Sloan Schoolof Management. The emphasis in this school is to develop technologies that can be utilized on acommercial basis within 3-5 years. In addition, there are a number of other centers that typicallycollaborate in various research efforts including the Laboratory for Computer Sciences, the Centerfor Transportation Studies, the International Financial Services Research Center, and the IntelligentEngineering Systems Laboratory. The facilities of these research centers/labs are utilized on acase-by-case basis, and will be available for work proposed in this document.Commencing in 1993, ARPA has supported the Context Interchange Network (CIN) project atPROFIT. The current CIN experimental prototype provides unified access to several major datasources, and will serve as one of the foundations for DLA-related work. It was developed in ourleading edge facility that includes a fully networked three-tiered client-server environment withconnectivity to the Internet. We operate on several types of machines (e.g., Sequent, SUN, DEC,PCs, Macs), multiple operating systems, and a mixture of COTS software and researchprototypes. The Context Interchange Network runs on middleware, and utilizes COTS softwarewhere possible. Research prototypes are used for knowledge representation to build ontologies andlocal context, and an object-oriented approach has been used in the conversion-based queryprocessing module. Significant capabilities exist within PROFIT in the areas of imaging, video-conferencing, and other areas of information integration; for example, applied research in the areaof imaging has led to a-prototype which can read handwritten information at very high speed andaccuracy. MIT has applied for patent rights on this technology, and the International Patent Officehas recently supported this application.

Several of the individuals from MIT, whose participation is proposed, have previously served asadvisors on large governmental initiatives, such as CALS (of DoD), Total Asset Visibility (of USArmy), TIMS (of Federal Aviation Administration), and IDS (of US Air Force). Almost all ofthese efforts involved heterogeneity in terms of codes, part numbers, and organizations. Thefacilities and equipment acquired for these activities, coupled with experience on industry-sponsored projects, will be available for use in the proposed effort.

6/29/95, Edit 3

I.H.3 Qualifications and Experience of PersonnelAmar Gupta is a the first and only person to attain the position of a Senior Research

Scientist at MIT's Sloan School of Management; this is the highest position on the research track atMIT, and is granted to individuals who have demonstrated research competence equal to that of asenior faculty member. He performs research and manages large research projects on a full-timebasis, with total exemption from classroom teaching responsibilities. He has written more than100 technical articles and papers and has produced eight books, the most recent one entitled,Integration of Information Systems: Bridging Heterogeneous Databases, (IEEE Press). Dr. Guptawas involved in the development of the split transaction bus protocol which significantly increasesthe performance of multi-processor-based systems and the use of neural networks for accessingsemistructured sets of information. He has served as principal investigator for more than 20research projects at MIT, including projects relating to Computer-Aided Logistics Systems(CALS), the Air Force Tech Order Management Systems (AFTOMS), and FAA'sTelecommunications Information Management System (TIMS); for his work on TIMS, he receiveda federal award from the US Department of Transportation for the notable achievements of hisgroup. Dr. Gupta and Professor Madnick jointly led the Knowledge-Based Integrated InformationSystems Engineering (KBIISE) project and the Context Interchange project. He holds a bachelor'sdegree in electrical engineering, a master's degree in management, and a doctorate in decisionsupport systems; he joined MIT in 1979, and is the co-founder and co-director of the Productivityfrom Information Technology (PROFIT) Initiative at MIT since 1992.

Stuart Madnick is the John Norris Maguire Professor of Information Technology andLeaders for Manufacturing Professor of Management Science at the MIT Sloan School ofManagement. He is also an affiliate member of the MIT Laboratory for Computer Science and amember of the Executive Committee of the MIT Center for Information Systems Research. Hiscurrent research interests include connectivity among disparate distributed information systems,database technology, and software project management. He is the author or co-author of over 200books, articles, or reports on these subjects, including the classic textbook, Operating Systems(McGraw-Hill, 1974), and the recent book, The Dynamics of Software Development (Prentice-Hall, 1991). He has been active in industry, making significant contributions as one of the keydesigners and developers of projects such as IBM's VM/370 operating system and Lockheed'sDIALOG information retrieval system.

Amedeo R. Odoni is Professor of Aeronautics and Astronautics and of CivilEngineering and Head of Systems Division of the Aeronautics and Astronautics Department. From1986 to 1991 he served as Co-Director of the Operations Research Center of MIT. ProfessorOdoni is the author or co-author of three books and has served for four years as Editor-in-Chief ofthe scientific journal Transportation Science. He has served as consultant for many public andprivate organizations, primarily on issues related to air traffic and to airport development. In 1991he received the Federal Aviation Administration's National Award for Excellence in AviationEducation.

Michael Siegel is a Principal Research Scientist and the Associate Director for theInternational Financial Services Research Center at MIT's Sloan School of Management. He hasbeen conducting research in the areas of heterogeneous database systems, query optimization, and

6/29/95, Edit 3

context representation and management. He has published numerous works in these areas and hassupervised the development of the Context Interchange Network Prototype as part of the ARPA I3

initiative.James B. Rice, Jr. is Research Scientist at Center for Transportation Studies and

Director of Integrated Supply Chain Management Program, both at MIT. He holds a mastersdegree in business administration from Harvard University, and possesses 12 years experience inthe area of logistics, including seven years at Proctor and Gamble Company.

Dr.Joseph M. Sussman is JR East Professor and Professor of Civil and EnvironmentalEngineering and is Director of the Association of American Railroads Affiliated Research Lab atMIT. From 1986-1991, he served as Director of the Center for Transportation Studies. Further, hehas worked on Intelligent Vehicle/Highway Systems (IVHS), helping to build both a majorprogram at MIT as co-director of the MIT/IVHS Program and a national program while serving asthe first Distinguished University Scholar at IVHS AMERICA.

Gio Wiederhold is Professor of Computer Science at Stanford University, with courtesyfaculty appointments in Medicine and Electrical Engineering. He is known for his work in thedevelopment and application of knowledge-based techniques to database management, informationsystems, and software construction and maintenance. Current research projects addressHeterogeneity of Information within Commercenet and Trusted Interoperation of HealthcareInformation. From 1991 to 1994 he was the Program Manager for Knowledge-Based Systems atARPA, initiating programs in Intelligent Integration of Information and in Persistent-Object Bases.He participated in establishing directions for a Software Foundations program, specifically incomposition technology for large, multi-site systems.

6/29/95, Edit 3