disambiguation of uspto inventors the institute for quantitative social science at harvard...

DISAMBIGUATION OF USPTO INVENTORS

The Institute for Quantitative Social Science at Harvard University

Name Game Workshop – Madrid 9-10 December 2010

Presenter: Amy Yu [email protected]

Coauthors: Ronald Lai [email protected]

Alex D’Amour [email protected]

Lee Fleming [email protected]

Technical Collaborator: Edward Sun [email protected]

We would like to thank the NSF for supporting this research. Errors and omissions remain ours (though we ask that you bring them to our attention).

mailto:[email protected]





Agenda

Introduction Methodology

Torvik-Smalheiser Algorithm (PubMed) Results and Analysis

Descriptive Statistics DVN platform

Introduction

Background Patent data made available by the USTPO

enables further research into technology and innovation

NBER database includes authorship, firm, and state level data but has not completed the effort to disambiguate unique inventors (Hall, Trajtenberg, and Jaffe, 2001)

Inventor disambiguation is non-trivial USPTO does not require consistent and unique

identifiers for inventors

Motivation Inventor disambiguation allows for construction

of inventor collaboration networks Open new avenues of study:

Which inventors are most central in their field? How does connectedness affect inventor

productivity? What corporate structures are conducive to

innovation? How do legal changes impact idea flow?

Build a scalable, automated system for tracking and analyzing developments in the inventor community

Methodology

Overview Previous methodology (2008)

Linear, unsupervised – more intuitive Similarity between records is a weighted average of

element-wise similarity scores Weights are not optimized Strong results for US - (Lai, D’Amour, Fleming 2008)

showed recall of 97.3% and precision of 96.1% Current methodology (2010)

Variation of Torvik-Smalheiser algorithm (Torvik et al, 2005; Torvik and Smalheiser, 2009)

Multi-dimensional similarity profiles Semi-supervised with automatically generated

training sets Optimal weighting, non linear interactions Easier to scale

Disambiguation Process

Assignee

Classes

Inventors

Patents

Primary Datasets

HBS scripts

Data preparation• load and validate• clean and format• generate datasetsWeekly

USPTO patent data(1998 – 2010)

Public Databases

Consolidatedinventor matched dataset

Consolidatedinventor dataset

HBS scripts

Inventor disambiguation

algorithm

Data preparation

Create inventor, assignee, patent and classification datasets from primary and secondary data sources USPTO: weekly patent data in XML files NBER Patent Data Project: assignee data National Geospatial-Intelligence agency: location data

Standardize and reformat Removal of excess whitespace, removal of tags, and

translation of Unicode characters Construct the inventor-patent database

Consolidate inventor, assignee, patent, and classification datasets

Patent Data: Base Datasets

10

INVENTOR

Invnum_N

Disambiguated inventor number

Invnum Initial inventor number: Patent + Invseq

Firstname

Inventor first name

Lastname

Inventor last name

InvSeq Inventor number on patent.

Street Inventor’s street address

City Inventor’s city.

State (US only) State

Zipcode (US only) Zipcode.

Lat Latitude

Long Longitude

PATENT

Patent USPTO assigned patent number.

AppDate

Patent application date.

GDate Patent grant date.

AppYear Patent application yearASSIGNEE

Assignee

Primary firm associated with patent.

Asgnum

Generated assignee number.

CLASSES

Class Main patent classification

Subclass

Patent subclassification

*HBS algorithm generated variables.


Consolidated Dataset

Firstname Lastname City State Country Zipcode Lat Lng InvSeq Patent AppYear GYear AppDate Assignee AsgNum Class Invnum Invnum_NGAROLD LEE FLEMING NEWTON KS US 67117 38.13 -97.32 3 4091724 1977 1978 3/16/1977

HESSTON CORPORATION H000000002441 100-21 04091724-3 04091724-3

LEE FLEMING FREMONT CA US 94555 37.57 -122.05 2 5029133 1990 1991 8/30/1990

HEWLETT PACKARD COMPANY A000010088678

365-189.02/365-201/714-703/714-731 05029133-2 05029133-2

LEE O FLEMING FREMONT CA US 94555 37.57 -122.05 1 5136185 1991 1992 9/20/1991

HEWLETT PACKARD COMPANY A000010088678

326-16/326-31/326-56/326-82 05136185-1 05136185-1

CATHLEEN M FLEMING

FOREST HILL MD US 21050 39.57 -76.40 2 5799675 1997 1998 3/3/1997

COLOR PRELUDE INC A000011790130

132-333/132-317 05799675-2 05799675-2

EILEEN FLEMING SANDIA TX US 78383 28.09 -97.94 1 7066218 2003 2006 10/29/2003TMC SYSTEMS L P H000000134163

141-198/239-63/239-64/239-67 07066218-1 07066218-1

EILEEN FLEMING SANDIA TX US 78382 30.09 -100.94 1 7540433 2006 2009 4/27/2006TMC SYSTEMS L P H000000134163

239-69/141-198/239-63/239-64 07540433-1 07540433-1

ELENA FLEMINGNORTH VANCOUVER CA 49.32 -123.07 5 7521240 2004 2009 12/6/2004

SMITHKLINE BEECHAM CORPORATION A000011538118 435 07521240-5 07521240-5

ELIZABETH A FLEMING HOUSTON TX US 77299 29.77 -95.41 1 5164591 1991 1992 9/9/1991

SHELL OIL COMPANY A000010266734 250 05164591-1 05164591-1

ELIZABETH S FLEMING BELMONT MA US 2478 42.39 -71.18 3 6335339 1999 2002 1/13/1999

SCRIPTGEN PHARMACEUTICALS INC H000000014253 514 06335339-3 06335339-3

ELLEN L FLEMING BIRMINGHAM MI US 48012 42.54 -83.21 2 5683090 1995 1997 6/7/1995 273/463 05683090-2 05683090-2

ERIC MICHAEL FLEMING SELKIRK CA 50.15 -96.88 2 5906799 1992 1999 6/1/1992

HEMLOCK SEMICONDUCTOR CORPORATION A000011501872 422/501 05906799-2 05906799-2

FRANCIS WILLIAM FLEMING GLASGOW GB 55.83 -4.25 1 4118646 1977 1978 6/17/1977

MARKON ENGINEERING COMPANY LIMITED H000000234058 310/165 04118646-1 04118646-1

FRANK ALBERT FLEMING

WARRENSBURG MO US 64093 38.76 -93.73 1 6555265 2000 2003 11/6/2000

HAWKER ENERGY PRODUCTS INC H000000026219 429 06555265-1 06555265-1

FRANK ALBERT FLEMING OZARK MO US 65721 37.01 -93.20 1 6815118 2003 2004 1/4/2003

HAWKER ENERGY PRODUCTS INC H000000026219 429 06815118-1 06555265-1

FRANK ALBERT FLEMING

WARRENSBURG MO US 64093 38.76 -93.73 0 7601456 2005 2009 2/23/2005

HAWKWER ENERGY PRODUCTS INC H000000202360 429 07601456-0 06555265-1

FRANK JOSEPH FLEMING MELBOURNE FL US 32951 28.02 -80.54 1 6931132 2002 2005 5/10/2002

HARRIS CORPORATION A000010045633 380/713 06931132-1 06931132-1

Inventor First & Last

Name

Location data

Patent Application

& Grant dates

Invnum

Assignee data

Patent Class

Patent Number

Invnum_N


Disambiguation Algorithm

Blocking

Training Sets

Ratios

Disambiguation

Consolidation

Blocking

Run #

Type Block1 Block2

1 Consolidated First name Last name

2 Consolidated First 5 characters of first name.

First 8 characters of last name.

3 Consolidated First 3 characters of first name

First 5 characters of last name

4 Consolidated Initials of first and middle names.


5 Consolidated First initial First 5 characters of last name


Last 5 characters of last name, reversed

7 Consolidated First initial Last 5 characters of last name, reversed

FirstnameLastname City

State

Country

Zipcode

GAROLD LEE FLEMINGNEWTON KS US 67117 …

LEE FLEMINGFREMONT CA US 94555 …

LEE O FLEMINGFREMONT CA US 94555 …CATHLEEN M FLEMING

FOREST HILL MD US 21050 …

EILEEN FLEMINGSANDIA TX US 78383 …


Blocking

Training Sets

Ratios

Disambiguation

Consolidation

Inventor disambiguat

ion algorithm

Name Attributes

Patent Attributes

Match P(α|M) P(β|M)

Nonmatch P(α|N) P(β|N)

Training Sets

P(α|M) * P(β|M) =P(x|M) = Probability of seeing similarity profile x given a match

P(α|N) * P(β|N) =P(x|N) = Probability of seeing similarity profile x given nonmatch

[x1, x2, x3, x4, x5, x6, x7] α β

Blocking

Training Sets

Ratios

Disambiguation

Consolidation

SimilarityProfile:


ion algorithm

RatiosSimilarity Profile Match Probability

P(M|x)

[2, 4, 3, 4, 2, 1, 4] 0.3439485

[3, 4, 3, 5, 3, 2, 5] 0.5872638

[4, 5, 3, 7, 3, 4, 6] 0.7936452

[6, 6, 4, 8, 3, 8, 7] 0.9828447… …

… …

Blocking

Training Sets

Ratios

Disambiguation

Consolidation

• Likelihood ratio: r = P(x|M)/P(x|N) generated from training sets

• Probability of match given similarity profile x:• P(M) is empirically determined• Smoothing: enforce monotonicity• r is interpolated/extrapolated for unobserved xa


ion algorithm

* approximated probabilities for demonstration

Disambiguation


State

Country

Zipcode

GAROLD LEE FLEMINGNEWTON KS US 67117 …

LEE FLEMINGFREMONT CA US 94555 …

LEE O FLEMINGFREMONT CA US 94555 …CATHLEEN M FLEMING





State

Country

Zipcode



[6, 6, 4, 8, 3, 8, 7]

Similarity Profile

Match Probability

… …

… …

… …

[6, 6, 4, 8, 3, 8, 7]

0.9828447

… …

… …

… …

Invnum Invnum_N07066218-1

07066218-1

07540433-1

07540433-1

Invnum Invnum_N07066218-1

07066218-1

07540433-1

07066218-1

Blocking

Training Sets

Ratios

Disambiguation

Consolidation

> 0.95


ion algorithm

Consolidation

FirstnameLastname City State

Country

Zipcode Invnum Invnum_N

GAROLD LEE FLEMING NEWTON KS US 67117 …

04091724-3

04091724-3

LEE FLEMING FREMONT CA US 94555 …05029133-2

05029133-2

LEE O FLEMING FREMONT CA US 94555 …05136185-1

05136185-1

CATHLEEN M FLEMING


05799675-2

05799675-2

EILEEN FLEMING SANDIA TX US 78383 …07066218-1

07066218-1

EILEEN FLEMING SANDIA TX US 78382 …07540433-1

07066218-1

Blocking

Training Sets

Ratios

Disambiguation

Consolidation


ion algorithm

FirstnameLastname City State

Country Zipcode Invnum

Invnum_N

GAROLD LEE FLEMING NEWTON KS US 67117 …

04091724-3

04091724-3

LEE FLEMING FREMONT CA US 94555 …05029133-2

05029133-2

LEE O FLEMING FREMONT CA US 94555 …05136185-1

05136185-1

CATHLEEN M FLEMING


05799675-2

05799675-2

EILEEN~2FLEMING~2 SANDIA~2 TX~2 US~2

78383~1/

78382~1 …

07066218-1

07066218-1

Process Map: Consolidated Steps

tsetC5tsetC4tsetC1 tsetC2 tsetC6 tsetC7tsetC3

ratio1

ratio2

ratio3

ratio4

ratio7

ratio5

ratio6

Inventor-

Patent Dataset

D1 D2 D3 D4 D5 D6 D7

C1 C2 C3 C4 C5 C6 C7


ion algorithm

lowerboun

d result

…

…

…

Final Step: Splitting

ratio7 D8

D7

Inventor-

Patent Dataset


ion algorithm

invnum_N

Blocking

upperboun

d result

Disambiguation

Results and Analysis

Patents and Inventors* 1975 – 2010

* excluding East Asian inventors

1975

1977

1979

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

0

50000

100000

150000

200000

250000

Unique Inventors (upper bound)

Unique Inventors (lower bound)

Total Patents

Year

Cou

nt

166.78%

2 thru 526.86%

6 thru 103.94%

11 thru 502.29%

50+0.14%

Patents Per Inventor

* based on lower bound disambiguation

Top 10 InventorsFirstname Lastname

Country Assignee

Number of Patents

KIASILVERBROOK AU SILVERBROOK RESEARCH PTY LTD 3382

DONALD E WEDER US

WANDA M WEDER AND WILLIAM F STAETER 1001

LEONARD FORBES US MICRON TECHNOLOGY INC 925GURTEJ S SANDHU US MICRON TECHNOLOGY INC 832PAUL LAPSTUN AU SILVERBROOK RESEARCH PTY LTD 803WARREN M

FARNWORTH US MICRON TECHNOLOGY INC 729

GEORGE SPECTOR US THE RUIZ LAW FIRM 715SALMAN AKRAM US MICRON TECHNOLOGY INC 670WILLIAM I WOOD US GENENTECH INC 646AUSTIN L GURNEY US GENENTECH INC 618

* based on lower bound disambiguation, excluding East Asian inventors

Unique Coauthors by Patent Grant Year1

97

51

97

61

97

71

97

81

97

91

98

01

98

11

98

21

98

31

98

41

98

51

98

61

98

71

98

81

98

91

99

01

99

11

99

21

99

31

99

41

99

51

99

61

99

71

99

81

99

92

00

02

00

12

00

22

00

32

00

42

00

52

00

62

00

72

00

82

00

92

01

0

0

1

2

Average # Coauthors LB

Average # Coauthors UB

Grant Year

Cou

nt

Largest Component per Year

1975

1977

1979

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

Lower bound disambigUpper bound disambig

Grant Year

Com

pon

en

t S

ize (

Nu

mb

er

of

vert

ices)

Analysis Benchmark dataset from Jerry Marschke,

NBER manually edited, data derived from

inventor CVs patent history of ~100 US inventors –

mainly research scientists in university engineering and biochemistry depts

Verification Measures:

Verification statisticsRun #

Type # of records

Underclumping

Overclumping

Recall Precision

0 Base Dataset

9.17 million

n/a

1 Consolidated

4.61 million

74.6% 1.7% 25.40% 93.73%

2 Consolidated

2.20 million 12.3% 4.8% 87.70% 94.81%

3 Consolidated

2.08 million 6.8% 10.1% 93.20% 90.22%

4 Consolidated

2.05 million 4.6% 10.3% 95.40% 90.26%

5 Consolidated

2.02 million 4.1% 10.3% 95.90% 90.30%

6 Consolidated

2.01 million 2.8% 19.2%** 97.20% 83.51%

7 Consolidated

1.99 million 2.7% 19.2%** 97.30% 83.52%

8 Splitting 2.26 million

15.9% 15.3% 84.10% 84.61%

** due to “blackhole” names

Encouraging results

Challenges and Improvements Disambiguation of East Asian names is difficult

Current algorithm is well-suited for European names Systematic improvements required to handle

correlations between fields Overclumping for common names – frequency

adjustment using stop listing removing David Johnson, Eric Anderson, and Stephen

Smith from our analysis improves the overclumping metric from 19.2% to 5.1% for the last two consolidated runs

Computation time v. algorithmic accuracy Benchmark datasets for results analysis

Research applications

Origin of breakthroughs

Impact of legislation on innovation

Organizational influence on innovation

Inventor careers and collaboration networks

Dataverse Network Platform

http://dvn.iq.harvard.edu/dvn/dv/patent

http://dvn.iq.harvard.edu/dvn/dv/patent

Questions?

Appendix

Patent Data

35

Prof Fleming, Amy, and Ron collaborate on patent 9999999. Data are organized in unique inventor-patent pairs. Unique inventor number (HBS disambiguation algorithm), constant

between patents. Invnum = Patent Num + inventor sequence Invnum_N = disambiguated inventor identifier

Patent is assigned to one entity (usually inventors’ employer or self if blank), constant over a patent.

Location data are personal addresses (at the city level) of inventors, vary over a patent.

Invnum_N

Name Patent Assignee

City State

…

12345 Fleming, Lee 5029133 HP Fremont CA …

12345 Fleming, Lee 9999999 Harvard Cambridge

MA …

45678 Yu, Amy 9999999 Harvard Boston MA …

67890 Lai, Ronald 9999999 Harvard Randolph MA …


Disambiguation Algorithm

Blocking

• Partition the inventor-patent dataset

• Based on seven different criteria

Training Sets

• Build a training set for each set of block criteria

• Each set is a database that contains four different tables, each with ~ 10 million pairs of record ids.

Ratios

• One ratio database is created for each training set

• Similarity profiles are paired with match probabilities.

Disambiguation

• Starts from invpat or previously disambiguated and consolidated database

• Within each block, we compare each record

• Output is invnum_N

Consolidation

• Based on the disambiguated invnum_N, update invnum_N within invpat

• Consolidates records with the same invnum_N

Summary of Data Passes

Run #

Type Block1 Block2

1 Consolidated First name Last name

2 Consolidated First 5 characters of first name.

First 8 characters of last name.

3 Consolidated First 3 characters of first name First 5 characters of last name



5 Consolidated First initial First 5 characters of last name


Last 5 characters of last name, reversed

7 Consolidated First initial Last 5 characters of last name, reversed

8 Splitting Invnum_N from step 7


ion algorithm

Patent Similarity Profiles

Seven-dimensional Fields used: name attributes (first name, middle initials,

and last name) and patent attributes (author address, assignee, technology class, and coauthors)

Each element is a discrete similarity score determined by a fieldwise comparison between two records Jaro-Winkler string comparison

Monotonicity assumption: if one profile dominates another profile (this is, each of its elements is greater than or equal to the elements of another similarity profile), then it must map to a higher match probability.


ion algorithm

Similarity ScoresComparison function scoring: LEFT/RIGHT=LEFT VS RIGHT

1) Firstname: 0-6. Factors: # of token and similarity between tokens

0: Totally different: THOMAS ERIC/RICHARD JACK EVAN

1: ONE NAME MISSING: THOMAS ERIC/(NONE)

2: THOMAS ERIC/ THOMAS JOHN ALEX

3: LEE RON ERIC/LEE ALEX ERIC

4: No space match but raw names don't: JOHNERIC/JOHN ERIC. Short name vs long name: ERIC/ERIC THOMAS

5: ALEX NICHOLAS/ALEX NICHOLAS TAKASHI

6: ALEX NICHOLAS/ALEX NICHOLA (Might be not exactly the same but identified the same by jaro-wrinkler)

2) Lastname: 0-6 Factors: # of token and similarity between tokens

0: Totally different: ANDERSON/DAVIDSON

1: ONE NAME MISSING: ANDERSON/(NONE)

2: First part non-match: DE AMOUR/DA AMOUR

3: VAN DE WAALS/VAN DES WAALS

4: DE AMOUR/DEAMOUR

5: JOHNSTON/JOHNSON

6: DE AMOUR/DE AMOURS

3) Midname: 0-4 (THE FOLLOWING EXAMPLES ARE FROM THE COLUMN FIRSTNAME, SO FIRSTNAME IS INCLUDED)

0: THOMAS ERIC/JOHN THOMAS

1: JOHN ERIC/JOHN (MISSING)

2: THOMAS ERIC ALEX/JACK ERIC RONALD

3: THOMAS ERIC RON ALEX EDWARD/JACK ERIC RON ALEX LEE

4: THOMAS ERIC/THOMAS ERIC LEE

4) Assignee: 0-8

0: DIFFERENT ASGNUM, TOTALY DIFFERENT NAMES ( NO single common word )

1: DIFFERENT ASGNUM, One name missing

2: DIFFERENT ASGNUM, Harvard University Longwood Medical School / Dartmouth Hitchcock Medical Center

3: DIFFERENT ASGNUM, Harvard University President and Fellows / Presidents and Fellow of Harvard

4: DIFFERENT ASGNUM, Harvard University / Harvard University Medical School

5: DIFFERENT ASGNUM, Microsoft Corporation/Microsoft Corporated

6: SAME ASGNUM, COMPANY SIZE>1000

7: SAME ASGNUM, 1000>SIZE>100

8: SAME ASGNUM, SIZE<100

5) CLASS: 0-4

# OF COMMON CLASSES. MISSING=1

6) COAUTHERS 0-10

# OF COMMON COAUTHERS

7) DISTANCE: 0-7 FACTORS: LONGITUDE/LATITUDE, STREET ADDRESS

0: TOTALLY DIFFERENT

1: ONE IS MISSING

2: 75<DISTANCE < 100KM

3: 50<DISTANCE < 75

4: 10<DISTANCE < 50

5: DISTANCE < 10

6: DISTANCE < 10 AND STREET MATCH BUT NOT IN US, OR DISTANCE < 10 AND IN US BUT STREET NOT MATCH

7: STREET MATCH AND IN US


ion algorithm

Probabilistic Matching Model Name and Patent attributes are assumed

to be independent Unbiased training sets are created by

conditioning on one set of features to create a sample of obvious matches or non-matches to learn about the other set of features without bias

Count frequency of each similarity profile x in match and nonmatch sets to calculate P(x|M) and P(x|N)


ion algorithm

Training Set CriteriaInventor

disambiguation

algorithm

Name Attributes Patent Attributes

Match

Choose all the record pairs that have at least two common coauthors within each predefined block.

Choose all the record pairs that share the same rare name (calculate statistics on unique full names, choose those whose first or last name only appear once). Not necessary to check each block.

Nonmatch

Choose all record pairs that have same appyear, different assignee, no common coauthors and no common classes within each predefined block.

Choose all record pairs that have different last names from a subset of the whole database in which the number of records are proportional to the original one in terms of grant year.

Condition on patent

attributes to train name attributes

Condition on name attributes to train patent

attributes

Probabilistic Matching Model Likelihood ratio: r = P(x|M)/P(x|N) Probability of match given similarity

profile x:

where P(M) is empirically determined Smoothing: enforce monotonicity r is interpolated/extrapolated for

unobserved xa


ion algorithm

Disambiguation & Consolidation

Generate similarity profile for each record within each block

Lookup similarity profile in ratio database to find match probability

Based on a given probability threshold, we determine if invnum_N (algorithmically generated unique inventor identifier) should be updated

Records with same invnum_N are consolidated Improves algorithm efficiency for

subsequent runs


ion algorithm

Verification Measures

References

Hall, B. H., A. B. Jaffe, and M. Trajtenberg. (2001). The NBER patent Citations Data File: Lessons Insights and Methodological Tools, NBER.

Torvik, V. and M. Weeber, D. Swanson, N. Smalheiser (2005). “A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation,” JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 56(2):140–158, 2005.

Torvik, V. and N. Smalheiser (2009). “Author Name Disambiguation in MEDLINE.” ACM Transactions on Knowledge Discovery from Data, Vol. 3., No. 3, Article 11.

disambiguation of uspto inventors the institute for quantitative social science at harvard...

Documents

inventorpatent database

inventor community slide

weekly patent data

inventor productivity

background patent data

location data

methodology slide

introduction slide