disambiguation of uspto inventors the institute for quantitative social science at harvard...
TRANSCRIPT
DISAMBIGUATION OF USPTO INVENTORS
The Institute for Quantitative Social Science at Harvard University
Name Game Workshop – Madrid 9-10 December 2010
Presenter: Amy Yu [email protected]
Coauthors: Ronald Lai [email protected]
Alex D’Amour [email protected]
Lee Fleming [email protected]
Technical Collaborator: Edward Sun [email protected]
We would like to thank the NSF for supporting this research. Errors and omissions remain ours (though we ask that you bring them to our attention).
Agenda
Introduction Methodology
Torvik-Smalheiser Algorithm (PubMed) Results and Analysis
Descriptive Statistics DVN platform
Introduction
Background Patent data made available by the USTPO
enables further research into technology and innovation
NBER database includes authorship, firm, and state level data but has not completed the effort to disambiguate unique inventors (Hall, Trajtenberg, and Jaffe, 2001)
Inventor disambiguation is non-trivial USPTO does not require consistent and unique
identifiers for inventors
Motivation Inventor disambiguation allows for construction
of inventor collaboration networks Open new avenues of study:
Which inventors are most central in their field? How does connectedness affect inventor
productivity? What corporate structures are conducive to
innovation? How do legal changes impact idea flow?
Build a scalable, automated system for tracking and analyzing developments in the inventor community
Methodology
Overview Previous methodology (2008)
Linear, unsupervised – more intuitive Similarity between records is a weighted average of
element-wise similarity scores Weights are not optimized Strong results for US - (Lai, D’Amour, Fleming 2008)
showed recall of 97.3% and precision of 96.1% Current methodology (2010)
Variation of Torvik-Smalheiser algorithm (Torvik et al, 2005; Torvik and Smalheiser, 2009)
Multi-dimensional similarity profiles Semi-supervised with automatically generated
training sets Optimal weighting, non linear interactions Easier to scale
Disambiguation Process
Assignee
Classes
Inventors
Patents
Primary Datasets
HBS scripts
Data preparation• load and validate• clean and format• generate datasetsWeekly
USPTO patent data(1998 – 2010)
Public Databases
Consolidatedinventor matched dataset
Consolidatedinventor dataset
HBS scripts
Inventor disambiguation
algorithm
Data preparation
Create inventor, assignee, patent and classification datasets from primary and secondary data sources USPTO: weekly patent data in XML files NBER Patent Data Project: assignee data National Geospatial-Intelligence agency: location data
Standardize and reformat Removal of excess whitespace, removal of tags, and
translation of Unicode characters Construct the inventor-patent database
Consolidate inventor, assignee, patent, and classification datasets
Patent Data: Base Datasets
10
INVENTOR
Invnum_N
Disambiguated inventor number
Invnum Initial inventor number: Patent + Invseq
Firstname
Inventor first name
Lastname
Inventor last name
InvSeq Inventor number on patent.
Street Inventor’s street address
City Inventor’s city.
State (US only) State
Zipcode (US only) Zipcode.
Lat Latitude
Long Longitude
PATENT
Patent USPTO assigned patent number.
AppDate
Patent application date.
GDate Patent grant date.
AppYear Patent application yearASSIGNEE
Assignee
Primary firm associated with patent.
Asgnum
Generated assignee number.
CLASSES
Class Main patent classification
Subclass
Patent subclassification
*HBS algorithm generated variables.
Consolidatedinventor dataset
Consolidated Dataset
Firstname Lastname City State Country Zipcode Lat Lng InvSeq Patent AppYear GYear AppDate Assignee AsgNum Class Invnum Invnum_NGAROLD LEE FLEMING NEWTON KS US 67117 38.13 -97.32 3 4091724 1977 1978 3/16/1977
HESSTON CORPORATION H000000002441 100-21 04091724-3 04091724-3
LEE FLEMING FREMONT CA US 94555 37.57 -122.05 2 5029133 1990 1991 8/30/1990
HEWLETT PACKARD COMPANY A000010088678
365-189.02/365-201/714-703/714-731 05029133-2 05029133-2
LEE O FLEMING FREMONT CA US 94555 37.57 -122.05 1 5136185 1991 1992 9/20/1991
HEWLETT PACKARD COMPANY A000010088678
326-16/326-31/326-56/326-82 05136185-1 05136185-1
CATHLEEN M FLEMING
FOREST HILL MD US 21050 39.57 -76.40 2 5799675 1997 1998 3/3/1997
COLOR PRELUDE INC A000011790130
132-333/132-317 05799675-2 05799675-2
EILEEN FLEMING SANDIA TX US 78383 28.09 -97.94 1 7066218 2003 2006 10/29/2003TMC SYSTEMS L P H000000134163
141-198/239-63/239-64/239-67 07066218-1 07066218-1
EILEEN FLEMING SANDIA TX US 78382 30.09 -100.94 1 7540433 2006 2009 4/27/2006TMC SYSTEMS L P H000000134163
239-69/141-198/239-63/239-64 07540433-1 07540433-1
ELENA FLEMINGNORTH VANCOUVER CA 49.32 -123.07 5 7521240 2004 2009 12/6/2004
SMITHKLINE BEECHAM CORPORATION A000011538118 435 07521240-5 07521240-5
ELIZABETH A FLEMING HOUSTON TX US 77299 29.77 -95.41 1 5164591 1991 1992 9/9/1991
SHELL OIL COMPANY A000010266734 250 05164591-1 05164591-1
ELIZABETH S FLEMING BELMONT MA US 2478 42.39 -71.18 3 6335339 1999 2002 1/13/1999
SCRIPTGEN PHARMACEUTICALS INC H000000014253 514 06335339-3 06335339-3
ELLEN L FLEMING BIRMINGHAM MI US 48012 42.54 -83.21 2 5683090 1995 1997 6/7/1995 273/463 05683090-2 05683090-2
ERIC MICHAEL FLEMING SELKIRK CA 50.15 -96.88 2 5906799 1992 1999 6/1/1992
HEMLOCK SEMICONDUCTOR CORPORATION A000011501872 422/501 05906799-2 05906799-2
FRANCIS WILLIAM FLEMING GLASGOW GB 55.83 -4.25 1 4118646 1977 1978 6/17/1977
MARKON ENGINEERING COMPANY LIMITED H000000234058 310/165 04118646-1 04118646-1
FRANK ALBERT FLEMING
WARRENSBURG MO US 64093 38.76 -93.73 1 6555265 2000 2003 11/6/2000
HAWKER ENERGY PRODUCTS INC H000000026219 429 06555265-1 06555265-1
FRANK ALBERT FLEMING OZARK MO US 65721 37.01 -93.20 1 6815118 2003 2004 1/4/2003
HAWKER ENERGY PRODUCTS INC H000000026219 429 06815118-1 06555265-1
FRANK ALBERT FLEMING
WARRENSBURG MO US 64093 38.76 -93.73 0 7601456 2005 2009 2/23/2005
HAWKWER ENERGY PRODUCTS INC H000000202360 429 07601456-0 06555265-1
FRANK JOSEPH FLEMING MELBOURNE FL US 32951 28.02 -80.54 1 6931132 2002 2005 5/10/2002
HARRIS CORPORATION A000010045633 380/713 06931132-1 06931132-1
Inventor First & Last
Name
Location data
Patent Application
& Grant dates
Invnum
Assignee data
Patent Class
Patent Number
Invnum_N
Consolidatedinventor dataset
Disambiguation Algorithm
Blocking
Training Sets
Ratios
Disambiguation
Consolidation
Blocking
Run #
Type Block1 Block2
1 Consolidated First name Last name
2 Consolidated First 5 characters of first name.
First 8 characters of last name.
3 Consolidated First 3 characters of first name
First 5 characters of last name
4 Consolidated Initials of first and middle names.
First 5 characters of last name
5 Consolidated First initial First 5 characters of last name
6 Consolidated Initials of first and middle names.
Last 5 characters of last name, reversed
7 Consolidated First initial Last 5 characters of last name, reversed
FirstnameLastname City
State
Country
Zipcode
GAROLD LEE FLEMINGNEWTON KS US 67117 …
LEE FLEMINGFREMONT CA US 94555 …
LEE O FLEMINGFREMONT CA US 94555 …CATHLEEN M FLEMING
FOREST HILL MD US 21050 …
EILEEN FLEMINGSANDIA TX US 78383 …
EILEEN FLEMINGSANDIA TX US 78382 …
Blocking
Training Sets
Ratios
Disambiguation
Consolidation
Inventor disambiguat
ion algorithm
Name Attributes
Patent Attributes
Match P(α|M) P(β|M)
Nonmatch P(α|N) P(β|N)
Training Sets
P(α|M) * P(β|M) =P(x|M) = Probability of seeing similarity profile x given a match
P(α|N) * P(β|N) =P(x|N) = Probability of seeing similarity profile x given nonmatch
[x1, x2, x3, x4, x5, x6, x7] α β
Blocking
Training Sets
Ratios
Disambiguation
Consolidation
SimilarityProfile:
Inventor disambiguat
ion algorithm
RatiosSimilarity Profile Match Probability
P(M|x)
[2, 4, 3, 4, 2, 1, 4] 0.3439485
[3, 4, 3, 5, 3, 2, 5] 0.5872638
[4, 5, 3, 7, 3, 4, 6] 0.7936452
[6, 6, 4, 8, 3, 8, 7] 0.9828447… …
… …
Blocking
Training Sets
Ratios
Disambiguation
Consolidation
• Likelihood ratio: r = P(x|M)/P(x|N) generated from training sets
• Probability of match given similarity profile x:• P(M) is empirically determined• Smoothing: enforce monotonicity• r is interpolated/extrapolated for unobserved xa
Inventor disambiguat
ion algorithm
* approximated probabilities for demonstration
Disambiguation
FirstnameLastname City
State
Country
Zipcode
GAROLD LEE FLEMINGNEWTON KS US 67117 …
LEE FLEMINGFREMONT CA US 94555 …
LEE O FLEMINGFREMONT CA US 94555 …CATHLEEN M FLEMING
FOREST HILL MD US 21050 …
EILEEN FLEMINGSANDIA TX US 78383 …
EILEEN FLEMINGSANDIA TX US 78382 …
FirstnameLastname City
State
Country
Zipcode
EILEEN FLEMINGSANDIA TX US 78383 …
EILEEN FLEMINGSANDIA TX US 78382 …
[6, 6, 4, 8, 3, 8, 7]
Similarity Profile
Match Probability
… …
… …
… …
[6, 6, 4, 8, 3, 8, 7]
0.9828447
… …
… …
… …
Invnum Invnum_N07066218-1
07066218-1
07540433-1
07540433-1
Invnum Invnum_N07066218-1
07066218-1
07540433-1
07066218-1
Blocking
Training Sets
Ratios
Disambiguation
Consolidation
> 0.95
Inventor disambiguat
ion algorithm
Consolidation
FirstnameLastname City State
Country
Zipcode Invnum Invnum_N
GAROLD LEE FLEMING NEWTON KS US 67117 …
04091724-3
04091724-3
LEE FLEMING FREMONT CA US 94555 …05029133-2
05029133-2
LEE O FLEMING FREMONT CA US 94555 …05136185-1
05136185-1
CATHLEEN M FLEMING
FOREST HILL MD US 21050 …
05799675-2
05799675-2
EILEEN FLEMING SANDIA TX US 78383 …07066218-1
07066218-1
EILEEN FLEMING SANDIA TX US 78382 …07540433-1
07066218-1
Blocking
Training Sets
Ratios
Disambiguation
Consolidation
Inventor disambiguat
ion algorithm
FirstnameLastname City State
Country Zipcode Invnum
Invnum_N
GAROLD LEE FLEMING NEWTON KS US 67117 …
04091724-3
04091724-3
LEE FLEMING FREMONT CA US 94555 …05029133-2
05029133-2
LEE O FLEMING FREMONT CA US 94555 …05136185-1
05136185-1
CATHLEEN M FLEMING
FOREST HILL MD US 21050 …
05799675-2
05799675-2
EILEEN~2FLEMING~2 SANDIA~2 TX~2 US~2
78383~1/
78382~1 …
07066218-1
07066218-1
Process Map: Consolidated Steps
tsetC5tsetC4tsetC1 tsetC2 tsetC6 tsetC7tsetC3
ratio1
ratio2
ratio3
ratio4
ratio7
ratio5
ratio6
Inventor-
Patent Dataset
D1 D2 D3 D4 D5 D6 D7
C1 C2 C3 C4 C5 C6 C7
Inventor disambiguat
ion algorithm
lowerboun
d result
…
…
…
Final Step: Splitting
ratio7 D8
D7
Inventor-
Patent Dataset
Inventor disambiguat
ion algorithm
invnum_N
Blocking
upperboun
d result
Disambiguation
Results and Analysis
Patents and Inventors* 1975 – 2010
* excluding East Asian inventors
1975
1977
1979
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2001
2003
2005
2007
2009
0
50000
100000
150000
200000
250000
Unique Inventors (upper bound)
Unique Inventors (lower bound)
Total Patents
Year
Cou
nt
166.78%
2 thru 526.86%
6 thru 103.94%
11 thru 502.29%
50+0.14%
Patents Per Inventor
* based on lower bound disambiguation
Top 10 InventorsFirstname Lastname
Country Assignee
Number of Patents
KIASILVERBROOK AU SILVERBROOK RESEARCH PTY LTD 3382
DONALD E WEDER US
WANDA M WEDER AND WILLIAM F STAETER 1001
LEONARD FORBES US MICRON TECHNOLOGY INC 925GURTEJ S SANDHU US MICRON TECHNOLOGY INC 832PAUL LAPSTUN AU SILVERBROOK RESEARCH PTY LTD 803WARREN M
FARNWORTH US MICRON TECHNOLOGY INC 729
GEORGE SPECTOR US THE RUIZ LAW FIRM 715SALMAN AKRAM US MICRON TECHNOLOGY INC 670WILLIAM I WOOD US GENENTECH INC 646AUSTIN L GURNEY US GENENTECH INC 618
* based on lower bound disambiguation, excluding East Asian inventors
Unique Coauthors by Patent Grant Year1
97
51
97
61
97
71
97
81
97
91
98
01
98
11
98
21
98
31
98
41
98
51
98
61
98
71
98
81
98
91
99
01
99
11
99
21
99
31
99
41
99
51
99
61
99
71
99
81
99
92
00
02
00
12
00
22
00
32
00
42
00
52
00
62
00
72
00
82
00
92
01
0
0
1
2
Average # Coauthors LB
Average # Coauthors UB
Grant Year
Cou
nt
Largest Component per Year
1975
1977
1979
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2001
2003
2005
2007
2009
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Lower bound disambigUpper bound disambig
Grant Year
Com
pon
en
t S
ize (
Nu
mb
er
of
vert
ices)
Analysis Benchmark dataset from Jerry Marschke,
NBER manually edited, data derived from
inventor CVs patent history of ~100 US inventors –
mainly research scientists in university engineering and biochemistry depts
Verification Measures:
Verification statisticsRun #
Type # of records
Underclumping
Overclumping
Recall Precision
0 Base Dataset
9.17 million
n/a
1 Consolidated
4.61 million
74.6% 1.7% 25.40% 93.73%
2 Consolidated
2.20 million 12.3% 4.8% 87.70% 94.81%
3 Consolidated
2.08 million 6.8% 10.1% 93.20% 90.22%
4 Consolidated
2.05 million 4.6% 10.3% 95.40% 90.26%
5 Consolidated
2.02 million 4.1% 10.3% 95.90% 90.30%
6 Consolidated
2.01 million 2.8% 19.2%** 97.20% 83.51%
7 Consolidated
1.99 million 2.7% 19.2%** 97.30% 83.52%
8 Splitting 2.26 million
15.9% 15.3% 84.10% 84.61%
** due to “blackhole” names
Encouraging results
Challenges and Improvements Disambiguation of East Asian names is difficult
Current algorithm is well-suited for European names Systematic improvements required to handle
correlations between fields Overclumping for common names – frequency
adjustment using stop listing removing David Johnson, Eric Anderson, and Stephen
Smith from our analysis improves the overclumping metric from 19.2% to 5.1% for the last two consolidated runs
Computation time v. algorithmic accuracy Benchmark datasets for results analysis
Research applications
Origin of breakthroughs
Impact of legislation on innovation
Organizational influence on innovation
Inventor careers and collaboration networks
Dataverse Network Platform
Questions?
Appendix
Patent Data
35
Prof Fleming, Amy, and Ron collaborate on patent 9999999. Data are organized in unique inventor-patent pairs. Unique inventor number (HBS disambiguation algorithm), constant
between patents. Invnum = Patent Num + inventor sequence Invnum_N = disambiguated inventor identifier
Patent is assigned to one entity (usually inventors’ employer or self if blank), constant over a patent.
Location data are personal addresses (at the city level) of inventors, vary over a patent.
Invnum_N
Name Patent Assignee
City State
…
12345 Fleming, Lee 5029133 HP Fremont CA …
12345 Fleming, Lee 9999999 Harvard Cambridge
MA …
45678 Yu, Amy 9999999 Harvard Boston MA …
67890 Lai, Ronald 9999999 Harvard Randolph MA …
Consolidatedinventor dataset
Disambiguation Algorithm
Blocking
• Partition the inventor-patent dataset
• Based on seven different criteria
Training Sets
• Build a training set for each set of block criteria
• Each set is a database that contains four different tables, each with ~ 10 million pairs of record ids.
Ratios
• One ratio database is created for each training set
• Similarity profiles are paired with match probabilities.
Disambiguation
• Starts from invpat or previously disambiguated and consolidated database
• Within each block, we compare each record
• Output is invnum_N
Consolidation
• Based on the disambiguated invnum_N, update invnum_N within invpat
• Consolidates records with the same invnum_N
Summary of Data Passes
Run #
Type Block1 Block2
1 Consolidated First name Last name
2 Consolidated First 5 characters of first name.
First 8 characters of last name.
3 Consolidated First 3 characters of first name First 5 characters of last name
4 Consolidated Initials of first and middle names.
First 5 characters of last name
5 Consolidated First initial First 5 characters of last name
6 Consolidated Initials of first and middle names.
Last 5 characters of last name, reversed
7 Consolidated First initial Last 5 characters of last name, reversed
8 Splitting Invnum_N from step 7
Inventor disambiguat
ion algorithm
Patent Similarity Profiles
Seven-dimensional Fields used: name attributes (first name, middle initials,
and last name) and patent attributes (author address, assignee, technology class, and coauthors)
Each element is a discrete similarity score determined by a fieldwise comparison between two records Jaro-Winkler string comparison
Monotonicity assumption: if one profile dominates another profile (this is, each of its elements is greater than or equal to the elements of another similarity profile), then it must map to a higher match probability.
Inventor disambiguat
ion algorithm
Similarity ScoresComparison function scoring: LEFT/RIGHT=LEFT VS RIGHT
1) Firstname: 0-6. Factors: # of token and similarity between tokens
0: Totally different: THOMAS ERIC/RICHARD JACK EVAN
1: ONE NAME MISSING: THOMAS ERIC/(NONE)
2: THOMAS ERIC/ THOMAS JOHN ALEX
3: LEE RON ERIC/LEE ALEX ERIC
4: No space match but raw names don't: JOHNERIC/JOHN ERIC. Short name vs long name: ERIC/ERIC THOMAS
5: ALEX NICHOLAS/ALEX NICHOLAS TAKASHI
6: ALEX NICHOLAS/ALEX NICHOLA (Might be not exactly the same but identified the same by jaro-wrinkler)
2) Lastname: 0-6 Factors: # of token and similarity between tokens
0: Totally different: ANDERSON/DAVIDSON
1: ONE NAME MISSING: ANDERSON/(NONE)
2: First part non-match: DE AMOUR/DA AMOUR
3: VAN DE WAALS/VAN DES WAALS
4: DE AMOUR/DEAMOUR
5: JOHNSTON/JOHNSON
6: DE AMOUR/DE AMOURS
3) Midname: 0-4 (THE FOLLOWING EXAMPLES ARE FROM THE COLUMN FIRSTNAME, SO FIRSTNAME IS INCLUDED)
0: THOMAS ERIC/JOHN THOMAS
1: JOHN ERIC/JOHN (MISSING)
2: THOMAS ERIC ALEX/JACK ERIC RONALD
3: THOMAS ERIC RON ALEX EDWARD/JACK ERIC RON ALEX LEE
4: THOMAS ERIC/THOMAS ERIC LEE
4) Assignee: 0-8
0: DIFFERENT ASGNUM, TOTALY DIFFERENT NAMES ( NO single common word )
1: DIFFERENT ASGNUM, One name missing
2: DIFFERENT ASGNUM, Harvard University Longwood Medical School / Dartmouth Hitchcock Medical Center
3: DIFFERENT ASGNUM, Harvard University President and Fellows / Presidents and Fellow of Harvard
4: DIFFERENT ASGNUM, Harvard University / Harvard University Medical School
5: DIFFERENT ASGNUM, Microsoft Corporation/Microsoft Corporated
6: SAME ASGNUM, COMPANY SIZE>1000
7: SAME ASGNUM, 1000>SIZE>100
8: SAME ASGNUM, SIZE<100
5) CLASS: 0-4
# OF COMMON CLASSES. MISSING=1
6) COAUTHERS 0-10
# OF COMMON COAUTHERS
7) DISTANCE: 0-7 FACTORS: LONGITUDE/LATITUDE, STREET ADDRESS
0: TOTALLY DIFFERENT
1: ONE IS MISSING
2: 75<DISTANCE < 100KM
3: 50<DISTANCE < 75
4: 10<DISTANCE < 50
5: DISTANCE < 10
6: DISTANCE < 10 AND STREET MATCH BUT NOT IN US, OR DISTANCE < 10 AND IN US BUT STREET NOT MATCH
7: STREET MATCH AND IN US
Inventor disambiguat
ion algorithm
Probabilistic Matching Model Name and Patent attributes are assumed
to be independent Unbiased training sets are created by
conditioning on one set of features to create a sample of obvious matches or non-matches to learn about the other set of features without bias
Count frequency of each similarity profile x in match and nonmatch sets to calculate P(x|M) and P(x|N)
Inventor disambiguat
ion algorithm
Training Set CriteriaInventor
disambiguation
algorithm
Name Attributes Patent Attributes
Match
Choose all the record pairs that have at least two common coauthors within each predefined block.
Choose all the record pairs that share the same rare name (calculate statistics on unique full names, choose those whose first or last name only appear once). Not necessary to check each block.
Nonmatch
Choose all record pairs that have same appyear, different assignee, no common coauthors and no common classes within each predefined block.
Choose all record pairs that have different last names from a subset of the whole database in which the number of records are proportional to the original one in terms of grant year.
Condition on patent
attributes to train name attributes
Condition on name attributes to train patent
attributes
Probabilistic Matching Model Likelihood ratio: r = P(x|M)/P(x|N) Probability of match given similarity
profile x:
where P(M) is empirically determined Smoothing: enforce monotonicity r is interpolated/extrapolated for
unobserved xa
Inventor disambiguat
ion algorithm
Disambiguation & Consolidation
Generate similarity profile for each record within each block
Lookup similarity profile in ratio database to find match probability
Based on a given probability threshold, we determine if invnum_N (algorithmically generated unique inventor identifier) should be updated
Records with same invnum_N are consolidated Improves algorithm efficiency for
subsequent runs
Inventor disambiguat
ion algorithm
Verification Measures
References
Hall, B. H., A. B. Jaffe, and M. Trajtenberg. (2001). The NBER patent Citations Data File: Lessons Insights and Methodological Tools, NBER.
Torvik, V. and M. Weeber, D. Swanson, N. Smalheiser (2005). “A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation,” JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 56(2):140–158, 2005.
Torvik, V. and N. Smalheiser (2009). “Author Name Disambiguation in MEDLINE.” ACM Transactions on Knowledge Discovery from Data, Vol. 3., No. 3, Article 11.