daniela ichim, giulio perani, giovanni seri italian national statistical institute (istat)
DESCRIPTION
Designing Linkage between Patents and Business Registers: the Italian Experience. Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat) {ichim,perani,seri}@istat.it EESW European Establishment Statistics Workshop 2011. Neuchatel, 12 – 14 September 2011. - PowerPoint PPT PresentationTRANSCRIPT
Neuchatel, 12-14 September 2011
Daniela Ichim, Giulio Perani, Giovanni SeriItalian National Statistical Institute (Istat)
{ichim,perani,seri}@istat.it
EESW European Establishment Statistics Workshop 2011
Designing Linkage between Patents and Business Registers: the Italian
Experience
Neuchatel, 12 – 14 September 2011
Neuchatel, 12-14 September 2011
OutlineEESWEuropean Establishment Statistics Workshop 2011
Project description
Data sets
Linkage approach
Pre-processing of the input files
Choice of the matching variables
Choice of the similarity function
Creation of the search space of link candidate pairs
Choice of the decision model and selection of unique links
Record linkage evaluation
Preliminary results
Future works
Neuchatel, 12-14 September 2011
Project description
Aim: profiling the Italian patenting enterprises
Linking economic data and technological information on patenting enterprises in order to identify the key drivers of patenting propensity
Evaluating the economic impact of the patenting activity
Identifying and collecting additional information on enterprises to be surveyed as R&D performers
Investigating specific sub-population of enterprises (e.g. biotech enterprises)
EESWEuropean Establishment Statistics Workshop 2011
Neuchatel, 12-14 September 2011
Project description
Source of data: PATSTAT - EPO Worldwide Patent Statistical Database
Target data: applicants based in ItalyPeriod: patent applications from 1985 to 2010
Subject classification criterium: A) individuals B) establishments
Business enterprises Public institutions Non profit institutions Universities
EESWEuropean Establishment Statistics Workshop 2011
Neuchatel, 12-14 September 2011
Data sets: patentsEESW European Establishment Statistics Workshop 2011
Additional information to be retrieved from the above database:• Year of first/last application by applicant• Number of patent applications filed by applicant• Region of residence of the applicants
PATSTAT (1) Applications 299769Application number (by year)International Patent Classification (IPC) code (each application can be classified under several IPC codes)
PATSTAT (2) Applications 72034Application number (by year)Applicant nameApplicant codePostal/Zip CodeApplicant Country (=IT)
Neuchatel, 12-14 September 2011
Data sets: enterprisesEESW European Establishment Statistics Workshop 2011
Italian business register: ASIA (Archivio Statistico Imprese Attive)it is the frame for Istat surveys built as a logical and physical combination of data from both surveys and administrative sources (Tax Register, Register of Enterprises and Local Units, Social Security Register, Work Accident Insurance Register, Register of the Electric Power Board).
ASIAEnterprises identification numberEnterprises namePostal/Zip CodeNACE codeAddress, municipality, province, region Legal formFiscal code
Enterprise’s size variables: Number of employeesTurnover
ASIA 1998-2008 (size 2008 ~ 4.5million records)
Neuchatel, 12-14 September 2011
Data sets: linkage outputEESW European Establishment Statistics Workshop 2011
Applicant identification number
Enterprises identification number
Surveys
Shared variables:NamePostal/Zip Code
Neuchatel, 12-14 September 2011
Pre-processing of the input filesEESW European Establishment Statistics Workshop 2011
• Accents
• Symbols & special characters
• Double spaces
• Dots (e.g. L.T.D. in LTD), punctuations
• Known abbreviations (about 150 ways to say “in
short”)
• Most frequent words (more than 1000 and 100)
• Lower/upper letters
• Deduplication of words
• Known legal forms (reduced to 6 main categories)
• Universities/public administrations dropped
Standardisation:
Neuchatel, 12-14 September 2011
Choice of the matching variablesEESW European Establishment Statistics Workshop 2011
1. Std name in upper letter and alphabetical order2. Postal/Zip code3. Legal form
Freq
Application number 72037
Applicant code 26509
Std Applicants 23833
Legal forms COOP SAS SNC SPA SRL Total
Freq 8979 63 501 756 6164 7370 23833
% 37,7% 0.3% 2.1% 3.2% 25.9% 30.9% 100%
% ASIA >70% ~1% 5/7% 6/8% ~0.5% 7/13%
Neuchatel, 12-14 September 2011
Search space reduction EESW European Establishment Statistics Workshop 2011
Patent applicants:Establishments (Enterprises) – Individuals
- several words in a name (OK only for enterprises, not for individuals)
Individuals: Std Applicant name does not contain- legal form- a name not included in the database of Italian first names “List of italian first names”* - special terms: “enterprise”, “construction”, “hotel”, “systems”, “group”, … (63 values)
Freq Std applicants
Std applicant enterprises
16132 ~65 %
*(http://www.nomix.it/nomi-italiani-maschili-e-femminili.php)
Neuchatel, 12-14 September 2011
Search space reductionEESW European Establishment Statistics Workshop 2011
• Blocking by year of application (reduces only the size of the patent applicants archive:
ineffective)
• Blocking by Postal/Zip Code-Region (ineffective)
• Partition of ASIA 2008 (more than 10 employees, 1 employee with legal form)
• ASIA 2007-1998 (recursively removing the enterprises included in most recent ASIA archives)
• R&D survey frame (as a subset of ASIA archive)
Neuchatel, 12-14 September 2011
Search space reductionEESW European Establishment Statistics Workshop 2011
Neighbourhoods of words: the set of ASIA enterprises having at least one word in common with the patent applicant name
Huge number of small problems!!!!
Neuchatel, 12-14 September 2011
Search space reductionEESW European Establishment Statistics Workshop 2011
Hypotheses:
- assumes at least one word in a name registered at the same manner in both registers
Problems:
- very short words (1-2 letters) generate huge neighbourhoods
- very common words generate huge neighbourhoods
- names without neighbourhood
- not applicable in a probabilistic approach
* 23338 Patent applicants ~ ASIA 2008 (10+ number of employees)
Neighbourhoods of words:
Neuchatel, 12-14 September 2011
Preliminary resultsEESW European Establishment Statistics Workshop 2011
Still under expert clerical check (~hundreds)
No Duplicated Enterprises code
Clerical review on 190 randomly chosen links 97,5%
Classes ofEmplyees
NeuchatelFirst phase
NeuchatelSecond phase
Freq % Freq %
1 793 8.1 2459 19.3
(1-10) 1995 20.4 2502 19.6
[10, 6985 71.5 7809 61.2
Total 9773 12770
Neuchatel, 12-14 September 2011
Preliminary results
Patent applicants by year: lost and found (black and red)
EESW European Establishment Statistics Workshop 2011
Neuchatel, 12-14 September 2011
Preliminary results
Patenting enterprises in ASIA 2008by economic activity (NACE 2007)
NACE NACE Description Freq % Cum Cum%
28 Manufacture of machinery and equipment n.e.c.
2203 22.0 2203 22.0
25 Manufacture of fabricated metal products, except machinery and
equipment899 9.0 3102 31.0
46 Wholesale trade, except of motor vehicles and motorcycles
736 7.3 3838 38.3
22 Manufacture of rubber and plastic products 607 6.1 4445 44.3
27 Manufacture of electrical equipment 465 4.7 4910 49.0
EESW European Establishment Statistics Workshop 2011
The 5 most frequent NACE’s divisions
Neuchatel, 12-14 September 2011
Future Work
Methods•Neighborhood based on similarity instead of equality•Probabilistic approach (using the R&D survey frame)
UnitsNames containing only 2 letters wordsIndividuals (names without legal form)
List of companies’ owners and partnersList of University Professors/Researchers
No neighbourhood names
Analyses•Produce analytical evidence on specific technological areas (e.g. Biotech) using ICP codes•Overall classification of patent applicants
EESW European Establishment Statistics Workshop 2011
Neuchatel, 12-14 September 2011
Thank you for your attention!
EESW European Establishment Statistics Workshop 2011