patstat 7 deadly sin and how to solve them

19
PATSTAT 7 DEADLY SINS (and how to solve them) Gianluca Tarasconi – Crios Università Bocconi Blog: http://rawpatentdata.blogspot.com

Upload: gianluca-tarasconi

Post on 06-Jul-2015

18.122 views

Category:

Business


3 download

DESCRIPTION

list of 7 major issues with patstat (EPO patent database) data and possible solutions

TRANSCRIPT

Page 1: Patstat 7 deadly sin and how to solve them

PATSTAT 7 DEADLY SINS (and how to solve them)

Gianluca Tarasconi – Crios Università BocconiBlog: http://rawpatentdata.blogspot.com

Page 2: Patstat 7 deadly sin and how to solve them

When investigating EP / PCT citations some of themdisappear, as if ‘eaten up’

SEE FI: EP1103560 equivalent to WO0006594

From citations table we would agree it has only 2 NPL (and one of them is "SEE ALSO CITATIONS OF WO0006594 ”

APPLN_IDPUBLN_AUTH

+ NR PUBLN_IDNPL_CITNSEQ _NR NPL_PUBLN_ID NPL_BIBLIO

347305 EP1103560 511640 1 950236893No further relevant documents disclosed

347305 EP1103560 511640 3 950236894See also references of WO 0006594A1

Page 3: Patstat 7 deadly sin and how to solve them

but looking at the patent's search report:http://worldwide.espacenet.com/publicationDetails/originalDocument?CC=EP&NR=1103560A1&KC=A1&

FT=D&ND=3&date=20010530&DB=EPODOC&locale=en_EP

we can find 1 NPL and 3 Patents that are in reality listed as backward citations

Page 4: Patstat 7 deadly sin and how to solve them

As a matter of fact, seeking in espacenet the correspondig WO we find: http://worldwide.espacenet.com/publicationDetails/citedDocuments?CC=WO&NR=0006594A1&KC=A1&FT=D&ND=4&date=20000210&DB=EPODOC&locale=en_EP

Page 5: Patstat 7 deadly sin and how to solve them

An issue inherited from REFI dataset: when a PCT / EP patent has an equivalent we may have, instead of the list of citations, an entry as NPL that is something like "SEE ALSO CITATIONS OF…”

(REFI is the original EPO dataset for citations)

publn_id affected publn_id tot %'EP' 334140 4505456 7,42%

'WO' 334135 2682654 12,46%

Page 6: Patstat 7 deadly sin and how to solve them

A possible solution is to integrate citations of WO/EP equivalent patent where a record of NPL contains the string "SEE ALSO“

This means adding patents and NPL citations from the corresponding INTERNAT_APPLN_ID in TLS201. In our example we’d find all (and something more )

APPLN_IDPUBLN_AUTH +NR PUBLN_ID

PAT_CITN SEQ_NR

CITED_PAT PUBLN_ID

NPL_CITN SEQ_NR NPL PUBLN_ID NPL_BIBLIO / CITED PUBNR

30241523 WO0006594 38723126 1 45098451 0 0 JPH09124691

30241523 WO0006594 38723126 2 46832598 0 0 JPH01143897

30241523 WO0006594 38723126 3 34575918 0 0 JPS56139455

30241523 WO0006594 38723126 0 0 1 2925184

GILBERT M. RISHTON ET AL: 'A beta-turn Mimic...Study of Cyclic Peptide RGD and RCD Cell-adhension Inhibitors' LETTERS IN PEPTIDE SCIENCE vol. 3

30241523 WO0006594 38723126 0 0 2 952900902 See also references of EP 1103560A1

Page 7: Patstat 7 deadly sin and how to solve them

A part of applications in the DB result not to have anypriority but they should (greedy EPO kept them for itself…)

Easy identification with EP applications with a corresponding PCT with no priority

Seeking both APPLN_ID and INTERNAT_APPLN_ID in TLS204 we find they have no priorities (no recordsreturned);

Surprisingly they are in the same inpadoc family (TLS219)

APPLN_ID APPLN_AUTH APPLN_NR APPLN_KIND APPLN_FILING_DATE INTERNAT_APPLN_ID

20701 'EP'' 92917913' 'A' '1992-08-10' 11643479

Page 8: Patstat 7 deadly sin and how to solve them

If we seek in Espacenet we see 2 more priorities are lost:

48.945 cases (1% of PCT applications) involved, but may have a bigger effect if counted by priorities missing…

Page 9: Patstat 7 deadly sin and how to solve them

Partial solution:

Add APPLN_ID and INTERNAT_APPLN_ID fromTLS201 to a customized priorities list, correspondinglyas application and priority.

This issue affects also calculation of priority year…

Page 10: Patstat 7 deadly sin and how to solve them

As from table aside: address coverage forsome applicationautorities is very poor, enraging manyusers…

(top 15 authorities by distinct person id; oct 2012 data)

Page 11: Patstat 7 deadly sin and how to solve them

Two safe ways to recover address data:

1) for PCT applications: rescue data from regionalphase persons: about 7/8 % of data

2) rescue data from homonims for the sameapplications (ie applicant = inventor): 3% of data (especially effective for USPTO)

One more risky method:

Rescue data from homonims in patent priorities(10/15% but less safe, see example below)

DOCDB_FAMILY_ID APPLN_ID APPLN_ID1 person_name PERSON_CTRY_CODE PERSON_CTRY_CODE1

22857305 24074575 6621661 'A. AGRAWAL' 'US' 'IN'

Page 12: Patstat 7 deadly sin and how to solve them

TLS221 offers a way, through PRS CODE RAP1 to trackownership changes.

For example: for EP application id 15706726 we can track a double ownership change

If we look at patstat TLS206/207 tables we will findactual owner (SENTEX CHEMNITZ) but we cannot getthe first owner (removed from envious new owner…)

APPLN_ID DATE PRS CODE NAME EXPLANATION

15706726 18/10/2006 RAP1 NEUMANN ELEKTROTECHNIK GMBH TRANSFER OF RIGHTS OF AN EP APPLICATION

15706726 24/11/2010 RAP1 SENTEX CHEMNITZ GMBH TRANSFER OF RIGHTS OF AN EP APPLICATION

Page 13: Patstat 7 deadly sin and how to solve them

The only solution is to look in espacenet and find, in bibliographic data web page, the name of the first owner(in this case Univ Dresda)

Note: oct 2013 pastat includes major changes to personsmanagement maybe this issue has new solutions…

Page 14: Patstat 7 deadly sin and how to solve them

PATSTAT covers about 100 patent authorities, but withinequal coverage and pubblication lags.

Good coverage and short lags for EU countries; lessgood and regular for national patent authoritiesoutside EU (except big players ie US JP…)

In next page an example, using GB as baseline for BR and IN: (applications count)

Page 15: Patstat 7 deadly sin and how to solve them

Two possible errors: different transmission timeframe (decay of patent count in BR starts before GB); Partial data transimssion: counts are different than official data from patent office (see IN

next page)BR GB IN

1990 10851 30055 2209

1991 10122 29991 2002

1992 9103 30089 1958

1993 10272 29901 2032

1994 10992 29560 2529

1995 13557 29909 2554

1996 15580 30448 1679

1997 18589 31219 1383

1998 19032 32828 1026

1999 21019 35222 750

2000 20725 36996 690

2001 20626 36884 705

2002 19265 36318 757

2003 20909 35452 1049

2004 22816 33794 1113

2005 23973 31066 1691

2006 23472 30495 1973

2007 16078 30848 2215

2008 10088 28816 2541

2009 8843 27103 2507

2010 5028 25363 2988

2011 539 24010 872

2012 7 7955 28

Page 16: Patstat 7 deadly sin and how to solve them

Official data for India applications:

On EPO website at page: https://data.epo.org/data/data.html coverage byauthority is listed (available in absolute numbers notin % )

IN

2007 2215

2008 2541

2009 2507

2010 2988

2011 872

2012 28

Page 17: Patstat 7 deadly sin and how to solve them

Consider all records contained in TLS212 for PAT_PUBLN_ID = 3:

We note all records are duplicated since the origin of citation is double (both 0/1: applicant and examiner)

This may lead to an overextimation of citations received.

About 750K records out of 100M (0,75%) suffer of this issue but…

PAT_PUBLN_ID CITN_ID CITED_PAT_PUBLN_ID PAT_CITN_SEQ_NR CITN_ORIGIN

3 1 20433311 1 '0 '

3 2 20473739 2 '0 '

3 3 15421766 3 '0 '

3 4 20433311 4 '1 '

3 5 20473739 5 '1 '

3 6 15421766 6 '1 '

Page 18: Patstat 7 deadly sin and how to solve them

Distribution of the error is very concentrated:

EPO: 218.000 (about 2.5 %)

WIPO 318.000 (about 3%)

This makes also unclear which is the origin of citation…

Solution: count distinct citations; move to a separate table citation origin data

Page 19: Patstat 7 deadly sin and how to solve them

Some applications, in TLS207 have the same person_idrepeated twice. FI:

This ‘reproductive act’ affects only 0,1% of TLS207but is probably originatedby a legal event like Change of Ownership or data correction, since all of the records list one of that events(3.4% of such applications suffer of duplications).

Suggestion: data from TLS207 should be treated using a DISTINCT clause.

APPLN_ID PERSON_ID APPLT_SEQ_NR INVT_SEQ_NR

2055 15868134 0 2

2055 15868134 0 5

2055 27024905 0 1

2055 27024905 0 4

2055 31219618 2 0

2055 31219618 1 0

2055 40555313 0 3

2055 40555313 0 6