exploringemergingtechnologiesusingpatent*data*and*...

4
Exploring emerging technologies using patent data and patent classification Suraj Ankam Computer Science UNC Charlotte Wenwen Dou Computer Science UNC Charlotte [email protected] Debbie Strumsky Geography UNC Charlotte Derek Xiaoyu Wang Computer Science UNC Charlotte [email protected] Terry Rabinowitz Computer Science UNC Charlotte Wlodek Zadrozny Computer Science UNC Charlotte [email protected] ABSTRACT Scientific investments should have impact: scientific, technical, economic, and social. How to assess emerging technologies from this perspective is still an open question among policy makers and researchers. In this paper, we report our research effort to identify the core techniques in emerging technology assessment based on a data-driven visual analytic approach. We report preliminary results on discovering emerging new technologies. We use the corpus of US patents, and an “ontology” implicit in the patent examiners classification manual. We use topic modeling and interactive visualization techniques to find emerging technology trends, and we can validate such discoveries by interacting with patents metadata and text. Author Keywords US patent data; innovation; patent classification manual; visualization; topic models; ontology; interactive visualization; ACM Classification Keywords Information extraction; Information visualization. General Terms Information extraction; Information visualization INTRODUCTION Innovation management, emergence of new technologies and their societal impact is of great value to economists, politicians, and to research sponsors like industry partners. In our previous research, we worked closely with project managers at NSF to identify their research management needs. Several advanced visual analytics approaches for helping funding agencies in making program-funding decisions have been developed [4]. During these activities, the essential topic of identifying and assessing emerging technologies of scientific outcomes (esp. patents) attracted our attention. Better understanding of emerging technologies is desired by decision makers at every stage of the research cycle, including research topic identification, research selection, research management and evaluation, and research termination/transition and retrospective analysis. It is therefore crucial for decision makers to understand the trends and patterns that occurred in existing patents, and utilize those insights to envision future technology innovations. In this paper, we present our research effort in this direction and demonstrate two preliminary results. We have developed an interactive visual analytics system, which integrates automated topic modeling, natural language processing and visualizations, of patent documents, to facilitate the identification of emerging technologies in a massive collection of patents. Here, we report the following results (a) the emergence of a new class of applications can be deduced from patent data using text mining and visualization (b) we can see temporal changes in a class of patents, and the loci of innovation. DATA , DATA ANALTYICS, AND VISUAL ANALYTICS Since we aim to understand longterm social economical impacts results from patents, we have been processing a significant amount of U.S. patent data from the USPTO.[1]. We indexed all US patents from 1977 until 1Q 2013, resulting in over 5 million patent Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CHI’12, May 5–10, 2012, Austin, Texas, USA. Copyright 2012 ACM 978-1-4503-1015-4/12/05...$10.00.

Upload: others

Post on 03-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploringemergingtechnologiesusingpatent*data*and* …vialab.science.uoit.ca/textvis2013/papers/Ankam-TextVis... · 2015-09-24 · guard 0 process 0 generate 0 protect 0 manufacture

Exploring  emerging  technologies  using  patent  data  and  patent  classification    

Suraj  Ankam  Computer  Science    UNC  Charlotte  

 

Wenwen  Dou  Computer  Science    UNC  Charlotte  

[email protected]  

Debbie  Strumsky    Geography    

UNC  Charlotte    

     Derek  Xiaoyu  Wang    Computer  Science    UNC  Charlotte  

[email protected]  

Terry  Rabinowitz  Computer  Science    UNC  Charlotte  

 

Wlodek  Zadrozny  Computer  Science    UNC  Charlotte  

[email protected]        

       

ABSTRACT  Scientific investments should have impact: scientific, technical, economic, and social. How to assess emerging technologies from this perspective is still an open question among policy makers and researchers. In this paper, we report our research effort to identify the core techniques in emerging technology assessment based on a data-driven visual analytic approach. We report  preliminary  results  on  discovering   emerging   new   technologies.   We   use   the  corpus   of  US   patents,   and   an   “ontology”   implicit   in   the  patent   examiners   classification   manual.   We   use   topic  modeling   and   interactive   visualization   techniques   to  find   emerging   technology   trends,   and   we   can   validate  such   discoveries   by   interacting   with   patents   metadata  and  text.    

Author  Keywords  US  patent  data;  innovation;  patent  classification  manual;  visualization;  topic  models;  ontology;  interactive  visualization;  

ACM  Classification  Keywords  Information  extraction;  Information  visualization.  

General  Terms  Information  extraction;  Information  visualization    

INTRODUCTION    

Innovation management, emergence of new technologies and their societal impact is of great value to economists,

politicians, and to research sponsors like industry partners. In our previous research, we worked closely with project managers at NSF to identify their research management needs. Several advanced visual analytics approaches for helping funding agencies in making program-funding decisions have been developed [4]. During these activities, the essential topic of identifying and assessing emerging technologies of scientific outcomes (esp. patents) attracted our attention.

Better understanding of emerging technologies is desired by decision makers at every stage of the research cycle, including research topic identification, research selection, research management and evaluation, and research termination/transition and retrospective analysis. It is therefore crucial for decision makers to understand the trends and patterns that occurred in existing patents, and utilize those insights to envision future technology innovations.

In   this   paper,   we   present   our   research   effort   in   this  direction  and  demonstrate   two  preliminary  results.  We  have  developed     an   interactive   visual   analytics   system,  which   integrates   automated   topic   modeling,   natural  language   processing   and   visualizations,     of   patent  documents,   to   facilitate   the   identification   of   emerging  technologies   in   a   massive   -­‐collection   of   patents.   Here,  we  report  -­‐  the  following  results  -­‐  (a)  the  emergence  of  a  new   class   of   applications   can   be   deduced   from   patent  data  using  text  mining  and  visualization  (b)  we  can  see  temporal   changes   in   a   class   of   patents,   and   the   loci   of  innovation.  

DATA  ,  DATA  ANALTYICS,  AND  VISUAL  ANALYTICS    Since   we   aim   to   understand   long-­‐term   social-­‐economical   impacts  results  from  patents,  we  have  been  processing  a  significant  amount  of  U.S.  patent  data  from  the   USPTO.[1].   We   indexed   all   US   patents   from   1977  until   1Q   2013,   resulting   in   over   5   million   patent  

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CHI’12, May 5–10, 2012, Austin, Texas, USA. Copyright 2012 ACM 978-1-4503-1015-4/12/05...$10.00.

Page 2: Exploringemergingtechnologiesusingpatent*data*and* …vialab.science.uoit.ca/textvis2013/papers/Ankam-TextVis... · 2015-09-24 · guard 0 process 0 generate 0 protect 0 manufacture

documents.   These   documents   serve   as   unstructured  inputs   to   our   data   modeling   and   analysis.     For   this  particular   study   data   we   used   50,000  telecommunication  patents.   More   specifically   we   used  the   abstract   text   and   patent   meta-­‐information  (altogether  about  1.5  Gb  of  text).      

We  have   further   converted   the  US  Patent   classification  manual  [2],  which  has  descriptions  of  all  patent  classes  into   a   JSON   file.    This  data   gives  us   a  basis   to   compare  the   actual   invention,   as   represented   in   the   abstract   or  claims   and   the   broad   topic   represented   by   the   class  definition.  

RESULTS:  DISCOVERING  AN  EMERGING  TECHNOLOGY  

We  applied  topic  modeling  and  visualization  to  see  how  patents   change   over   time.   We   discovered   that   we   can  see   a   significant   change   in   the   topic   of   “software   and  storage”   in   telecommunication   patents   around   2007  (corresponding   to   Apple   iPhone?).   We   are   currently  trying  to  see  if  such  patterns  repeat  for  other  invention  classes.  

Modeling  patents  as  collection  of  topics  We   apply   a   variant   of   standard   topic   modeling  techniques.  We  model  the  set  as  100  topics,  where  each  topic  a  distribution  on  words,  and  each  patent  abstract  is  a  combination  of  topics.    

Fig.  1.  shows  a  collection  of   telecommunication  patents  as   a   collection  of   topics.  The  horizontal   line   represents  the   time.   The   vertical   axis   shows   the   strength   of   the  signal   for   100   topics   that   were   derived   from   the   50K  telecommunication  patents.  The  increase  in  width  of  the  

bands   comes   from   both   changes   in   strength   of   a  particular   topic,   and   from   the   increase   in  telecommunication  patents  in  that  period.  For  example,  the   number   of   class   455   patents   grew   from   2234   in  2005   to   7647   in   2012.   Details   of   each   topic   ribbon   is  explained   and   visualized   in   Figure   3.   Both   the   topical  analysis   view   and   the   work   cloud   representation   are  coordinated   to   provide   user   an   interactive   analysis  environment.    

Discovering  an  emerging  trend  using  visualization  

Fig.2.  below  show  the  visual  difference  for  two  topics,  a  stable   topic   (“transistor,   …”)   and   an   emerging   topic   of  “storage,  software,  …”.  We  are  interested  in  finding  such  emerging   topics,   and   linking   them   to   specific   patents.  Our   visualization   techniques   allow   us   to   perform   such  explorations  interactively.  

Notice  that  the  “transistor”  topic  is  roughly  proportional  to   the   growth   in   the   number   of   telecommunication  patents.   While   “storage,   software”   (in  telecommunication)  suddenly  emerges  in  mid  2000s.    

This   is   obviously   a   suggestive   and   preliminary   result.  We   need   to   validate   it   for   other   technologies   and  technology  classes.  However,  if  the  method  is  successful  it   can   provide   economist,   policy   makers,   and   business  people  better  understanding  of   the   changing   landscape  of   technologies,   in   advance   of   their   appearance   in   the  market.    

Notice  that  if  the  same  method  was  applied  to  US  Patent  

Figure 2.  Patents  in  class  455  (telecomunnication)  in  2001-­‐2012  represented  as  a  combination  of  100  topics.  The  corresponding  topic  modeled  used  LDA  is  presented  in  Figure  3.  

Page 3: Exploringemergingtechnologiesusingpatent*data*and* …vialab.science.uoit.ca/textvis2013/papers/Ankam-TextVis... · 2015-09-24 · guard 0 process 0 generate 0 protect 0 manufacture

Applications,  the  results  would  likely  to  show  up  about  a  year   earlier,   because   patent   applications   are   published  from   six   months   to   several   years   earlier   than   granted  patents.    

Our   plan   to   replicate   this   result   for   other   innovative  technologies   and   technology   classes   involves   the  Internet,   jet   engine,   flexible   transistors,   and   solar  panels.  In  addition,  we  have  already  moved  from  purely  word  based  topics  to  topics  based  on  n-­‐grams  (1,2  done,  and   3   in   progress).   In   the   coming   months   we’ll  investigate   basing   our   topic  models   for   patent   data   on  concepts   (i.e.   normalized   phrases)   not   only   on   words.  Arguably   technology   descriptions   are   better   accounted  for  by  complex  phrases  than  by  single  words.    

In   addition   we   see   an   opportunity   to   describe   an  evolution   of   a   class   as   in   Fig.1.   in   terms   of   how  much  new  topics  differ  from  typical  patents  in  the  class  at  the  time   the   class   was   created.   (USPTO   adds   new   classes  periodically,   and   reclassifies   previously   granted  patents).    

Interactive  visualization  will  remain  part  of  the  process,  since  topic  modeling  is  not  foolproof  and  the  real  value  comes  from  interacting  with  new  data.    

 

 Fig.2.  Showing  the  difference  between  a  stable  topic  such   as   “transistor”   vs.   an   emerging   topic   “storage,  software…”   in   patent   class   455  (telecommunication).  

 

SPOTTING  THE  NOVELTY  WITHIN  EXISTING  PATENTS    

In  this  experiment  we  looked  at  quantifying  the  novelty  of   patent   claims   based   on   how   much   they   differ   from  patent   class   definitions.   The   class   definitions   are  contained   in   the   patent   examiner   manual,   which   we  downloaded  and  converted  into  a  JSON  file.    

The   focus   of   our   attention   was   on   patent   claims.   In   a  preliminary  experiment  we  took  a  random  sample  of  40  patents   from   several   classes   (but   with   focus   on   class  455-­‐-­‐telecommunication).     We   compared   words   in  claims   with   words   in   class   plus   subclass   definition  (patent   are   classified   by   class   and   subclass).   We  discovered   that   words   and   phrases   in   patents   claims  substantially   differ   from   words   in   relevant   class  definitions.    

For   example   a   patent   on   an   astronaut’s   suit   “Support  frame   for   radiation  shield  garment  and  methods  of  use  thereof”   is   classified  under  Class  002-­‐Apparel/Subclass  2.12,  and  as  class  250-­‐Radiant  Energy/Subclass  516.1  .  

The   difference   between   the   text   of   claims,   and   the  patent  class  definitions   is  very  substantial,  as  shown   in  Table  1  and  Table  2.    

Support 23 Bottom 20 Frame 18 Slideably 5

support frame 18 Shoulder 19

Elongated 23 Shaped 6

Upper 42 Configuration 6 Vertical 39 Projecting 3

comprise(ing) 17 Attach 17

Back 47 Member 61 Top 20

Table   1:   Words   in   patent   claims   that   are   not   in   patent  class   and   subclass   definitions     The   numbers   are   word  counts  in  the  claims.  

Relatively 0 atmosphere 0 radiation 3 Rotatable 0 device 0 nuclear 0 Coaxial 0 worn/wear 9 shield 3 Coupling 4 unusual 0 absorb 0 Astronaut 0 condition 0 radiant 0 body cover 0 force 0 energy 0 Trunk 0 high temperature 0 emissions 0 Appendage 0 apparel 0 invisible 0 Tubular 0 garment 3 eliminate 0 connection 2 adorn 0 method 0 common axis 0 cover 0 apparatus 0

Page 4: Exploringemergingtechnologiesusingpatent*data*and* …vialab.science.uoit.ca/textvis2013/papers/Ankam-TextVis... · 2015-09-24 · guard 0 process 0 generate 0 protect 0 manufacture

guard 0 process 0 generate 0 protect 0 manufacture 0 control 0 body suit 0 harmful 0 detect 0 earth 0 electromagnetic 0 emanations 0

Table  2:  Only  some  words  in  class  and  subclass  definitions  appear  in  patent  claims.  The  number  says  how  many  time  the   word   or   phrase   from   the   class   definition   appears   in  the  claims.  

Fig  4     shows   the  word   to   subclass  distribution.     In   this    star   graph  we   compare   patents   in   different   subclasses  with   respect   to   the   amount   of   overlap   between   the  abstract  and  the  subclass/class  definition.  As  in  the  case  of  the  manual  analysis,  there  are  substantial  differences  between  the  broad  technology  themes  in  definitions  and  in  the  claims.  

CONCLUSION  We  have  shown  that  combining  visualization  techniques  with   text   mining   and   ontological   information   can  provide   insights   into   the   emergence   of   new  technologies.   We   used   text   of   patents   and   patent  examiner  manual  converted  into  an  online  “ontology”  as  our   data   set.   The   results   are   preliminary   but   highly  suggestive.  We  expect  to  have  much  stronger  results  by  the  time  of  the  workshop.  

Expanding   on   previously   developed   topic-­‐based   text  visualization   [4],   we     are   currently   working   to  incorporate  multiple  sources  of  patent  information,  such  as  NSF  funded  project  abstracts,  news  related  to  Federal  R&D   spending,   and   new   programs   that   NSF   has  launched  over   the  years,   to   identify   critical   events   that  may   cause   changes   in   patterns   of   emerging  technologies..   In   summary,   we   see   an   opportunity   to  combine   text   mining   of   different   types   of   information  about   patents   with   visualization   techniques   to   better  understand  emerging  trends,  as  well  as  to  quantify  some  of  the  impacts  of  policy  decisions.  

ACKNOWLEDGMENTS  

REFERENCES  [1] http://www.google.com/googlebooks/uspto.html US Patent Data, hosted by Google

[2] US Patent Classification Manual http://www.uspto.gov/web/patents/classification/selectnumwithtitle.htm

[3] W. Dou, L. Yu, X. Wang, Z. Ma, and W. Ribarsky “Hierarchical Topics: Visually Exploring Large Text Collections Using Topic Hierarchies”. IEEE Transactions in Visualization and Computer Graphics (IEEE VAST 2013)

[4] W. Dou, X. Wang, D. Skau, W. Ribarsky, and M. X. Zhou. “Leadline: Interactive visual analysis of text data through event identification and exploration”. In 2012

IEEE Conference on Visual Analytics Science and Technology (VAST), Oct. 2012.

Fig 3. Example patent topics.

Figure 4: Visualization of overlapping between Patent

Keywords and Subclasses.