exploringemergingtechnologiesusingpatentdataand*...

Exploring emerging technologies using patent data and patent classification

Suraj Ankam Computer Science UNC Charlotte

Wenwen Dou Computer Science UNC Charlotte

[email protected]

Debbie Strumsky Geography

UNC Charlotte

Derek Xiaoyu Wang Computer Science UNC Charlotte

[email protected]

Terry Rabinowitz Computer Science UNC Charlotte

Wlodek Zadrozny Computer Science UNC Charlotte

[email protected]

ABSTRACT Scientific investments should have impact: scientific, technical, economic, and social. How to assess emerging technologies from this perspective is still an open question among policy makers and researchers. In this paper, we report our research effort to identify the core techniques in emerging technology assessment based on a data-driven visual analytic approach. We report preliminary results on discovering emerging new technologies. We use the corpus of US patents, and an “ontology” implicit in the patent examiners classification manual. We use topic modeling and interactive visualization techniques to find emerging technology trends, and we can validate such discoveries by interacting with patents metadata and text.

Author Keywords US patent data; innovation; patent classification manual; visualization; topic models; ontology; interactive visualization;

ACM Classification Keywords Information extraction; Information visualization.

General Terms Information extraction; Information visualization

INTRODUCTION

Innovation management, emergence of new technologies and their societal impact is of great value to economists,

politicians, and to research sponsors like industry partners. In our previous research, we worked closely with project managers at NSF to identify their research management needs. Several advanced visual analytics approaches for helping funding agencies in making program-funding decisions have been developed [4]. During these activities, the essential topic of identifying and assessing emerging technologies of scientific outcomes (esp. patents) attracted our attention.

Better understanding of emerging technologies is desired by decision makers at every stage of the research cycle, including research topic identification, research selection, research management and evaluation, and research termination/transition and retrospective analysis. It is therefore crucial for decision makers to understand the trends and patterns that occurred in existing patents, and utilize those insights to envision future technology innovations.

In this paper, we present our research effort in this direction and demonstrate two preliminary results. We have developed an interactive visual analytics system, which integrates automated topic modeling, natural language processing and visualizations, of patent documents, to facilitate the identification of emerging technologies in a massive -‐collection of patents. Here, we report -‐ the following results -‐ (a) the emergence of a new class of applications can be deduced from patent data using text mining and visualization (b) we can see temporal changes in a class of patents, and the loci of innovation.

DATA , DATA ANALTYICS, AND VISUAL ANALYTICS Since we aim to understand long-‐term social-‐economical impacts results from patents, we have been processing a significant amount of U.S. patent data from the USPTO.[1]. We indexed all US patents from 1977 until 1Q 2013, resulting in over 5 million patent

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CHI’12, May 5–10, 2012, Austin, Texas, USA. Copyright 2012 ACM 978-1-4503-1015-4/12/05...$10.00.

documents. These documents serve as unstructured inputs to our data modeling and analysis. For this particular study data we used 50,000 telecommunication patents. More specifically we used the abstract text and patent meta-‐information (altogether about 1.5 Gb of text).

We have further converted the US Patent classification manual [2], which has descriptions of all patent classes into a JSON file. This data gives us a basis to compare the actual invention, as represented in the abstract or claims and the broad topic represented by the class definition.

RESULTS: DISCOVERING AN EMERGING TECHNOLOGY

We applied topic modeling and visualization to see how patents change over time. We discovered that we can see a significant change in the topic of “software and storage” in telecommunication patents around 2007 (corresponding to Apple iPhone?). We are currently trying to see if such patterns repeat for other invention classes.

Modeling patents as collection of topics We apply a variant of standard topic modeling techniques. We model the set as 100 topics, where each topic a distribution on words, and each patent abstract is a combination of topics.

Fig. 1. shows a collection of telecommunication patents as a collection of topics. The horizontal line represents the time. The vertical axis shows the strength of the signal for 100 topics that were derived from the 50K telecommunication patents. The increase in width of the

bands comes from both changes in strength of a particular topic, and from the increase in telecommunication patents in that period. For example, the number of class 455 patents grew from 2234 in 2005 to 7647 in 2012. Details of each topic ribbon is explained and visualized in Figure 3. Both the topical analysis view and the work cloud representation are coordinated to provide user an interactive analysis environment.

Discovering an emerging trend using visualization

Fig.2. below show the visual difference for two topics, a stable topic (“transistor, …”) and an emerging topic of “storage, software, …”. We are interested in finding such emerging topics, and linking them to specific patents. Our visualization techniques allow us to perform such explorations interactively.

Notice that the “transistor” topic is roughly proportional to the growth in the number of telecommunication patents. While “storage, software” (in telecommunication) suddenly emerges in mid 2000s.

This is obviously a suggestive and preliminary result. We need to validate it for other technologies and technology classes. However, if the method is successful it can provide economist, policy makers, and business people better understanding of the changing landscape of technologies, in advance of their appearance in the market.

Notice that if the same method was applied to US Patent

Figure 2. Patents in class 455 (telecomunnication) in 2001-‐2012 represented as a combination of 100 topics. The corresponding topic modeled used LDA is presented in Figure 3.

Applications, the results would likely to show up about a year earlier, because patent applications are published from six months to several years earlier than granted patents.

Our plan to replicate this result for other innovative technologies and technology classes involves the Internet, jet engine, flexible transistors, and solar panels. In addition, we have already moved from purely word based topics to topics based on n-‐grams (1,2 done, and 3 in progress). In the coming months we’ll investigate basing our topic models for patent data on concepts (i.e. normalized phrases) not only on words. Arguably technology descriptions are better accounted for by complex phrases than by single words.

In addition we see an opportunity to describe an evolution of a class as in Fig.1. in terms of how much new topics differ from typical patents in the class at the time the class was created. (USPTO adds new classes periodically, and reclassifies previously granted patents).

Interactive visualization will remain part of the process, since topic modeling is not foolproof and the real value comes from interacting with new data.

Fig.2. Showing the difference between a stable topic such as “transistor” vs. an emerging topic “storage, software…” in patent class 455 (telecommunication).

SPOTTING THE NOVELTY WITHIN EXISTING PATENTS

In this experiment we looked at quantifying the novelty of patent claims based on how much they differ from patent class definitions. The class definitions are contained in the patent examiner manual, which we downloaded and converted into a JSON file.

The focus of our attention was on patent claims. In a preliminary experiment we took a random sample of 40 patents from several classes (but with focus on class 455-‐-‐telecommunication). We compared words in claims with words in class plus subclass definition (patent are classified by class and subclass). We discovered that words and phrases in patents claims substantially differ from words in relevant class definitions.

For example a patent on an astronaut’s suit “Support frame for radiation shield garment and methods of use thereof” is classified under Class 002-‐Apparel/Subclass 2.12, and as class 250-‐Radiant Energy/Subclass 516.1 .

The difference between the text of claims, and the patent class definitions is very substantial, as shown in Table 1 and Table 2.

Support 23 Bottom 20 Frame 18 Slideably 5

support frame 18 Shoulder 19

Elongated 23 Shaped 6

Upper 42 Configuration 6 Vertical 39 Projecting 3

comprise(ing) 17 Attach 17

Back 47 Member 61 Top 20

Table 1: Words in patent claims that are not in patent class and subclass definitions The numbers are word counts in the claims.

Relatively 0 atmosphere 0 radiation 3 Rotatable 0 device 0 nuclear 0 Coaxial 0 worn/wear 9 shield 3 Coupling 4 unusual 0 absorb 0 Astronaut 0 condition 0 radiant 0 body cover 0 force 0 energy 0 Trunk 0 high temperature 0 emissions 0 Appendage 0 apparel 0 invisible 0 Tubular 0 garment 3 eliminate 0 connection 2 adorn 0 method 0 common axis 0 cover 0 apparatus 0

guard 0 process 0 generate 0 protect 0 manufacture 0 control 0 body suit 0 harmful 0 detect 0 earth 0 electromagnetic 0 emanations 0

Table 2: Only some words in class and subclass definitions appear in patent claims. The number says how many time the word or phrase from the class definition appears in the claims.

Fig 4 shows the word to subclass distribution. In this star graph we compare patents in different subclasses with respect to the amount of overlap between the abstract and the subclass/class definition. As in the case of the manual analysis, there are substantial differences between the broad technology themes in definitions and in the claims.

CONCLUSION We have shown that combining visualization techniques with text mining and ontological information can provide insights into the emergence of new technologies. We used text of patents and patent examiner manual converted into an online “ontology” as our data set. The results are preliminary but highly suggestive. We expect to have much stronger results by the time of the workshop.

Expanding on previously developed topic-‐based text visualization [4], we are currently working to incorporate multiple sources of patent information, such as NSF funded project abstracts, news related to Federal R&D spending, and new programs that NSF has launched over the years, to identify critical events that may cause changes in patterns of emerging technologies.. In summary, we see an opportunity to combine text mining of different types of information about patents with visualization techniques to better understand emerging trends, as well as to quantify some of the impacts of policy decisions.

ACKNOWLEDGMENTS

REFERENCES [1] http://www.google.com/googlebooks/uspto.html US Patent Data, hosted by Google

[2] US Patent Classification Manual http://www.uspto.gov/web/patents/classification/selectnumwithtitle.htm

[3] W. Dou, L. Yu, X. Wang, Z. Ma, and W. Ribarsky “Hierarchical Topics: Visually Exploring Large Text Collections Using Topic Hierarchies”. IEEE Transactions in Visualization and Computer Graphics (IEEE VAST 2013)

[4] W. Dou, X. Wang, D. Skau, W. Ribarsky, and M. X. Zhou. “Leadline: Interactive visual analysis of text data through event identification and exploration”. In 2012

IEEE Conference on Visual Analytics Science and Technology (VAST), Oct. 2012.

Fig 3. Example patent topics.

Figure 4: Visualization of overlapping between Patent

Keywords and Subclasses.

exploringemergingtechnologiesusingpatent*data*and*...

Documents

exploringemergingtechnologiesusingpatentdataand*...