study in spatial distribution analysis of science research activities based on toponym resolution in...
Post on 29-Mar-2015
216 Views
Preview:
TRANSCRIPT
Study in Spatial Distribution Analysis ofScience Research Activities based onToponym Resolution in Text
Jianxia Ma1, Guodong Cheng2, Shaoxiong Liu1 , Hanqing Ma1, Jinhui Ma3 ,Na Li1
1.The Lanzhou Branch of the National Science Library, Chinese Academy of Sciences, , Lanzhou 73000,China;
2. Cold and Arid Regions Environmental and Engineering Research Institute, Chinese Academy of Sciences, Lanzhou 730000,China;
3. College Of Earth and Environmental Science, LanZhou University,Lanzhou 730000, China
Collnet 2012 ,Korea Souel
Outline
Background Intorduction to Related Study Framework of the Analysis Tool Spatial Analysis of Research
Activity in sporopollen in China Conclusion
Background Recently, many scholars and applications
have begun to show analysis results of scientific papers combined with GIS visually.
Most of their studies are based on addresses of authors given by the authors directly.
There are few reports on the analysis of distribution of research area based on text-mining in research papers, especially written in Chinese.
Katy Börner , Shashikant Penumarthy, Mark Meriss etc. Mapping the Diffusion of Information Among Major U.S. Research Institutions. Scientometrics, 2006,68(3):415-426
[
Xuemei Wang, Mingguo Ma.Spatial information mining and visualization for Qinghai-Tibet Plateau’s literature based on GIS[A] in:Yaolin Liu, Xinming Tang.International Symposium on Spatial Analysis, Spatial-Temporal Data Mining[C].Wuhan, Proc. Of SPIE,2009,1-8
Lutz Bornmann$, Ludo Waltman. The detection of “hot regions” in the geography of science – A visualization approach by using density maps , arXiv:1102.3862v2
Lutz Bornmann, Loet Leydesdorff, Christiane Walch-Solimena, Christoph Ettl$Mapping excellence in the geography of science: An approach based on Scopus data
Background In earth science, resources and
environment related fields, research is closely related with some location.
It is inefficient to read the articles one by one while annotate the research area by hand to get the understanding of the distribution of research area. In doing so, it is not easy to grasp where the research blanks and hot spots are.
Background Through automatic recognition and indication
of geographical names referred in research papers, we can analyze the spatial distribution of research activities in a research field, and understand the hot areas and blank areas in the field.
It will help decision makers and researchers to adjust strategy of research and optimize research resources allocation, and it will be an innovation in information analysis by adding a new spatial dimension to traditional information analysis.
Background PossibilityCan we mine hidden geographical knowledge from large-
scale research papers to support spatial analysis of research activity?
How? How to analyze geographical feature in magnanimous
textual collections and mine the hidden knowledge efficiently?
Key:Toponym resolution in the research articles
includes two tasks, namely Geo-Parsing and Geo-Coding
Introduction to Related Study
Geo-parsing Geo-parsing consists of detecting and
extracting the geographic names referred in the unstructured text of an article or a Web page using Named Entity Recognition (NER) techniques.
Gazetteers based extraction. Simple and allows efficient implementations, with a
loss of precision in toponym extraction. A tedious job to get a full covered gazetteer.
Natural language processing generally based on statistical models. Hidden Markov Models (HMMs) , Maximum Entropy
Models (MEMs),Maximum Entropy Markov Models (MEMMs) ,Conditional Random Field (CRF) ,Supporting Vector Machine(SVM)were discussed in many documents for extraction of geographic names.
require lots of training and are corpus dependent.
Geo-coding Geo-coding is the key step to correlate textual
information to maps. Gazetteer or the geographical knowledge base is the key component
A well-designed digital gazetteer can support geo-entity identification, toponym disambiguation and geo-coding.
By now, the famous digital gazetteers includes ADL Gazetteer , Getty TGN 、 GeoName.
And some digital map services, including Google Map , Microsoft Bing map , Yahoo PlaceFinder, Baidu Map provide API for geo-coding.
Chinese Toponym Extraction
Unlike English, there is no blank to mark word boundaries in Chinese text.
The previous research focused on syntax rules and word segmentation. Statistical models have been used to identify unknown geographical names in Chinese text.
The research mainly carried out in webpage & news, few of them related to research paper.
Framework of The Analysis Tool
Framework of The Analysis Tool
Documentary Database Preparation Geo-parsing in Text
Geo-extraction from authors’ affiliation and address fields
Geo-recognition from unstructured text CRF++ Based Toponym Identification Geographical Knowledge Base with Semantic Relationship
Supporting Toponym Disambiguation GeoFocus
Geo-Coding Spatial Analysis of Research Activity Based on
Toponym Resolution from Documents with ArcGIS
Geographical Knowledge Base with Semantic Relationship Supporting Toponym Disambiguation
四、实验原型设计 15
Geographical KB Abbre-Alias-Formal Toponym transformation Toponoy-Footprint/Coordinate Combining with toponym rules to support
toponym annotation. Combining
administrative,spatialrelationship, and feature type of geo-entity to support disambugation.
Geo-coding
Spatial Analysis of Research Activity in sporopollen in China
The author’s distribution CNKI 1490 papers (2000-2010) 1402 items have clear authors’ affiliations
and addresses. identified 97.08% author’s affiliation and
address. In combination with Google earth and
Google Map, the rate of geo-coding to 96.9%.
As Fig shows most of authors of palynology come from Beijing, Jiansu and Shanghai, then from Gansu and Shanxi. Few of the authors are from Xizang and Ningxia.
Distribution of the research area in sporopollen in China
There are 1112 papers referred geographical names according to manual annotation in abstract.
Distribution of research area in sporopollen
The hottest research area of Sporopollen in China is estuary of the Yangtze river, Shandong inland area, Beijing, Qinling mountain area and Junggar Basin,
the sampling point is sparse in the south of Changjiang, mountainous border of Heilongjiang Jilin and Inner Mongolia and northwest desert and southwest tropical regions.
These places should be payed much attention in the future,i.e. in addition to consider research significance, geographic area representative and filling blanks research area also is worth considering.
Conclusion
The experiment shows that it is possible to analyze distribution of research activities based on automatic identification and annotation of the geo-entity in large-scale textual collections.
The method is useful for the science decision maker to allocate research resources.
Further research Further research and experiment is needed and
actually is on-going to improve geo-parsing and geo-coding rate.
We need much more corpus to be trained, need to adjust the feature template to get better efficiency.
We also need to take into consideration of other heuristics to improve the toponym resolution.
A systematic evaluation of the method we have taken should be carried out as well.
Thanks for your attention!
top related