conversion of unstructured data to structured data · pdf fileconversion of unstructured data...
TRANSCRIPT
http://www.iaeme.com/IJMET/index.asp 623 [email protected]
International Journal of Mechanical Engineering and Technology (IJMET) Volume 8, Issue 8, August 2017, pp. 623–630, Article ID: IJMET_08_08_068
Available online at http://www.iaeme.com/IJMET/issues.asp?JType=IJMET&VType=8&IType=8
ISSN Print: 0976-6340 and ISSN Online: 0976-6359
© IAEME Publication Scopus Indexed
CONVERSION OF UNSTRUCTURED DATA TO
STRUCTURED DATA WITH A PROFILE
HANDLING APPLICATION
Sivaramakrishnan N, Vandana V, Vishali M, Dharshana S G, Subramaniyaswamy V
and Umamakeswari A
School of Computing, SASTRA University, Thanjavur, India
ABSTRACT
Resumes play a vital role in recruitment processes in institutions. It showcases the
qualifications and the abilities of a candidate. There are no international
standardizations on the core blueprint of a resume. There are a plethora of formats
and structures of resume available. This causes structuring of the data to become
incredibly involved. But it is imperative that this is done for increased readability. In
this digital era, keeping things simple is the motto. Structuring a resume helps in its
classification, and makes it easier for employers to read through it. The proposed
method exhibits a simple solution that serves the above purpose. Using Natural
Language Processing (NLP) methods, construction of a readable resume can be
successfully done. Here, Python and its language processing libraries are used to
detect the data, extract them and structure them.
Keywords: Natural Language Processing, Resume Handling, Structured Data,
Unstructured Data, Data Mining
Cite this Article: Sivaramakrishnan N, Vandana V, Vishali M, Dharshana S G,
Subramaniyaswamy V and Umamakeswari A, Conversion of Unstructured Data to
Structured Data with a Profile Handling Application, International Journal of
Mechanical Engineering and Technology 8(8), 2017, pp. 623–630.
http://www.iaeme.com/IJMET/issues.asp?JType=IJMET&VType=8&IType=8
1. INTRODUCTION
It's tough to read several resumes because they don’t follow a standard format. Each resume
has its fields and in different places. One may refer to his/her technical qualifications as
academic achievements or segregate them into categories. This brings in disparities. Since
there isn’t a nonadjustable blueprint for a resume, extracting the data according to the field is
next to impossible. A simple string matching algorithm doesn’t suffice [14, 15, 16, 17].
Therefore building a dynamic program that can structure useful data efficiently is of
paramount importance [18, 19, 20, 21]
Unstructured data refers to the information that does not have a pre-defined frame or is not
regularized in a defined framework [22, 23, 24, 25] These consist of text, dates, numbers, and
Sivaramakrishnan N, Vandana V, Vishali M, Dharshana S G, Subramaniyaswamy V and
Umamakeswari A
http://www.iaeme.com/IJMET/index.asp 624 [email protected]
facts. Conversion of this data into a structured format essentially deals with Data Mining and
Natural Processing Language (NLP). These provide different strategies to find patterns, or
otherwise elucidate this information. These mechanisms typically involve tagging of metadata
or mining models. In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-
90% of all potentially usable business information may originate in the unstructured form [1].
This rule of thumb is not based on primary or any quantitative research but nonetheless is
accepted by some. Having unstructured data is detrimental to the system hence structuring the
data is of supreme importance. In the proposed system, we use Python and its Natural
Language Toolkit (NLTK) assemblage of libraries. Also, few Python libraries like re,
openpyxl, tabulate etc. have been used. Apache PDF Box is also used to convert the income
data into .txt format.
The proposed method builds a system that can efficiently extract the required data from
any undergraduate Resume and put it in a structured format. This system can pull any data
requested by the user from the resume uploaded. To do this, we use Python‟s NLTK and a
few other libraries. Structuring the resume helps in easy readability, classification, extraction
of required fields like CGPA, address, etc. Here, the sole concentration is on undergraduate
student continues. This was chosen because undergrad resumes typically have a myriad of
diverse fields.
2. LITERATURE REVIEW
This research work is all about utilizing two relevant fields of computer science - Natural
Language Processing and Data Mining. Natural Language Processing (NLP) is a branch of
artificial intelligence that helps analyze human speech to interface with computers and use
human languages rather than computer languages. Natural Language Toolkit (NLTK) in
Python features this field of computer science.
Talib et al. [3] talk about the various ways in which data is extracted and converted to
knowledge, and the different areas where it can be put to use. One of them was academic, and
research fields and this provided the fundamental impetus to the development of a research
work that converts an unstructured piece of data to a structured one. Liao et al. [8] worked
extensively to survey widely used text mining techniques like clustering, decision tree
organization, etc. Zhong et al. [9] elucidate about the difficulties arising from text mining
applications specifically around the area of dealing with unstructured text. Laxman and
Sujatha [10] demonstrated that term based approaches cannot handle polysemy and synonyms
properly. Sumathy and Chidambaram [11] elucidated about the applications and tools used to
mine text. A generic framework was presented by [10] for concept based mining. Joby et al.
[12] demonstrated an efficient and intelligent pattern discovery technique. They used BM25
and support vector machines to filter router corpuses.
Wen et al. [13] deal with various classification experiments typically using multi-word
features from the data set. Prakash Nadkarni [4], briefs about the basics of natural language
processing, Naïve Bayes Theorem, n-grams, etc. The challenges of NLP are also discussed by
Nadkarni. The n-grams described helps to build a code that fetches the date or possibly,
institutions the person has studied in. Carman et al. [5], elucidates about the NER system, its
components, and the data flow through the source code. This work describes the way a
module is trained and that the result may never be perfect. Bird et al. [2] in their NLTK book
refers to the logic and concepts used in the implementation part of this work. Their work was
chosen because the seminal book written by Bird [2] is considered by programmers around
the world to be the bible of Natural Language Processing. A Recent Survey on Unstructured
Data to Structured Data in Distributed Data Mining done by Padmapriya et al. [6] and
Conversion of Unstructured Data to Structured Data with a Profile Handling Application
http://www.iaeme.com/IJMET/index.asp 625 [email protected]
Sukanya et al. [7] talks about the conversion of data in the form of text files like .doc, .pdf,
fax and paper documents.
3. METHODOLOGY AND APPROACH
Conversion of unstructured data to structured data has main three states depicted in figure 1: a
database containing unstructured data, data in its intermediary state and successfully
structured data. The user uploads the resumes. This is automatically stored in a database.
Figure 1 Data flow Diagram of Conversion of an unstructured resume into a structured form using
Natural Language Processing
There are various intermediary forms. This system uses n-grams. The structured data can
be tabulated on Python library Tabulate or can be stored in an external database. In this
system, we have used Microsoft Excel 2010 to store the data for increased readability.
This represents the entire data flow in the system. The user/applicant uploads his resume
on a database. This database stores all the resumes in text format. This is fed into the system.
The system uses NLTK libraries and methods to return structured data to the database.
Structuring of the data involves various methods like n-grams that split sentences into n-
words, named entity recognition that tags entities like names, geographical location, etc.,
compiling the data using regular expressions for objects that have regularity like address,
email ID, phone numbers, etc. There is a total of ten modules in this work, and we have
described what each module does in detail below.
Conversion of an unstructured resume into a structured form using Natural Language
Processing, we have to follow various steps. They are
Step 1: Creating a web application for file upload
First, we have to create a web application to upload the resume. In the proposed system,
we are using HTML to build the front end, Java servlets for extracting the uploaded file from
the web page. The incoming file is stored in a database. Building a UI helps in easy upload
and is very practical. We use Servlets for effectively communication between client and
server, the user and employer in this case. These receive requests and generate a response for
that request. This web application can be successfully hosted on Apache Tomcat server on
Eclipse IDE or Net beans.
Sivaramakrishnan N, Vandana V, Vishali M, Dharshana S G, Subramaniyaswamy V and
Umamakeswari A
http://www.iaeme.com/IJMET/index.asp 626 [email protected]
Step 2: Conversion of file from the format .pdf to .txt
This conversion is imperative for efficient extraction text. PDF documents are generally
very neatly arranged, and this is the reason for its popularity. Underneath the surface, there
are some major encoding done to keep the data intact because of this reading text from this by
a language is very tough. Converting a PDF file to .txt is a tedious process because of all the
rich text. Using Apache PDF Box, we can convert a PDF to a text file successfully.
All incoming PDF files should be converted into .txt files for the compiler to parse
through the data within easily. We use Apache PDF Box and its functions to do this. The class
PD Document holds the PDF file. PDF Text Stripper helps in reading the PDF line by line.
We use the Java library Print Writer to create a new .txt file and write into it. This is not the
structuring system; it’s just the intermediary.
Step 3: Extracting various information
3.1. E-mail ID
Every email ID has a set pattern: it always has a combination of characters, numbers and
special characters followed by the @ and the domain name at the end. The domain name
always has string. String format. We can implement pattern mining by using a regular
expression.
A regular expression match is used for identifying the email address in a resume. It is
stored in a variable, and the search method is used to find an appropriate match from the text.
The search returns a Match Object instance which is captured by a variable. The Match
Object class contains a method called group() which returns the last match of the defined
regular expression. If there is no match, -1 is returned.
3.2. Date of birth
Unlike email IDs, date of births has different formats. Few might have specified it in
dd/mm/yyyy or mm/dd/yyyy or dd, month, year or date month year, etc. There are so many
variations with three variables. Despite the changes, it always occurs in triads so that we can
use ngrams here.
The extraction of the date of birth has been thought to be done in two ways. This is
because there are two different formats in which a person might mention their birthday. For
the first, a highlighted feature of nltk called ngrams. Ngrams, as the name goes, it divides a
given text into a custom number of “grams”. Grams here are the words of the content. One
can use this class to split a sentence into a group of a particular size. Here, the text has been
arranged as trigrams(n as 3 in ngrams) to fetch a date of birth of the format „ date month year‟
or „month date year‟ in names. All the months of the year have been put into a list for
efficiency. We check for the month to be in either of the first or second positions. If the
condition is satisfied, the trigram is mined. For the alternate format that is „dd/mm/yyyy‟ or
„dd-mm-yyyy‟, a regular expression match is found and extracted.
3.3. Phone Number
Different regions of the world follow different formats for phone numbers. They vary from 6
digits to 14 digits. Few might even add + or – in their phone numbers. Hence we construct a
regular expression that can satisfy all the conditions.
A similar concept of the previous two functionalities has been used to extract the phone
numb Conversion of Unstructured Data to Structured Data with a Profile Handling
Application of a person as well. The defined regular expression matches all Indian and USA
Conversion of Unstructured Data to Structured Data with a Profile Handling Application
http://www.iaeme.com/IJMET/index.asp 627 [email protected]
phone numbers. It even fetches numbers where in the country code has been enclosed in
brackets or if the number is hyphenated.
3.4 Skills
Technical qualifications of candidates differ according to the profile and domain of work they
are looking for. To make things easy for the employer, we can create a list of all the technical
qualifications and choose the suitable ones from that.
Technological skills of an undergraduate usually include the various programming
languages the individual has learned. These are put in a list. The elements of the list are
parsed through and checked if they are found in the content fed. If the required line is found,
the particular extract is fetched.
3.5 House address
Similar to extracting phone number, addresses can be found and mined using regular
expressions. They are composed of characters, numbers and a few special characters like and
‟. Here again, a regular expression is defined for the house address and is tried matching
across the content of the resume fed. If a match is found, the group of the part of pattern is
returned.
3.6 Names
Here we use a major NLTK highlight Named Entity Recognition (NER). NER basically tags
all the words that are names, geographical locations etc. We can detect all the names by
extracting the object with the tag <Name>. Similarly, we can geographical locations. One
drawback of using NER in NLTK is that finding Indian names is a rigmarole process.
Libraries need to be trained more to properly tag all the entities correctly. The converted .txt
files first data retrieval method is the name of the person whose resume has been uploaded.
The entire document is tokenized into words so that identifying the part of speech of each
word becomes easier. NLTK‟s most renowned feature, chunking is used to group the tagged
tokens. On chunking, the tokens are identified as an entity such as person, organization etc.
and stored as a tree. The sub trees which are recognized as persons and so, the name is fished
out of them all.
3.7. Score and grade points
Most of the recruitment processes have their initial screening concerning the CGPA of the
candidate. Therefore this is an important field to be mined from the resume. This is also done
using a regular expression. One can understand this better with the code given below.
A regular expression that matches floating point numbers is defined and extracted if a
game is found. The „with-as‟ keyword specifies the file specified in the source path as a
single object.
This makes it easy for the contents of the file to be parsed through. Every line is checked
for the regular expression in the for() loop. The enumerate method is an enhanced way to
parse through every line of the content specified inside the for loop. All the CGPAs and
marks (in percentage) are stored in a list as they are extracted.
3.8. Education
Employers can easily manage resumes of candidates from the same college if they were
arranged so. This problem can be solved by extracting the candidate’s college and school
information.
Sivaramakrishnan N, Vandana V, Vishali M, Dharshana S G, Subramaniyaswamy V and
Umamakeswari A
http://www.iaeme.com/IJMET/index.asp 628 [email protected]
The extraction of the colleges or universities the person has studied in follows the same
procedure as that of school retrieval. The keywords used here are institute, college and
university.
4. RESULTS AND DISCUSSION
The below image (Figure 2) is the screenshot of the sample output in a worksheet. One can
see the clear segregation of fields. As and when there is a new resume being uploaded, the
respective fields are appended.
Figure 2 Sample output on Excel sheet
Figure 3 Sample output after inundation
Conversion of Unstructured Data to Structured Data with a Profile Handling Application
http://www.iaeme.com/IJMET/index.asp 629 [email protected]
Figure 4 Screenshots of the User Interface
5. CONCLUSION AND FUTURE WORK
Thus the main aim of this article, to completely structure a chaotic unstructured data is
achieved. This system can be used on advanced resumes by changing a few fields like
education and work experience. This system is not limited to just segregating resumes but can
be expanded into other domains too. This system proves to be useful for recruitment
departments in carefully selecting students with ease. Our proposed system has a wide variety
of uses like resume classification etc. This is a simple to use, dynamic systems that gives
optimal results.
REFERENCES
[1] Grimes, Seth. A Brief History of Text Analytics. B Eye Network. Retrieved June 24, 2016
[2] http://www.nltk.org/book/
[3] Ramzan Talib, Muhammad Kashif Hanif, Shaeela Ayesha,& Fakeeha Fatima(2016). Text
Mining: Techniques, Applications and Issues.
[4] Prakash Nadkarni (2016), Clinical Research Computing
[5] Mark Carman, Gholamreza Haffari (2017) Computer Speech & Language
[6] A Recent Survey on Unstructured Data to Structured Data in Distributed Data Mining
[7] Padmapriya.G, M.Hemalatha
[8] M. Sukanya, S. Biruntha(2012), "Techniques on Text Mining", IEEE International
Conference on Advanced Communication Control and Computing Technologies
(ICACCCT)
[9] S.H. Liao, P.-H. Chu, and P.-Y. Hsiao, “Data mining techniques and applications–a
decade review from 2000 to 2011,” Expert Systems with Applications, vol. 39, no. 12, pp.
11 303–11 311, 2012.
[10] N. Zhong, Y. Li, and S.-T. Wu, “Effective pattern discovery for text mining,” IEEE
transactions on knowledge and data engineering, vol. 24, no. 1, pp. 30–44, 2012.
[11] B. Laxman and D. Sujatha, “Improved method for pattern discovery in text mining,”
International Journal of Research in Engineering and Technology, vol. 2, no. 1, pp. 2321–
2328, 2013.
[12] K. Sumathy and M. Chidambaram, “Text mining: Concepts, applications, tools and
issues- an overview,” International Journal of Computer Applications, vol. 80, no. 4,
2013.
Sivaramakrishnan N, Vandana V, Vishali M, Dharshana S G, Subramaniyaswamy V and
Umamakeswari A
http://www.iaeme.com/IJMET/index.asp 630 [email protected]
[13] P. J. Joby and J. Korra, “Accessing accurate documents by mining auxiliary document
information,” in Advances in Computing and Communication Engineering (ICACCE),
2015 Second International Conference on. IEEE, 2015, pp. 634–638.
[14] Z. Wen, T. Yoshida, and X. Tang, “A study with multi-word feature with text
classification,” in Proceedings of the 51st Annual Meeting of the ISSS-2007, Tokyo,
Japan, vol. 51, 2007, p. 45.
[15] Subramaniyaswamy, V., & Logesh, R. (2017) Adaptive KNN based Recommender
System through Mining of User Preferences. Wireless Personal Communications, 1-19.
[16] Gandhi, V. I., Subramaniyaswamy, V., & Logesh, R. (2017). Topological Review And
Analysis Of Dc-Dc Boost Converters. Journal of Engineering Science and Technology,
12(6), 1541-1567.
[17] Logesh, R., & Subramaniyaswamy, V. (2017) A Reliable Point of Interest
Recommendation based on Trust Relevancy between Users. Wireless Personal
Communications, 1-30.
[18] Subramaniyaswamy, V., Logesh, R., Chandrashekhar, M., Challa, A., & Vijayakumar, V.
(2017). A personalised movie recommendation system based on collaborative filtering.
International Journal of High Performance Computing and Networking, 10(1-2), 54-63.
[19] Ravi, L., & Vairavasundaram, S. (2016). A collaborative location based travel
recommendation system through enhanced rating prediction for the group of users.
Computational intelligence and neuroscience, 2016, 7.
[20] Subramaniyaswamy, V., Logesh, R., Vijayakumar, V., & Indragandhi, V. (2015).
Automated Message Filtering System in Online Social Network. Procedia Computer
Science, 50, 466-475.
[21] Vairavasundaram, S., Varadharajan, V., Vairavasundaram, I., & Ravi, L. (2015). Data
mining‐based tag recommendation system: an overview. Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery, 5(3), 87-112.
[22] Lakshmi, R. B., Subramaniyaswamy, V., & Janani, M. (2013). A Novel Approach Based
on Nearest Neighbor Search on Encrypted Databases. Computers and Software, 881.
[23] Janani, M., Subramaniyaswamy, V., & Lakshmi, R. B. (2013). Achieving Optimal
Firewall Filtering Through Dynamic Rule Reordering. Computers and Software, 680.
[24] Sangeetha, S., & Subramaniyaswamy, V. (2016). Enhanced Travel Planning System for
Group of users using Hybrid Collaborative Filtering. Indian Journal of Science and
Technology, 9(48).
[25] Agalya, D., & Subramaniyaswamy, V. (2016). Group-Aware Recommendation using
Random Forest Classification for Sparsity Problem. Indian Journal of Science and
Technology, 9(48).
[26] Saipraba, N., & Subramaniyaswamy, V. (2016). Enhancing Stability of Recommender
System: An Ensemble based Information Retrieval Approach. Indian Journal of Science
and Technology, 9(48).
[27] Dr. P. Arul , M. Asokan, Load Testing for J query Based E-Commerce Web Applications
with Cloud Performance Testing Tools International Journal of Computer Engineering
and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5,
Issue 10, October (2014), pp. 01-10
[28] Vedavyas J and Venkatesulu.S, A Study Of Mvc – A Software Design Pattern for Web
Application Development on J2ee Architecture International Journal of Computer
Engineering and Technology (IJCET), Volume 4, Issue 3, May-June (2013), pp. 182-187
[29] Roma V J, M S Bewoor and Dr.S.H.Patil, Automation Tool for Evaluation of the Quality
of Nlp Based Text Summary Generated Through Summarization and Clustering
Techniques by Quantitative and Qualitative Metrics Volume 4, Issue 3, May-June (2013),
pp. 77-85