conversion of unstructured data to structured data · pdf fileconversion of unstructured data...

http://www.iaeme.com/IJMET/index.asp 623 [email protected]

International Journal of Mechanical Engineering and Technology (IJMET) Volume 8, Issue 8, August 2017, pp. 623–630, Article ID: IJMET_08_08_068

Available online at http://www.iaeme.com/IJMET/issues.asp?JType=IJMET&VType=8&IType=8

ISSN Print: 0976-6340 and ISSN Online: 0976-6359

© IAEME Publication Scopus Indexed

CONVERSION OF UNSTRUCTURED DATA TO

STRUCTURED DATA WITH A PROFILE

HANDLING APPLICATION

Sivaramakrishnan N, Vandana V, Vishali M, Dharshana S G, Subramaniyaswamy V

and Umamakeswari A

School of Computing, SASTRA University, Thanjavur, India

ABSTRACT

Resumes play a vital role in recruitment processes in institutions. It showcases the

qualifications and the abilities of a candidate. There are no international

standardizations on the core blueprint of a resume. There are a plethora of formats

and structures of resume available. This causes structuring of the data to become

incredibly involved. But it is imperative that this is done for increased readability. In

this digital era, keeping things simple is the motto. Structuring a resume helps in its

classification, and makes it easier for employers to read through it. The proposed

method exhibits a simple solution that serves the above purpose. Using Natural

Language Processing (NLP) methods, construction of a readable resume can be

successfully done. Here, Python and its language processing libraries are used to

detect the data, extract them and structure them.

Keywords: Natural Language Processing, Resume Handling, Structured Data,

Unstructured Data, Data Mining

Cite this Article: Sivaramakrishnan N, Vandana V, Vishali M, Dharshana S G,

Subramaniyaswamy V and Umamakeswari A, Conversion of Unstructured Data to

Structured Data with a Profile Handling Application, International Journal of

Mechanical Engineering and Technology 8(8), 2017, pp. 623–630.

http://www.iaeme.com/IJMET/issues.asp?JType=IJMET&VType=8&IType=8

1. INTRODUCTION

It's tough to read several resumes because they don’t follow a standard format. Each resume

has its fields and in different places. One may refer to his/her technical qualifications as

academic achievements or segregate them into categories. This brings in disparities. Since

there isn’t a nonadjustable blueprint for a resume, extracting the data according to the field is

next to impossible. A simple string matching algorithm doesn’t suffice [14, 15, 16, 17].

Therefore building a dynamic program that can structure useful data efficiently is of

paramount importance [18, 19, 20, 21]

Unstructured data refers to the information that does not have a pre-defined frame or is not

regularized in a defined framework [22, 23, 24, 25] These consist of text, dates, numbers, and

Sivaramakrishnan N, Vandana V, Vishali M, Dharshana S G, Subramaniyaswamy V and

Umamakeswari A


facts. Conversion of this data into a structured format essentially deals with Data Mining and

Natural Processing Language (NLP). These provide different strategies to find patterns, or

otherwise elucidate this information. These mechanisms typically involve tagging of metadata

or mining models. In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-

90% of all potentially usable business information may originate in the unstructured form [1].

This rule of thumb is not based on primary or any quantitative research but nonetheless is

accepted by some. Having unstructured data is detrimental to the system hence structuring the

data is of supreme importance. In the proposed system, we use Python and its Natural

Language Toolkit (NLTK) assemblage of libraries. Also, few Python libraries like re,

openpyxl, tabulate etc. have been used. Apache PDF Box is also used to convert the income

data into .txt format.

The proposed method builds a system that can efficiently extract the required data from

any undergraduate Resume and put it in a structured format. This system can pull any data

requested by the user from the resume uploaded. To do this, we use Python‟s NLTK and a

few other libraries. Structuring the resume helps in easy readability, classification, extraction

of required fields like CGPA, address, etc. Here, the sole concentration is on undergraduate

student continues. This was chosen because undergrad resumes typically have a myriad of

diverse fields.

2. LITERATURE REVIEW

This research work is all about utilizing two relevant fields of computer science - Natural

Language Processing and Data Mining. Natural Language Processing (NLP) is a branch of

artificial intelligence that helps analyze human speech to interface with computers and use

human languages rather than computer languages. Natural Language Toolkit (NLTK) in

Python features this field of computer science.

Talib et al. [3] talk about the various ways in which data is extracted and converted to

knowledge, and the different areas where it can be put to use. One of them was academic, and

research fields and this provided the fundamental impetus to the development of a research

work that converts an unstructured piece of data to a structured one. Liao et al. [8] worked

extensively to survey widely used text mining techniques like clustering, decision tree

organization, etc. Zhong et al. [9] elucidate about the difficulties arising from text mining

applications specifically around the area of dealing with unstructured text. Laxman and

Sujatha [10] demonstrated that term based approaches cannot handle polysemy and synonyms

properly. Sumathy and Chidambaram [11] elucidated about the applications and tools used to

mine text. A generic framework was presented by [10] for concept based mining. Joby et al.

[12] demonstrated an efficient and intelligent pattern discovery technique. They used BM25

and support vector machines to filter router corpuses.

Wen et al. [13] deal with various classification experiments typically using multi-word

features from the data set. Prakash Nadkarni [4], briefs about the basics of natural language

processing, Naïve Bayes Theorem, n-grams, etc. The challenges of NLP are also discussed by

Nadkarni. The n-grams described helps to build a code that fetches the date or possibly,

institutions the person has studied in. Carman et al. [5], elucidates about the NER system, its

components, and the data flow through the source code. This work describes the way a

module is trained and that the result may never be perfect. Bird et al. [2] in their NLTK book

refers to the logic and concepts used in the implementation part of this work. Their work was

chosen because the seminal book written by Bird [2] is considered by programmers around

the world to be the bible of Natural Language Processing. A Recent Survey on Unstructured

Data to Structured Data in Distributed Data Mining done by Padmapriya et al. [6] and

Conversion of Unstructured Data to Structured Data with a Profile Handling Application


Sukanya et al. [7] talks about the conversion of data in the form of text files like .doc, .pdf,

fax and paper documents.

3. METHODOLOGY AND APPROACH

Conversion of unstructured data to structured data has main three states depicted in figure 1: a

database containing unstructured data, data in its intermediary state and successfully

structured data. The user uploads the resumes. This is automatically stored in a database.

Figure 1 Data flow Diagram of Conversion of an unstructured resume into a structured form using

Natural Language Processing

There are various intermediary forms. This system uses n-grams. The structured data can

be tabulated on Python library Tabulate or can be stored in an external database. In this

system, we have used Microsoft Excel 2010 to store the data for increased readability.

This represents the entire data flow in the system. The user/applicant uploads his resume

on a database. This database stores all the resumes in text format. This is fed into the system.

The system uses NLTK libraries and methods to return structured data to the database.

Structuring of the data involves various methods like n-grams that split sentences into n-

words, named entity recognition that tags entities like names, geographical location, etc.,

compiling the data using regular expressions for objects that have regularity like address,

email ID, phone numbers, etc. There is a total of ten modules in this work, and we have

described what each module does in detail below.

Conversion of an unstructured resume into a structured form using Natural Language

Processing, we have to follow various steps. They are

Step 1: Creating a web application for file upload

First, we have to create a web application to upload the resume. In the proposed system,

we are using HTML to build the front end, Java servlets for extracting the uploaded file from

the web page. The incoming file is stored in a database. Building a UI helps in easy upload

and is very practical. We use Servlets for effectively communication between client and

server, the user and employer in this case. These receive requests and generate a response for

that request. This web application can be successfully hosted on Apache Tomcat server on

Eclipse IDE or Net beans.


Umamakeswari A


Step 2: Conversion of file from the format .pdf to .txt

This conversion is imperative for efficient extraction text. PDF documents are generally

very neatly arranged, and this is the reason for its popularity. Underneath the surface, there

are some major encoding done to keep the data intact because of this reading text from this by

a language is very tough. Converting a PDF file to .txt is a tedious process because of all the

rich text. Using Apache PDF Box, we can convert a PDF to a text file successfully.

All incoming PDF files should be converted into .txt files for the compiler to parse

through the data within easily. We use Apache PDF Box and its functions to do this. The class

PD Document holds the PDF file. PDF Text Stripper helps in reading the PDF line by line.

We use the Java library Print Writer to create a new .txt file and write into it. This is not the

structuring system; it’s just the intermediary.

Step 3: Extracting various information

3.1. E-mail ID

Every email ID has a set pattern: it always has a combination of characters, numbers and

special characters followed by the @ and the domain name at the end. The domain name

always has string. String format. We can implement pattern mining by using a regular

expression.

A regular expression match is used for identifying the email address in a resume. It is

stored in a variable, and the search method is used to find an appropriate match from the text.

The search returns a Match Object instance which is captured by a variable. The Match

Object class contains a method called group() which returns the last match of the defined

regular expression. If there is no match, -1 is returned.

3.2. Date of birth

Unlike email IDs, date of births has different formats. Few might have specified it in

dd/mm/yyyy or mm/dd/yyyy or dd, month, year or date month year, etc. There are so many

variations with three variables. Despite the changes, it always occurs in triads so that we can

use ngrams here.

The extraction of the date of birth has been thought to be done in two ways. This is

because there are two different formats in which a person might mention their birthday. For

the first, a highlighted feature of nltk called ngrams. Ngrams, as the name goes, it divides a

given text into a custom number of “grams”. Grams here are the words of the content. One

can use this class to split a sentence into a group of a particular size. Here, the text has been

arranged as trigrams(n as 3 in ngrams) to fetch a date of birth of the format „ date month year‟

or „month date year‟ in names. All the months of the year have been put into a list for

efficiency. We check for the month to be in either of the first or second positions. If the

condition is satisfied, the trigram is mined. For the alternate format that is „dd/mm/yyyy‟ or

„dd-mm-yyyy‟, a regular expression match is found and extracted.

3.3. Phone Number

Different regions of the world follow different formats for phone numbers. They vary from 6

digits to 14 digits. Few might even add + or – in their phone numbers. Hence we construct a

regular expression that can satisfy all the conditions.

A similar concept of the previous two functionalities has been used to extract the phone

numb Conversion of Unstructured Data to Structured Data with a Profile Handling

Application of a person as well. The defined regular expression matches all Indian and USA



phone numbers. It even fetches numbers where in the country code has been enclosed in

brackets or if the number is hyphenated.

3.4 Skills

Technical qualifications of candidates differ according to the profile and domain of work they

are looking for. To make things easy for the employer, we can create a list of all the technical

qualifications and choose the suitable ones from that.

Technological skills of an undergraduate usually include the various programming

languages the individual has learned. These are put in a list. The elements of the list are

parsed through and checked if they are found in the content fed. If the required line is found,

the particular extract is fetched.

3.5 House address

Similar to extracting phone number, addresses can be found and mined using regular

expressions. They are composed of characters, numbers and a few special characters like and

‟. Here again, a regular expression is defined for the house address and is tried matching

across the content of the resume fed. If a match is found, the group of the part of pattern is

returned.

3.6 Names

Here we use a major NLTK highlight Named Entity Recognition (NER). NER basically tags

all the words that are names, geographical locations etc. We can detect all the names by

extracting the object with the tag <Name>. Similarly, we can geographical locations. One

drawback of using NER in NLTK is that finding Indian names is a rigmarole process.

Libraries need to be trained more to properly tag all the entities correctly. The converted .txt

files first data retrieval method is the name of the person whose resume has been uploaded.

The entire document is tokenized into words so that identifying the part of speech of each

word becomes easier. NLTK‟s most renowned feature, chunking is used to group the tagged

tokens. On chunking, the tokens are identified as an entity such as person, organization etc.

and stored as a tree. The sub trees which are recognized as persons and so, the name is fished

out of them all.

3.7. Score and grade points

Most of the recruitment processes have their initial screening concerning the CGPA of the

candidate. Therefore this is an important field to be mined from the resume. This is also done

using a regular expression. One can understand this better with the code given below.

A regular expression that matches floating point numbers is defined and extracted if a

game is found. The „with-as‟ keyword specifies the file specified in the source path as a

single object.

This makes it easy for the contents of the file to be parsed through. Every line is checked

for the regular expression in the for() loop. The enumerate method is an enhanced way to

parse through every line of the content specified inside the for loop. All the CGPAs and

marks (in percentage) are stored in a list as they are extracted.

3.8. Education

Employers can easily manage resumes of candidates from the same college if they were

arranged so. This problem can be solved by extracting the candidate’s college and school

information.


Umamakeswari A


The extraction of the colleges or universities the person has studied in follows the same

procedure as that of school retrieval. The keywords used here are institute, college and

university.

4. RESULTS AND DISCUSSION

The below image (Figure 2) is the screenshot of the sample output in a worksheet. One can

see the clear segregation of fields. As and when there is a new resume being uploaded, the

respective fields are appended.

Figure 2 Sample output on Excel sheet

Figure 3 Sample output after inundation



Figure 4 Screenshots of the User Interface

5. CONCLUSION AND FUTURE WORK

Thus the main aim of this article, to completely structure a chaotic unstructured data is

achieved. This system can be used on advanced resumes by changing a few fields like

education and work experience. This system is not limited to just segregating resumes but can

be expanded into other domains too. This system proves to be useful for recruitment

departments in carefully selecting students with ease. Our proposed system has a wide variety

of uses like resume classification etc. This is a simple to use, dynamic systems that gives

optimal results.

REFERENCES

[1] Grimes, Seth. A Brief History of Text Analytics. B Eye Network. Retrieved June 24, 2016

[2] http://www.nltk.org/book/

[3] Ramzan Talib, Muhammad Kashif Hanif, Shaeela Ayesha,& Fakeeha Fatima(2016). Text

Mining: Techniques, Applications and Issues.

[4] Prakash Nadkarni (2016), Clinical Research Computing

[5] Mark Carman, Gholamreza Haffari (2017) Computer Speech & Language

[6] A Recent Survey on Unstructured Data to Structured Data in Distributed Data Mining

[7] Padmapriya.G, M.Hemalatha

[8] M. Sukanya, S. Biruntha(2012), "Techniques on Text Mining", IEEE International

Conference on Advanced Communication Control and Computing Technologies

(ICACCCT)

[9] S.H. Liao, P.-H. Chu, and P.-Y. Hsiao, “Data mining techniques and applications–a

decade review from 2000 to 2011,” Expert Systems with Applications, vol. 39, no. 12, pp.

11 303–11 311, 2012.

[10] N. Zhong, Y. Li, and S.-T. Wu, “Effective pattern discovery for text mining,” IEEE

transactions on knowledge and data engineering, vol. 24, no. 1, pp. 30–44, 2012.

[11] B. Laxman and D. Sujatha, “Improved method for pattern discovery in text mining,”

International Journal of Research in Engineering and Technology, vol. 2, no. 1, pp. 2321–

2328, 2013.

[12] K. Sumathy and M. Chidambaram, “Text mining: Concepts, applications, tools and

issues- an overview,” International Journal of Computer Applications, vol. 80, no. 4,

2013.


Umamakeswari A


[13] P. J. Joby and J. Korra, “Accessing accurate documents by mining auxiliary document

information,” in Advances in Computing and Communication Engineering (ICACCE),

2015 Second International Conference on. IEEE, 2015, pp. 634–638.

[14] Z. Wen, T. Yoshida, and X. Tang, “A study with multi-word feature with text

classification,” in Proceedings of the 51st Annual Meeting of the ISSS-2007, Tokyo,

Japan, vol. 51, 2007, p. 45.

[15] Subramaniyaswamy, V., & Logesh, R. (2017) Adaptive KNN based Recommender

System through Mining of User Preferences. Wireless Personal Communications, 1-19.

[16] Gandhi, V. I., Subramaniyaswamy, V., & Logesh, R. (2017). Topological Review And

Analysis Of Dc-Dc Boost Converters. Journal of Engineering Science and Technology,

12(6), 1541-1567.

[17] Logesh, R., & Subramaniyaswamy, V. (2017) A Reliable Point of Interest

Recommendation based on Trust Relevancy between Users. Wireless Personal

Communications, 1-30.

[18] Subramaniyaswamy, V., Logesh, R., Chandrashekhar, M., Challa, A., & Vijayakumar, V.

(2017). A personalised movie recommendation system based on collaborative filtering.

International Journal of High Performance Computing and Networking, 10(1-2), 54-63.

[19] Ravi, L., & Vairavasundaram, S. (2016). A collaborative location based travel

recommendation system through enhanced rating prediction for the group of users.

Computational intelligence and neuroscience, 2016, 7.

[20] Subramaniyaswamy, V., Logesh, R., Vijayakumar, V., & Indragandhi, V. (2015).

Automated Message Filtering System in Online Social Network. Procedia Computer

Science, 50, 466-475.

[21] Vairavasundaram, S., Varadharajan, V., Vairavasundaram, I., & Ravi, L. (2015). Data

mining‐based tag recommendation system: an overview. Wiley Interdisciplinary Reviews:

Data Mining and Knowledge Discovery, 5(3), 87-112.

[22] Lakshmi, R. B., Subramaniyaswamy, V., & Janani, M. (2013). A Novel Approach Based

on Nearest Neighbor Search on Encrypted Databases. Computers and Software, 881.

[23] Janani, M., Subramaniyaswamy, V., & Lakshmi, R. B. (2013). Achieving Optimal

Firewall Filtering Through Dynamic Rule Reordering. Computers and Software, 680.

[24] Sangeetha, S., & Subramaniyaswamy, V. (2016). Enhanced Travel Planning System for

Group of users using Hybrid Collaborative Filtering. Indian Journal of Science and

Technology, 9(48).

[25] Agalya, D., & Subramaniyaswamy, V. (2016). Group-Aware Recommendation using

Random Forest Classification for Sparsity Problem. Indian Journal of Science and

Technology, 9(48).

[26] Saipraba, N., & Subramaniyaswamy, V. (2016). Enhancing Stability of Recommender

System: An Ensemble based Information Retrieval Approach. Indian Journal of Science

and Technology, 9(48).

[27] Dr. P. Arul , M. Asokan, Load Testing for J query Based E-Commerce Web Applications

with Cloud Performance Testing Tools International Journal of Computer Engineering

and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5,

Issue 10, October (2014), pp. 01-10

[28] Vedavyas J and Venkatesulu.S, A Study Of Mvc – A Software Design Pattern for Web

Application Development on J2ee Architecture International Journal of Computer

Engineering and Technology (IJCET), Volume 4, Issue 3, May-June (2013), pp. 182-187

[29] Roma V J, M S Bewoor and Dr.S.H.Patil, Automation Tool for Evaluation of the Quality

of Nlp Based Text Summary Generated Through Summarization and Clustering

Techniques by Quantitative and Qualitative Metrics Volume 4, Issue 3, May-June (2013),

pp. 77-85

conversion of unstructured data to structured data · pdf fileconversion of unstructured data...

Documents