© 2019 ijrar august 2019, volume 6, issue 3 ...ijrar.org/papers/ijrar19k4870.pdf · a comparative...

copy 2019 IJRAR August 2019 Volume 6 Issue 3 wwwijrarorg (E-ISSN 2348-1269 P- ISSN 2349-5138)

IJRAR19K4870 International Journal of Research and Analytical Reviews (IJRAR) wwwijrarorg 488

A COMPARATIVE STUDY AND SENTIMENT

ANALYSIS USING KONSTANZ INFORMATION

MINER IN SOCIAL NETWORKS 1A Anitha of 1st Author2DrSSivakumar of 2nd Author

1MSc ampResearch scholar of 1st Author2MCAMPhilPhDampHead and Asstprofessor of 2nd Author 1Department of Computer Science of 1st Author

1Thanthai Hans Roever College of 1st AuthorPerambalurIndia

Abstract The sentiment analysis process that gives this work its name is the main theme of the work Since the beginning of 2000

sentiment analysis has become one of the most active research areas by researchers working on natural language processing and social

networking analysis In addition data mining web mining and text mining are also studied extensively Moreover the method of

sentiment analysis has spread over many fields from computer science to management science from social science to economics due

to the importance given to the business world as a whole and the collectivity In this study Konstanz Information Miner (KNIME)

which is a powerful data mining tool with its richest features and many visualization tools was used on Facebook data Ten thousand

Facebook data were used in this study The sentiment analysis study which is in fact a classification study was conducted using machine learning algorithms on Facebook data The results of the study were interpreted by carrying out an accuracy analysis It is

anticipated that the use of the KNIME which has rich visualization tools will be widespread in sentiment analysis studies to make

these works both easier and more reliable As the same the texts messages and the contents of the users datasets are collected and analyzed for positive and negative

words whereas the users can encrypt their contents and allow access for requests as well as the messages received to and fro from the

sender as well as the receiver The shared contents can be defined here as public and private in which public posts can be viewed by

all whereas private shared contents can be viewed only by the permission allowed by the owner At the same time more than two or

three attempts of negative words from the sender side has been analyzed and blocked for the sake of senders security issues

IndexTerms - Sentiment Analysis Opinion Mining KNIME Facebook Social Media

IINTRODUCTION

Opinions are at the center of almost all human activities and are an important reflection of our behavior Our belief in reality

our perceptions and our choices depend on how others see and appreciate the world For this reason we often refer to others opinions

when we need to make a decision This does not apply only to individuals It also applies to organizations In the real world companies

and organizations always want to receive opinions and comments from consumers or the public about their products and services

Individual consumers want to know the views of current users before a product is purchased or the opinions of others about political

candidates before giving a vote on political elections Getting public opinion and consumer perspectives has long been a major

workload for marketing public relations and political campaign companies With the help of social networks (for example criticism forum discussions blogs microblogs Facebook comments and posts on social networking sites) and as a result of increased power in

decision-making of social media individuals and organizations have become inevitable to take into account the content of these media

In recent years industrial activities involving sentiment analysis are also developing rapidly A large number of new initiatives have

emerged in this area Many large corporations have developed their own sentiment analysis systems to measure the quality of on-site

services thereby creating awareness in the business and social environment

Social media plays a vital role in marketing and creating relationships with customers With limited barrier to entry small

businesses are beginning to use social media as a means of marketing Unfortunately many small businesses struggle to use social

media and have no strategy going into it As a result without a basic understanding of the advantages of social media and how to use

it to engage customers countless opportunities are missed The research aims to acquire an initial understanding of how a small

business recognized for using social media to grow the business uses social media to engage customers In todayrsquos technology driven

world social networking sites have become an avenue where retailers can extend their marketing campaigns to a wider range of

consumers Chi (2011 46) defines social media marketing as a ldquoconnection between brands and consumers [while] offering a personal channel and currency for user centered networking and social interactionrdquo The tools and approaches for communicating with

customers have changed greatly with the emergence of social media therefore businesses must learn how to use social media in a

way that is consistent with their business plan (Mangold and Faulds 2099) This is especially true for companies striving to gain a

competitive advantage This review examines current literature that focuses on a retailerrsquos development and use of social media as an

extension of their marketing strategy This phenomenon has only developed within the last decade thus social media research has

largely focused on (1) defining what it is through the explanation of new terminology and concepts that makeup its foundations and

(2) exploring the impact of a companyrsquos integration of social media on consumer behavior This paper begins with an explanation of

terminology that defines social media marketing followed by a discussion of the four main themes found within current research

studies Virtual Brand Communities Consumers Attitudes and Motives User Generated Content and Viral Advertising 2 Although

social media marketing is a well-researched topic it has only been studied through experimental and theoretical research studies

never precisely describe the benefits retailers gain from this marketing tactic In reviewing the rich plethora of multi-disciplinary literature it is has become clear that studies are focusing on describing what social media marketing is as well as examining what



factors affect consumer behavior relative to social networking Despite the initial progress made by researchers development in this

area of study has been limited

II DOMAIN INTRODUCTION

Big Data

Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process

them using traditional data processing applications The challenges include analysis capture duration search sharing storage

transfer visualization and privacy violations The trend to larger data sets is due to the additional information derivable from analysis

of a single large set of related data as compared to separate smaller sets with the same total amount of data allowing correlations to

be found to spot business trends prevent diseases combat crime and so on

Scientists regularly encounter limitations due to large data sets in many areas including meteorology genomics connectomics complex physics simulations and biological and environmental research The limitations also affect Internet

search finance and business informatics Data sets grow in size in part because they are increasingly being gathered by ubiquitous

information-sensing mobile devices aerial sensory technologies (remote sensing) software logs cameras microphones radio-

frequency identification (RFID) readers and wireless sensor networks The worlds technological per-capita capacity to store

information has roughly doubled every 40 months since the 1980s as of 2012 every day 25exabytes (25times1018) of data were created

The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization

Big data is difficult to work with using most relational database management systems and desktop statistics and visualization

packages requiring instead massively parallel software running on tens hundreds or even thousands of servers What is considered

big data varies depending on the capabilities of the organization managing the set and on the capabilities of the applications that

are traditionally used to process and analyze the data set in its domain Big Data is a moving target what is considered to be Big

today will not be so years ahead For some organizations facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options For others it may take tens or hundreds of terabytes before data size becomes a significant

consideration

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture curate manage and

process data within a tolerable elapsed time Big data size is a constantly moving target as of 2012 ranging from a few dozen

terabytes to many peta bytes of data Big data is a set of techniques and technologies that require new forms of integration to uncover

large hidden values from large datasets that are diverse complex and of a massive scale

In a 2001 research report and related lectures META Group (now Gartner) analyst Doug Laney defined data growth challenges and

opportunities as being three-dimensional ie increasing volume (amount of data) velocity (speed of data in and out) and variety

(range of data types and sources) Gartner and now much of the industry continue to use this 3Vs model for describing big data In

2012Gartner updated its definition as follows Big data is high volume high velocity andor high variety information assets that

require new forms of processing to enable enhanced decision making

insight discovery and process optimization Additionally a new V Veracity is added by some organizations to describe it If Gartnerrsquos definition (the 3Vs) is still widely used the growing maturity of the concept fosters a more sound difference between big

data and Business Intelligence regarding data and their use

Business Intelligence uses descriptive statistics with data with high information density to measure things detect trends etc Big data

uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions nonlinear relationships and

causal effects) from large sets of data with low information density to reveal relationships dependencies and perform predictions of

outcomes and behaviors Big data can also be defined as Big data is a large volume unstructured data which cannot be handled by

standard database management systems

like DBMS RDBMS or ORDBMS

Big data can be described by the following characteristics

Volume ndash The quantity of data that is generated is very important in this context It is the size of the data which determines the value

and potential of the data under consideration and whether it can actually be considered as Big Data or not The name lsquoBig Datarsquo itself

contains a term which is related to size and hence the characteristic

Variety - The next aspect of Big Data is its variety This means that the category to which Big Data belongs to is also a very essential

fact that needs to be known by the data analysts This helps the people who are closely analyzing the data and are associated with it

to effectively use the data to their advantage and thus upholding the importance of the Big Data

Velocity - The term lsquovelocityrsquo in the context refers to the speed of generation of data or how fast the data is generated and processed

to meet the demands and the challenges which lie ahead in the path of growth and development

Variability - This is a factor which can be a problem for those who analyze the data This refers to the inconsistency which can be

shown by the data at times thus hampering the process of being able to handle and manage the data effectively



Veracity - The quality of the data being captured can vary greatly Accuracy of analysis depends on the veracity of the source data

Complexity - Data management can become a very complex process especially when large volumes of data come from multiple sources These data need to be linked connected and correlated in order to be able to grasp the information that is supposed to be

conveyed by these data This situation is therefore termed as the lsquocomplexityrsquo of Big Data Big data analytics enables organizations

to analyze a mix of structured semi-structured and unstructured data in search of valuable business information and insights

III EXISTING SYSTEM Long before the awareness of the Internet became widespread some of our friends asked me to recommend a television ask

who they planned to vote for in local elections ask colleagues to ask for reference letters about business owners for job applications

or what dishwashers they wanted to buy Today however the development of internet technologies has given us the opportunity to

discover the views and experiences that we both have both personal and well known professional critics Such studies show that more

and more people are starting to present their opinions for foreigners on the internet Ideas such as opinions measurements evaluations

attitudes interpretations and concepts related to them are the areas of study of sentiment analysis and opinion mining The rapid growth of workspaces has helped to increase the use of forums discussion and dating pages blogs micro blogs and other social media

tools among people With the increasing use of the Internet communication infrastructure has undergone radical changes The social

sharing sites that emerged with the development of information technologies have had an important place in human life Among the

most popular social networking sites on the planet are web sites and applications like Facebook Instagram YouTube Google Play

Vine blog micro blog social networking and social bookmarking services All these services come together to reveal the Social

Media structure

In addition to real-life applications research articles have also been published in the field of sentiment analysis For example Leilei

and his team conducted a sentiment analysis study using Facebook data to predict the election results Different sentiment analysis

studies were conducted by Bernardo Mahesh and Venetis using Facebook data movie reviews and blogs to estimate the box office

revenue of films Ozel has conducted a survey using a software tool called Limesurvey a web based survey interface In this study

the effect of using Facebook of employees on company profile was analyzed The survey was tweeted and retweeted by 10 different

Facebook accounts to reach two thousand Facebook users The obtained data were analyzed by statistical analysis Sentiment analysis studies using supervised learning approach from machine learning methods in social networks In doing this study the data set

consisting of the interpretations of various products of some food companies on Facebook manually are obtained Oguz and his team

have used Facebook messages and newspaper sites to investigate the detection of influenza-like illnesses through social media

Facebook data were collected using free Topsy real-time search engine application developed for social media Yazan and Uskudarli

have made earthquake detection through social networks The Streaming API developed by Facebook is used to get the data

Disadvantages of the existing system

Event detection and summarization opinion mining sentiment analysis and many others

Limited length of a tweet (ie 140 characters) and no restrictions on its writing styles tweets often contain grammatical

errors misspellings and informal abbreviations

On the other hand despite the noisy nature of tweets the core semantic information is well preserved in tweets in the form of named entities or semantic phrases

IV PROPOSED SYSTEM

In this study sentiment analysis study was done on Facebook data The first step of this study is to collect data It is known

that collecting the data and the data sets require the most time and power for the researchers who work on social media There are

many different types of data collection on social networks When we look at literature there are many tools and methods for collecting

data There are many tools and methods for collecting data when we look at previous studies Among these the most commonly used

are custom designed APIs web crawling web scraping operations and scripts In this study it is aimed to obtain the data sets in a

meaningful and regular manner based on Facebook data and to carry out sentiment analysis work In this study tagged Facebook data set named Sentiment 140 was used This set was created by Stanford Universitys

Computer Science graduate students Alec Go Richa Bhayani and Lei Huang This data set contains about 16 million positive and

negative tagged Facebook data When these data were collected and tagged the emoji contained in each data was used For example

smiley is considered to have a positive tag because it is an emotion expressing happiness It is likewise classified as negative because

it is an emotion containing the phrase 1048623 sadness 10 thousand Facebook data were used in this study Two different data sets were

created The total number of Facebook data in both sets is 5 thousand The number of Facebook data with positive and negative tags

in each set was calculated

Advantages

Reduces noisy and irrelevant words that are not associated to the users and maintains their privacy

Data and information are maintained with better security and dilutes the negative words up to an extent

Provides secure measures by avoiding or blocking negative words and improves the efficiency



V LITERATURE SURVEY

51 Improving Entity Resolution with Global Constraints

Jim Gemmell

In this paper investigate another online socio-economic property that to our knowledge has never been exploited that site

listing entities have an incentive to avoid gratuitous duplicates For instance duplicate movies in IMDB would have reviews and

corrections applied to one copy and not the other If Netix has one entry for a DVD and a duplicate for the Blu-ray version then their

customers might be looking at one and not realize the other is available Hulu supports Face-book likes for their movies and could

have the like counts diluted by duplicates

Additional examples in other domains are easily constructed We leverage this socio-economic property to resolve entities across the

different web sites by applying a global one-to-one constraint to produced matchings The resulting resolution has much better

accuracy compared to matching without such a constraint Our framework for one-to-one entity resolution (ER) is generic in that it

can constrain existing resolution methods using weighted graph matching The goal is not to engineer features or tune high-

performance domain-specific ER but rather to develop generic algorithms that can be combined with existing methods for improving retrieval performance The purpose of this paper is not to investigate alternate scoring functions but rather to explore generic

algorithms for constrained ER In this section we describe the abstract ER problem the framework for including particular scoring

approaches and several generic algorithms for constrained ER

52 Entity Resolution Theory Practice amp Open Challenges

Author Lise Getoor

In this paper begin by introducing a simple abstraction for the entity resolution problem We categorize ER based on the type

of input ndash single-entity ER where all mentions correspond to a single entity type relational ER where real world entities are linked

(like in a social network) and multi-entity ER representing the most general problem with potentially linked mentions of different

entity types (eg products sellers and reviews) We survey classical techniques for ER which assume that there exists a distance

function between pairs of mentions These techniques can be broadly classified as pair-wise ER where the decision to match a pair

of mentions is made independent of other mentions and cluster-based ER where equivalence classes of entities are constructed via

clustering Pair-wise ER is well suited for the problem of aligning two databases of the same set of entities (eg lists of restaurants from two sites) We survey common algorithms for computing similarity functions between mentions and rule based and probabilistic

methods for pair-wise and cluster based ER We also discuss techniques for computing cluster representatives aka canonical entities

from database and machine learning communities We conclude this section by discussing the state of the art collective probabilistic

inference techniques for multi entity ER These techniques are becoming popular due to an abundance of redundant mentions of

entities on the Web that are also linked and techniques that only consider one entity type and that ignore links perform poorly We

describe approaches based on multi-relational clustering algorithms probabilistic generative models and probabilistic logical

languages eg Markov logic networks and probabilistic soft logic

53 Matching Unstructured Product Offers to Structured Product Specifications

Author Anitha Kannan

We have a large database of product specifications Each product specification (which we shall interchangeably call

lsquoproductrsquo) consists of a set of attribute ⟨name value⟩ pairs and is represented in the database as a structured record Some of the attributes can be numeric while the others can be categorical The unstructured offer descriptions (which we shall call lsquoofferrsquo for

short) are comprised of free text The text has embedded in it some of the values and possibly some attribute names corresponding to

one of the products The text may also contain additional words The attribute names and values in the text may not precisely match

those found in the database The text does not contain an identifier that uniquely identifies the corresponding product Different textual

descriptions may be provided for the same product An offer may match more than one product as only partial descriptions are provided

in the offers and because the same real-world product might have multiple representations in the product database We performed

extensive experiments using Bing Shopping catalog to understand the performance characteristics of the proposed solution The

experimental results show that the proposed approach scores high on F-measure and consistently beats baseline approaches for product

categories that have reasonably rich attribute structure and good data They also point to the desirability of hybrid solutions that

additionally make use of classical text matching techniques for attribute impoverished product categories The methodology we

employed for analyzing the experimental results might also be of interest to those building and analyzing web scale systems

54 Title Frameworks for entity matching A comparison

Author Hanna Koumlpcke

The functional comparison reveals a number of further research directions All frameworks focus on offline matching ie

they do not yet cover online matching The definition of the blocking key is not yet derived (semi) automatically from training data

but has to be specified manually in all considered frameworks While attribute value matchers are well supported the combination of

context and attribute matchers is not and should be further studied Training-based EM frameworks should provide more support for

(semi-)automatic selection of suitable training data with low labeling effort So far training- based approaches only helped to optimize

some decisions eg determining parameters for matchers (eg similarity thresholds) and combination functions (eg weights for

matchers) while other decisions (eg selection of the similarity functions and attributes to be evaluated) still have to be determined

manually The published framework evaluations used diverse methodologies measures and test problems making it difficult to assess

the effectiveness and efficiency of each single system While the reported evaluation results are usually very positive the tests so far mostly dealt with small match problems so that the scalability of most approaches is unclear Hence scalability to large test cases

needs to be better addressed in future frameworks Some recent work regarding scalability has focused on computational aspects of

string similarity computation and time completeness trade-offs Furthermore we see a strong need for comparative performance

evaluations of different frameworks and EM strategies Standardized benchmarks for entity matching are needed for comparative

investigations first proposals exist but have not yet been implemented or applied Published evaluation results should also be

reproducible by other researchers ideally by providing the prototype implementations and test data



VI ARCHITECTURE DIAGRAM

VII MODULES

Data Acquisition

Preprocessing

Hybrid segmentation

Named Entity Recognition

Performance Evaluation

71 Modules Description

711 Data Acquisition

Facebook is an online social networking service that enables users to send and read messages images as well as videos posts

Registered users can read and post but unregistered users can only read them it is also only if the concerned data owner provides

permission then only it is possible Users access Facebook through the website interface or mobile device app In order to have an

opinion about the user his posts have to be examined Therefore using Facebook API all posts posted by user are crawled first In

this study we tried to examine the user with not only his posts but also his friendsrsquo posts However crawling all friendsrsquo posts is a

huge overload and misleading since Facebook following mechanism does not show an actual interest every time People sometimes tend to follow some users for a temporary occasion and then forget to un-follow Sometimes they follow some users just to be informed

of although they are not actually interested in There are also friends that do not post for a long time but still followed by the user In

this module we can upload the datasets as CSV file It contains following id followers id time stamp user following user followers

and posts The data of entire consumers of the facebook has been examined and their entities are analyzed for better process The data

that has been acquired are the posts that has been done by the users messages that has been sent and received and so on it continuous

712 Preprocessing

For named entities to be extracted successfully the informal writing style in posts has to be handled Before real data has

entered our lives studies on the area were being conducted on formal texts such as news articles Generally named entities are assumed

as words written in uppercase or mixed case phrases where uppercased letters are at the beginning and ending and almost all of the

studies bases on this assumption However capitalization is not a strong indicator in posts like informal texts sometimes even misleading As the example of capitalization shows the approaches have to be changed To extract named entities in posts the effect

of the informality of the posts has to be minimized as possible

713 Hybrid segmentation Hybrid Segmentation learns from both global and local contexts and has the ability of learning from pseudo feedback

HybridSeg is also designed to iteratively learn from confident segments as pseudo feedback Posts are posted for information sharing

and communication The named entities and semantic phrases are well preserved in posts The global context derived from Web pages

therefore helps identifying the meaningful segments in posts The well preserved linguistic features in these posts facilitate named

entity recognition with high accuracy Each named entity is a valid segment The method utilizing local linguistic features is denoted

by HybridSegNER It obtains confident segments based on the voting results of multiple off-the-shelf NER tools Another method

utilizing local collocation knowledge denoted by HybridSegNGram is proposed based on the observation that many posts published

within a short time period are about the same topic HybridSegNGram segments the posts by estimating the term-dependency within a batch of posts The segments recognized based on local context with high confidence serve as good feedback to extract more

meaningful segments The learning from pseudo feedback is conducted iteratively and the method implementing the iterative learning

is named HybridSegIter

714 Named Entity Recognition

Named Entity Recognition can be basically defined as identifying and categorizing certain type of data (ie person location

organization names and date-time and numeric expressions) in a certain type of text On the other hand tweets are characteristically

Data Acquisition

Datasets

Preprocessing

Stop Removal

Stemming words

analysis

Tokenization

Hybrid

Segmentation Global

Context Local Context

Pseudo

Feedback

POS tagger

Named Entity

Recognition

Network

Content Features

Blog Features

KNN classifiers

Trained

Rumors



short and noisy Given the length of a posts and restriction free writing style named entity recognition on this type of data become

challenging After basic segmentation a great number of named entities in the text such as personal names location names and organization names are not yet segmented and recognized properly Part of speech tagging is applicable to a wide range of NLP tasks

including named entity segmentation and information extraction Named Entity Recognition strategies vary on basically three factors

Language textual genre and domain and entity type Language is very important because language characteristics affect approaches

Assign each word to its most frequent tag and assign each Out of Vocabulary (OOV) word the most common POS tag Textual genre

is another concept whose effects cannot be neglected

715 Performance Evaluation

In this module we can evaluate the process of the system using accuracy rate and normalized utility Our proposed system

provides improved accuracy rate and normalized utility Once the messages that has been received by the receiver form the sender

side it analysis the data for negative words and positive words and incase of presence of negative it warns the user and if the process

continuous more than 3 or 4 times it will make a suggestion and blocks the users who are associated with the negative contents And also the posts that are shared by the data owner can be shared as a public and private In public the posts can be viewed by the entire

users present in the data owners profile whereas in private mode only the owner permitted users can access the posts that has been

posted by the data owner

VIII CONCLUSION

The increase in the use of computers and the internet has caused a serious increase in the methods of information extraction

from social media In this study the information access and interpretation steps used in the literature have been investigated in detail

The sentiment analysis work was conducted on Facebook data Obtaining Facebook data clearing data transforming data into

numerical form extracting meaningful results and interpreting them are performed Machine learning algorithms are used by using KNIME software

In this study Decision Tree Learner and K-NN algorithms showing the accuracy sensitivity sensitivity and selectivity

values have been examined in detail in two different experimental sets The results obtained were compared in detail with the help of

tables In the future studies it is aimed to carry out sentiment analysis studies on different sets using different machine learning and

intelligent optimization algorithms In order to increase value of accuracy it is foreseen to prepare more suitable data sets to increase

the accuracy rate of the studies Using intelligent search and optimization algorithms with optimized parameters may also be used

with integrated feature selection methods to increase the sentiment analysis performances We designed novel features for use in the

classification of posts in order to develop a system through which informational data may be filtered from the conversations which

are not of much value in the context of searching for immediate information for relief efforts or bystanders to utilize in order to

minimize damages The results of our experiments show that classifying tweets as ldquorumorrdquo vs ldquonon rumorrdquo can use solely the proposed

features if computing resources are concerned since the computing power required to process data into featured is immensely

decreased in comparison to a BOW feature set which contains a substantially larger number of features However if computing power and time necessary to process incoming Facebook data are not a concern a combined feature set of the proposed features and BOW-

presence approach will maximize overall accuracy

IX FUTURE ENHANCEMENT

In future work we can extend our approach implement various classification algorithm to predict the attackers and also

eliminate the attackers from facebook datasets And try this approach to implement in various languages in facebook At the same it

can be extended to analyze not only texts but also images videos and so on So that the exact scenario of entire users and their entities

are managed with proper efficiency and avoids inappropriate medias

REFERENCES [1] John A H (2008) Online shopping Pew Internet amp American Life Project Report

[2] Com S Kelsey G (2007) Online consumer-generated reviews have significant impact on offline purchase behavior Press

Release November

[3] Chen B Leilei Z Daniel K Dongwon L (2010) What is an opinion about Exploring Political Standpoints Using Opinion

Scoring Model In Proceeedings of AAAI Conference on Artificial Intelligence

[4] Asur S Bernardo A Huberman (2010) Predicting the future with social media Arxiv preprint arXiv10035699

[5] Joshi M Dipanjan D Kevin G Noah A S (2010) Movie reviews and revenues An experiment in text regression in

Proceedings of the North American Chapter of the Association for computational Linguistics Human Language Technologies Conference (NAACL)

[6] Sadikov E Parameswaran A Petros V (2009) Blogs as predictors of movie success 1048824n Proceedings of the Third International Conference on Weblogs and Social Media (ICWSM)

[7] Nizam H (2016) Sosyal Medyada Makine Ouml1048824renmesi ile Duygu Analizinde Dengeli ve Dengesiz Veri Setlerinin

Performanslar1048824n1048824n Kar10488241048824la1048824t1048824r1048824lmas1048824 International Artificial Intelligence and Data Processing Symposium (IDAP16)

[8] Bilge U Bozkurt S O1048824uz Y B Oumlzel D (2011) Sosyal medya araccedillar1048824 Tuumlrkiyedeki grip benzeri hastal1048824klar1048824 saptayabilmek

iccedilin kullan1048824labilir mi XVI Tuumlrkiyede 1048824nternet Konferans1048824 1048824zmir

[9] K1048824vanccedil Y (2015) Sosyal A1048824lar Uumlzerinden Deprem Tespiti XVII Akademik Bili1048824im Konferans1048824 Bo1048824aziccedili Uumlniversitesi

[10] Suumltcuuml S Bayrakccedil1048824 S (2014) Sosyal Medya Gazeteleri Nas1048824l Etkiliyor Haberlerin Tw1048824tterrsquoda Yay1048824lmas1048824 Uumlzerine Bir

Ara1048824t1048824rma The Turkish Online Journal of Design Art and Communication ndash TOJDAC April 2014 Volume 4 Issue 2 40-52



BIOGRAPHIES

AAnitha MSc

Research Scholar

Department of Computer Science

Thanthai Hans Roever College

Perambalur

DrSSivakumar MCAMphilPhD

Head and Asstprofessor Department of Computer Science


Perambalur



factors affect consumer behavior relative to social networking Despite the initial progress made by researchers development in this

area of study has been limited

II DOMAIN INTRODUCTION

Big Data

Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process

them using traditional data processing applications The challenges include analysis capture duration search sharing storage

transfer visualization and privacy violations The trend to larger data sets is due to the additional information derivable from analysis

of a single large set of related data as compared to separate smaller sets with the same total amount of data allowing correlations to

be found to spot business trends prevent diseases combat crime and so on

Scientists regularly encounter limitations due to large data sets in many areas including meteorology genomics connectomics complex physics simulations and biological and environmental research The limitations also affect Internet

search finance and business informatics Data sets grow in size in part because they are increasingly being gathered by ubiquitous

information-sensing mobile devices aerial sensory technologies (remote sensing) software logs cameras microphones radio-

frequency identification (RFID) readers and wireless sensor networks The worlds technological per-capita capacity to store

information has roughly doubled every 40 months since the 1980s as of 2012 every day 25exabytes (25times1018) of data were created

The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization

Big data is difficult to work with using most relational database management systems and desktop statistics and visualization

packages requiring instead massively parallel software running on tens hundreds or even thousands of servers What is considered

big data varies depending on the capabilities of the organization managing the set and on the capabilities of the applications that

are traditionally used to process and analyze the data set in its domain Big Data is a moving target what is considered to be Big

today will not be so years ahead For some organizations facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options For others it may take tens or hundreds of terabytes before data size becomes a significant

consideration

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture curate manage and

process data within a tolerable elapsed time Big data size is a constantly moving target as of 2012 ranging from a few dozen

terabytes to many peta bytes of data Big data is a set of techniques and technologies that require new forms of integration to uncover

large hidden values from large datasets that are diverse complex and of a massive scale

In a 2001 research report and related lectures META Group (now Gartner) analyst Doug Laney defined data growth challenges and

opportunities as being three-dimensional ie increasing volume (amount of data) velocity (speed of data in and out) and variety

(range of data types and sources) Gartner and now much of the industry continue to use this 3Vs model for describing big data In

2012Gartner updated its definition as follows Big data is high volume high velocity andor high variety information assets that

require new forms of processing to enable enhanced decision making

insight discovery and process optimization Additionally a new V Veracity is added by some organizations to describe it If Gartnerrsquos definition (the 3Vs) is still widely used the growing maturity of the concept fosters a more sound difference between big

data and Business Intelligence regarding data and their use

Business Intelligence uses descriptive statistics with data with high information density to measure things detect trends etc Big data

uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions nonlinear relationships and

causal effects) from large sets of data with low information density to reveal relationships dependencies and perform predictions of

outcomes and behaviors Big data can also be defined as Big data is a large volume unstructured data which cannot be handled by

standard database management systems

like DBMS RDBMS or ORDBMS

Big data can be described by the following characteristics

Volume ndash The quantity of data that is generated is very important in this context It is the size of the data which determines the value

and potential of the data under consideration and whether it can actually be considered as Big Data or not The name lsquoBig Datarsquo itself

contains a term which is related to size and hence the characteristic

Variety - The next aspect of Big Data is its variety This means that the category to which Big Data belongs to is also a very essential

fact that needs to be known by the data analysts This helps the people who are closely analyzing the data and are associated with it

to effectively use the data to their advantage and thus upholding the importance of the Big Data

Velocity - The term lsquovelocityrsquo in the context refers to the speed of generation of data or how fast the data is generated and processed

to meet the demands and the challenges which lie ahead in the path of growth and development

Variability - This is a factor which can be a problem for those who analyze the data This refers to the inconsistency which can be

shown by the data at times thus hampering the process of being able to handle and manage the data effectively

















Media structure
















IV PROPOSED SYSTEM













Advantages






V LITERATURE SURVEY


Jim Gemmell














Author Lise Getoor















































VII MODULES

Data Acquisition

Preprocessing

Hybrid segmentation















712 Preprocessing



















Data Acquisition

Datasets

Preprocessing

Stop Removal

Stemming words

analysis

Tokenization

Hybrid

Segmentation Global


Pseudo

Feedback

POS tagger

Named Entity

Recognition

Network

Content Features

Blog Features

KNN classifiers

Trained

Rumors
















VIII CONCLUSION
























Release November
















BIOGRAPHIES

AAnitha MSc

Research Scholar



Perambalur




Perambalur

















Media structure
















IV PROPOSED SYSTEM













Advantages






V LITERATURE SURVEY


Jim Gemmell














Author Lise Getoor















































VII MODULES

Data Acquisition

Preprocessing

Hybrid segmentation















712 Preprocessing



















Data Acquisition

Datasets

Preprocessing

Stop Removal

Stemming words

analysis

Tokenization

Hybrid

Segmentation Global


Pseudo

Feedback

POS tagger

Named Entity

Recognition

Network

Content Features

Blog Features

KNN classifiers

Trained

Rumors
















VIII CONCLUSION
























Release November
















BIOGRAPHIES

AAnitha MSc

Research Scholar



Perambalur




Perambalur



V LITERATURE SURVEY


Jim Gemmell














Author Lise Getoor















































VII MODULES

Data Acquisition

Preprocessing

Hybrid segmentation















712 Preprocessing



















Data Acquisition

Datasets

Preprocessing

Stop Removal

Stemming words

analysis

Tokenization

Hybrid

Segmentation Global


Pseudo

Feedback

POS tagger

Named Entity

Recognition

Network

Content Features

Blog Features

KNN classifiers

Trained

Rumors
















VIII CONCLUSION
























Release November
















BIOGRAPHIES

AAnitha MSc

Research Scholar



Perambalur




Perambalur




VII MODULES

Data Acquisition

Preprocessing

Hybrid segmentation















712 Preprocessing



















Data Acquisition

Datasets

Preprocessing

Stop Removal

Stemming words

analysis

Tokenization

Hybrid

Segmentation Global


Pseudo

Feedback

POS tagger

Named Entity

Recognition

Network

Content Features

Blog Features

KNN classifiers

Trained

Rumors
















VIII CONCLUSION
























Release November
















BIOGRAPHIES

AAnitha MSc

Research Scholar



Perambalur




Perambalur
















VIII CONCLUSION
























Release November
















BIOGRAPHIES

AAnitha MSc

Research Scholar



Perambalur




Perambalur



BIOGRAPHIES

AAnitha MSc

Research Scholar



Perambalur




Perambalur

© 2019 ijrar august 2019, volume 6, issue 3 ...ijrar.org/papers/ijrar19k4870.pdf · a comparative...

Documents