m phil-computer-science-data-mining-projects

25
M.Phil Computer Science Data Mining Projects Web : www.kasanpro.com Email : [email protected] List Link : http://kasanpro.com/projects-list/m-phil-computer-science-data-mining-projects Title :Bridging Socially Enhanced Virtual Communities Language : C# Project Link : http://kasanpro.com/p/c-sharp/bridging-socially-enhanced-virtual-communities Abstract : Interactions spanning multiple organizations have become an important aspect in todays collaboration landscape. Organizations create alliances to fulfill strategic objectives. The dynamic nature of collaborations increasingly demands for automated techniques and algorithms to support the creation of such alliances. Our approach bases on the recommendation of potential alliances by discovery of currently relevant competence sources and the support of semi automatic formation. The environment is service-oriented comprising humans and software services with distinct capabilities. To mediate between previously separated groups and organizations, we introduce the broker concept that bridges disconnected networks. We present a dynamic broker discovery approach based on interaction mining techniques and trust metrics. We evaluate our approach by using simulations in real Web services testbeds. Title :Mood Recognition During Online Self Assessment Test Language : C# Project Link : http://kasanpro.com/p/c-sharp/mood-recognition-during-online-self-assessment-test Abstract : Individual emotions play a crucial role during any learning interaction. Identifying a student's emotional state and providing personalized feedback, based on integrated pedagogical models, has been considered to be one of the main limits of traditional tools of e-learning. This paper presents an empirical study that llustrates how learner mood may be predicted during online self-assessmenttests. Here, a previous method of determining student mood has been refined based on the assumption that the influence on learner mood of questions already answered declines in relation to their distance from the current question. Moreover, this paper sets out to indicate that "exponential logic" may help produce more efficient models if integrated adequately with affective modeling. The results show that these assumptions may prove useful to future research. Title :On The Path To A World Wide Web Census: A Large Scale Survey Language : C# Project Link : http://kasanpro.com/p/c-sharp/world-wide-web-census-large-scale-survey Abstract : How large is the World Wide Web? We present the results of the largest Web survey performed to date.We use an inter-disciplinary approach which uses methods from ecology. In addition to Web server counts, we also present other information collected, such as Web server market share, operating system type used by Web servers and Web server distribution. The software system used to collect data is a prototype of a system that we believe can be used for a complete Web census. Title :Knowledge Sharing In Virtual Organizations: Barriers and Enablers Language : C# Project Link : http://kasanpro.com/p/c-sharp/knowledge-sharing-in-virtual-organizations-barriers-enablers Abstract : Modern organizations have to deal with many drastic external and internal constraints due notably to the globalization of the economy, the fast technological changes, and the shifts in customers demand. Moreover, organizations functionally divided and hierarchical internal structures are too rigid and make difficult their adjustment to the changing constraints resulting from the pressure of their external environment. Consequently, to survive and maintain their competitive advantage in the market, modern organizations must alter their internal structure to become

Upload: vijay-karan

Post on 08-Aug-2015

50 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: M phil-computer-science-data-mining-projects

M.Phil Computer Science Data Mining Projects

Web : www.kasanpro.com     Email : [email protected]

List Link : http://kasanpro.com/projects-list/m-phil-computer-science-data-mining-projects

Title :Bridging Socially Enhanced Virtual CommunitiesLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/bridging-socially-enhanced-virtual-communities

Abstract : Interactions spanning multiple organizations have become an important aspect in todays collaborationlandscape. Organizations create alliances to fulfill strategic objectives. The dynamic nature of collaborationsincreasingly demands for automated techniques and algorithms to support the creation of such alliances. Ourapproach bases on the recommendation of potential alliances by discovery of currently relevant competence sourcesand the support of semi automatic formation. The environment is service-oriented comprising humans and softwareservices with distinct capabilities. To mediate between previously separated groups and organizations, we introducethe broker concept that bridges disconnected networks. We present a dynamic broker discovery approach based oninteraction mining techniques and trust metrics. We evaluate our approach by using simulations in real Web servicestestbeds.

Title :Mood Recognition During Online Self Assessment TestLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/mood-recognition-during-online-self-assessment-test

Abstract : Individual emotions play a crucial role during any learning interaction. Identifying a student's emotionalstate and providing personalized feedback, based on integrated pedagogical models, has been considered to be oneof the main limits of traditional tools of e-learning. This paper presents an empirical study that llustrates how learnermood may be predicted during online self-assessmenttests. Here, a previous method of determining student moodhas been refined based on the assumption that the influence on learner mood of questions already answered declinesin relation to their distance from the current question. Moreover, this paper sets out to indicate that "exponential logic"may help produce more efficient models if integrated adequately with affective modeling. The results show that theseassumptions may prove useful to future research.

Title :On The Path To A World Wide Web Census: A Large Scale SurveyLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/world-wide-web-census-large-scale-survey

Abstract : How large is the World Wide Web? We present the results of the largest Web survey performed todate.We use an inter-disciplinary approach which uses methods from ecology. In addition to Web server counts, wealso present other information collected, such as Web server market share, operating system type used by Webservers and Web server distribution. The software system used to collect data is a prototype of a system that webelieve can be used for a complete Web census.

Title :Knowledge Sharing In Virtual Organizations: Barriers and EnablersLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/knowledge-sharing-in-virtual-organizations-barriers-enablers

Abstract : Modern organizations have to deal with many drastic external and internal constraints due notably to theglobalization of the economy, the fast technological changes, and the shifts in customers demand. Moreover,organizations functionally divided and hierarchical internal structures are too rigid and make difficult their adjustmentto the changing constraints resulting from the pressure of their external environment. Consequently, to survive andmaintain their competitive advantage in the market, modern organizations must alter their internal structure to become

Page 2: M phil-computer-science-data-mining-projects

organic and flexible systems able to adapt and progress in a high velocity environment. Virtual organizations areamong the most popular solutions which provide organizations with more agility and improve their efficiency andeffectiveness. Despite many success stories materialized by economic and non-economic benefits, many virtualorganizations have failed to reach their goals due to the problems they have encountered while trying to manageknowledge. In this work, we analyze the barriers and enablers of knowledge management in virtual organizations.

Title :Adaptive Provisioning of Human Expertise in Service-oriented SystemsLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/adaptive-provisioning-human-expertise-service-oriented-systems

Abstract : Web-based collaborations have become essential in today's business environments. Due to the availabilityof various SOA frameworks, Web services emerged as the de facto technology to realize flexible compositions ofservices. While most existing work focuses on the discovery and composition of software based services, we highlightconcepts for a people-centric Web. Knowledge-intensive environments clearly demand for provisioning of humanexpertise along with sharing of computing resources or business data through software-based services. To addressthese challenges, we introduce an adaptive approach allowing humans to provide their expertise through servicesusing SOA standards, such as WSDL and SOAP. The seamless integration of humans in the SOA loop triggersnumerous social implications, such as evolving expertise and drifting interests of human service providers. Here wepropose a framework that is based on interaction monitoring techniques enabling adaptations in SOA-basedsocio-technical systems.

M.Phil Computer Science Data Mining Projects

Title :Cost-aware rank join with random and sorted accessLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/cost-aware-rank-join-random-sorted-access

Abstract : In this project, we address the problem of joining ranked results produced by two or more services on theWeb. We consider services endowed with two kinds of access that are often available: i) sorted access, which returnstuples sorted by score; ii) random access, which returns tuples matching a given join attribute value. Rank joinoperators combine objects of two or more relations and output the k combinations with the highest aggregate score.While the past literature has studied suitable bounding schemes for this setting, in this paper we focus on thedefinition of a pulling strategy, which determines the order of invocation of the joined services. We propose the CARS(Cost-Aware with Random and Sorted access) pulling strategy, which is derived at compile-time and is oblivious ofthe query-dependent score distributions. We cast CARS as the solution of an optimization problem based on a smallset of parameters characterizing the joined services. We validate the proposed strategy with experiments on both realand synthetic data sets. We show that CARS outperforms prior proposals and that its overall access cost is alwayswithin a very short margin from that of an oracle-based optimal strategy. In addition, CARS is shown to be robust. Theuncertainty that may characterize the estimated parameters.

Title :USHER Improving Data Quality with Dynamic FormsLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/usher-improving-data-quality-dynamic-forms

Abstract : ta quality is a critical problem in modern databases. Data entry forms present the first and arguably bestopportunity for detecting and mitigating errors, but there has been little research into automatic methods for improvingdata quality at entry time. In this paper, we propose USHER, an end-to-end system for form design, entry, and dataquality assurance. Using previous form submissions, USHER learns a probabilistic model over the questions of theform. USHER then applies this model at every step of the data entry process to improve data quality. Before entry, itinduces a form layout that captures the most important data values of a form instance as quickly as possible. Duringentry, it dynamically adapts the form to the values being entered, and enables real-time feedback to guide the dataenterer toward their intended values. After entry, it re-asks questions that it deems likely to have been enteredincorrectly. We evaluate all three components of USHER using two real-world data sets. Our results demonstrate thateach component has the potential to improve data quality considerably, at a reduced cost when compared to currentpractice.

Title :A Dual Framework and Algorithms for Targeted Data DeliveryLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/algorithms-targeted-data-delivery

Page 3: M phil-computer-science-data-mining-projects

Abstract : In this project, we develop a framework for comparing pull based solutions and present dual optimizationapproaches. The first approach maximizes user utility while satisfying constraints on the usage of system resources.The second approach satisfies the utility of user profiles while minimizing the usage of system resources. We presentan adaptive algorithm and show how it can incorporate feedback to improve user utility with only a moderate increasein resource utilization.

http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews

Title :Selecting Attributes for Sentiment Classification Using Feature Relation NetworksLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/sentiment-classification-using-feature-relation-networks

Abstract : A major concern when incorporating large sets of diverse n-gram features for sentiment classification isthe presence of noisy, irrelevant, and redundant attributes. These concerns can often make it difficult to harness theaugmented discriminatory potential of extended feature sets. We propose a rule-based multivariate text featureselection method called Feature Relation Network (FRN) that considers semantic information and also leverages thesyntactic relationships between n-gram features. FRN is intended to efficiently enable the inclusion of extended setsof heterogeneous n-gram features for enhanced sentiment classification. Experiments were conducted on three onlinereview test beds in comparison with methods used in prior sentiment classification research. FRN outperformed thecomparison univariate, multivariate, and hybrid feature selection methods; it was able to select attributes resulting insignificantly better classification accuracy irrespective of the feature subset sizes. Furthermore, by incorporatingsyntactic information about n-gram relations, FRN is able to select features in a more computationally efficient mannerthan many multivariate and hybrid techniques.

Title :Improving Aggregate Recommendation Diversity Using Ranking-Based TechniquesLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/aggregate-recommendation-diversity-using-ranking-based

Abstract : Recommender systems are becoming increasingly important to individual users and businesses forproviding personalized recommendations. However, while the majority of algorithms proposed in recommendersystems literature have focused on improving recommendation accuracy, other important aspects of recommendationquality, such as the diversity of recommendations, have often been overlooked. In this paper, we introduce andexplore a number of item ranking techniques that can generate recommendations that have substantially higheraggregate diversity across all users while maintaining comparable levels of recommendation accuracy.Comprehensive empirical evaluation consistently shows the diversity gains of the proposed techniques using severalreal-world rating datasets and different rating prediction algorithms.

M.Phil Computer Science Data Mining Projects

Title :Integration of Sound Signature in Graphical Password Authentication SystemLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/sound-signature-graphical-password-authentication-system

Abstract : In this project, a graphical password system with a supportive sound signature to increase theremembrance of the password is discussed. In proposed work a click-based graphical password scheme called CuedClick Points (CCP) is presented. In this system a password consists of sequence of some images in which user canselect one click-point per image. In addition user is asked to select a sound signature corresponding to click point thissound signature will be used to help the user to login. System showed very good Performance in terms of speed,accuracy, and ease of use. Users preferred CCP to Pass Points, saying that selecting and remembering only onepoint per image was easier and sound signature helps considerably in recalling the click points.

Title :Monitoring Service Systems from a Language-Action PerspectiveLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/monitoring-service-systems-language-action

Page 4: M phil-computer-science-data-mining-projects

Abstract : The Exponential growth in the global economy is being supported by service systems, realized byrecasting mission-critical application services accessed across organizational boundaries. Language-ActionPerspective (LAP) is based upon the notion as proposed that "expert behavior requires an exquisite sensitivity tocontext and that such sensitivity is more in the realm of the human than in that of the artificial.

Business processes are increasingly distributed and open, making them prone to failure. Monitoring is, therefore, animportant concern not only for the processes themselves but also for the services that comprise these processes. Wepresent a framework for multilevel monitoring of these service systems. It formalizes interaction protocols, policies,and commitments that account for standard and extended effects following the language-action perspective, andallows specification of goals and monitors at varied abstraction levels. We demonstrate how the framework can beimplemented and evaluate it with multiple scenarios like between merchant and customer transaction that includespecifying and monitoring open-service policy commitments.

Title :A Personalized Ontology Model for Web Information GatheringLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/ontology-model-web-information-gathering

Abstract : As a model for knowledge description and formalization, ontologies are widely used to represent userprofiles in personalized web information gathering. However, when representing user profiles, many models haveutilized only knowledge from either a global knowledge base or user local information. In this paper, a personalizedontology model is proposed for knowledge representation and reasoning over user profiles. This model learnsontological user profiles from both a world knowledge base and user local instance repositories. The ontology modelis evaluated by comparing it against benchmark models in web information gathering. The results show that thisontology model is successful.

Title :Publishing Search Logs-A Comparative Study of Privacy GuaranteesLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/publishing-search-logs-privacy-guarantees

Abstract : Search engine companies collect the "database of intentions", the histories of their users' search queries.These search logs are a gold mine for researchers. Search engine companies, however, are wary of publishingsearch logs in order not to disclose sensitive information. In this paper we analyze algorithms for publishing frequentkeywords, queries and clicks of a search log. We first show how methods that achieve variants of k-anonymity arevulnerable to active attacks. We then demonstrate that the stronger guarantee ensured by differential privacyunfortunately does not provide any utility for this problem. Our paper concludes with a large experimental study usingreal applications where we compare ZEALOUS and previous work that achieves k-anonymity in search log publishing.Our results show that ZEALOUS yields comparable utility to k?anonymity while at the same time achieving muchstronger privacy guarantees.

Title :Scalable Scheduling of Updates in Streaming Data WarehouseLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/scheduling-updates-streaming-data-warehouse

Abstract : This study of collective behavior is to understand how individuals behave in a social networkingenvironment. Oceans of data generated by social media like Face book, Twitter, Flicker, and YouTube presentopportunities and challenges to study collective behavior on a large scale. In this work, we aim to learn to predictcollective behavior in social media. In particular, given information about some individuals, how can we infer thebehavior of unobserved individuals in the same network? A social-dimension-based approach has been showneffective in addressing the heterogeneity of connections presented in social media. However, the networks in socialmedia are normally of colossal size, involving hundreds of thousands of actors. The scale of these networks entailsscalable learning of models for collective behavior prediction. To address the scalability issue, we propose anedge-centric clustering scheme to extract sparse social dimensions. With sparse social dimensions, the proposedapproach can efficiently handle networks of millions of actors while demonstrating a comparable predictionperformance to other non-scalable methods.

M.Phil Computer Science Data Mining Projects

Title :The Awareness Network, To Whom Should I Display My Actions? And, Whose Actions Should I Monitor?Language : C#

Project Link : http://kasanpro.com/p/c-sharp/accessing-monitoring-inawareness-network

Page 5: M phil-computer-science-data-mining-projects

Abstract : The concept of awareness plays a pivotal role in research in Computer Supported Cooperative Work.Recently, Software Engineering researchers interested in the collaborative nature of software development haveexplored the implications of this concept in the design of software development tools. A critical aspect of awareness isthe associated coordinative work practices of displaying and monitoring actions. This aspect concerns how colleaguesmonitor one another's actions to understand how these actions impact their own work and how they display theiractions in such a way that others can easily monitor them while doing their own work. In this paper, we focus on anadditional aspect of awareness: the identification of the social actors who should be monitored and the actors towhom their actions should be displayed. We address this aspect by presenting software developers' work practicesbased on ethnographic data from three different software development teams. In addition, we illustrate how thesework practices are influenced by different factors, including the organizational setting, the age of the project, and thesoftware architecture. We discuss how our results are relevant for both CSCW and Software Engineeringresearchers.

http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews

Title :The World in a Nutshell Concise Range QueriesLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/world-nutshell-concise-range-queries

Abstract : With the advance of wireless communication technology, it is quite common for people to view maps or getrelated services from the handheld devices, such as mobile phones and PDAs. Range queries, as one of the mostcommonly used tools, are often posed by the users to retrieve needful information from a spatial database. However,due to the limits of communication bandwidth and hardware power of handheld devices, displaying all the results of arange query on a handheld device is neither communicationefficient nor informative to the users. This is simplybecause that there are often too many results returned from a range query.

In view of this problem, we present a novel idea that a concise representation of a specified size for the range queryresults, while incurring minimal information loss, shall be computed and returned to the user. Such a concise rangequery not only reduces communication costs, but also offers better usability to the users, providing an opportunity forinteractive exploration.

The usefulness of the concise range queries is confirmed by comparing it with other possible alternatives, such assampling and clustering. Unfortunately, we prove that finding the optimal representation with minimum informationloss is an NP-hard problem. Therefore, we propose several effective and nontrivial algorithms to find a goodapproximate result. Extensive experiments on real-world data have demonstrated the effectiveness and efficiency ofthe proposed techniques.

Title :A Query Formulation Language for the Data WebLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/query-formulation-language-data-web

Abstract : We present a query formulation language called MashQL in order to easily query and fuse structured dataon the web. The main novelty of MashQL is that it allows people with limited IT-skills to explore and query one ormultiple data sources without prior knowledge about the schema, structure, vocabulary, or any technical details ofthese sources. More importantly, to be robust and cover most cases in practice, we do not assume that a data sourceshould have -an offline or inline- schema. This poses several language-design and performance complexities that wefundamentally tackle. To illustrate the query formulation power of MashQL, and without loss of generality, we chosethe Data Web scenario. We also chose querying RDF, as it is the most primitive data model; hence, MashQL can besimilarly used for querying relational databases and XML. We present two implementations of MashQL, an onlinemashup editor, and a Firefox add-on. The former illustrates how MashQL can be used to query and mash up the DataWeb as simple as filtering and piping web feeds; and the Firefox addon illustrates using the browser as a webcomposer rather than only a navigator. To end, we evaluate MashQL on querying two datasets, DBLP and DBPedia,and show that our indexing techniques allow instant user-interaction.

Title :Exploring Application-Level Semantics for Data CompressionLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/exploring-application-level-semantics-data-compression

Abstract : Natural phenomena show that many creatures form large social groups and move in regular patterns.

Page 6: M phil-computer-science-data-mining-projects

However, previous works focus on finding the movement patterns of each single object or all objects. In this paper, wefirst propose an efficient distributed mining algorithm to jointly identify a group of moving objects and discover theirmovement patterns in wireless sensor networks. Afterward, we propose a compression algorithm, called 2P2D, whichexploits the obtained group movement patterns to reduce the amount of delivered data.

The compression algorithm includes a sequence merge and an entropy reduction phases. In the sequence mergephase, we propose a Merge algorithm to merge and compress the location data of a group of moving objects. In theentropy reduction phase, we formulate a Hit Item Replacement (HIR) problem and propose a Replace algorithm thatobtains the optimal solution. Moreover, we devise three replacement rules and derive the maximum compressionratio. The experimental results show that the proposed compression algorithm leverages the group movementpatterns to reduce the amount of delivered data effectively and efficiently.

Title :Data Leakage DetectionLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/data-leakage-detection

Abstract : A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some ofthe data is leaked and found in an unauthorized place (e.g., on the web or somebody's laptop). The distributor mustassess the likelihood that the leaked data came from one or more agents, as opposed to having been independentlygathered by other means. We propose data allocation strategies (across the agents) that improve the probability ofidentifying leakages. These methods do not rely on alterations of the released data (e.g., watermarks). In some caseswe can also inject "realistic but fake" data records to further improve our chances of detecting leakage and identifyingthe guilty party.

M.Phil Computer Science Data Mining Projects

Title :Knowledge Based Interactive Postmining of Association Rules Using OntologiesLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/knowledge-based-interactive-postmining-association-rules-using-ontologies

Abstract : In Data Mining, the usefulness of association rules is strongly limited by the huge amount of deliveredrules. To overcome this drawback, several methods were proposed in the literature such as item set conciserepresentations, redundancy reduction, and post processing. However, being generally based on statisticalinformation, most of these methods do not guarantee that the extracted rules are interesting for the user. Thus, it iscrucial to help the decision-maker with an efficient post processing step in order to reduce the number of rules. Thispaper proposes a new interactive approach to prune and filter discovered rules. First, we propose to use ontologies inorder to improve the integration of user knowledge in the post processing task. Second, we propose the Rule Schemaformalism extending the specification language proposed by Liu et al. for user expectations. Furthermore, aninteractive framework is designed to assist the user throughout the analyzing task. Applying our new approach overvoluminous sets of rules, we were able, by integrating domain expert knowledge in the post processing step, toreduce the number of rules to several dozens or less. Moreover, the quality of the filtered rules was validated by thedomain expert at various points in the interactive process.

Title :A Link Analysis Extension of Correspondence Analysis for Mining Relational DatabasesLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/link-analysis-mining-relational-databases

Abstract : This work introduces a link-analysis procedure for discovering relationships in a relational database or agraph, generalizing both simple and multiple correspondence analysis. It is based on a random-walk model throughthe database defining a Markov chain having as many states as elements in the database. Suppose we are interestedin analyzing the relationships between some elements (or records) contained in two different tables of the relationaldatabase. To this end, in a first step, a reduced, much smaller, Markov chain containing only the elements of interestand preserving the main characteristics of the initial chain is extracted by stochastic complementation. This reducedchain is then analyzed by projecting jointly the elements of interest in the diffusion-map subspace and visualizing theresults. This two-step procedure reduces to simple correspondence analysis when only two tables are defined and tomultiple correspondence analyses when the database takes the form of a simple star schema. On the other hand, akernel version of the diffusion-map distance, generalizing the basic diffusion-map distance to directed graphs, is alsointroduced and the links with spectral clustering are discussed. Several datasets are analyzed by using the proposedmethodology, showing the usefulness of the technique for extracting relationships in relational databases or graphs.

Page 7: M phil-computer-science-data-mining-projects

Title :Query Planning for Continuous Aggregation Queries over a Network of Data AggregatorsLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/query-planning-continuous-aggregation-queries

Abstract : Continuous queries are used to monitor changes to time varying data and to provide results useful foronline decision making. Typically a user desires to obtain the value of some aggregation function over distributed dataitems, for example, to know value of portfolio for a client; or the AVG of temperatures sensed by a set of sensors. Inthese queries a client specifies a coherency requirement as part of the query. We present a low-cost, scalabletechnique to answer continuous aggregation queries using a network of aggregators of dynamic data items. In such anetwork of data aggregators, each data aggregator serves a set of data items at specific coherencies. Just as variousfragments of a dynamic web-page are served by one or more nodes of a content distribution network, our techniqueinvolves decomposing a client query into sub-queries and executing sub-queries on judiciously chosen dataaggregators with their individual sub-query incoherency bounds. We provide a technique for getting the optimal set ofsub-queries with their incoherency bounds which satisfies client query's coherency requirement with least number ofrefresh messages sent from aggregators to the client. For estimating the number of refresh messages, we build aquery cost model which can be used to estimate the number of messages required to satisfy the client specifiedincoherency bound. Performance results using real-world traces show that our cost based query planning leads toqueries being executed using less than one third the number of messages required by existing schemes.

Title :Scalable learning of collective behaviorLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/scalable-learning-collective-behavior

Abstract : This study of collective behavior is to understand how individuals behave in a social networkingenvironment. Oceans of data generated by social media like Face book, Twitter, Flicker, and YouTube presentopportunities and challenges to study collective behavior on a large scale. In this work, we aim to learn to predictcollective behavior in social media. In particular, given information about some individuals, how can we infer thebehavior of unobserved individuals in the same network? A social-dimension-based approach has been showneffective in addressing the heterogeneity of connections presented in social media. However, the networks in socialmedia are normally of colossal size, involving hundreds of thousands of actors. The scale of these networks entailsscalable learning of models for collective behavior prediction. To address the scalability issue, we propose anedge-centric clustering scheme to extract sparse social dimensions. With sparse social dimensions, the proposedapproach can efficiently handle networks of millions of actors while demonstrating a comparable predictionperformance to other non-scalable methods.

http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews

Title :Horizontal Aggregations in SQL to prepare Data Sets for Data Mining AnalysisLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/horizontal-aggregations-sql-data-mining-analysis

Abstract : Preparing a data set for analysis is generally the most time consuming task in a data mining project,requiring many complex SQL queries, joining tables and aggregating columns. Existing SQL aggregations havelimitations to prepare data sets because they return one column per aggregated group. In general, a significantmanual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful,methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set ofnumbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontalaggregations build data sets with a horizontal denormalized layout (e.g. point-dimension, observation-variable,instance-feature), which is the standard layout required by most data mining algorithms. We propose threefundamental methods to evaluate horizontal aggregations: CASE: Exploiting the programming CASE construct; SPJ:Based on standard relational algebra operators (SPJ queries); PIVOT: Using the PIVOT operator, which is offered bysome DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE methodhas similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOTmethods exhibit linear scalability, whereas the SPJ method does not.

M.Phil Computer Science Data Mining Projects

Title :A Machine Learning Approach for Identifying Disease-Treatment Relations in Short TextsLanguage : C#

Page 8: M phil-computer-science-data-mining-projects

Project Link : http://kasanpro.com/p/c-sharp/machine-learning-identifying-disease-treatment-relations-short-texts

Abstract : The Machine Learning (ML) field has gained its momentum in almost any domain of research and justrecently has become a reliable tool in the medical domain. The empirical domain of automatic learning is used intasks such as medical decision support, medical imaging, protein-protein interaction, extraction of medical knowledge,and for overall patient management care.

ML is envisioned as a tool by which computer-based systems can be integrated in the healthcare field in order to geta better, more efficient medical care. This paper describes a ML-based methodology for building an application that iscapable of identifying and disseminating healthcare information.

It extracts sentences from published medical papers that mention diseases and treatments, and identifies semanticrelations that exist between diseases and treatments.

Our evaluation results for these tasks show that the proposed methodology obtains reliable outcomes that could beintegrated in an application to be used in the medical care domain. The potential value of this paper stands in the MLsettings that we propose and in the fact that we outperform previous results on the same data set.

Title :m-Privacy for Collaborative Data PublishingLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/privacy-collaborative-data-publishing

Abstract : In this paper, we consider the collaborative data publishing problem for anonymizing horizontallypartitioned data at multiple data providers. We consider a new type of "insider attack" by colluding data providers whomay use their own data records (a subset of the overall data) in addition to the external background knowledge toinfer the data records contributed by other data providers. The paper addresses this new threat and makes severalcontributions. First, we introduce the notion of m-privacy, which guarantees that the anonymized data satisfies a givenprivacy constraint against any group of up to m colluding data providers. Second, we present heuristic algorithmsexploiting the equivalence group monotonicity of privacy constraints and adaptive ordering techniques for efficientlychecking m-privacy given a set of records. Finally, we present a data provider-aware anonymization algorithm withadaptive m- privacy checking strategies to ensure high utility and m-privacy of anonymized data with efficiency.Experiments on real-life datasets suggest that our approach achieves better or comparable utility and efficiency thanexisting and baseline algorithms while providing m-privacy guarantee.

Title :Spatial Approximate String SearchLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/spatial-approximate-string-search

Abstract : This work deals with the approximate string search in large spatial databases. Specifically, we investigaterange queries augmented with a string similarity search predicate in both Euclidean space and road networks. Wedub this query the spatial approximate string (S AS ) query. In Euclidean space, we propose an approximate solution,the M H R-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node u keepsa concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruningfunctionality of such signatures based on the set resemblance between the query string and the q-grams from thesub-trees of index nodes. We also discuss how to estimate the selectivity of a S AS query in Euclidean space, forwhich we present a novel adaptive algorithm to find balanced partitions using both the spatial and string informationstored in the tree. For queries on road networks, we propose a novel exact method, R SAS S OL, which significantlyoutperforms the baseline algorithm in practice. The R SAS S OL combines the q-gram based inverted lists and thereference nodes based pruning. Extensive experiments on large real data sets demonstrate the efficiency andeffectiveness of our approaches.

Title :Predicting iPhone Sales from iPhone TweetsLanguage : ASP.NET with C#

Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/predicting-iphone-sales-iphone-tweets

Abstract : Recent research in the field of computational social science have shown how data resulting from thewidespread adoption and use of social media channels such as twitter can be used to predict outcomes such asmovie revenues, election winners, localized moods, and epidemic outbreaks. Underlying assumptions for thisresearch stream on predictive analytics are that social media actions such as tweeting, liking, commenting and ratingare proxies for user/consumer's attention to a particular object/product and that the shared digital artefact that ispersistent can create social influence. In this paper, we demonstrate how social media data from twitter can be used

Page 9: M phil-computer-science-data-mining-projects

to predict the sales of iPhones. Based on a conceptual model of social data consisting of social graph (actors, actions,activities, and artefacts) and social text (topics, keywords, pronouns, and sentiments), we develop and evaluate alinear regression model that transforms iPhone tweets into a prediction of the quarterly iPhone sales with an averageerror close to the established prediction models from investment banks. This strong correlation between iPhonetweets and iPhone sales becomes marginally stronger after incorporating sentiments of tweets. We discuss thefindings and conclude with implications for predictive analytics with big social data.

Title :A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional DataLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/clustering-based-feature-subset-selection-algorithm-high-dimensional-data

Abstract : Feature selection involves identifying a subset of the most useful features that produces compatible resultsas the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency andeffectiveness points of view. While the efficiency concerns the time required to find a subset of features, theeffectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based featureselection algorithm, FAST, is proposed and experimentally evaluated in this paper. The FAST algorithm works in twosteps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the secondstep, the most representative feature that is strongly related to target classes is selected from each cluster to form asubset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST hasa high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, weadopt the efficient minimum-spanning tree clustering method. The efficiency and effectiveness of the FAST algorithmare evaluated through an empirical study. Extensive experiments are carried out to compare FAST and severalrepresentative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect tofour types of well-known classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5, theinstance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publiclyavailable real-world high dimensional image, microarray, and text data, demonstrate that FAST not only producessmaller subsets of features but also improves the performances of the four types of classifiers.

M.Phil Computer Science Data Mining Projects

Title :Crowdsourcing Predictors of Behavioral OutcomesLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/crowdsourcing-predictors-behavioral-outcomes

Abstract : Generating models from large data sets--and deter- mining which subsets of data to mine--is becomingincreasingly automated. However choosing what data to collect in the first place requires human intuition orexperience, usually supplied by a domain expert. This paper describes a new approach to machine science whichdemonstrates for the first time that non-domain experts can collectively formulate features, and provide values forthose features such that they are predictive of some behavioral outcome of interest. This was accomplished bybuilding a web platform in which human groups interact to both respond to questions likely to help predict a behavioraloutcome and pose new questions to their peers. This results in a dynamically-growing online survey, but the result ofthis cooperative behavior also leads to models that can predict user's outcomes based on their responses to theuser-generated survey questions. Here we describe two web-based experiments that instantiate this approach: thefirst site led to models that can predict users' monthly electric energy consumption; the other led to models that canpredict users' body mass index. As exponential increases in content are often observed in successful onlinecollaborative communities, the proposed methodology may, in the future, lead to similar exponential rises in discoveryand insight into the causal factors of behavioral outcomes.

Title :Data Extraction for Deep Web Using WordNetLanguage : ASP.NET with VB

Project Link : http://kasanpro.com/p/asp-net-with-vb/data-extraction-deep-web-using-wordnet

Abstract : Our survey shows that the techniques used in data extraction from deep webs need to be improved toachieve the efficiency and accuracy of automatic wrappers. Further investigations indicate that the development of alightweight ontological technique using existing lexical database for English (WordNet) is able to check the similarityof data records and detect the correct data region with higher precision using the semantic properties of these datarecords. The advantages of this method are that it can extract three types of data records, namely, single-section datarecords, multiple-section data records, and loosely structured data records, and it also provides options for aligningiterative and disjunctive data items. Experimental results show that our technique is robust and performs better thanthe existing state-of-the-art wrappers. Tests also show that our wrapper is able to extract data records from

Page 10: M phil-computer-science-data-mining-projects

multilingual web pages and that it is domain independent.

http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews

Title :Data Extraction for Deep Web Using WordNetLanguage : ASP.NET with C#

Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/data-extraction-deep-web-using-wordnet-code

Abstract : Our survey shows that the techniques used in data extraction from deep webs need to be improved toachieve the efficiency and accuracy of automatic wrappers. Further investigations indicate that the development of alightweight ontological technique using existing lexical database for English (WordNet) is able to check the similarityof data records and detect the correct data region with higher precision using the semantic properties of these datarecords. The advantages of this method are that it can extract three types of data records, namely, single-section datarecords, multiple-section data records, and loosely structured data records, and it also provides options for aligningiterative and disjunctive data items. Experimental results show that our technique is robust and performs better thanthe existing state-of-the-art wrappers. Tests also show that our wrapper is able to extract data records frommultilingual web pages and that it is domain independent.

Title :Data Extraction for Deep Web Using WordNetLanguage : PHP

Project Link : http://kasanpro.com/p/php/data-extraction-deep-web-using-wordnet-implement

Abstract : Our survey shows that the techniques used in data extraction from deep webs need to be improved toachieve the efficiency and accuracy of automatic wrappers. Further investigations indicate that the development of alightweight ontological technique using existing lexical database for English (WordNet) is able to check the similarityof data records and detect the correct data region with higher precision using the semantic properties of these datarecords. The advantages of this method are that it can extract three types of data records, namely, single-section datarecords, multiple-section data records, and loosely structured data records, and it also provides options for aligningiterative and disjunctive data items. Experimental results show that our technique is robust and performs better thanthe existing state-of-the-art wrappers. Tests also show that our wrapper is able to extract data records frommultilingual web pages and that it is domain independent.

Title :An Effective Retrieval of Medical Records using Data Mining TechniquesLanguage : ASP.NET with VB

Project Link : http://kasanpro.com/p/asp-net-with-vb/retrieval-medical-records-data-mining

Abstract : Nowadays, the standard of healthcare domain mainly depends on in the delivery of modern healthcare andefficiency of healthcare systems. Due to time and cost constraints, most of the people rely on health care systems toobtain healthcare services. Healthcare system becomes very important to develop an automated tool that is capableof identifying and disseminating relevant healthcare information. This work focuses on retrieval of updated, accurateand relevant information from Medline datasets using Machine earning approach. The proposed work uses keywordsearching algorithm for extracting relevant information from Medline datasets and K-Nearest Neighbor algorithm(KNN) to get the relation between disease and treatment. Since, improvement of patient care achieved effectively.

M.Phil Computer Science Data Mining Projects

Title :An Effective Retrieval of Medical Records using Data Mining TechniquesLanguage : ASP.NET with C#

Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/retrieval-medical-records-data-mining-code

Abstract : Nowadays, the standard of healthcare domain mainly depends on in the delivery of modern healthcare andefficiency of healthcare systems. Due to time and cost constraints, most of the people rely on health care systems toobtain healthcare services. Healthcare system becomes very important to develop an automated tool that is capableof identifying and disseminating relevant healthcare information. This work focuses on retrieval of updated, accurateand relevant information from Medline datasets using Machine earning approach. The proposed work uses keywordsearching algorithm for extracting relevant information from Medline datasets and K-Nearest Neighbor algorithm

Page 11: M phil-computer-science-data-mining-projects

(KNN) to get the relation between disease and treatment. Since, improvement of patient care achieved effectively.

Title :An Effective Retrieval of Medical Records using Data Mining TechniquesLanguage : PHP

Project Link : http://kasanpro.com/p/php/retrieval-medical-records-data-mining-implement

Abstract : Nowadays, the standard of healthcare domain mainly depends on in the delivery of modern healthcare andefficiency of healthcare systems. Due to time and cost constraints, most of the people rely on health care systems toobtain healthcare services. Healthcare system becomes very important to develop an automated tool that is capableof identifying and disseminating relevant healthcare information. This work focuses on retrieval of updated, accurateand relevant information from Medline datasets using Machine earning approach. The proposed work uses keywordsearching algorithm for extracting relevant information from Medline datasets and K-Nearest Neighbor algorithm(KNN) to get the relation between disease and treatment. Since, improvement of patient care achieved effectively.

Title :Design and analysis of concept adapting real time data stream ApplicationsLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/concept-adapting-real-time-data-stream-applications

Abstract : Real - time signals are continuous in nature and abruptly changing hence there is a need to apply anefficient and concept adapting real - time data stream mining technique to take intelligent decisions online. Conceptdrift in real time data stream refers to a change in the class (concept) definitions over time. It is also called as NON -STATIONARY LEARNING (NSL).

The most important criteria are to solve the real - time data stream mining problem with 'concept drift' in well manner.

Title :Data Extraction for Deep Web Using WordNetLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/data-extraction-deep-web-using-wordnet-module

Abstract : Our survey shows that the techniques used in data extraction from deep webs need to be improved toachieve the efficiency and accuracy of automatic wrappers. Further investigations indicate that the development of alightweight ontological technique using existing lexical database for English (WordNet) is able to check the similarityof data records and detect the correct data region with higher precision using the semantic properties of these datarecords. The advantages of this method are that it can extract three types of data records, namely, single-section datarecords, multiple-section data records, and loosely structured data records, and it also provides options for aligningiterative and disjunctive data items. Experimental results show that our technique is robust and performs better thanthe existing state-of-the-art wrappers. Tests also show that our wrapper is able to extract data records frommultilingual web pages and that it is domain independent.

Title :Answering General Time-Sensitive QueriesLanguage : ASP.NET with VB

Project Link : http://kasanpro.com/p/asp-net-with-vb/answering-general-time-sensitive-queries

Abstract : Time is an important dimension of relevance for a large number of searches, such as over blogs and newsarchives. So far, research on searching over such collections has largely focused on locating topically similardocuments for a query. Unfortunately, topic similarity alone is not always sufficient for document ranking. In thispaper, we observe that, for an important class of queries that we call time-sensitive queries, the publication time of thedocuments in a news archive is important and should be considered in conjunction with the topic similarity to derivethe final document ranking. Earlier work has focused on improving retrieval for "recency" queries that target recentdocuments. We propose a more general framework for handling time-sensitive queries and we automatically identifythe important time intervals that are likely to be of interest for a query. Then, we build scoring techniques thatseamlessly integrate the temporal aspect into the overall ranking mechanism. We present an extensive experimentalevaluation using a variety of news article data sets, including TREC data as well as real web data analyzed using theAmazon Mechanical Turk. We examine several techniques for detecting the important time intervals for a query overa news archive and for incorporating this information in the retrieval process. We show that our techniques are robustand significantly improve result quality for time-sensitive queries compared to state-of-the-art retrieval techniques.

M.Phil Computer Science Data Mining Projects

Page 12: M phil-computer-science-data-mining-projects

Title :Answering General Time-Sensitive QueriesLanguage : ASP.NET with C#

Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/answering-general-time-sensitive-queries-framwork

Abstract : Time is an important dimension of relevance for a large number of searches, such as over blogs and newsarchives. So far, research on searching over such collections has largely focused on locating topically similardocuments for a query. Unfortunately, topic similarity alone is not always sufficient for document ranking. In thispaper, we observe that, for an important class of queries that we call time-sensitive queries, the publication time of thedocuments in a news archive is important and should be considered in conjunction with the topic similarity to derivethe final document ranking. Earlier work has focused on improving retrieval for "recency" queries that target recentdocuments. We propose a more general framework for handling time-sensitive queries and we automatically identifythe important time intervals that are likely to be of interest for a query. Then, we build scoring techniques thatseamlessly integrate the temporal aspect into the overall ranking mechanism. We present an extensive experimentalevaluation using a variety of news article data sets, including TREC data as well as real web data analyzed using theAmazon Mechanical Turk. We examine several techniques for detecting the important time intervals for a query overa news archive and for incorporating this information in the retrieval process. We show that our techniques are robustand significantly improve result quality for time-sensitive queries compared to state-of-the-art retrieval techniques.

Title :A Survey of Indexing Techniques for Scalable Record Linkage and DeduplicationLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/indexing-scalable-record-linkage-deduplication

Abstract : Record linkage is the process of matching records from several databases that refer to the same entities.When applied on a single database, this process is known as deduplication. Increasingly, matched data are becomingimportant in many application areas, because they can contain information that is not available otherwise, or that istoo costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process,because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With theincreasing size of today's databases, the complexity of the matching process becomes one of the major challengesfor record linkage and deduplication. In recent years, various indexing techniques have been developed for recordlinkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matchingprocess by removing obvious non-matching pairs, while at the same time maintaining high matching quality. Thispaper presents a survey of twelve variations of six indexing techniques. Their complexity is analysed, and theirperformance and scalability is evaluated within an experimental framework using both synthetic and real data sets. Nosuch detailed survey has so far been published.

Title :Decentralized Probabilistic Text ClusteringLanguage : NS2

Project Link : http://kasanpro.com/p/ns2/decentralized-probabilistic-text-clustering

Abstract : Text clustering is an established technique for improving quality in information retrieval, for bothcentralized and distributed environments. However, traditional text clustering algorithms fail to scale on highlydistributed environments, such as peer-to-peer networks. Our algorithm for peer-to-peer clustering achieves highscalability by using a probabilistic approach for assigning documents to clusters. It enables a peer to compare each ofits documents only with very few selected clusters, without significant loss of clustering quality. The algorithm offersprobabilistic guarantees for the correctness of each document assignment to a cluster. Extensive experimentalevaluation with up to 1 million peers and 1 million documents demonstrates the scalability and effectiveness of thealgorithm.

Title :Decentralized Probabilistic Text ClusteringLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/decentralized-probabilistic-text-clustering-code

Abstract : Text clustering is an established technique for improving quality in information retrieval, for bothcentralized and distributed environments. However, traditional text clustering algorithms fail to scale on highlydistributed environments, such as peer-to-peer networks. Our algorithm for peer-to-peer clustering achieves highscalability by using a probabilistic approach for assigning documents to clusters. It enables a peer to compare each ofits documents only with very few selected clusters, without significant loss of clustering quality. The algorithm offersprobabilistic guarantees for the correctness of each document assignment to a cluster. Extensive experimentalevaluation with up to 1 million peers and 1 million documents demonstrates the scalability and effectiveness of the

Page 13: M phil-computer-science-data-mining-projects

algorithm.

Title :Effective Pattern Discovery for Text MiningLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/effective-pattern-discovery-text-mining

Abstract : Many data mining techniques have been proposed for mining useful patterns in text documents. However,how to effectively use and update discovered patterns is still an open research issue, especially in the domain of textmining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems ofpolysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase)-basedapproaches should perform better than the term-based ones, but many experiments do not support this hypothesis.This paper presents an innovative and effective pattern discovery technique which includes the processes of patterndeploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for findingrelevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics demonstratethat the proposed solution achieves encouraging performance.

M.Phil Computer Science Data Mining Projects

Title :Ranking Model Adaptation for Domain-Specific SearchLanguage : ASP.NET with VB

Project Link : http://kasanpro.com/p/asp-net-with-vb/adaptation-domain-specific-search

Abstract : With the explosive emergence of vertical search domains, applying the broad-based ranking model directlyto different domains is no longer desirable due to domain differences, while building a unique ranking model for eachdomain is both laborious for labeling data and time-consuming for training models. In this paper, we address thesedifficulties by proposing a regularization based algorithm called ranking adaptation SVM (RA-SVM), through which wecan adapt an existing ranking model to a new domain, so that the amount of labeled data and the training cost isreduced while the performance is still guaranteed. Our algorithm only requires the prediction from the existing rankingmodels, rather than their internal representations or the data from auxiliary domains. In addition, we assume thatdocuments similar in the domain-specific feature space should have consistent rankings, and add some constraints tocontrol the margin and slack variables of RA-SVM adaptively. Finally, ranking adaptability measurement is proposedto quantitatively estimate if an existing ranking model can be adapted to a new domain. Experiments performed overLetor and two large scale datasets crawled from a commercial search engine demonstrate the applicabilities of theproposed ranking adaptation algorithms and the ranking adaptability measurement.

Title :Ranking Model Adaptation for Domain-Specific SearchLanguage : ASP.NET with C#

Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/ranking-adaptation-domain-specific-search

Abstract : With the explosive emergence of vertical search domains, applying the broad-based ranking model directlyto different domains is no longer desirable due to domain differences, while building a unique ranking model for eachdomain is both laborious for labeling data and time-consuming for training models. In this paper, we address thesedifficulties by proposing a regularization based algorithm called ranking adaptation SVM (RA-SVM), through which wecan adapt an existing ranking model to a new domain, so that the amount of labeled data and the training cost isreduced while the performance is still guaranteed. Our algorithm only requires the prediction from the existing rankingmodels, rather than their internal representations or the data from auxiliary domains. In addition, we assume thatdocuments similar in the domain-specific feature space should have consistent rankings, and add some constraints tocontrol the margin and slack variables of RA-SVM adaptively. Finally, ranking adaptability measurement is proposedto quantitatively estimate if an existing ranking model can be adapted to a new domain. Experiments performed overLetor and two large scale datasets crawled from a commercial search engine demonstrate the applicabilities of theproposed ranking adaptation algorithms and the ranking adaptability measurement.

Title :Scalable Learning of Collective BehaviorLanguage : ASP.NET with VB

Project Link : http://kasanpro.com/p/asp-net-with-vb/scalable-learning-collective-behavior-code

Abstract : This study of collective behavior is to understand how individuals behave in a social networkingenvironment. Oceans of data generated by social media like Facebook, Twitter, Flickr, and YouTube present

Page 14: M phil-computer-science-data-mining-projects

opportunities and challenges to study collective behavior on a large scale. In this work, we aim to learn to predictcollective behavior in social media. In particular, given information about some individuals, how can we infer thebehavior of unobserved individuals in the same network? A social-dimension-based approach has been showneffective in addressing the heterogeneity of connections presented in social media. However, the networks in socialmedia are normally of colossal size, involving hundreds of thousands of actors. The scale of these networks entailsscalable learning of models for collective behavior prediction. To address the scalability issue, we propose anedge-centric clustering scheme to extract sparse social dimensions. With sparse social dimensions, the proposedapproach can efficiently handle networks of millions of actors while demonstrating a comparable predictionperformance to other non-scalable methods.

http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews

Title :Scalable Learning of Collective BehaviorLanguage : ASP.NET with C#

Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/scalable-learning-collective-behavior-implement

Abstract : This study of collective behavior is to understand how individuals behave in a social networkingenvironment. Oceans of data generated by social media like Facebook, Twitter, Flickr, and YouTube presentopportunities and challenges to study collective behavior on a large scale. In this work, we aim to learn to predictcollective behavior in social media. In particular, given information about some individuals, how can we infer thebehavior of unobserved individuals in the same network? A social-dimension-based approach has been showneffective in addressing the heterogeneity of connections presented in social media. However, the networks in socialmedia are normally of colossal size, involving hundreds of thousands of actors. The scale of these networks entailsscalable learning of models for collective behavior prediction. To address the scalability issue, we propose anedge-centric clustering scheme to extract sparse social dimensions. With sparse social dimensions, the proposedapproach can efficiently handle networks of millions of actors while demonstrating a comparable predictionperformance to other non-scalable methods.

Title :Resilient Identity Crime DetectionLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/resilient-identity-crime-detection

Abstract : Identity crime is well known, prevalent, and costly; and credit application fraud is a specific case of identitycrime. The existing nondata mining detection system of business rules and scorecards, and known fraud matchinghave limitations. To address these limitations and combat identity crime in real time, this paper proposes a newmultilayered detection system complemented with two additional layers: communal detection (CD) and spikedetection (SD). CD finds real social relationships to reduce the suspicion score, and is tamper resistant to syntheticsocial relationships. It is the whitelist-oriented approach on a fixed set of attributes. SD finds spikes in duplicates toincrease the suspicion score, and is probe-resistant for attributes. It is the attribute-oriented approach on avariable-size set of attributes. Together, CD and SD can detect more types of attacks, better account for changinglegal behavior, and remove the redundant attributes. Experiments were carried out on CD and SD with several millionreal credit applications. Results on the data support the hypothesis that successful credit application fraud patternsare sudden and exhibit sharp spikes in duplicates. Although this research is specific to credit application frauddetection, the concept of resilience, together with adaptivity and quality data discussed in the paper, are general tothe design, implementation, and evaluation of all detection systems.

M.Phil Computer Science Data Mining Projects

Title :Resilient Identity Crime DetectionLanguage : ASP.NET with C#

Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/resilient-identity-crime-detection-code

Abstract : Identity crime is well known, prevalent, and costly; and credit application fraud is a specific case of identitycrime. The existing nondata mining detection system of business rules and scorecards, and known fraud matchinghave limitations. To address these limitations and combat identity crime in real time, this paper proposes a newmultilayered detection system complemented with two additional layers: communal detection (CD) and spikedetection (SD). CD finds real social relationships to reduce the suspicion score, and is tamper resistant to syntheticsocial relationships. It is the whitelist-oriented approach on a fixed set of attributes. SD finds spikes in duplicates toincrease the suspicion score, and is probe-resistant for attributes. It is the attribute-oriented approach on a

Page 15: M phil-computer-science-data-mining-projects

variable-size set of attributes. Together, CD and SD can detect more types of attacks, better account for changinglegal behavior, and remove the redundant attributes. Experiments were carried out on CD and SD with several millionreal credit applications. Results on the data support the hypothesis that successful credit application fraud patternsare sudden and exhibit sharp spikes in duplicates. Although this research is specific to credit application frauddetection, the concept of resilience, together with adaptivity and quality data discussed in the paper, are general tothe design, implementation, and evaluation of all detection systems.

Title :Resilient Identity Crime DetectionLanguage : ASP.NET with VB

Project Link : http://kasanpro.com/p/asp-net-with-vb/resilient-identity-crime-detection-implement

Abstract : Identity crime is well known, prevalent, and costly; and credit application fraud is a specific case of identitycrime. The existing nondata mining detection system of business rules and scorecards, and known fraud matchinghave limitations. To address these limitations and combat identity crime in real time, this paper proposes a newmultilayered detection system complemented with two additional layers: communal detection (CD) and spikedetection (SD). CD finds real social relationships to reduce the suspicion score, and is tamper resistant to syntheticsocial relationships. It is the whitelist-oriented approach on a fixed set of attributes. SD finds spikes in duplicates toincrease the suspicion score, and is probe-resistant for attributes. It is the attribute-oriented approach on avariable-size set of attributes. Together, CD and SD can detect more types of attacks, better account for changinglegal behavior, and remove the redundant attributes. Experiments were carried out on CD and SD with several millionreal credit applications. Results on the data support the hypothesis that successful credit application fraud patternsare sudden and exhibit sharp spikes in duplicates. Although this research is specific to credit application frauddetection, the concept of resilience, together with adaptivity and quality data discussed in the paper, are general tothe design, implementation, and evaluation of all detection systems.

Title :Resilient Identity Crime DetectionLanguage : PHP

Project Link : http://kasanpro.com/p/php/resilient-identity-crime-detection-module

Abstract : Identity crime is well known, prevalent, and costly; and credit application fraud is a specific case of identitycrime. The existing nondata mining detection system of business rules and scorecards, and known fraud matchinghave limitations. To address these limitations and combat identity crime in real time, this paper proposes a newmultilayered detection system complemented with two additional layers: communal detection (CD) and spikedetection (SD). CD finds real social relationships to reduce the suspicion score, and is tamper resistant to syntheticsocial relationships. It is the whitelist-oriented approach on a fixed set of attributes. SD finds spikes in duplicates toincrease the suspicion score, and is probe-resistant for attributes. It is the attribute-oriented approach on avariable-size set of attributes. Together, CD and SD can detect more types of attacks, better account for changinglegal behavior, and remove the redundant attributes. Experiments were carried out on CD and SD with several millionreal credit applications. Results on the data support the hypothesis that successful credit application fraud patternsare sudden and exhibit sharp spikes in duplicates. Although this research is specific to credit application frauddetection, the concept of resilience, together with adaptivity and quality data discussed in the paper, are general tothe design, implementation, and evaluation of all detection systems.

Title :Real-Time Analysis of Physiological Data to Support Medical ApplicationsLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/real-time-analysis-physiological-data-support-medical-applications

Abstract : This paper presents a flexible framework that per- forms real-time analysis of physiological data to monitorpeople's health conditions in any context (e.g., during daily activities, in hospital environments). Given historicalphysiological data, different behavioral models tailored to specific conditions (e.g., a particular disease, a specificpatient) are automatically learnt. A suitable model for the currently monitored patient is exploited in the real- timestream classification phase. The framework has been designed to perform both instantaneous evaluation and streamanalysis over a sliding time window. To allow ubiquitous monitoring, real-time analysis could also be executed onmobile devices. As a case study, the framework has been validated in the intensive care scenario. Experimentalvalidation, performed on 64 patients affected by different critical illnesses, demonstrates the effectiveness and theflexibility of the proposed framework in detecting different severity levels of monitored people's clinical situations.

Title :Contextual query classification in web searchLanguage : ASP.NET with C#

Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/contextual-query-classification-web-search

Page 16: M phil-computer-science-data-mining-projects

Abstract : There has been an increasing interest in exploiting multiple sources of evidence for improving the qualityof a search engine's results. User context elements like interests, preferences and intents are the main sourcesexploited in information retrieval approaches to better fit the user information needs. Using the user intent to improvethe query specific retrieval search relies on classifying web queries into three types: informational, navigational andtransactional according to the user intent. However, query type classification strategies involved are based solely onquery features where the query type decision is made out of the user context represented by his search history. In thispaper, we present a con- textual query classification method making use of both query features and the user contextdefined by quality indicators of the previous query session type called the query profile. We define a query session asa sequence of queries of the same type. Preliminary experimental results carried out using TREC data show that ourapproach is promising.

M.Phil Computer Science Data Mining Projects

Title :Contextual query classification in web searchLanguage : ASP.NET with VB

Project Link : http://kasanpro.com/p/asp-net-with-vb/contextual-query-classification-web-search-results

Abstract : There has been an increasing interest in exploiting multiple sources of evidence for improving the qualityof a search engine's results. User context elements like interests, preferences and intents are the main sourcesexploited in information retrieval approaches to better fit the user information needs. Using the user intent to improvethe query specific retrieval search relies on classifying web queries into three types: informational, navigational andtransactional according to the user intent. However, query type classification strategies involved are based solely onquery features where the query type decision is made out of the user context represented by his search history. In thispaper, we present a con- textual query classification method making use of both query features and the user contextdefined by quality indicators of the previous query session type called the query profile. We define a query session asa sequence of queries of the same type. Preliminary experimental results carried out using TREC data show that ourapproach is promising.

http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews

Title :Contextual query classification in web searchLanguage : PHP

Project Link : http://kasanpro.com/p/php/query-classification-web-search

Abstract : There has been an increasing interest in exploiting multiple sources of evidence for improving the qualityof a search engine's results. User context elements like interests, preferences and intents are the main sourcesexploited in information retrieval approaches to better fit the user information needs. Using the user intent to improvethe query specific retrieval search relies on classifying web queries into three types: informational, navigational andtransactional according to the user intent. However, query type classification strategies involved are based solely onquery features where the query type decision is made out of the user context represented by his search history. In thispaper, we present a con- textual query classification method making use of both query features and the user contextdefined by quality indicators of the previous query session type called the query profile. We define a query session asa sequence of queries of the same type. Preliminary experimental results carried out using TREC data show that ourapproach is promising.

Title :Annotating Search Results from Web DatabasesLanguage : ASP.NET with VB

Project Link : http://kasanpro.com/p/asp-net-with-vb/annotating-search-results-web-databases

Abstract : An increasing number of databases have become web accessible through HTML form-based searchinterfaces. The data units returned from the underlying database are usually encoded into the result pagesdynamically for human browsing. For the encoded data units to be machine processable, which is essential for manyapplications such as deep web data collection and Internet comparison shopping, they need to be extracted out andassigned meaningful labels. In this paper, we present an automatic annotation approach that first aligns the data unitson a result page into different groups such that the data in the same group have the same semantic. Then, for eachgroup we annotate it from different aspects and aggregate the different annotations to predict a final annotation labelfor it. An annotation wrapper for the search site is automatically constructed and can be used to annotate new resultpages from the same web database. Our experiments indicate that the proposed approach is highly effective.

Page 17: M phil-computer-science-data-mining-projects

Title :Annotating Search Results from Web DatabasesLanguage : ASP.NET with C#

Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/annotating-search-results-web-databas

Abstract : An increasing number of databases have become web accessible through HTML form-based searchinterfaces. The data units returned from the underlying database are usually encoded into the result pagesdynamically for human browsing. For the encoded data units to be machine processable, which is essential for manyapplications such as deep web data collection and Internet comparison shopping, they need to be extracted out andassigned meaningful labels. In this paper, we present an automatic annotation approach that first aligns the data unitson a result page into different groups such that the data in the same group have the same semantic. Then, for eachgroup we annotate it from different aspects and aggregate the different annotations to predict a final annotation labelfor it. An annotation wrapper for the search site is automatically constructed and can be used to annotate new resultpages from the same web database. Our experiments indicate that the proposed approach is highly effective.

Title :Annotating Search Results from Web DatabasesLanguage : PHP

Project Link : http://kasanpro.com/p/php/annotating-search-results-web-databases-efficient

Abstract : An increasing number of databases have become web accessible through HTML form-based searchinterfaces. The data units returned from the underlying database are usually encoded into the result pagesdynamically for human browsing. For the encoded data units to be machine processable, which is essential for manyapplications such as deep web data collection and Internet comparison shopping, they need to be extracted out andassigned meaningful labels. In this paper, we present an automatic annotation approach that first aligns the data unitson a result page into different groups such that the data in the same group have the same semantic. Then, for eachgroup we annotate it from different aspects and aggregate the different annotations to predict a final annotation labelfor it. An annotation wrapper for the search site is automatically constructed and can be used to annotate new resultpages from the same web database. Our experiments indicate that the proposed approach is highly effective.

M.Phil Computer Science Data Mining Projects

Title :A cost sensitive decision tree classification in credit card identity crime detection systemLanguage : ASP.NET with VB

Project Link : http://kasanpro.com/p/asp-net-with-vb/cost-sensitive-decision-tree-classification-credit-card-identity-crime-detection-systems

Abstract :

Title :A cost sensitive decision tree classification in credit card identity crime detection systemLanguage : ASP.NET with C#

Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/cost-sensitive-decision-tree-classification-credit-card-identity-crime-detection-system

Abstract :

Title :A cost sensitive decision tree classification in credit card identity crime detection systemLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/cost-sensitive-decision-tree-classification-credit-card-identity-fraud-crime-detection

Abstract :

Title :A cost sensitive decision tree classification in credit card identity crime detection systemLanguage : PHP

Project Link : http://kasanpro.com/p/php/decision-tree-classification-credit-card-identity-crime-detection-system

Page 18: M phil-computer-science-data-mining-projects

Abstract :

http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews

Title :A cost-sensitive decision tree approach for fraud detectionLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/credit-card-identity-crime-detection-system-cost-sensitive-decision-tree-classification

Abstract : With the developments in the information technology, fraud is spreading all over the world, resulting inhuge financial losses. Though fraud prevention mechanisms such as CHIP&PIN are developed for credit cardsystems, these mechanisms do not prevent the most common fraud types such as fraudulent credit card usages overvirtual POS (Point Of Sale) terminals or mail orders so called online credit card fraud. As a result, fraud detectionbecomes the essential tool and probably the best way to stop such fraud types. In this study, a new cost-sensitivedecision tree approach which minimizes the sum of misclassification costs while selecting the splitting attribute ateach non-terminal node is developed and the performance of this approach is compared with the well-knowntraditional classification models on a real world credit card data set. In this approach, misclassification costs are takenas varying. The results show that this cost-sensitive decision tree algorithm outperforms the existing well-knownmethods on the given prob- lem set with respect to the well-known performance metrics such as accuracy and truepositive rate, but also a newly defined cost-sensitive metric specific to credit card fraud detection domain. Accordingly,financial losses due to fraudulent transactions can be decreased more by the implementation of this approach in frauddetection systems.

M.Phil Computer Science Data Mining Projects

Title :A cost-sensitive decision tree approach for fraud detectionLanguage : VB.NET

Project Link : http://kasanpro.com/p/vb-net/cost-sensitive-decision-tree-classify-credit-card-identity-crime-detection-system

Abstract : With the developments in the information technology, fraud is spreading all over the world, resulting inhuge financial losses. Though fraud prevention mechanisms such as CHIP&PIN are developed for credit cardsystems, these mechanisms do not prevent the most common fraud types such as fraudulent credit card usages overvirtual POS (Point Of Sale) terminals or mail orders so called online credit card fraud. As a result, fraud detectionbecomes the essential tool and probably the best way to stop such fraud types. In this study, a new cost-sensitivedecision tree approach which minimizes the sum of misclassification costs while selecting the splitting attribute ateach non-terminal node is developed and the performance of this approach is compared with the well-knowntraditional classification models on a real world credit card data set. In this approach, misclassification costs are takenas varying. The results show that this cost-sensitive decision tree algorithm outperforms the existing well-knownmethods on the given prob- lem set with respect to the well-known performance metrics such as accuracy and truepositive rate, but also a newly defined cost-sensitive metric specific to credit card fraud detection domain. Accordingly,financial losses due to fraudulent transactions can be decreased more by the implementation of this approach in frauddetection systems.

Title :PREDICTING HOME SERVICE DEMANDS FROM APPLIANCE USAGE DATALanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/predicting-home-service-demands-from-appliance-usage-data

Abstract : Power management in homes and offices requires appliance usage prediction when the future userrequests are not available. The randomness and uncertainties associated with an appliance usage make theprediction of appliance usage from energy consumption data a non-trivial task. A general model for prediction at theappliance level is still lacking. In this work, we propose to enrich learning algorithms with expert knowledge andpropose a general model using a knowledge driven approach to forecast if a particular appliance will start at a givenhour or not. The approach is both a knowledge driven and data driven one. The overall energy management for ahouse requires that the prediction is done for the next 24 hours in the future. The proposed model is tested over theIrise data and the results are compared with some trivial knowledge driven predictors.

Title :Data Mining and Wireless Sensor Network for Groundnut Pest/Disease Interaction and Predictions - A

Page 19: M phil-computer-science-data-mining-projects

Preliminary StudyLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/data-mining-wireless-sensor-network-groundnut-pest-disease-predictions

Abstract : Data driven precision agriculture aspects, particularly the pest/disease management, require a dynamiccrop-weather data. An experiment was conducted in semi-arid region of India to understand thecrop-weather-pest/disease relations using wireless sensory and field-level surveillance data on closely related andinterdependent pest (Thrips) - disease (Bud Necrosis) dynamics of groundnut (peanut) crop. Various data miningtechniques were used to turn the data into useful information/ knowledge/ relations/ trends and correlation ofcrop-weather-pest/disease continuum. These dynamics obtained from the data mining techniques and trained throughmathematical models were validated with corresponding ground level surveillance data. It was found that BudNecrosis viral disease infection is strongly influenced by Humidity, Maximum Temperature, prolonged duration of leafwetness, age of the crop and propelled by a carrier pest Thrips. Results obtained from the four continuous agricultureseasons (monsoon & post monsoon) data has led to develop cumulative and non-cumulative prediction models,which can assist the user community to take respective ameliorative measures.

Title :Mining Social Media Data for Understanding Student's Learning ExperiencesLanguage : ASP.NET with VB

Project Link : http://kasanpro.com/p/asp-net-with-vb/mining-social-media-data-understanding-students-learning-experiences

Abstract : Students' informal conversations on social media (e.g. Twitter, Facebook) shed light into their educationalexperiences - opinions, feelings, and concerns about the learning process. Data from such uninstrumentedenvironment can provide valuable knowledge to inform student learning. Analyzing such data, however, can bechallenging. The complexity of student's experiences reflected from social media content requires humaninterpretation. However, the growing scale of data demands automatic data analysis techniques. In this paper, wedeveloped a workflow to integrate both qualitative analysis and large - scale data mining techniques. We focus onengineering student's Twitter posts to understand issues and problems in their educational experiences. We firstconducted a qualitative analysis on samples taken from about 25,000 tweets related to engagement, and sleepdeprivation. Based on these results, we implemented a multi - label classification algorithm to classify tweetsreflecting student's problems. We then used the algorithm to train a detector of student problems from about 35,000tweets streamed at the geo - location of Purdue University. This work, for the first time, presents a methodology andresults that show how informal social media data can provide insights into students' experiences.

Title :Mining Social Media Data for Understanding Student's Learning ExperiencesLanguage : ASP.NET with C#

Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/mining-social-media-data-understanding-students-learning-experiences-code

Abstract : Students' informal conversations on social media (e.g. Twitter, Facebook) shed light into their educationalexperiences - opinions, feelings, and concerns about the learning process. Data from such uninstrumentedenvironment can provide valuable knowledge to inform student learning. Analyzing such data, however, can bechallenging. The complexity of student's experiences reflected from social media content requires humaninterpretation. However, the growing scale of data demands automatic data analysis techniques. In this paper, wedeveloped a workflow to integrate both qualitative analysis and large - scale data mining techniques. We focus onengineering student's Twitter posts to understand issues and problems in their educational experiences. We firstconducted a qualitative analysis on samples taken from about 25,000 tweets related to engagement, and sleepdeprivation. Based on these results, we implemented a multi - label classification algorithm to classify tweetsreflecting student's problems. We then used the algorithm to train a detector of student problems from about 35,000tweets streamed at the geo - location of Purdue University. This work, for the first time, presents a methodology andresults that show how informal social media data can provide insights into students' experiences.

M.Phil Computer Science Data Mining Projects

Title :Mining Social Media Data for Understanding Student's Learning ExperiencesLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/mining-social-media-data-understanding-students-learning-experiences-implement

Abstract : Students' informal conversations on social media (e.g. Twitter, Facebook) shed light into their educational

Page 20: M phil-computer-science-data-mining-projects

experiences - opinions, feelings, and concerns about the learning process. Data from such uninstrumentedenvironment can provide valuable knowledge to inform student learning. Analyzing such data, however, can bechallenging. The complexity of student's experiences reflected from social media content requires humaninterpretation. However, the growing scale of data demands automatic data analysis techniques. In this paper, wedeveloped a workflow to integrate both qualitative analysis and large - scale data mining techniques. We focus onengineering student's Twitter posts to understand issues and problems in their educational experiences. We firstconducted a qualitative analysis on samples taken from about 25,000 tweets related to engagement, and sleepdeprivation. Based on these results, we implemented a multi - label classification algorithm to classify tweetsreflecting student's problems. We then used the algorithm to train a detector of student problems from about 35,000tweets streamed at the geo - location of Purdue University. This work, for the first time, presents a methodology andresults that show how informal social media data can provide insights into students' experiences.

Title :Cost-effective Viral Marketing for Time-critical Campaigns in Large-scale Social NetworksLanguage : ASP.NET with VB

Project Link : http://kasanpro.com/p/asp-net-with-vb/viral-marketing-cost-effective-time-critical-campaigns-large-scale-social-networks

Abstract : Online social networks (OSNs) have become one of the most effective channels for marketing andadvertising. Since users are often influenced by their friends, "wordof- mouth" exchanges, so-called viral marketing, insocial networks can be used to increase product adoption or widely spread content over the network. The commonperception of viral marketing about being cheap, easy, and massively effective makes it an ideal replacement oftraditional advertising. However, recent studies have revealed that the propagation often fades quickly within only fewhops from the sources, counteracting the assumption on the self-perpetuating of influence considered in literature.With only limited influence propagation, is massively reaching customers via viral marketing still affordable? How toeconomically spend more resources to increase the spreading speed? We investigate the cost-effective massive viralmarketing problem, taking into the consideration the limited influence propagation. Both analytical analysis based onpower-law network theory and numerical analysis demonstrate that the viral marketing might involve costly seeding.To minimize the seeding cost, we provide mathematical programming to find optimal seeding for medium-sizenetworks and propose VirAds, an efficient algorithm, to tackle the problem on largescale networks. VirAds guaranteesa relative error bound of O(1) from the optimal solutions in power-law networks and outperforms the greedy heuristicswhich realizes on the degree centrality. Moreover, we also show that, in general, approximating the optimal seedingwithin a ratio better than O(log n) is unlikely possible.

http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews

Title :Cost-effective Viral Marketing for Time-critical Campaigns in Large-scale Social NetworksLanguage : ASP.NET with C#

Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/cost-effective-viral-marketing-time-critical-campaigns-large-scale-social-networks

Abstract : Online social networks (OSNs) have become one of the most effective channels for marketing andadvertising. Since users are often influenced by their friends, "wordof- mouth" exchanges, so-called viral marketing, insocial networks can be used to increase product adoption or widely spread content over the network. The commonperception of viral marketing about being cheap, easy, and massively effective makes it an ideal replacement oftraditional advertising. However, recent studies have revealed that the propagation often fades quickly within only fewhops from the sources, counteracting the assumption on the self-perpetuating of influence considered in literature.With only limited influence propagation, is massively reaching customers via viral marketing still affordable? How toeconomically spend more resources to increase the spreading speed? We investigate the cost-effective massive viralmarketing problem, taking into the consideration the limited influence propagation. Both analytical analysis based onpower-law network theory and numerical analysis demonstrate that the viral marketing might involve costly seeding.To minimize the seeding cost, we provide mathematical programming to find optimal seeding for medium-sizenetworks and propose VirAds, an efficient algorithm, to tackle the problem on largescale networks. VirAds guaranteesa relative error bound of O(1) from the optimal solutions in power-law networks and outperforms the greedy heuristicswhich realizes on the degree centrality. Moreover, we also show that, in general, approximating the optimal seedingwithin a ratio better than O(log n) is unlikely possible.

Title :Cost-effective Viral Marketing for Time-critical Campaigns in Large-scale Social NetworksLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/effective-viral-marketing-time-critical-campaigns-large-scale-social-networks

Abstract : Online social networks (OSNs) have become one of the most effective channels for marketing and

Page 21: M phil-computer-science-data-mining-projects

advertising. Since users are often influenced by their friends, "wordof- mouth" exchanges, so-called viral marketing, insocial networks can be used to increase product adoption or widely spread content over the network. The commonperception of viral marketing about being cheap, easy, and massively effective makes it an ideal replacement oftraditional advertising. However, recent studies have revealed that the propagation often fades quickly within only fewhops from the sources, counteracting the assumption on the self-perpetuating of influence considered in literature.With only limited influence propagation, is massively reaching customers via viral marketing still affordable? How toeconomically spend more resources to increase the spreading speed? We investigate the cost-effective massive viralmarketing problem, taking into the consideration the limited influence propagation. Both analytical analysis based onpower-law network theory and numerical analysis demonstrate that the viral marketing might involve costly seeding.To minimize the seeding cost, we provide mathematical programming to find optimal seeding for medium-sizenetworks and propose VirAds, an efficient algorithm, to tackle the problem on largescale networks. VirAds guaranteesa relative error bound of O(1) from the optimal solutions in power-law networks and outperforms the greedy heuristicswhich realizes on the degree centrality. Moreover, we also show that, in general, approximating the optimal seedingwithin a ratio better than O(log n) is unlikely possible.

Title :Green Mining: Investigating Power Consumption across VersionsLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/green-mining-investigating-power-consumption-versions

Abstract : Power consumption is increasingly becoming a concern for not only electrical engineers, but for softwareengineers as well, due to the increasing popularity of new power-limited contexts such as mobile-computing,smart-phones and cloud-computing. Software changes can alter software power consumption behaviour and cancause power performance regressions. By tracking software power consumption we can build models to providesuggestions to avoid power regressions. There is much research on software power consumption, but little focus onthe relationship between software changes and power consumption. Most work measures the power consumption ofa single software task; instead we seek to extend this work across the history (revisions) of a project. We develop aset of tests for a well established product and then run those tests across all versions of the product while recordingthe power usage of these tests. We provide and demonstrate a methodology that enables the analysis of powerconsumption performance for over 500 nightly builds of Firefox 3.6; we show that software change does inducechanges in power consumption. This methodology and case study are a first step towards combining powermeasurement and mining software repositories research, thus enabling developers to avoid power regressions viapower consumption awareness.

M.Phil Computer Science Data Mining Projects

Title :Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing clusternumberLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/categorical-numerical-attribute-data-clustering-based

Abstract : Most of the existing clustering approaches are applicable to purely numerical or categorical data only, butnot the both. In general, it is a nontrivial task to perform clustering on mixed data composed of numerical andcategorical attributes because there exists an awkward gap between the similarity metrics for categorical andnumerical data. This paper therefore presents a general clustering framework based on the concept of object-clustersimilarity and gives a unified similarity metric which can be simply applied to the data with categorical, numerical, andmixed attributes. Accordingly, an iterative clustering algorithm is developed, whose outstanding performance isexperimentally demonstrated on different benchmark data sets. Moreover, to circumvent the difficult selection problemof cluster number, we further develop a penalized competitive learning algorithm within the proposed clusteringframework. The embedded competition and penalization mechanisms enable this improved algorithm to determinethe number of clusters automatically by gradually eliminating the redundant clusters. The experimental results showthe efficacy of the proposed approach.

Title :Categorical-and-numerical-attribute data clustering using K - Mode clustering and Fuzzy K - Mode clusteringLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/categorical-numerical-attribute-data-clustering-fuzzy

Abstract : Most of the existing clustering approaches are applicable to purely numerical or categorical data only, butnot the both. In general, it is a nontrivial task to perform clustering on mixed data composed of numerical andcategorical attributes because there exists an awkward gap between the similarity metrics for categorical andnumerical data. This paper therefore presents a general clustering framework based on the concept of object-clustersimilarity and gives a unified similarity metric which can be simply applied to the data with categorical, numerical, and

Page 22: M phil-computer-science-data-mining-projects

mixed attributes. This paper proposes a novel initialization method for mixed data which is implemented using K -Modes algorithm and further and iterative fuzzy K - Modes clustering algorithm.

Title :Multiple Cost -sensitive decision tree approach for fraud detectionLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/multiple-cost-sensitive-decision-tree-fraud-detection

Abstract : Fraud can be defined as wrongful or criminal deception aimed to result in financial or personal gain. Thetwo main mechanisms to avoid frauds and losses due to fraudulent activities are fraud prevention and fraud detectionsystems. Fraud prevention is the proactive mechanism with the goal of disabling the occurrence of fraud. Frauddetection systems come into play when the fraudsters surpass the fraud prevention systems and start a fraudulenttrans- action. Previously researches investigates a single cost for cost-sensitive tree approach which minimizes thesum of misclassification costs while selecting the splitting attribute at each non-terminal node is developed. . This maynot be feasible for real cost-sensitive decisions which involve multiple costs. We propose to modify the existing model,the cost sensitive decision tree by extending the model for multiple cost decisions. This developing model is calledCCTree. In this multiple-cost extension based CCTree model, all costs are normalized to be in the same interval (i.e.between 0 and 1) . The performance of this approach is compared with the well-known traditional classificationmodels on a real world credit card data set.

Title :Secure Mining of Association Rules in Horizontally Distributed DatabasesLanguage : Java

Project Link : http://kasanpro.com/p/java/secure-mining-association-rules-horizontally-distributed-databases

Abstract : We propose a protocol for secure mining of association rules in horizontally distributed databases. Thecurrent leading protocol is that of Kantarcioglu and Clifton. Our protocol, like theirs, is based on the Fast DistributedMining (FDM) algorithm of Cheung et al., which is an unsecured distributed version of the Apriori algorithm. The mainingredients in our protocol are two novel secure multi-party algorithms -- one that computes the union of privatesubsets that each of the interacting players hold, and another that tests the inclusion of an element held by one playerin a subset held by another. Our protocol offers enhanced privacy with respect to the protocol in. In addition, it issimpler and is significantly more efficient in terms of communication rounds, communication cost and computationalcost.

Title :An Efficient Certificateless Encryption for Secure Data Sharing in Public CloudsLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/efficient-certificateless-encryption-secure-data-sharing-public-clouds

Abstract : We propose a mediated certificateless encryption scheme without pairing operations for securely sharingsensitive information in public clouds. Mediated certificateless public key encryption (mCL-PKE) solves the keyescrow problem in identity based encryption and certificate revocation problem in public key cryptography. However,existing mCL-PKE schemes are either inefficient because of the use of expensive pairing operations or vulnerableagainst partial decryption attacks. In order to address the performance and security issues, in this paper, we firstpropose a mCL-PKE scheme without using pairing operations. We apply our mCL-PKE scheme to construct apractical solution to the problem of sharing sensitive information in public clouds. The cloud is employed as a securestorage as well as a key generation center. In our system, the data owner encrypts the sensitive data using the cloudgenerated users' public keys based on its access control policies and uploads the encrypted data to the cloud. Uponsuccessful authorization, the cloud partially decrypts the encrypted data for the users. The users subsequently fullydecrypt the partially decrypted data using their private keys. The confidentiality of the content and the keys ispreserved with respect to the cloud, because the cloud cannot fully decrypt the information. We also propose anextension to the above approach to improve the efficiency of encryption at the data owner. We implement ourmCL-PKE scheme and the overall cloud based system, and evaluate its security and performance. Our results showthat our schemes are efficient and practical.

M.Phil Computer Science Data Mining Projects

Title :An Empirical Performance Evaluation of Relational Keyword Search SystemsLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/empirical-performance-evaluation-relational-keyword-search-systems

Abstract : In the past decade, extending the keyword search paradigm to relational data has been an active area of

Page 23: M phil-computer-science-data-mining-projects

research within the database and information retrieval (IR) community. A large number of approaches have beenproposed and implemented, but despite numerous publications, there remains a severe lack of standardization forsystem evaluations. This lack of standardization has resulted in contradictory results from different evaluations, andthe numerous discrepancies muddle what advantages are proffered by different approaches. In this paper, we presenta thorough empirical performance evaluation of relational keyword search systems. Our results indicate that manyexisting search techniques do not provide acceptable performance for realistic retrieval tasks. In particular, memoryconsumption precludes many search techniques from scaling beyond small datasets with tens of thousands ofvertices. We also explore the relationship between execution time and factors varied in previous evaluations; ouranalysis indicates that these factors have relatively little impact on performance. In summary, our work confirmsprevious claims regarding the unacceptable performance of these systems and underscores the need forstandardization--as exemplified by the IR community--when evaluating these retrieval systems.

Title :Web Image Re-Ranking UsingQuery-Specific Semantic SignaturesLanguage : ASP.NET with C#

Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/web-image-re-ranking-usingquery-specific-semantic-signatures

Abstract : Image re-ranking, as an effective way to improve the results of web-based image search, has beenadopted by current commercial search engines. Given a query keyword, a pool of images are first retrieved by thesearch engine based on textual information. By asking the user to select a query image from the pool, the remainingimages are re-ranked based on their visual similarities with the query image. A major challenge is that the similaritiesof visual features do not well correlate with images' semantic meanings which interpret users' search intention. On theother hand, learning a universal visual semantic space to characterize highly diverse images from the web is difficultand inefficient. In this paper, we propose a novel image re-ranking framework, which automatically offline learnsdifferent visual semantic spaces for different query keywords through keyword expansions. The visual features ofimages are projected into their related visual semantic spaces to get semantic signatures. At the online stage, imagesare re-ranked by comparing their semantic signatures obtained from the visual semantic space specified by the querykeyword. The new approach significantly improves both the accuracy and efficiency of image re-ranking. The originalvisual features of thousands of dimensions can be projected to the semantic signatures as short as 25 dimensions.Experimental results show that 20% ? 35% relative improvement has been achieved on re-ranking precisionscompared with the stateof- the-art methods.

Title :Demand Bidding Program and Its Application in Hotel Energy ManagementLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/demand-bidding-program-its-application-hotel-energy-management

Abstract : Demand bidding program (DBP) is recently adopted in practice by some energy operators. DBP is arisk-free demand response program targeting large energy consumers. In this paper, we consider DBP with theapplication in hotel energy management. For DBP, optimization problem is formulated with the objective ofmaximizing expected reward, which is received when the the amount of energy saving satisfies the contract. For ageneral distribution of energy consumption, we give a general condition for the optimal bid and outline an algorithm tofind the solution without numerical integration. Furthermore, for Gaussian distribution, we derive closed-formexpressions of the optimal bid and the corresponding expected reward. Regarding hotel energy, we characterizeloads in the hotel and introduce several energy consumption models that capture major energy use. With theproposed models and DBP, simulation results show that DBP provides economics benefits to the hotel andencourages load scheduling. Furthermore, when only mean and variance of energy consumption are known, thevalidity of Gaussian approximation for computing optimal load and expected reward is also discussed.

Title :Incremental Affinity Propagation Clustering Based on Message PassingLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/incremental-affinity-propagation-clustering-based-message-passing

Abstract : Affinity Propagation (AP) clustering has been successfully used in a lot of clustering problems. However,most of the applications deal with static data. This paper considers how to apply AP in incremental clusteringproblems. Firstly, we point out the difficulties in Incremental Affinity Propagation (IAP) clustering, and then proposetwo strategies to solve them. Correspondingly, two IAP clustering algorithms are proposed. They are IAP clusteringbased on K-Medoids (IAPKM) and IAP clustering based on Nearest Neighbor Assignment (IAPNA). Five popularlabeled data sets, real world time series and a video are used to test the performance of IAPKM and IAPNA.Traditional AP clustering is also implemented to provide benchmark performance. Experimental results show thatIAPKM and IAPNA can achieve comparable clustering performance with traditional AP clustering on all the data sets.Meanwhile, the time cost is dramatically reduced in IAPKM and IAPNA. Both the effectiveness and the efficiencymake IAPKM and IAPNA able to be well used in incremental clustering tasks.

Page 24: M phil-computer-science-data-mining-projects

Title :Incremental Detection of Inconsistencies in Distributed DataLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/incremental-detection-inconsistencies-distributed-data

Abstract : This paper investigates incremental detection of errors in distributed data. Given a distributed database D,a set of conditional functional dependencies (CFDs), the set V of violations of the CFDs in D, and updates D to D, it isto find, with minimum data shipment, changes V to V in response to D. The need for the study is evident since real-lifedata is often dirty, distributed and frequently updated. It is often prohibitively expensive to recompute the entire set ofviolations when D is updated. We show that the incremental detection problem is NP-complete for database D that ispartitioned either vertically or horizontally, even when and D are fixed. Nevertheless, we show that it is bounded: thereexist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of Dand V, independent of the size of the database D. We provide such incremental algorithms for vertically partitioneddata and horizontally partitioned data, and show that the algorithms are optimal. We further propose optimizationtechniques for the incremental algorithm over vertical partitions to reduce data shipment. We verify experimentally,using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batchcounterparts.

M.Phil Computer Science Data Mining Projects

Title :Keyword Query RoutingLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/keyword-query-routing

Abstract : Keyword search is an intuitive paradigm for searching linked data sources on the web. We propose toroute keywords only to relevant sources to reduce the high cost of processing keyword search queries over allsources. We propose a novel method for computing top-k routing plans based on their potentials to contain results fora given keyword query. We employ a keyword-element relationship summary that compactly represents relationshipsbetween keywords and the data elements mentioning them. A multilevel scoring mechanism is proposed forcomputing the relevance of routing plans based on scores at the level of keywords, data elements, element sets, andsubgraphs that connect these elements. Experiments carried out using 150 publicly available sources on the webshowed that valid plans (precision@1 of 0.92) that are highly relevant (mean reciprocal rank of 0.89) can becomputed in 1 second on average on a single PC. Further, we show routing greatly helps to improve the performanceof keyword search, without compromising its result quality.

Title :Personalized Recommendation Combining User Interest and Social CircleLanguage : C#

Project Link : http://kasanpro.com/p/c-sharp/personalized-recommendation-combining-user-interest-social-circle

Abstract : With the advent and popularity of social network, more and more users like to share their experiences,such as ratings, reviews, and blogs. The new factors of social network like interpersonal influence and interest basedon circles of friends bring opportunities and challenges for recommender system (RS) to solve the cold start andsparsity problem of datasets. Some of the social factors have been used in RS, but have not been fully considered. Inthis paper, three social factors, personal interest, interpersonal interest similarity, and interpersonal influence, fuseinto a unified personalized recommendation model based on probabilistic matrix factorization. The factor of personalinterest can make the RS recommend items to meet users' individualities, especially for experienced users. Moreover,for cold start users, the interpersonal interest similarity and interpersonal influence can enhance the intrinsic linkamong features in the latent space. We conduct a series of experiments on three rating datasets: Yelp, MovieLens,and Douban Movie. Experimental results show the proposed approach outperforms the existing RS approaches.

Title :A Random Decision Tree Framework for Privacy-preserving Data MiningLanguage : Java

Project Link : http://kasanpro.com/p/java/random-decision-tree-privacy-preserving-data-mining

Abstract : Distributed data is ubiquitous in modern information driven applications. With multiple sources of data, thenatural challenge is to determine how to collaborate effectively across proprietary organizational boundaries whilemaximizing the utility of collected information. Since using only local data gives suboptimal utility, techniques forprivacy-preserving collaborative knowledge discovery must be developed. Existing cryptography-based work forprivacy-preserving data mining is still too slow to be effective for large scale datasets to face today's big data

Page 25: M phil-computer-science-data-mining-projects

challenge. Previous work on Random Decision Trees (RDT) shows that it is possible to generate equivalent andaccurate models with much smaller cost. We exploit the fact that RDTs can naturally fit into a parallel and fullydistributed architecture, and develop protocols to implement privacy-preserving RDTs that enable general andefficient distributed privacy-preserving knowledge discovery.

http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews

Title :Ontology-based annotation and retrieval of services in the cloudLanguage : Java

Project Link : http://kasanpro.com/p/java/ontology-based-annotation-retrieval-services-cloud

Abstract : Cloud computing is a technological paradigm that permits computing services to be offered over theInternet. This new service model is closely related to previous well-known distributed computing initiatives such asWeb services and grid computing. In the current socio-economic climate, the affordability of cloud computing hasmade it one of the most popular recent innovations. This has led to the availability of more and more cloud services,as a consequence of which it is becoming increasingly difficult for service consumers to find and access those cloudservices that fulfil their requirements. In this paper, we present a semantically-enhanced platform that will assist in theprocess of discovering the cloud services that best match user needs. This fully-fledged system encompasses twobasic functions: the creation of a repository with the semantic description of cloud services and the search for servicesthat accomplish the required expectations. The cloud service's semantic repository is generated by means of anautomatic tool that first annotates the cloud service descriptions with semantic content and then creates a semanticvector for each service. The comprehensive evaluation of the tool in the ICT domain has led to very promising resultsthat outperform state-of-the-art solutions in similarly broad domains.