enhancing clustering mechanism by implementation of em

4
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 6, Issue 2, March - April 2017 ISSN 2278-6856 Volume 6, Issue 2, March –April 2017 Page 155 Abstract: Big data [1] is a term for data sets that are so big or complex that traditional data processing applications are inadequate. Challenges include analysis, data curtain, search, sharing, storage, transfer, visualization, querying & information privacy. term often refers simply to use of predictive analytics or certain other advanced methods to extract value from data, & seldom to a particular size of data set. Accuracy in big data might lead to more confident decision making, & better decisions could result in greater operational efficiency, cost reduction & reduced risk. Data mining [7] is central step in a process called knowledge discovery in databases, namely step in which modeling techniques are include. Research areas like artificial intelligence, machine learning, & soft computing had contributed to its arsenal of methods. In our opinion fuzzy approaches could play an important role in data mining, because they given comprehensible results (although this goal is maybe because this is sometimes hard to achieve within other methods). Keywords—Data mining, web mining, web intelligence, knowledge discovery, fuzzy logic, K-mean 1. INTRODUCTION Data mining is extraction of hidden predictive info from large database record records, is a powerful new technology within great potential to help by companies focus on most important info within their data value warehouses. Data mining utensils predict future trends & behaviors, permitting businesses to make taking initiative, knowledge motivated decisions. Automated, prospective analyses offered by data mining transfer outside analyses of past events providing by retrospective utensils typical of decision support systems. Data mining utensils may answer business questions that generally were too time consuming to resolve. They scour database record records for hidden patterns, finding predictive info that experts may miss since this lies outside their expectations. Data mining (the analysis step of "Knowledge Discover in Databases" process, or KDD), an interdisciplinary subfield of computer science, is computational process of discovering system in large data sets involving methods at intersection of artificial intelligence, statistics, & database systems. overall goal of data mining is process of extract information from a data set & transform this into an understandable structure for further use. Fig 1 Data Mining 2. MOTIVATION & PROBLEM STATEMENT Web Intelligence [2] based Google Analytics is known as a service offered by Google that generates detailed about a website's traffic & traffic sources & measures conversions & sales. It's most widely used website statistics service. Basic services are free of charge & a premium version is available for a fee. Google Analytics might track visitors from all referrers, including search engines direct visits & referring sites. It also tracks email marketing, & digital collateral such as links within PDF documents. Regular article of Google Analytics Integrated within Ad Words, users might now review online campaigns by landing page quality & conversions (goals). Goals might include sales, viewing a specific page, or downloading data. Google Analytics approach is to show high-level, dashboard -type data/information for casual user, & more in-depth data/information further into report set. Google Analytics analysis might identify poorly performing pages within techniques/technology such as funnel visualization, where visitors came how long they stayed & their geographical position. It also provides more including custom visitor segmentation. 3. TOOLS & TECHNOLOGY USED MATLAB (matrix laboratory) is a multi-paradigm numerical computing environment & fourth-generation programming language. Developed by Math Works, Enhancing Clustering Mechanism by Implementation of EM Algorithm for Gaussian Mixture Model Jyoti 1 and Dr. Rajendar Singh Chhillar 2 1,2 Department of Computer Science & Applications, Maharshi Dayanand University, Rohtak, Haryana, India

Upload: others

Post on 03-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected]

Volume 6, Issue 2, March - April 2017 ISSN 2278-6856

Volume 6, Issue 2, March –April 2017 Page 155

Abstract: Big data[1] is a term for data sets that are so big or complex that traditional data processing applications are inadequate. Challenges include analysis, data curtain, search, sharing, storage, transfer, visualization, querying & information privacy. term often refers simply to use of predictive analytics or certain other advanced methods to extract value from data, & seldom to a particular size of data set. Accuracy in big data might lead to more confident decision making, & better decisions could result in greater operational efficiency, cost reduction & reduced risk. Data mining[7] is central step in a process called knowledge discovery in databases, namely step in which modeling techniques are include. Research areas like artificial intelligence, machine learning, & soft computing had contributed to its arsenal of methods. In our opinion fuzzy approaches could play an important role in data mining, because they given comprehensible results (although this goal is maybe because this is sometimes hard to achieve within other methods). Keywords—Data mining, web mining, web intelligence, knowledge discovery, fuzzy logic, K-mean 1. INTRODUCTION Data mining is extraction of hidden predictive info from large database record records, is a powerful new technology within great potential to help by companies focus on most important info within their data value warehouses. Data mining utensils predict future trends & behaviors, permitting businesses to make taking initiative, knowledge motivated decisions. Automated, prospective analyses offered by data mining transfer outside analyses of past events providing by retrospective utensils typical of decision support systems. Data mining utensils may answer business questions that generally were too time consuming to resolve. They scour database record records for hidden patterns, finding predictive info that experts may miss since this lies outside their expectations. Data mining (the analysis step of "Knowledge Discover in Databases" process, or KDD), an interdisciplinary subfield of computer science, is computational process of discovering system in large data sets involving methods at intersection of artificial intelligence, statistics, & database systems. overall goal of data mining is process of extract information from a data set & transform this into an understandable structure for further use.

Fig 1 Data Mining

2. MOTIVATION & PROBLEM

STATEMENT Web Intelligence[2] based Google Analytics is known as a service offered by Google that generates detailed about a website's traffic & traffic sources & measures conversions & sales. It's most widely used website statistics service. Basic services are free of charge & a premium version is available for a fee. Google Analytics might track visitors from all referrers, including search engines direct visits & referring sites. It also tracks email marketing, & digital collateral such as links within PDF documents. Regular article of Google Analytics Integrated within Ad Words, users might now review online campaigns by landing page quality & conversions (goals). Goals might include sales, viewing a specific page, or downloading data. Google Analytics approach is to show high-level, dashboard -type data/information for casual user, & more in-depth data/information further into report set. Google Analytics analysis might identify poorly performing pages within techniques/technology such as funnel visualization, where visitors came how long they stayed & their geographical position. It also provides more including custom visitor segmentation. 3. TOOLS & TECHNOLOGY USED

MATLAB (matrix laboratory) is a multi-paradigm numerical computing environment & fourth-generation programming language. Developed by Math Works,

Enhancing Clustering Mechanism by Implementation of EM Algorithm for Gaussian

Mixture Model

Jyoti1 and Dr. Rajendar Singh Chhillar2

1,2Department of Computer Science & Applications, Maharshi Dayanand University, Rohtak, Haryana, India

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected]

Volume 6, Issue 2, March - April 2017 ISSN 2278-6856

Volume 6, Issue 2, March –April 2017 Page 156

MATLAB allows matrix manipulations, plotting of functions & data, implementation of used to algorithms, creation of user interfaces, & interfacing within programs written in other languages, including C, C++, Java, Fortran & Python. 4.1 Syntax The MATLAB application is built around MATLAB language, & most use of MATLAB involves typing MATLAB code into Command Window (as an interactive mathematical shell), or executing text files containing MATLAB code, including scripts and/or functions. 1. Variables Variables are defined using assignment operator, =. MATLAB is a weakly typed programming language because types are implicitly converted. It is an inferred typed language because variables could be assigned without declaring their type, except if they are to be treated as symbolic objects, & that their type could change. Values could come from constants, from computation involving values of other variables, or from output of a function. For example:

>> x = 17 x = 17 >> x = 'hat' x = hat >> y = x + 0 y = 104 97 116 >> x = [3*4, pi/2] x = 12.0000 1.5708 >> y = 3*sin(x) y = -1.6097 3.0000

4. PROPOSED WORK Traditional EM algorithm is used to find (locally) maximum likelihood parameters of a statistical model in cases where equations cannot be solved directly. Typically these models involve latent variables in addition to unknown parameters & known data observations. That is, either there are missing values among data, or model could be formulated more simply by assuming existence of additional unobserved data points.

The EM algorithm proceeds from observation that following is a way to solve these two sets of equations numerically. One could simply pick arbitrary values for one of two sets of unknowns, use them to estimate second set, then use these new values to find a better estimate of first set, & then keep alternating between two until resulting values both converge to fixed points. It's not obvious that this would work at all, but in fact it could be

proven that in this particular context it does, & that derivative of likelihood is (arbitrarily close to) zero at that point, which in turn means that point is either a maximum or a saddle point. In general there may be multiple maxima, & there is no guarantee that global maximum would be found. Some likelihood also had singularities in them, i.e. nonsensical maxima. For example, one of solutions that may be fined by EM in a mixture model involves setting one of components to had zero variance & mean parameter for same component to be equal to one of data points.

5. RESULT & DISSCUSSION Implementation of EM algorithm for Gaussian mixture model A mixture model could be described more simply by assuming that each observed data point has a corresponding unobserved data point, or latent variable, specifying mixture component that each data point belongs to. EM algorithm for Gaussian mixture model function [label, model, llh] = emgm(X, init) % EM algorithm for Gaussian mixture model % fprintf('EM for Gaussian mixture: running ... '); R = initialization(X,init); tol = 1e-6; maxiter = 500; llh = -inf(1,maxiter); converged = false; t = 1; while ~converged && t < maxiter t = t+1; model = maximization(X,R); [R, llh(t)] = expectation(X,model); converged = llh(t)-llh(t-1) < tol*abs(llh(t)); end [~,label(1,:)] = max(R,[],2); llh = llh(2:t); if converged fprintf('converged in %d steps.\n',t); else fprintf('not converged in %d steps.\n',maxiter); end Initialization function

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected]

Volume 6, Issue 2, March - April 2017 ISSN 2278-6856

Volume 6, Issue 2, March –April 2017 Page 157

function R = initialization(X, init) [d,n] = size(X); if isstruct(init) % initialize within model R = expectation(X,init); elseif length(init) == 1 % random initialization k = init; idx = randsample(n,k); m = X(:,idx); [~,label] = max(bsxfun(@minus,m'*X,sum(m.^2,1)'/2)); while k ~= unique(label) idx = randsample(n,k); m = X(:,idx); [~,label] = max(bsxfun(@minus,m'*X,sum(m.^2,1)'/2)); end R = full(sparse(1:n,label,1,n,k,n)); elseif size(init,1) == 1 && size(init,2) == n % initialize within labels label = init; k = max(label); R = full(sparse(1:n,label,1,n,k,n)); elseif size(init,1) == d && size(init,2) > 1 %initialize within only centers k = size(init,2); m = init; [~,label] = max(bsxfun(@minus,m'*X,sum(m.^2,1)'/2)); R = full(sparse(1:n,label,1,n,k,n)); else error('ERROR: init is not valid.'); end Expectation Function function [R, llh] = expectation(X, model) mu = model.mu; Sigma = model.Sigma; w = model.weight; n = size(X,2); k = size(mu,2); R = zeros(n,k); for i = 1:k R(:,i) = loggausspdf(X,mu(:,i),Sigma(:,:,i)); end R = bsxfun(@plus,R,log(w)); T = logsumexp(R,2); llh = sum(T)/n; % loglikelihood R = bsxfun(@minus,R,T); R = exp(R); Finding a maximum likelihood solution typically requires taking derivatives of likelihood function within respect to all unknown values — parameters & latent variables — & simultaneously solving resulting equations. In statistical models within latent variables, this usually is not possible. Instead, result is typically a set of interlocking equations in which solution to parameters requires values of latent variables & vice versa, but substituting one set of equations into other produces an unsolvable equation.

Maximization function function model = maximization(X, R) [d,n] = size(X); k = size(R,2); sigma0 = eye(d)*(1e-6); % regularization factor for covariance s = sum(R,1); w = s/n; mu = bsxfun(@rdivide, X*R, s); Sigma = zeros(d,d,k); for i = 1:k Xo = bsxfun(@minus,X,mu(:,i)); Xo = bsxfun(@times,Xo,sqrt(R(:,i)')); Sigma(:,:,i) = (Xo*Xo'+sigma0)/s(i); end model.mu = mu; model.Sigma = Sigma; model.weight = w;

Fig 2 Sigma is Diagonal, Shared covariance=true

Fig 3 Sigma is Diagonal, Shared covariance=false

Fig 3 Sigma is full, Shared covariance=true

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected]

Volume 6, Issue 2, March - April 2017 ISSN 2278-6856

Volume 6, Issue 2, March –April 2017 Page 158

Fig 4 Sigma is full, Shared covariance=false

6. CONCLUSION Overall goal of data mining process is to extract information from a data set & transform this into an understandable structure. In order to make wise decisions both for people & for things in IoT, data mining technologies are open to all people within IoT technologies for decision making support & system optimization. Data mining involves discovering novel, interesting, & potentially useful patterns from data & applying algorithms to extraction of hidden information Due to increasing amount of data available online, World Wide Web has becoming one of most valuable resources for information retrievals & knowledge discoveries. Web mining technologies are right solutions for knowledge discovery on Web. Knowledge extracted from Web could be used to raise performances for Web information retrievals, question answering, & Web based data warehousing. REFRENCES [1] Hellerstein, Joe (9 November 2008). "Parallel

Programming in Age of Big Data". Gigaom Blog. [2] J. Liu, N. Zhong, Y. Y. Yao, Z. W. Ras, wisdom web:

new challenges for web intelligence (WI), J. Intell. Inform. Sys.,20(1): 5–9, 2003.

[3] Congiusta, A. Pugliese, D. Talia, & P. Trunfio, Designing GridServices for distributed knowledge discovery, Web Intel. Agent Sys, 1(2): 91–104, 2003.

[4] J. A. Hendler & E. A. Feigenbaum, Knowledge is power: semantic web vision, in N. Zhong, et al. (eds.), Web Intelligence: Research & Development, LNAI 2198, Springer, 2001, 18–29.

[5] N. Zhong & J. Liu (eds.), Intelligent Technologies for Information Analysis, New York: Springer, 2004.

[6] Bouckaert, Remco R.; Frank, Eibe; Hall, Mark A.; Holmes, Geoffrey; Pfahringer, Bernhard; Reutemann, Peter; Witten, Ian H. (2010). "WEKA Experiences within a Java open-source project"

[7] Journal of Machine Learning Research 11: 2533–2541. original title, "Practical machine learning", was changed ... term "data mining" was [added] primarily for marketing reasons.

[8] Mena, Jesús (2011). Machine Learning Forensics for Law Enforcement, Security, & Intelligence. Boca Raton, FL: CRC Press (Taylor & Francis Group). ISBN 978-1-4398-6069-4.

[9] Piatetsky-Shapiro, Gregory; Parker, Gary (2011). "Lesson: Data Mining, & Knowledge Discovery: An Introduction". Introduction to Data Mining. KD Nuggets. Retrieved 30 August 2012.

[10] Kantardzic, Mehmed (2003). Data Mining: Concepts, Models, Methods, & Algorithms. John Wiley & Sons. ISBN 0-471-22852-4. OCLC 50055336.

[11] "Microsoft Academic Search: Top conferences in data mining". Microsoft Academic Search.

[12] "Google Scholar: Top publications - Data Mining & Analysis". Google Scholar.

[13] Proceedings, International Conferences on Knowledge Discovery & Data Mining, ACM, New York.

[14] SIGKDD Explorations, ACM, New York.

AUTHOR

Jyoti received the B.Tech degree in Computer Science and Engineering from KIIT College of Engineering, Gurgaon in 2013.After that joined M.Tech program 2014-16 in Computer Science and Engineering from Maharishi Dayanand University, Rohtak.