a data mining algorithm in distance learning 04537118

8/3/2019 A Data Mining Algorithm in Distance Learning 04537118

1/4

A Data Mining Algorithm In Distance Learning

Dai Shangping1, Zhang Ping

2

1

Department of Computer Science, Hua Zhong Normal University, Wu Han, [email protected]

2 Foreign Languages College, Zhongnan University of Economics and Law, Wuhan,China

[email protected]

Abstract

Currently there is an increasing interest in data

mining and educational systems, making educational

data mining as a new growing research community.

One of the challenges in developing data mining

systems is to integrate and coordinate existing datamining applications in a seamless manner so that cost-

effective systems can be developed without the need of

costly proprietary products. The popularity of distance

education has grown rapidly over the last decade in

higher education, yet many fundamental teaching

learning issues are still in debate. This paper presents

an approach for classifying students in order to predict

their final grade based on features extracted from

logged data in an education web-based system. In this

paper we take advantage of the genetic algorithm (GA)

designed specifically for discovering association rules.

We propose a novel spatial mining algorithm, called

ARMNGA(Association Rules Mining in Novel Genetic Algorithm), Compared to the algorithm in

Reference[2] , the ARMNGA algorithm avoids

generating impossible candidates, and therefore is

more efficient in terms of the execution time.

Keywords: data mining; genetic algorithm, association

rules, distance learning

1. Introduction

Some of the commonly used data mining algorithms

fall under the following categories: decision trees andrules, nonlinear regression and classification methods,

example-based methods, probabilistic graphical

dependency models, and relational learning models.

Over the years genetic algorithms have been

successfully applied in learning tasks in different

domains, like chemical process control, financial

classification, manufacturing scheduling, among others.

Many leading educational institutions are working to

establish an online teaching and learning presence.

Several systems with different capabilities and

approaches have been developed to deliver online

education in an academic setting. [1]

Because of the advances in information technology,

vast number of educational resources has accumulated

on the Internet. Therefore, how to mine interesting

resources from databases has attracted more and more

attention in recent years [2]. Many data mining methods

have been proposed such as association rule mining,

sequential pattern mining, calling path pattern mining,

text mining, temporal data mining, spatial data mining,etc.

Association rule mining, as originally proposed in

with its apriori algorithm (ARMA), has developed into

an active research area. Many additional algorithms

have been proposed for association rule mining. Also,

the concept of association rule has been extended in

many different ways, such as generalized association

rules, association rules with item constraints, sequence

rules etc. Apart from the earlier analysis of market

basket data, these algorithms have been widely used in

many other practical applications such as customer

profiling, analysis of products and so on[3].

Genetic Algorithm (GA) is one self-adaptiveoptimization searching algorithm. GA obtains the best

solution, or the most satisfactory solution through

generations of chromosomes constant evolution

includes the reproduce, crossover and mutation etc.

operation, until a certain termination condition is

coincident [4].

Association rules mining Algorithm Based on a

novel Genetic Algorithm (ARMNGA) is an optimal

algorithm combing GA with ARMA.

The contributions of this paper are:

We take advantage of the genetic algorithm (GA)

designed specifically for discovering association rules.

We propose a novel spatial mining algorithm, called

ARMNGA, Compared to the algorithm in Ref [2], and

the ARMNGA algorithm avoids generating impossible

candidates, and therefore is more efficient in terms of

the execution time.

2. Association Rules

Definition 1 confidence

Set up },,{ 21 miiiI for items of collection, for item

in )1( mjij , for lasting)1( mj

_____________________________________978 -1-4244-1651-6/08/$25.00 2008 IEEE


2/4

item, it is a trade collection,

here T is the trade.

),{ 1 NTTD

)1( NiITi

Rule qr is probability that concentrateson including in the trade.

qr

The association rule here is an implication of the

form qr where X is the conjunction of conditions,and Y is the type of classification. The rule qr hasto satisfy specified minimum support and minimum

confidence measure [5].

The support of Rule qr is the measure offrequency both r and q in D

DrrS )( (1)

The confidence measure of Rule qr is for the premise that includes r in the bargain descend, in the

meantime includes q

)()()( rSrqSqrC (2) Definition 2 Weighting supportDesignated ones project to collect I = {i1 , i2, im},

each project ij is composed with the value wj of right

(0wj 1, 1j m). If the rule is rq, the weighting

support is

ri

jw rSwk

rS )(1

)( (3)

And, the K is the size of the Set rq of the project.

When the right value Wj is the same as ij, we

calculating the weighting including rule to have the

same support.

3. Genetic Algorithm (GA)

Genetic Algorithm (GA) is a self-adaptive

optimization searching algorithm. GA obtains the best

solution, or the most satisfactory solution through

generations of chromosomes constant evolution

includes reproduction, crossover and mutation etc.

Here is the general description of this problem:

)()()()( rAcrCbrSarF

(3)a,b,c is constants ,a 0,b0, c0,S(r) is the

support ,and C(r) is the confidence ,A(r) is the mantle.

4. A Data Mining Algorithm

4.1 Encoding

This paper employs natural numbers to encode the

variable Aij. That is, the number of the lines of every

range in the matrix A in which the element 1 exists is

regarded as a gene. The genes are independent of each

other. They are marked by A1, A2,,Aj,,An, in which

jA ],1[],,1[ njm and An may be a repeatedly

equal natural number.

When the distributive method at random is employed

to produce the initial population comprised of certain

individuals, the population must be in a certain scale in

order to achieve the optimal solution on the whole. Thebest way is the generated M individuals randomly that

the length is nthen the chromosome bunch encoded

by the natural number is calculated as the initial

population.

4.2 The Fitness

Formula (3) is properly transformed into:

minmin

)()()(

C

rCW

S

rSWrF cs (4)

Here, Wc+Ws=1, Wc 0, Ws 0, Smin, is minimum

support, and Cmin is minimum confidence.

4.3 Reproduction Operator

Reproduction is the transmission of personal

information from the father generation to the son

generation. Each individual in each generation

determines the probability that it can reproduce the next

generation according to how big or small the fitness

value is. Through reproducing, the number of excellent

individuals in the population increases constantly, and

the whole process of evolution head for the optimal

direction. We are adopting roulette selection strategy;

each individual reproduction probability is proportion

to fitness value.

1) Compute the reproduction probability of all the

individuals

M

i = 1

( i )P ( i ) =

( i )

f

f(5)

2) Generate a number r randomly, r=random [0, 1]

3) IfP(0)+ P(1)++ P(i-1)


3/4

Pc can slow down the research process [6] .Here is the

definition of the crossover operator:

Computing crossover probability Pc

ffp

ffff

ffppp

Pc

c

cc

c

1

max

21

1

))((

(6)

Here, =0.9, =0.7, is the maximum

fitness value of the population,

1cp 2cp maxf

f is the average

fitness value of the population.

4.5 Mutation Operator

Here is the definition of the mutations operator,

computing the mutation probability mP

ffp

ffff

ffppp

Pm

m

mmm

1

max

21

1

))((

(7)

Here, =0.1, =0.001, is the maximum

fitness value of the population,

1mp 2mp maxf

f is the average

fitness value of the population.

4.6 Termination Condition

When the matching condition is not coincident, the

process will naturally stop.

5. Experiments

To check the research capability of the operator and

its operational efficiency, such a simulation result is

given compared with the GA in [2] ,The platform of the

simulation experiment is a Dell power Edge2600 server

(double Intel Xeon 1.8GHz CPU,1G memory , RedHatLinux 9.0).

We first compare the performance of our proposed

method with the algorithm in Ref [2].

Fig.1 shows the runtime vs. the minimum support for

both algorithms, where the minimum support varies

from 0.25% to 2% for the synthetic dataset. Our

proposed algorithm runs 25 times faster than the

Apriori algorithm, because a large number of

candidates can be pruned by using the ARMNGA

pruning strategy. Therefore, as the minimum support

threshold decreases, the runtime of the Apriori

algorithm increases dramatically since it generates too

many candidates when the minimum support is small.

Fig 1. Runtime vs. minimum support.

Fig.2 shows the runtime vs. the average size of

transactions for both algorithms, where the average size

of transactions varies from 4 to 14 for the synthetic

dataset. As the average size of transactions increases,the runtime of the algorithm in Ref [2] increases

dramatically, however, compared to the algorithm in

Ref [2], the runtime of our proposed algorithm

increases slowly. The reason for increase in the

runtimes of both algorithms is that the number of

frequent resources increases as the lengths of

transactions are increased. Therefore, finding

candidates present in a trans- action takes a longer time.

Our proposed algorithm is more scalable than the

algorithm in Ref [2] because a large number of

candidates can be pruned by using the ARMNGA

pruning strategy.

Fig2. Runtime vs. average size of transactions.

From fig 1 and 2, we can educe that ARMNGA has a

higher convergence speed and more reasonable

selective scheme which guarantees the non-reduction

performance of the optimal solution. Therefore, it is

better than GA and ARMA through the theoretic

analysis and the experimental results.

6. CONCLUSIONS

Educational data mining is a young research area and

it is necessary more specialized and oriented work


4/4

educational domain in order to obtain a similar

application success level to other areas, such as medical

data mining, mining e-commerce data, etc. In this paper,

We propose an association rules mining based on a

novel genetic algorithm, designed specifically for

discovering association rules. We compare the results

of the ARMNGA with the results of Ref [2], and, it isbetter than GA and ARM through the theoretic analysis

and the experimental results. We believe that some

future mining tools more easy to use by educators.

Acknowledgement

This work is partially supported by The research of

foreign Chinese long-distance visualized teaching

model The National Society Science Foundation of

P.R. China, Under Grant No. 07BYY033 .We thanks

Professor Zheng shijue of the Department of Computer

Science of Central China Normal UniversityTechnology for his supports of various aspects and

thanks also go to all members of our research group.

References

[1] Behrouz Minaei-Bidgoli , William F. Punch III, Using

Genetic Algorithms for Data Mining Optimization in an

Educational Web-based System, Lecture Notes in

computer Science,pp2724 -2735,2003

[2] G. Chen, Q. Wei, Fuzzy association rules and the

extended mining algorithms, Information Sciences 147

(2002) 201228.[3] K. Koperski, J. Han, Discovery of spatial association rules

in geographic information databases, in: Proc. of

International Symposium on Advance in Spatial

Databases, SSD, LNCS, vol. 951, Springer Verlag, 1995,

pp. 4766.

[4] Shijue Zheng,Wanneng Shu,Li Gao,Task Scheduling

Using Parallel Genetic Simulated Annealing

Algorithm ,2006 IEEE International Conference on

Service Operations and Logistics, and Informatics

Proceedings June 21-23, 2006, Shanghai, pp46-50

[5] P.Y. Hsu, Y.L. Chen, C.C. Ling, Algorithms for mining

association rules in bag databases, Information Sciences

166 (2004) 3147.

[6] Gao Li,Li Dan,Dai Shangping, A mining Algorithm ofconstraint based association rules, journal of Henan

University Vol.33(2003) pp.55-58

[7] Wu zhaohui , Association rule mining based on simulated

annealing genetic algorithm, Comput Applications

Vol.25(2005) pp.1009-1011

a data mining algorithm in distance learning 04537118

Documents