a data mining algorithm in distance learning 04537118

Upload: vishwas20

Post on 07-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 A Data Mining Algorithm in Distance Learning 04537118

    1/4

    A Data Mining Algorithm In Distance Learning

    Dai Shangping1, Zhang Ping

    2

    1

    Department of Computer Science, Hua Zhong Normal University, Wu Han, [email protected]

    2 Foreign Languages College, Zhongnan University of Economics and Law, Wuhan,China

    [email protected]

    Abstract

    Currently there is an increasing interest in data

    mining and educational systems, making educational

    data mining as a new growing research community.

    One of the challenges in developing data mining

    systems is to integrate and coordinate existing datamining applications in a seamless manner so that cost-

    effective systems can be developed without the need of

    costly proprietary products. The popularity of distance

    education has grown rapidly over the last decade in

    higher education, yet many fundamental teaching

    learning issues are still in debate. This paper presents

    an approach for classifying students in order to predict

    their final grade based on features extracted from

    logged data in an education web-based system. In this

    paper we take advantage of the genetic algorithm (GA)

    designed specifically for discovering association rules.

    We propose a novel spatial mining algorithm, called

    ARMNGA(Association Rules Mining in Novel Genetic Algorithm), Compared to the algorithm in

    Reference[2] , the ARMNGA algorithm avoids

    generating impossible candidates, and therefore is

    more efficient in terms of the execution time.

    Keywords: data mining; genetic algorithm, association

    rules, distance learning

    1. Introduction

    Some of the commonly used data mining algorithms

    fall under the following categories: decision trees andrules, nonlinear regression and classification methods,

    example-based methods, probabilistic graphical

    dependency models, and relational learning models.

    Over the years genetic algorithms have been

    successfully applied in learning tasks in different

    domains, like chemical process control, financial

    classification, manufacturing scheduling, among others.

    Many leading educational institutions are working to

    establish an online teaching and learning presence.

    Several systems with different capabilities and

    approaches have been developed to deliver online

    education in an academic setting. [1]

    Because of the advances in information technology,

    vast number of educational resources has accumulated

    on the Internet. Therefore, how to mine interesting

    resources from databases has attracted more and more

    attention in recent years [2]. Many data mining methods

    have been proposed such as association rule mining,

    sequential pattern mining, calling path pattern mining,

    text mining, temporal data mining, spatial data mining,etc.

    Association rule mining, as originally proposed in

    with its apriori algorithm (ARMA), has developed into

    an active research area. Many additional algorithms

    have been proposed for association rule mining. Also,

    the concept of association rule has been extended in

    many different ways, such as generalized association

    rules, association rules with item constraints, sequence

    rules etc. Apart from the earlier analysis of market

    basket data, these algorithms have been widely used in

    many other practical applications such as customer

    profiling, analysis of products and so on[3].

    Genetic Algorithm (GA) is one self-adaptiveoptimization searching algorithm. GA obtains the best

    solution, or the most satisfactory solution through

    generations of chromosomes constant evolution

    includes the reproduce, crossover and mutation etc.

    operation, until a certain termination condition is

    coincident [4].

    Association rules mining Algorithm Based on a

    novel Genetic Algorithm (ARMNGA) is an optimal

    algorithm combing GA with ARMA.

    The contributions of this paper are:

    We take advantage of the genetic algorithm (GA)

    designed specifically for discovering association rules.

    We propose a novel spatial mining algorithm, called

    ARMNGA, Compared to the algorithm in Ref [2], and

    the ARMNGA algorithm avoids generating impossible

    candidates, and therefore is more efficient in terms of

    the execution time.

    2. Association Rules

    Definition 1 confidence

    Set up },,{ 21 miiiI for items of collection, for item

    in )1( mjij , for lasting)1( mj

    _____________________________________978 -1-4244-1651-6/08/$25.00 2008 IEEE

  • 8/3/2019 A Data Mining Algorithm in Distance Learning 04537118

    2/4

    item, it is a trade collection,

    here T is the trade.

    ),{ 1 NTTD

    )1( NiITi

    Rule qr is probability that concentrateson including in the trade.

    qr

    The association rule here is an implication of the

    form qr where X is the conjunction of conditions,and Y is the type of classification. The rule qr hasto satisfy specified minimum support and minimum

    confidence measure [5].

    The support of Rule qr is the measure offrequency both r and q in D

    DrrS )( (1)

    The confidence measure of Rule qr is for the premise that includes r in the bargain descend, in the

    meantime includes q

    )()()( rSrqSqrC (2) Definition 2 Weighting supportDesignated ones project to collect I = {i1 , i2, im},

    each project ij is composed with the value wj of right

    (0wj 1, 1j m). If the rule is rq, the weighting

    support is

    ri

    jw rSwk

    rS )(1

    )( (3)

    And, the K is the size of the Set rq of the project.

    When the right value Wj is the same as ij, we

    calculating the weighting including rule to have the

    same support.

    3. Genetic Algorithm (GA)

    Genetic Algorithm (GA) is a self-adaptive

    optimization searching algorithm. GA obtains the best

    solution, or the most satisfactory solution through

    generations of chromosomes constant evolution

    includes reproduction, crossover and mutation etc.

    Here is the general description of this problem:

    )()()()( rAcrCbrSarF

    (3)a,b,c is constants ,a 0,b0, c0,S(r) is the

    support ,and C(r) is the confidence ,A(r) is the mantle.

    4. A Data Mining Algorithm

    4.1 Encoding

    This paper employs natural numbers to encode the

    variable Aij. That is, the number of the lines of every

    range in the matrix A in which the element 1 exists is

    regarded as a gene. The genes are independent of each

    other. They are marked by A1, A2,,Aj,,An, in which

    jA ],1[],,1[ njm and An may be a repeatedly

    equal natural number.

    When the distributive method at random is employed

    to produce the initial population comprised of certain

    individuals, the population must be in a certain scale in

    order to achieve the optimal solution on the whole. Thebest way is the generated M individuals randomly that

    the length is nthen the chromosome bunch encoded

    by the natural number is calculated as the initial

    population.

    4.2 The Fitness

    Formula (3) is properly transformed into:

    minmin

    )()()(

    C

    rCW

    S

    rSWrF cs (4)

    Here, Wc+Ws=1, Wc 0, Ws 0, Smin, is minimum

    support, and Cmin is minimum confidence.

    4.3 Reproduction Operator

    Reproduction is the transmission of personal

    information from the father generation to the son

    generation. Each individual in each generation

    determines the probability that it can reproduce the next

    generation according to how big or small the fitness

    value is. Through reproducing, the number of excellent

    individuals in the population increases constantly, and

    the whole process of evolution head for the optimal

    direction. We are adopting roulette selection strategy;

    each individual reproduction probability is proportion

    to fitness value.

    1) Compute the reproduction probability of all the

    individuals

    M

    i = 1

    ( i )P ( i ) =

    ( i )

    f

    f(5)

    2) Generate a number r randomly, r=random [0, 1]

    3) IfP(0)+ P(1)++ P(i-1)

  • 8/3/2019 A Data Mining Algorithm in Distance Learning 04537118

    3/4

    Pc can slow down the research process [6] .Here is the

    definition of the crossover operator:

    Computing crossover probability Pc

    ffp

    ffff

    ffppp

    Pc

    c

    cc

    c

    1

    max

    21

    1

    ))((

    (6)

    Here, =0.9, =0.7, is the maximum

    fitness value of the population,

    1cp 2cp maxf

    f is the average

    fitness value of the population.

    4.5 Mutation Operator

    Here is the definition of the mutations operator,

    computing the mutation probability mP

    ffp

    ffff

    ffppp

    Pm

    m

    mmm

    1

    max

    21

    1

    ))((

    (7)

    Here, =0.1, =0.001, is the maximum

    fitness value of the population,

    1mp 2mp maxf

    f is the average

    fitness value of the population.

    4.6 Termination Condition

    When the matching condition is not coincident, the

    process will naturally stop.

    5. Experiments

    To check the research capability of the operator and

    its operational efficiency, such a simulation result is

    given compared with the GA in [2] ,The platform of the

    simulation experiment is a Dell power Edge2600 server

    (double Intel Xeon 1.8GHz CPU,1G memory , RedHatLinux 9.0).

    We first compare the performance of our proposed

    method with the algorithm in Ref [2].

    Fig.1 shows the runtime vs. the minimum support for

    both algorithms, where the minimum support varies

    from 0.25% to 2% for the synthetic dataset. Our

    proposed algorithm runs 25 times faster than the

    Apriori algorithm, because a large number of

    candidates can be pruned by using the ARMNGA

    pruning strategy. Therefore, as the minimum support

    threshold decreases, the runtime of the Apriori

    algorithm increases dramatically since it generates too

    many candidates when the minimum support is small.

    Fig 1. Runtime vs. minimum support.

    Fig.2 shows the runtime vs. the average size of

    transactions for both algorithms, where the average size

    of transactions varies from 4 to 14 for the synthetic

    dataset. As the average size of transactions increases,the runtime of the algorithm in Ref [2] increases

    dramatically, however, compared to the algorithm in

    Ref [2], the runtime of our proposed algorithm

    increases slowly. The reason for increase in the

    runtimes of both algorithms is that the number of

    frequent resources increases as the lengths of

    transactions are increased. Therefore, finding

    candidates present in a trans- action takes a longer time.

    Our proposed algorithm is more scalable than the

    algorithm in Ref [2] because a large number of

    candidates can be pruned by using the ARMNGA

    pruning strategy.

    Fig2. Runtime vs. average size of transactions.

    From fig 1 and 2, we can educe that ARMNGA has a

    higher convergence speed and more reasonable

    selective scheme which guarantees the non-reduction

    performance of the optimal solution. Therefore, it is

    better than GA and ARMA through the theoretic

    analysis and the experimental results.

    6. CONCLUSIONS

    Educational data mining is a young research area and

    it is necessary more specialized and oriented work

  • 8/3/2019 A Data Mining Algorithm in Distance Learning 04537118

    4/4

    educational domain in order to obtain a similar

    application success level to other areas, such as medical

    data mining, mining e-commerce data, etc. In this paper,

    We propose an association rules mining based on a

    novel genetic algorithm, designed specifically for

    discovering association rules. We compare the results

    of the ARMNGA with the results of Ref [2], and, it isbetter than GA and ARM through the theoretic analysis

    and the experimental results. We believe that some

    future mining tools more easy to use by educators.

    Acknowledgement

    This work is partially supported by The research of

    foreign Chinese long-distance visualized teaching

    model The National Society Science Foundation of

    P.R. China, Under Grant No. 07BYY033 .We thanks

    Professor Zheng shijue of the Department of Computer

    Science of Central China Normal UniversityTechnology for his supports of various aspects and

    thanks also go to all members of our research group.

    References

    [1] Behrouz Minaei-Bidgoli , William F. Punch III, Using

    Genetic Algorithms for Data Mining Optimization in an

    Educational Web-based System, Lecture Notes in

    computer Science,pp2724 -2735,2003

    [2] G. Chen, Q. Wei, Fuzzy association rules and the

    extended mining algorithms, Information Sciences 147

    (2002) 201228.[3] K. Koperski, J. Han, Discovery of spatial association rules

    in geographic information databases, in: Proc. of

    International Symposium on Advance in Spatial

    Databases, SSD, LNCS, vol. 951, Springer Verlag, 1995,

    pp. 4766.

    [4] Shijue Zheng,Wanneng Shu,Li Gao,Task Scheduling

    Using Parallel Genetic Simulated Annealing

    Algorithm ,2006 IEEE International Conference on

    Service Operations and Logistics, and Informatics

    Proceedings June 21-23, 2006, Shanghai, pp46-50

    [5] P.Y. Hsu, Y.L. Chen, C.C. Ling, Algorithms for mining

    association rules in bag databases, Information Sciences

    166 (2004) 3147.

    [6] Gao Li,Li Dan,Dai Shangping, A mining Algorithm ofconstraint based association rules, journal of Henan

    University Vol.33(2003) pp.55-58

    [7] Wu zhaohui , Association rule mining based on simulated

    annealing genetic algorithm, Comput Applications

    Vol.25(2005) pp.1009-1011