tool.doc

PRACTICAL-10AIM: Study of data mining Tool.

KNIME, the Konstanz Information Miner, is an open source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept. A graphical user interface allows assembly of nodes for data preprocessing (ETL: Extraction, Transformation, Loading), for modeling and data analysis and visualization.

KNIME allows users to visually create data flows (or pipelines), selectively execute some or all analysis steps, and later inspect the results, models, and interactive views. KNIME is written in Java and based on Eclipse and makes use of its extension mechanism to add plugins providing additional functionality. The core version already includes hundreds of modules for data integration (file I/O, database nodes supporting all common database management systems), data transformation (filter, converter, combiner) as well as the commonly used methods for data analysis and visualization. With the free Report Designer extension, KNIME workflows can be used as data sets to create report templates that can be exported to document formats like doc, ppt, xls, pdf and others. Other capabilities of KNIME are:

KNIMEs core-architecture allows processing of large data volumes that are only limited by the available hard disk space (most other open source data analysis tools are working in main memory and are therefore limited to the available RAM). E.g. KNIME allows analysis of 300 million customer addresses, 20 million cell images and 10 million molecular structures.

Additional plugins allows the integration of methods for Text mining, Image mining, as well as time series analysis.

FeaturesOne key behind the success of KNIME is its inherent modular workflow approach, which documents and stores the analysis process in the order it was conceived and implemented, while ensuring that intermediate results are always available.

Core KNIME features include: Scalability through sophisticated data handling (intelligent automatic caching of data in

the background while maximizing throughput performance) High, simple extensibility via a well-defined API for plugin extensions Intuitive user interface Import/export of workflows (for exchanging with other KNIME users) Parallel execution on multi-core systems Command line version for "headless" batch executions

Page No:

Available KNIME modules cover a vast range of functionality, such as: I/O: retrieves data from files or data bases Data Manipulation: pre-processes your input data with filtering, group-by, pivoting,

binning, normalization, aggregation, joining, sampling, partitioning, etc. Views: visualize data and results through several interactive views, allowing for

interactive data exploration Hiliting: ensures hilited data points in one view are also immediately hilited in all other

views Mining: uses state-of-the-art data mining algorithms like clustering, rule induction,

decision tree, association rules, naïve bayes, neural networks, support vector machines, etc. to better understand your data

Supported Operating Systems Windows - 32bit (regularly tested on XP and Vista) Windows - 64bit (regularly tested on Vista and verified to work under Windows 7) Linux - 32bit (regularly tested on RHEL4/5, OpenSUSE 10.2/10.3/11.0, amongst others) Linux - 64bit (regularly tested on RHEL4/5, OpenSUSE 10.2/10.3/11.0, amongst others) Mac OSX - 64bit Intel-based architecture with Java 1.6

Key Benefits

Integrate Data - Intuitively Integrate All Your DataKNIME workflows allow you to quickly and intuitively combine and transform all of your heterogeneous data sources and create summary tables

Report Results - Create Complex ReportsKNIME Report Designer gives you access to raw data and data summaries together with a vast library of graphical tools to put them into a publishable form

Reuse Templates- Reports are reproducibleThe combination of KNIME workflows and KNIME report templates allows you to re-create reports whenever needed on archives or up-to-date data. In combination with KNIME Report Sever even via an easy-to-use Web Portal.

Page No:

The home screen of the Knime looks like below:

Now Click on New project and select a file reader from node repository and drag it to the project screen to read a database. Now configure this file reader with a dataset.

Page No:

Select the dataset using browse as shown in given figure:

Now execute the node by right clicking on it as shown in figure above.

After execution the node is ready with the output. Now you can drive this output to any other node.

Missing Value

Now Choose the missing value node from data manipulation node. And add it to the project screen. Execute it and you will get a corrected output of data.

Page No:

Box Plot Now select Box Plot from Data views node from node repository and drag it to the

project screen. Execute and view plots as shown in figure.

Histogram Now select Histogram from same data view node from data repository and drag it to

screen. Configure it how many columns you need to include in histogram as shown in figure.

Page No:

After executing you will see a histogram as shown in fig below.

Naïve bayes PredictorNow create a flow diagram as shown in figure below to create a naïve bayes predictor. Te scorer at the end will give the confusion matrix showing the accuracy results.

K-means Clustering

Page No:

Here construct the data flow as shown in figure below. To see the clustering effect, use the scatter plot.

Page No:

Decision Tree Predictor Create a data flow as shown in figure. Firstly partition the data into two types: training

data and test data. Now apply training data to learner and test data to predictor. Now transfer the data model generated by learner to predictor and view the decision tree model generated as shown in figure.

Also use score to see the confusion matrix to view the accuracy results.

Hierarchical clustering Hierarchically clusters the input data. As shown in figure below we can see the

dendogram for the same.

Page No:

Scatter Plot Creates a scatterplot of two selectable attributes. Then each data point is displayed as a

dot at its corresponding place, dependent on its values of the selected attributes.

Page No:

tool.doc

Documents

input data

data bases data manipulation

data sets

data flows

data transformation

hilited data points

data preprocessing etl

study of data mining