Comparative Analysis of Various Data Mining Classification Algorithms Using R Software

International Journal of Computer Science (IJCS Journal) Published by SK Research Group of Companies (SKRGC) Scholarly Peer Reviewed Research Journals

Format: Volume 6, Issue 1, No 01, 2018

Copyright: All Rights Reserved ©2018

Year of Publication: 2018

Author: R.Palanisamy#1 , Dr.S.S.Dhenakaran*2

Reference:IJCS-330

View PDF Format

Abstract

Data mining is the process of sorting the collection of stored data to identify/discover the patterns using some technical analysis. Evaluate the probability of future events data mining has more algorithms to segment and predict the data sets. Classification is one of the important mechanisms functioning by data mining and main goal of classification is to predict the target data accurately. For prediction numerous classification techniques have currently working, in this paper few of them are classified and compared with different data sets. Classification algorithms such as BayesNet, DT, J48, Logistic, Naïve Bayes, NBT, PART, RBFN are implemented and compared using R software with different test data. Index Terms— Data mining, R classification, Classification comparison, Prediction, Accuracy, Quality Metrics. I. INTRODUCTION Data mining is the mechanism of identify the particular data sets from the large set of data with the help of methods includes database systems, statistics and machine learning. The characteristic of data mining is analysis of Knowledge Discovery Database process. Data mining is working with decision support system to perform accurate perdition of multiple groups in the data. Before mining of data, the initial preprocessing works should be done it includes data collection, data preparation and result interpretation. Data mining process is divided into four sections as data cleaning, data integration, data selection and data transformation. Required classification techniques available in data mining which includes machine learning based approach, Neural Network, statistical procedure based approach. Some of the classification algorithms are explained briefly here. A. K-Nearest Neighbors Algorithm K-nearest neighbor is a pattern recognition algorithm specially used for regression and classification. The working principle of k-NN is the result objects should be classified from the majority choice of its neighbors. The output result is a class membership in the sense of k-NN as a classifier; k-NN is used as a regression the output is the property value for the object [1]. The properties of k-NN classifier are working with two aspects named as 1-nearest neighbor classifier and weighted nearest neighbor classifier. K-NN assigns weights based on the nearer objects using weighted nearest neighbor algorithm. Compare than the naive method, k-NN is easily tractable by computationally even for large data sets. B. Naive Bayes Algorithm Naïve Bayes classifier comes under the family of probabilistic classifier works with the help of bayes theorem. Naïve bayes classifiers are highly scalable [2]. Naïve Bayes algorithm works for both binary class and multiclass problems. Two kinds of probabilities are taken over here such as class probabilities and conditional probabilities. From the training dataset, the probability of each class is considered as a class probability, the class value is given from the conditional probabilities of each input. C. ANN Algorithm Artificial Neural Network is a collection of connection nodes of many artificial neurons; it is a computational structure or function of biological neural networks. ANN works for such areas are fault detection, speech recognition, product inspection, machine translation, social network filtering etc [3]. Neural network lead its process in three sectors are input layer, output layer and hidden layer. The activity of input layer is to collect raw inputs which are feed input network. Hidden unit represents to determine the activity of each and every hidden unit. Output unit is entirely based on the hidden units and weight between hidden and output unit. II. RELATED RESEARCH ON CLASSIFICATION Classification technique is based on the inductive learning principle that analyzes and finds the patterns from the database. If the nature of an environment is dynamic, then the model must be adaptive i.e. it should be able to learn and map efficiently. Limère et al. (2004) presented a model for firm growth with decision tree induction principle [4]. It gives interesting results and fits the model to economic data like growth competence and resources, growth potential and growth ambitions. Hoi et al. (2006) developed a novel framework of learning the unified kernel machines for both labeled and unlabeled data. This framework includes semi supervised learning, supervised learning and active learning. Also, a spectral kernel is proposed, where it classifies the given labeled data and unlabeled data efficiently. Xu et al. (2008) proposed a reproducing kernel Hilbert space framework for information theoretic learning [5]. The framework uses the symmetric nonnegative definite kernel function i.e. cross-information potential. Though this framework gives better result than the previous RKHS frameworks, still there is an issue to choose an appropriate kernel function for a particular domain. Shilton and Palaniswami (2008) defined a unified approach to support vector machines. This unified approach is formulated for binary classification and later on extended to one-class classification and regression. Kumar et al. (2012) explored a binary classification framework for two stage multiple kernel learning [6]. The distinct advantage of this binary classification framework is that it is easier to leverage research in binary classification and to develop scalable and robust kernel based algorithms. Takeda et al. (2012) proposed a unified robust classification model that optimizes the existing classification models like SVM, minimax probability machine and fisher discriminant analysis. It provides several benefits like well-defined theoretical results, extends the existing techniques and clarifies relationships among existing models Yee and Haykin (2013) viewed the pattern classification as an ill-posed problem [7], it is a prerequisite to develop a unified theoretical framework that classifies and solves the ill posed problems. Recent literature on classification framework has reported better results for binary class datasets alone. For multiclass datasets, there is a lack in accuracy and robustness. So, developing an efficient classification framework for multiclass datasets is still an open research problem. To classify theses collected data, nine distinct data mining classification algorithms are applied which are named as DT, BayesNet, Logistic, J48, NBT, Naive Bayes, PART, NBT, SMO, RBFN. R tool used here to test analytical performance of these algorithms and produce exact results based on the data sets. These nine classification methods are tested using some parameters such as KS (Kappa Statistics), RMSE (Root Mean Squared Error, MAE (Mean Absolute Error). Kappa Statistics is used to classify and measure the data of observed accuracy with an expected accuracy. Root Mean Squared Error is used to predict the difference between observed and predicted values. Mean Absolute Error has provide average of the absolute errors, it is a measurement of difference between two continuous variables [8]. IV. RESULTS AND DISCUSSIONS A. Classification on Bank dataset Classification algorithms are tabulated in below table which is used for bank dataset measurement purpose using R tool. These measurements have compared with algorithms and also KS, MAE, RMSE. Among these algorithms, j48, BayesNet, SMO algorithms are having highest kappa statistics, MAE, RMSE respectively in bank dataset classification.

References

[1] Available at: http://en.wikipedia.org/wiki/K-nearest_neighbors algorithm.
[2] Available at:https://en.wikipedia.org/wiki/Naive_Bayes_classifier
[3] Saravanan, Sasithra ”Review on Classification Based on Artificial Neural Networks” International journal Ambient Systems and Applications vol 2, no 4, 2014 p.p 11-18.
[4] Limère et al, “A classification model for firm growth on the basis of ambitions, external potential and resources by means of decision tree induction”, Working Papers2004027, University of Antwerp, Faculty of Applied Economics.
[5] Xu et al, “A Reproducing Kernel Hilbert Space Framework for Information-Theoretic Learning”, IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 12, DECEMBER 2008, 5891 – 5902.
[6] Kumar et al, “A Binary Classification Framework for Two-Stage Multiple Kernel Learning”, Appearing in Proceedings of the 29 th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012.
[7] Andrew Secker, Matthew N. Davies et al., “An Experimental Comparison of Classification Algorithms for the Hierarchical Prediction of Protein Function”, Expert Update (the BCS-SGAI) Magazine, 9(3), 17-22, (2007).
[8] Available at: https://en.wikipedia.org/wiki/Mean_absolute_error.


Keywords

This work is licensed under a Creative Commons Attribution 3.0 Unported License.   

TOP
Facebook IconYouTube IconTwitter IconVisit Our Blog