Data mining is the process of sorting the collection
of stored data to identify/discover the patterns using some
technical analysis. Evaluate the probability of future events data
mining has more algorithms to segment and predict the data
sets. Classification is one of the important mechanisms
functioning by data mining and main goal of classification is to
predict the target data accurately. For prediction numerous
classification techniques have currently working, in this paper
few of them are classified and compared with different data sets.
Classification algorithms such as BayesNet, DT, J48, Logistic,
Naïve Bayes, NBT, PART, RBFN are implemented and
compared using R software with different test data.
Index Terms— Data mining, R classification, Classification
comparison, Prediction, Accuracy, Quality Metrics.
Data mining is the mechanism of identify the particular
data sets from the large set of data with the help of methods
includes database systems, statistics and machine learning.
The characteristic of data mining is analysis of Knowledge
Discovery Database process. Data mining is working with
decision support system to perform accurate perdition of
multiple groups in the data. Before mining of data, the initial
preprocessing works should be done it includes data
collection, data preparation and result interpretation. Data
mining process is divided into four sections as data cleaning,
data integration, data selection and data transformation.
Required classification techniques available in data mining
which includes machine learning based approach, Neural
Network, statistical procedure based approach. Some of the
classification algorithms are explained briefly here.
A. K-Nearest Neighbors Algorithm
K-nearest neighbor is a pattern recognition algorithm
specially used for regression and classification. The working
principle of k-NN is the result objects should be classified
from the majority choice of its neighbors. The output result is
a class membership in the sense of k-NN as a classifier; k-NN
is used as a regression the output is the property value for the
object . The properties of k-NN classifier are working with
two aspects named as 1-nearest neighbor classifier and
weighted nearest neighbor classifier. K-NN assigns weights
based on the nearer objects using weighted nearest neighbor
algorithm. Compare than the naive method, k-NN is easily
tractable by computationally even for large data sets.
B. Naive Bayes Algorithm
Naïve Bayes classifier comes under the family of
probabilistic classifier works with the help of bayes theorem.
Naïve bayes classifiers are highly scalable . Naïve Bayes
algorithm works for both binary class and multiclass
problems. Two kinds of probabilities are taken over here such
as class probabilities and conditional probabilities. From the
training dataset, the probability of each class is considered as
a class probability, the class value is given from the
conditional probabilities of each input.
C. ANN Algorithm
Artificial Neural Network is a collection of connection nodes of many artificial neurons; it is a computational structure or function of biological neural networks. ANN works for such areas are fault detection, speech recognition,
product inspection, machine translation, social network filtering etc . Neural network lead its process in three
sectors are input layer, output layer and hidden layer. The activity of input layer is to collect raw inputs which are feed
input network. Hidden unit represents to determine the activity of each and every hidden unit. Output unit is entirely based on the hidden units and weight between hidden and output unit.
II. RELATED RESEARCH ON CLASSIFICATION
Classification technique is based on the inductive learning principle that analyzes and finds the patterns from the database. If the nature of an environment is dynamic, then the model must be adaptive i.e. it should be able to learn and map efficiently. Limère et al. (2004) presented a model for firm growth with decision tree induction principle . It gives interesting results and fits the model to economic data like growth competence and resources, growth potential and growth ambitions. Hoi et al. (2006) developed a novel framework of learning the unified kernel machines for both labeled and unlabeled data. This framework includes semi supervised learning, supervised learning and active learning. Also, a spectral kernel is proposed, where it classifies the given labeled data and unlabeled data efficiently. Xu et al. (2008) proposed a reproducing kernel Hilbert space framework for information theoretic learning . The framework uses the symmetric nonnegative definite kernel function i.e. cross-information potential. Though this framework gives better result than the previous RKHS frameworks, still there is an issue to choose an appropriate kernel function for a particular domain. Shilton and Palaniswami (2008) defined a unified approach to support vector machines. This unified approach is formulated for binary classification and later on extended to one-class classification and regression. Kumar et al. (2012) explored a binary classification framework for two stage multiple kernel learning . The distinct advantage of this binary classification framework is that it is easier to leverage research in binary classification and to develop scalable and robust kernel based algorithms. Takeda et al. (2012) proposed a unified robust classification model that optimizes the existing classification models like SVM, minimax probability machine and fisher discriminant analysis. It provides several benefits like well-defined theoretical results, extends the existing techniques and clarifies relationships among existing models
Yee and Haykin (2013) viewed the pattern classification as an ill-posed problem , it is a prerequisite to develop a unified theoretical framework that classifies and solves the ill posed problems. Recent literature on classification framework has reported better results for binary class datasets alone. For multiclass datasets, there is a lack in accuracy and robustness. So, developing an efficient classification framework for multiclass datasets is still an open research problem.
To classify theses collected data, nine distinct data mining classification algorithms are applied which are named as DT, BayesNet, Logistic, J48, NBT, Naive Bayes, PART, NBT, SMO, RBFN. R tool used here to test analytical performance of these algorithms and produce exact results based on the data sets. These nine classification methods are tested using some parameters such as KS (Kappa Statistics), RMSE (Root Mean Squared Error, MAE (Mean Absolute Error). Kappa Statistics is used to classify and measure the data of observed accuracy with an expected accuracy. Root Mean Squared Error is used to predict the difference between observed and predicted values. Mean Absolute Error has provide average of the absolute errors, it is a measurement of difference between two continuous variables .
IV. RESULTS AND DISCUSSIONS
A. Classification on Bank dataset
Classification algorithms are tabulated in below table which is used for bank dataset measurement purpose using R tool. These measurements have compared with algorithms and also KS, MAE, RMSE. Among these algorithms, j48, BayesNet, SMO algorithms are having highest kappa statistics, MAE, RMSE respectively in bank dataset classification.
 Available at: http://en.wikipedia.org/wiki/K-nearest_neighbors algorithm.
 Available at:https://en.wikipedia.org/wiki/Naive_Bayes_classifier
 Saravanan, Sasithra ”Review on Classification Based on Artificial Neural Networks” International journal Ambient Systems and Applications vol 2, no 4, 2014 p.p 11-18.
 Limère et al, “A classification model for firm growth on the basis of ambitions, external potential and resources by means of decision tree induction”, Working Papers2004027, University of Antwerp, Faculty of Applied Economics.
 Xu et al, “A Reproducing Kernel Hilbert Space Framework for Information-Theoretic Learning”, IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 12, DECEMBER 2008, 5891 – 5902.
 Kumar et al, “A Binary Classification Framework for Two-Stage Multiple Kernel Learning”, Appearing in Proceedings of the 29 th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012.
 Andrew Secker, Matthew N. Davies et al., “An Experimental Comparison of Classification Algorithms for the Hierarchical Prediction of Protein Function”, Expert Update (the BCS-SGAI) Magazine, 9(3), 17-22, (2007).
 Available at: https://en.wikipedia.org/wiki/Mean_absolute_error.