## Classification of Diabetes Data Set from Iraq via Different Machine Learning Techniques | ||

IRAQI JOURNAL OF STATISTICAL SCIENCES | ||

Volume 21, Issue 1, June 2024, Pages 170-189 PDF (1.25 M) | ||

Document Type: Research Paper | ||

DOI: 10.33899/iqjoss.2024.183258 | ||

Authors | ||

Dilshad Omar Altalabani^{*} ; Fevzi Erdogan
| ||

^{}Sulaimani Polytechnic University | ||

Abstract | ||

Diabetes has become one of the most prevalent diseases in Iraq and is listed as one of the leading causes of death. Machine learning provides effective information extraction results by creating predictive models from diagnostic medical datasets collected from diabetes patients in Iraq. In this study, we applied machine learning classification to compare and contrast the performances of classification and regression trees (CART), support vector machines (SVM), random forests (RF), linear discrimination analysis (LDA), and K-nearest neighbors (KNN). We sought to design a model that can predict with maximum accuracy the probability that a person has, is healthy, or is expected to develop diabetes in the future using the two scales of accuracy and kappa. Based on the results obtained from the algorithms, it showed that the accuracy and sequence of the algorithms concerning the training data were Random Forest (RF), Classification and Regression Trees (CART), Support Vector Machine (SVM), Linear Discrimination Analysis (LDA), and K-Nearest Neighbors (KNN). While the test data results showed some differences, the sequence of the algorithms was as follows: SVM, RF, CART, LDA, and KNN were the highest, respectively. The training data set refers to the samples that were used to construct the model, whereas the testing data set is used to evaluate the model's performance. Based on the assessment criteria discussed above, we chose the best machine learning approach to predict diabetes mellitus in Iraq to achieve high performance. All of the strategies listed above are approximated using a supervised diabetes testing dataset. The approach that achieves the maximum performance in terms of accuracy and kappa is regarded as the best option. Based on the results, it can be seen that the SVM and RF algorithms predicted diabetes with more accuracy. | ||

Highlights | ||

**Conclusion**
Based on the results obtained from the previous five algorithms with a comparison function for machine learning algorithms, it shows that the sequence of the accuracy of the algorithms concerning the training data is utterly identical to the results of the overall comparison because the latter mainly depends on the results of the RF, CART, SVM, LDA, and KNN training data. The test data results showed some differences in the accuracy sequence of the algorithms shown, with SVM, RF, CART, LDA, and KNN being the highest, respectively. The training data set refers to the samples used to build the model, while the test or validation data set is used to check performance. Building a model that understands underlying data patterns is vital to providing long-lasting predictions with little retraining. At their most basic, machine learning models are statistical equations that run at high rates on several data points. As a result, statistical tests on the algorithms are essential for fine-tuning them and verifying whether the model's equation best fits the dataset at hand. We often generate multiple viable models when working on a machine learning project. Each model will have its own set of performance attributes. Using resampling techniques like cross-validation, we can determine how accurate each model is on unseen data. We must be able to use the estimations to choose one or two of the best models from our array of models. This research will contribute to a scientific addition to past studies in this field of knowledge, which must conducted in various sectors. Focusing on authentic data in all areas, notably health, because it is directly related to human life, which is at the heart of all life on Earth. It is vital to emphasize the importance of data from its sources and urge governments to open data centers, particularly in countries such as Iraq. In this pilot project, we used data from scientific research in Baghdad, Iraq's capital, and five machine learning algorithms. In the future, we hope that data will be available in a variety of disciplines so that we may give service to future generations a better living chance. After carefully considering the assessment above criteria, we have selected the most optimal machine learning methodology to predict diabetes mellitus in Iraq, with the objective of attaining superior performance. The techniques above are estimated with a supervised dataset for diabetes testing. The optimal choice is the technique that attains the highest level of performance in terms of accuracy and kappa. The findings indicate that the Support Vector Machine (SVM) and Random Forest (RF) algorithms exhibited higher accuracy in predicting diabetes. The study's findings might help healthcare providers in Iraq avoid diabetes earlier and make better clinical decisions to control it, perhaps saving lives. Our future study will include considering and evaluating new features for further investigation. | ||

Keywords | ||

Machine Learning; Classification; Diabetes | ||

Full Text | ||

International Diabetes Federation (IDF) (2017) data shows that hundreds of millions live with diabetes worldwide. Diabetes now routinely tops lists of the leading causes of death worldwide. Over the past 30 years, based on World Health Organization (WHO) (2018) data, diabetes prevalence has increased rapidly, especially in low- and middle-income countries. Khanam and Simon (2021) state that diabetes identification is one of the most difficult challenges in healthcare. Baran (2020) the rapid increase of so-called data sources gives diversity and importance to studies in machine learning. The development of technology has led to the introduction of multi-label classification for increasing datasets. The focus of this study is to develop prediction models using diagnostic and interventional datasets from diabetic patients in Iraq. We employed various machine learning techniques while considering their features and performance, and compared them to obtain the best disease prediction. We explored multiple supervised learning algorithms in the R programming language. Our study employs machine learning classification algorithms to predict the likelihood of diabetes. We evaluated the performance of all algorithms across multiple measures and found that the Support Vector Machine and Random Forest machine learning classification algorithms achieved perfect accuracy.
**Materials and methods**
The data for this study was initially collected from the laboratories of Medical City Hospital and the Specialized Centre for Endocrinology and Diabetes Al-Kindy Teaching Hospital in Baghdad, Iraq's capital, and contains 1000 records of diabetes patients of all ages. Attributes of the dataset (Gender, AGE, Urea, Cr (Creatinine ratio), HBA1c (Haemoglobin A1c Test), Chol. (Cholesterol), TG (Triglycerides), HDL (High-Density Lipoprotein, or good cholesterol), LDL (Low-Density Lipoprotein, sometimes called bad cholesterol), VLDL (Very Low-Density Lipoproteins), BMI (body mass index), and CLASS (Diabetic=Positive (P), Non- Diabetic=Negative (N), or Predict-Diabetic=Y). The data set was acquired from a particular location. https://data.mendeley.com/datasets/wj9rwkp9c2/1
**Methods**
Supervised machine learning algorithms give critical and high-accuracy results for prediction. Classification and Regression Trees (CART), Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), K-Nearest Neighbors (K-NN), and Random Forest (RF) were utilized. Rebala
**R Studio**
R is an open-source statistical computing and graphics language that supports various statistical methods like linear and nonlinear modeling, statistical testing, time series analysis, classification, and clustering. Ramasubramanian and Singh (2019) showed that it is easy to learn and robust, making it suitable for academics and those with little programming experience. R's easy calculation of statistical features makes it a popular tool for data analysts and statisticians.
**Model Diagram**
The architecture of the Proposed Approach in Figure 1 below in the form of model diagram.
**Classification and Regression Trees (CART)**
The first algorithm we modeled was CART. Breiman We constructed a tree for our model CLASS (dependent variable) with all other independent variables for training data. As we can see, there are 15 nodes in this tree. As shown in Figure 2, the root is at the top, the leaf is at the bottom, and the most crucial variable in the prediction model (BMI) is at the top. If we have a new patient whose BMI at node 1 is less than 25, we go to node 2, the following most influential variant in the model: HbA1c. If it is less than 5.6, we go to node 3, the Chol variable, and if it is less than 4.9, we go to node 4, the BMI variable; if it is less than 23, it will reach a terminal node or decision node. We look at the probability that the response variable equals one within class N (Non-Diabetic=Negative) with a probability up to 1; otherwise, if it is greater than 23, it will be within class N with a probability of about 0.9. In node number 7, if the variable TG is less than 1.9, we go to node 8, and it will be within class N with a probability of more than 0.6; otherwise, it will be within class Y with a probability of up to 1. In node number 10, if the variable Hb1Ac is less than 6.4 and we go to node 11, it will be within class P (Diabetic=Positive) with a probability up to 1; otherwise, it will be within class Y with a probability up to 1. In node number 13, if the variable age is less than 43, we go to node 14, and then it will be within class Y (Predict-Diabetic=Y) with a probability greater than 0.8; otherwise, it will be within class Y with a probability up to 1. The first algorithm we modeled was CART for data consisting of the response variable class, which consists of three classes: N (Non-Diabetic=Negative), P (Diabetic=Positive), and Y (Predict Diabetic) with ten features. The data were divided into training data at 80% and test data at 20%. We then obtained a model with an accuracy of 0.9823 and a Kappa of 0.9381. We obtained more confirmed results in the confusion matrix, showing the number of correct classifications at 778 out of 792, 88 N, 35 P, and 655 Y versus 14 missed classifications, 10 in N, and 4 in Y in the prediction model. The model's results were compared with the test data to determine the accuracy of the models. We obtained an accuracy of 0.976 and a Kappa of 0.905 with a confidence interval of 0.9448 to 0.9921, considered high accuracy. We obtained more confirmed results in the confusion matrix, showing the number of correct classifications at 203 out of 208, 15 N, 12 P, and 176 Y versus five missed classifications, 3 in N and 2 in Y in the prediction model. The confusion matrix for the training and test data is shown in Table 1.
**Support Vector Machines (SVM)**
The second algorithm we modeled was SVM. Rebala In a simple case, the feature vector will be x, a linear classifier that creates a hyperplane. All y values are more significant than are in class 1, whereas all other y values are in class 2. The feature vector is expected to be a linearly represented table in this linear equation. In Figure 3, the red lines represent the valid class boundaries of a, b, and c. All of these boundaries correctly divide the data points into two groups. Line b, on the other hand, gives both classes the most room to maneuver. SVM seeks out boundaries that maximize the data points' margins. The points closest to lines a and c are the support vectors that provide the class boundary lines. We then obtained the model by applying the SVM algorithm with an accuracy of 0.954 and a kappa of 0.825. In the confusion matrix, we obtained the correct classification of 954 out of 1000 for N 96, P 29, and Y 830, while the missed classification was N 17, P 3, and Y 26 in the prediction model. We use hyperparameter optimization (tuning), which helps us select the best model by tuning the function. We use the support vector machine model and cost capture to capture the constraint violation. The accuracy of the models was determined by comparing the results of model optimization with the test data. We obtained an accuracy of 0.999 and a Kappa of 0.996, considered perfect high accuracy. In the confusion matrix, we use the model on the entire dataset to obtain more results, showing the number of correct classifications at 999 out of 1000, 103 N, 53 P, and 843 Y versus one missed classification in the P class in the prediction model. As shown in Table 2, based on the confusion matrix, misclassification error, and accuracy for hyperparameter optimization (tuning), the model is considered to be the best based on the results. The darker sections indicate better outcomes, which translate into reduced misclassification error, lower cost, and varied epsilon values, as shown in Figure 4. The summary of the tuning of the support vector machine model gives us the following parameters: sampling method: 10-fold cross-validation, the best parameters' epsilon = 0, the best parameters epsilon cost = 256, and best performance error: 0.042, as shown in Table 3.
**Random forest (RF)**
The bagging method, first described by Breiman, is a historical example of a random forest method, a machine learning algorithm used for classification and regression. Random forests as Genuer and Poggi (2020) illustrated, consisting of decision trees, have excellent predictive capacity and versatility, making them widely used in various applications. They enable simultaneous assessment of qualitative and quantitative explanatory elements without preprocessing, making them suitable for analyzing both traditional data with a higher number of observations and high-dimensional data with a higher number of variables. As a result, statisticians and data scientists increasingly focus on random forests as a preferred methodology. Random Forest is used because it takes less time to train than other algorithms. The program then predicts output with excellent accuracy and sprints even with a vast dataset. Finally, accuracy can be maintained even when significant data is absent. The random forest was formed in two phases: the first was to combine N decision trees to build the random forest, and the second was to make predictions for each tree created in the first phase. In the following stages, we are used to demonstrating the working process:
The third algorithm we modeled is RF. We divided the data into training data at 80% and test data at 20%. To get the model, we use the Random Forest function. Next to the confusion matrix, we have a class error matrix that includes errors for each class. Then, we use our model on training data to get a confusion matrix that gives us perfect accuracy for the model with an error of 0.02352041, 0.12195122, and 0.00591715 for classes N, P, and Y, respectively. From the model from the training data, we will get overall statistics that give us perfect results, such as Accuracy 1 and Kappa 1. We get more results in the confusion matrix for train data, showing the number of correct classifications at 802 out of 802, 85 N, 41 P, and 676 Y without missing classifications in the prediction model. By using the random forest model (RF) for test data, we will have a confusion matrix with prediction data and compare the test data with our model. The accuracy is 0.9899, and the Kappa is 0.9623. In the confusion matrix for test data, we get more results, showing several correct classifications of 196 out of 198, 18 N, 11 P, and 167 Y, with just two missed classifications, 1 in N and 1 in Y, in the prediction model shown in Table 4. By plotting the (RF) model, we discover that as the number of trees grows, the out-of-bag error initially drops and becomes more or less constant, so we cannot improve this error after about 400 trees, as shown in Figure 6. To get the most negligible error for (OOB) and optimize this parameter, we use the (tuneRF) function, and the result and graph show that it is equal to 6, as shown in Figure 7. Using a plot for variable importance, we can determine which variable plays a vital role in the model. In the random forest model, we make a plot for our model RF that shows us the error rate. Show graph every variable importance in the random forest in the mean decrease accuracy and mean decrease Gini in Table 7, Figure 8.
**Linear Discriminant Analysis (LDA)**
The fourth algorithm we modeled was LDA. Discriminant Analysis and Other Linear Classification Models: In general, discriminant or classification strategies aim to group samples based on predictor features, and the way to accomplish this varies by methodology and follows a mathematical path. Kuhn and Johnson (2016) illustrated the roots of linear discriminant analysis date back to Fisher in 1936 and Welch in 1939. Using only one characteristic as James - Maximize the difference between the two classes' means.
- Reduce the amount of variety within each class.
Linear discriminant analysis (LDA) is a statistical technique employed to classify subjects into distinct groups, with each category being precisely defined. It helps to find the linear combination of the original variables that provides the best possible separation between the groups. The essential purpose is to estimate the relationship between a single categorical dependent variable and a set of independent variables. We want to carry out a discriminant analysis that will help find the linear combination of these ten variables, giving us the best possible separation among these three groups or three different classes. We will randomly partition the data set into training and test data sets by sampling with replacement. We will make our model by function LDA for linear discriminant analysis for the respondent variable class as a function of all the other ten variables for training data. In the below table, we have the coefficients of linear discriminants for LD. The first discriminant function is a linear combination of the ten variables, and the discriminant functions are scaled LD1 and LD2 for every ten variables in Table 6 and Figure 10. We have the proportion of trees that can tell us the percentage separation. The first discriminant function's achieved percentage separation is 0.9785, which is exceptionally high. The second discriminant function, in contrast, achieved a relatively low percentage separation of 0.0215, indicating that it is relatively difficult to distinguish between the first and the second categories. The dataset was partitioned into two subsets: the training dataset, which accounted for 80% of the data, and the test dataset, which accounted for the remaining 20%. Subsequently, the model was acquired, exhibiting a misclassification error rate of 0.1013767, an accuracy level of 0.898, and a Kappa coefficient of 0.64. In the confusion matrix, we obtained the correct classification of 718 out of 799 for N 70, P 13, and Y 635, while the missed classification was N 33, P 18, and Y 30 in the prediction model. The model's results were compared with the test data to determine the model's accuracy. We got a miss classification error of 0.06965174, an accuracy of 0.93, and a Kappa of 0.75, which is considered outstanding accuracy. The numbers on the diagonal indicate correct classification, while the off-diagonal numbers indicate misclassification. In the confusion matrix, we get more results, showing the number of correct classifications at 187 out of 208, 20 N, 3 P, and 164 Y versus 14 missed classifications, 4 in N, 6 in P, and 4 in Y in the prediction model, as shown by the confusion matrix in Table 7 below.
The fifth algorithm we modeled was K-NN. Breiman Rebala We will partition the data set into training and testing data sets to two independent sample sizes with a replacement of 80% for training data, which will take 792 observations, and 20% for testing data, which will take 208 observations. A majority vote of its neighbors determines the classification of an item. If k = 1, the item is assigned to the class of the object's single nearest neighbor. The outcome is determined by whether k-NN is used for classification or regression. It has widespread application in the medical field. Before making the k-nearest neighbor model, we will perform resampling to find the best model by choosing the tuning parameters' values using train control, and this will specify the resampling scheme. TrControl: The caret package specifies the resampling scheme used for cross-validation to find the optimal tuning parameters. We will use repeat CV (cross-validation) to develop the model. For several recent iterations, ten and repeat, which are several complete sets of folds to repeat, this cross-validation is 3. That means it repeats the whole thing three times. Then, we use the function that has a repeatable outcome. We fitted the model, and we are using training data. Our response variable is class, and we create the model by training data between classes versus all the independent variables, which are HbA1c, BMI, AGE, TG, VLDL, Chol, Urea, Cr, HDL, and LDL. As it shows, we have 792 samples, ten predictors, and three classes. In the resampling cross-validation, we have ten folds and repeat them three times, indicating that each cross-validation training data set is split into ten parts or ten folds; nine are used for creating the model, and the remaining one is used for assessing the model. The model accuracy and kappa values have been assessed, and we get listed for various values of k. Accuracy is used to select the optimal model using the most significant value we get when k = 7, as shown in Table 8 and Figure 12. Out of ten values for all variables, the critical values are spread between 0 (zero) and 100, and we can recognize that variables Hb1Ac and BMI are the most important. In contrast, variable LDL turns out to be the least important. Table 9 displays the classes of variables in order of maximum importance. The fifth algorithm we modeled was k-NN. The data was divided into training data at 80% and test data at 20%. Then, we obtained the model from training data, and the model results had a classification error of 0.094, an accuracy of 0.906, and a Kappa of 0.658, considered good accuracy. In the confusion matrix, we get more results, showing the number of correct classifications at 718 out of 792: 59 N, 19 P, and 640 Y versus 40 missed classifications: 27 in N, 14 in P, and 33 in Y in the prediction model. The model's results were compared with the test data to determine the model's accuracy. We got a classification error of 0.101, an accuracy of 0.899, and a Kappa of 0.6187, which is considered good accuracy. In the confusion matrix, we get more results, showing the number of correct classifications at 187 out of 208, 11 N, 6 P, and 170 Y versus 21 missed classifications, 10 in N, 6 in P, and 5 in Y in the prediction model, as shown in Table 10.
**. Comparison of Machine Learning Algorithms**
The goal of comparing machine learning algorithms is to discover the strength and clarity of the algorithms in giving better results, predictive models that last for long periods, are easy to evaluate, and give solid and multiple statistical indicators. Showing these practical qualities gives clear indications, which makes machine learning algorithms extremely important. We believe that comparing machine learning algorithms is significant in and of itself. There are several key advantages to successfully comparing multiple trials. The fundamental goal of model comparison and selection is to achieve unquestionably improved machine learning solution performance. The other goal is identifying the optimal algorithms that meet the data and business needs. At their most basic, machine learning models are statistical equations that run at high speeds on several data points to arrive at a result. As a result, statistical tests on the algorithms are crucial for fine-tuning them and determining if the model's equation best matches the dataset at hand. In the previous stages, we also presented machine learning algorithms and how to process data in multiple ways, gave an idea of the data structure and characteristics, and used five important algorithms in dealing with data to analyze and represent them graphically, showing the essential indicators among them in predicting diabetes in terms of whether a person is infected or not or is expected to be infected. After each algorithm shows its results, we will directly compare the results of the five algorithms with each other. We frequently end up with several good models when working on a machine learning project. Each model will have its own set of performance characteristics. We may evaluate how accurate each model is on unseen data using resampling approaches such as cross-validation. We should examine the predicted accuracy of our machine learning algorithms in various ways before deciding on one or two to finish. We may do this by displaying the average accuracy, variance, and other aspects of the distribution of model accuracies using various visualization approaches. We will compare models using repeated cross-validation with ten folds and three repetitions, a popular standard design. Accuracy and kappa are the assessment metrics. After training, the models are added to a list, and resamples run on the list of models. This function ensures that the models are similar and were trained using the same method of train control configuration. This object holds the evaluation metrics for each fold and the repeat of each method to be tested. Table 4.26 presents the output from the collected results for comparison between models using repeated cross-validation with ten folds and three repetitions. For example, in attempts 1 to 3 in fold 1, we repeat three times and note that RF, CART, and SVM models have the best accuracy, respectively. If we check extensively in the table, we note that this thing is repeated in the rest of Table 11. Table 12, Figure 13, Figure 14, and Figure 15 summarize the results obtained from the previous five algorithms with a comparison function of machine learning algorithms. It shows that the accuracy sequence of the algorithms concerning the training data is utterly identical to the results of the total comparison because the latter depends mainly on the results from the training data RF, CART, SVM, LDA, and KNN. As for the test data results, they showed some differences in the accuracy sequence of the algorithms, where SVM, RF, CART, LDA, and KNN were the highest, respectively. | ||

References | ||

Reference
- International Diabetes Federation. Chapter 3. The global picture. In:
*Diabetes**Atlas*. 8th ed. Brussels, Belgium: International Diabetes Federation; 2017. https://idf.org/e-library/epidemiology-research/diabetes-atlas/134-idf-diabetes-atlas-8th-edition.html. Accessed March (2019). - World Health Organization. ; (2018). Diabetes. Geneva, Switzerland: World HealthOrganization https://www.who.int/news-room/fact-sheets/detail/diabetes. Updated October 30, 2018. Accessed March (2019).
- Mansour AA, Al-Maliky AA, Kasem B, Jabar A, Mosbeh KA. (2014). Prevalence of diagnosed and undiagnosed diabetes mellitus in adults aged 19 years and older in Basrah, Iraq. Diabetes Metab Syndr Obes. ;7:139-144.
- Khanam, j.j., Simon, Y.F. (2021). Comparison of machine learning algorithms for diabetes prediction, Science Direct, ICT Express: 7, 432–439.
- Baran, M. (2020). Classification of multi-label data with machine learning methods (Master thesis). Sivas Cumhuriyet University, Sivas.
- Alan, A. (2020). Evaluation of performance metrics and test techniques on various data sets in machine learning classification methods (Master thesis). Firat University, Fen Bilimleri Enstitusu, Elazig.
- Parthiban, G., Srivatsa, S.K. (2012). Applying Machine Learning Methods in Diagnosing Heart Disease for Diabetic Patients. International Journal of Applied Information Systems (IJAIS). Foundation of Computer Science FCS, Volume 3– No.7, 25-30.
- Keskin, A.K. (2018). Investigation of Machine Learning Classification Algorithms (Master thesis). SINOP UNIVERSITY, Fen Bilimleri institute, Sinop.
- Nahzat, S., Yaganoglu, M. (2021). Diabetes Prediction Using Machine Learning Classification Algorithms. European Journal of Science and Technology, (24), 53-59.
- Rebala, G., Ravi, A., Churiwala, S. (2019).
*An Introduction to Machine Learning*. Springer, Gewerbestrasse 11, 6330 Cham, Switzerland. 9-11. - Ramasubramanian, K., Singh, A. (2019). Machine Learning Using R with Time Series and Industry-Based Use Cases in R. Second Edition. Apress Media, LLC California LLC, USA. 3.
- Breiman, L., Friedman, J. H., Olshen, R. A., Stone, C.J. (1984). Classification
and regression trees. First Edition. Chapman & Hall/CRC, NW, USA. 41. - Genuer, R., Poggi, J. (2020). USE R Random Forests with R. First Edition. Springer Nature Switzerland 6330 Cham, Switzerland. 10-12.
- Rebala, G., Ravi, A., Churiwala, S. (2019).
*An Introduction to Machine Learning*. Springer, Gewerbestrasse 11, 6330 Cham, Switzerland. 58-80. - Suthaharan, SH. (2016). Machine Learning Models and Algorithms for Big Data Classification. Volume 36. Springer Science & Business Media, New York 2016, USA. 7.
- Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet,C. ve Haussler, D. (1999).
*Support Vector Machine Classification of Microarray Gene Expression Data*, Special Work, Department of Computer Science & Biology University of California, Department of Engineering Mathematics, University of Bristol, Bristol, UK. 1-10. - Genuer, R., Poggi, J. (2020). USE R Random Forests with R. First Edition. Springer Nature Switzerland 6330 Cham, Switzerland. 43-107.
- Rebala, G., Ravi, A., Churiwala, S. (2019).
*An Introduction to Machine Learning*. Springer, Gewerbestrasse 11, 6330 Cham, Switzerland. 77-91. - Kuhn, , Johnson K. (2016).
*Applied Predictive Modeling*. Fifth Edition. Springer Science Business Media, New York. USA. 275-300. - Croux, , Filzmoser, P., Joossens, K. (2008). Classification Efficiency for Robust Linear Discriminant Analysis,
*Statistica Sinica,*18 (1): 581-599. - James, G., Witten, D., Hastie, T., Tibshirani, (2021).
*An Introduction to Statistical, Learning, with Applications in R*. Second Springer, NY 10004, USA. 132- 153. - Breiman, L., Friedman, J. H., Olshen, R. A., Stone, C.J. (1984). Classification
and regression trees. First Edition. Chapman & Hall/CRC, NW, USA. 15-17. - Kramer, O. (2013). Dimensionality Reduction with Unsupervised Nearest Neighbors. Volume 51. Springer, USA. 14-15.
Rebala, G., Ravi, A., Churiwala, S. (2019). An Introduction to Machine Learning. Springer, Gewerbestrasse 11, 6330 Cham, Switzerland. 72-76 | ||

Statistics Article View: 46 PDF Download: 23 |