Penyelesaian Masalah Ketidakseimbangan Data Melalui Teknik Oversampling dan Undersampling pada Klasifikasi Siswa Tidak Naik Kelas
Implementation of Data Imbalance Problems using Oversampling and Undersampling Techniques in the Classification of Students Not Upgraded
Abstract
Data mining is the process of generating patterns and knowledge from large datasets. Data sources can be obtained from databases, the web, or other information storage. Most data mining algorithms work best when the number of samples in each class is almost the same. But in the case of classification problems, the number of observations belonging to one class is significantly smaller than that of other classes is not a rare thing at all. This is called imbalanced data. To overcome the problem of data imbalance, resampling techniques can be used. Resampling is divided into two types, namely undersampling and oversampling. This research will apply oversampling and undesampling techniques followed by classification predictions using the C5.0 algorithm in the case of classification of students who do not graduate from school. Based on the test results that have been carried out with three different datasets, the C5.0 algorithm with k-fold cross validation can work better on datasets processed using random oversampling techniques compared to original datasets or datasets formed from random undersampling techniques. This is indicated by the accuracy in each fold which tends to be stable and consistent in the range of 93% to 97.6%.
Downloads

