Penyelesaian Masalah Ketidakseimbangan Data Melalui Teknik Oversampling dan Undersampling pada Klasifikasi Siswa Tidak Naik Kelas

Implementation of Data Imbalance Problems using Oversampling and Undersampling Techniques in the Classification of Students Not Upgraded

  • Anjas Aprihartha universitas dian nuswantoro
Keywords: C5.0, data mining, k-fold cross validation, oversampling, undersampling

Abstract

Data mining is the process of generating patterns and knowledge from large datasets. Data sources can be obtained from databases, the web, or other information storage. Most data mining algorithms work best when the number of samples in each class is almost the same. But in the case of classification problems, the number of observations belonging to one class is significantly smaller than that of other classes is not a rare thing at all. This is called imbalanced data. To overcome the problem of data imbalance, resampling techniques can be used. Resampling is divided into two types, namely undersampling and oversampling. This research will apply oversampling and undesampling techniques followed by classification predictions using the C5.0 algorithm in the case of classification of students who do not graduate from school. Based on the test results that have been carried out with three different datasets, the C5.0 algorithm with k-fold cross validation can work better on datasets processed using random oversampling techniques compared to original datasets or datasets formed from random undersampling techniques. This is indicated by the accuracy in each fold which tends to be stable and consistent in the range of 93% to 97.6%. 

Downloads

Download data is not yet available.
Statistik
Abstract View: 1166
ARTIKEL Download: 901
Published
2024-06-30