Download PDFOpen PDF in browserFeature Extraction Methods and Classification for Malware Incident NewsEasyChair Preprint 111196 pages•Date: October 23, 2023AbstractStudies related to data mining is one of the topics that has received quite a lot of interest recently, including for the form of unstructured data. One that is quite commonly discussed is the automatic classification process using machine learning methods. The large amount of data is the main obstacle in the manual classification process but there are still many people who have difficulty determining the right combination between feature extraction and classification methods, so with this we provide suggestions for using a combination of methods that can produce better accuracy in text classification. This research compares several feature extraction methods which include Bag-of-Word (BoW), Term Frequency - Inverse Document Frequency (TF-IDF) and Word2Vec which are focused on the Skip-gram model. Apart from that, this research also uses several classification methods which include Support Vector Machine (SVM), Decision Tree, Logistic Regression, Gaussian Naive Bayes, K-Nearest Neighbor, Neural Network, Random Forest and also Doc2Vec. This research used 200 crawled articles from several web blogs that had been labeled manually and has been split into two class, malware incident news and non-malware incident news class, and the dataset quality also measured using an open-source python library known as "cleanlab". Keyphrases: Document Embedding, malware incident, text classification, text mining, web crawling
|