Download PDFOpen PDF in browserOntology-Driven Scientific Literature Classification using Clustering and Self-Supervised LearningEasyChair Preprint 728819 pages•Date: January 5, 2022AbstractThe rapid growth of scientific literature in the fields of computer engineering (CE) and computer science (CS) presents difficulties to researchers who are interested in exploring publication records based on standard scientific categories. This urges the need for automatic classification of text documents into scientific categories using content and contextual information. Document classification is a significant application of supervised learning which requires a labeled data set for training the classifier. However, research publication records available on Google Scholar and dblp services are not labeled. First, manual annotation of a large body of scientific research work based on standard scientific terminology requires domain expertise and is extremely time-consuming. Second, hierarchical labeling of records facilitates a more effective and context-aware retrieval of documents. In this paper, we propose an ontology-driven classification technique based on zero-shot learning in conjunction with agglomerative clustering to automatically label a scientific literature data set related to CE and CS. We study and compare the effectiveness of multiple text classifiers such as logistic regression, support vector machines (SVM), gradient boosting with Word2vec and bag of words (BOW) embedding, recurrent neural networks (RNN) with GloVe embedding, and feed-forward neural networks with BOW embedding. Our study shows that RNN with GloVe embedding outperforms other models with an above 0.85 F1 score on all granularity levels. Our proposed technique will help junior and experienced researchers in identifying new emerging technologies and domains for their research purposes. Keyphrases: Granularity Level, Hierarchical Document Classification, Machine Learning Application, Natural Language Processing, document classification, machine learning, scientific literature, self-supervised learning, text classification, unsupervised learning
|