Class-Oriented Text Vectorization for Text Classification: Case Study of Job Offer Classification

Ghislain Wabo Tatchum(1), Armel Jacques Nzekon Nzeko'o(2), Fritz Sosso Makembe(3), Xaviera Youh Djam(4),


(1) Department of computer sciences, University of Yaounde I, Yaounde, BP 812
(2) Sorbonne University, IRD, UMI 209 UMMISCO Bondy, F-93143 France, and University of Yaounde I, Yaounde, BP 812
(3) University of Yaounde I, Yaounde, BP 812
(4) University of Yaounde I, Yaounde, BP 812

Abstract


Advances in data science have made it possible to solve many real-life problems using automatic text classification applications. This is the case in e-recruitment, where job offers are classified and recommended to jobseekers. In natural language processing, text classification involves a vectorization step, whereby each document is represented as a vector of coordinates linked to a keyword. Those keywords are obtained by vectorizing the entire corpus, and are used to distinguish one document from another in the corpus. However, it is preferable for each keyword to distinguish one class from another. To obtain these types of keywords, the authors consider the class of documents in the vectorization process. They first create a class-oriented document for each class by merging all documents from the same class, and then apply a vectorization algorithm. Experiments are carried out using datasets from Minajobs, Nigham, and Monster with the classification models Decision Tree, Naive Bayes, Support Vector Machine, and a deep neural network self-attention transformer (TFM). The vectorization methods used on class-oriented documents are Doc2Vec and TF-IDF combined with our class-oriented vectorization strategies, including OC, ZIPF, and OWDC. To evaluate these experiments, we used the precision, MAP, and F1-Score metrics. According to the results, the TFM methods can improve accuracy by 29, 40, and 33% compared to previous work and the traditional way of classifying text documents. The NB methods can improve accuracy by 19, 22, and 20%, while the DT methods can improve accuracy by 34, 37, and 34%. The SVM methods can improve accuracy by 33, 34, and 34% in the Monster, Nigham, and Minajobs datasets. In addition, we validate our contribution by comparing ourselves with three other works in the literature using four datasets (RE'16, Wap, WebKB, and Kla) and obtain improvements in accuracy and F1-score up to 55%.


Keywords


Supervised machine learning; Natural Language Processing (NLP); Text classification; Class-oriented text vectorization; Pre-processing

Full Text:

PDF

References


Y. Liang, H. Liu, and S. Zhang, "Micro‐blog sentiment classification using Doc2vec + SVM model with data purification," The Journal of Engineering, vol. 2020, no. 13, 2020. doi: 10.1049/joe.2019.1159

J. Chen, Z. Gong, and W. Liu, "A Dirichlet process biterm-based mixture model for short text stream clustering," Applied Intelligence, vol. 50, no. 5, 2020. doi: 10.1007/s10489-019-01606-1

C. Storey and D. E. O’Leary, “Text Analysis of Evolving Emotions and Sentiments in COVID-19 Twitter Communication,” Cognitive Computation, vol. 16, no. 4. Springer Science and Business Media LLC, pp. 1834–1857, Jul. 28, 2022. doi: 10.1007/s12559-022-10025-3.

M. A. Calijorne Soares and F. S. Parreiras, "A literature review on question answering techniques, paradigms and systems," Journal of King Saud University - Computer and Information Sciences, vol. 32, no. 6, 2020. doi: 10.1016/j.jksuci.2018.08.005

M. Hao, B. Xu, J. Y. Liang, B. W. Zhang, and X. C. Yin, "Chinese short text classification with mutual-attention convolutional neural networks," ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 19, no. 5, 2020. doi: 10.1145/3388970

J. Parmar, S. S. Chouhan, and V. Raychoudhury, "A machine learning based framework to identify unseen classes in open-world text classification," Information Processing and Management, vol. 60, no. 2, 2023. doi: 10.1016/j.ipm.2022.103214

S. Akter, N. Nawal, A. Dey, and A. Das, "Analyzing the IT job market and classifying IT jobs using machine learning algorithms," in Applied Intelligence for Industry 4.0, 2023. doi: 10.1201/9781003256083-19

L. Jin, L. Zhang, and L. Zhao, "Feature selection based on absolute deviation factor for text classification," Information Processing and Management, vol. 60, no. 3, 2023. doi: 10.1016/j.ipm.2022.103251

W. Cunha, S. Canuto, F. Viegas, T. Salles, C. Gomes, V. Mangaravite, E. Resende, T. Rosa, M. A. Gonçalves, and L. Rocha, "Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling," Information Processing and Management, vol. 57, no. 4, 2020. doi: 10.1016/j.ipm.2020.102263

S. Tiun, U. A. Mokhtar, S. H. Bakar, and S. Saad, "Classification of functional and non-functional requirement in software requirement using Word2vec and fast Text," Journal of Physics: Conference Series, vol. 1529, no. 4, 2020. doi: 10.1088/1742-6596/1529/4/042077

A. Nigam, A. Roy, H. Singh, and H. Waila, "Job recommendation through progression of job selection," in Proc. of 2019 6th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS 2019), 2019. doi: 10.1109/CCIS48116.2019.9073723

M. Afandi and K. N. Isnaini, "Analyzing Public Trust in Presidential Election Surveys: A Study Using SVM and Logistic Regression on Social Media Comments," Journal of Computer Science and Engineering (JCSE), vol. 5, no. 1, pp. 1-11, 2024.

P. Atanasova, J. G. Simonsen, C. Lioma, and I. Augenstein, "A diagnostic study of explainability techniques for text classification," in EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, 2020. doi: 10.1007/978-3-031-51518-7_7

J. Hjort and J. Poulsen, "The arrival of fast internet and employment in Africa," American Economic Review, vol. 109, no. 3, 2019. doi: 10.1257/aer.20161385

F. Suvankulov, M. C. K. Lau, and F. H. C. Chau, "Job search on the internet and its outcome," Internet Research, vol. 22, no. 3, 2012. doi: 10.1108/10662241211235662

G. Wabo, A. Nzekon, F. Sosso, and X. Djam, "Vectorization on class-oriented documents for job recommendation based on supervised machine learning models," in CARI 2022, Yaoundé, Cameroon, 2022.

T. Sabri, O. el Beggar, and M. Kissi, "Comparative study of Arabic text classification using feature vectorization methods," Procedia Computer Science, vol. 198, 2021. doi: 10.1016/j.procs.2021.12.239

K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, "Text classification algorithms: A survey," Information, vol. 10, no. 4, 2019. doi: 10.3390/info10040150

K. N. Prafajar, H. Vallyan, N. L. P. A. Candradewi, I. S. Edbert, and D. Suhartono, "Multiclass Job Recommendation System in the IT Field between Classification and Prediction Method," in 2022 International Conference on Green Energy, Computing and Sustainable Technology (GECOST 2022), 2022. doi: 10.1109/GECOST55694.2022.10010659

O. Ben-Porat, S. Hirsch, L. Kuchy, G. Elad, R. Reichart, and M. Tennenholtz, "Predicting strategic behavior from free text," Journal of Artificial Intelligence Research, vol. 68, 2020. doi: 10.1613/JAIR.1.11849

A. Shen, B. Salehi, J. Qi, and T. Baldwin, "A multimodal approach to assessing document quality," Journal of Artificial Intelligence Research, vol. 68, 2020. doi: 10.1613/JAIR.1.11647

B. Xing and I. W. Tsang, "Exploiting Contextual Target Attributes for Target Sentiment Classification," Journal of Artificial Intelligence Research, vol. 80, pp. 419–439, Jun. 2024. doi: 10.1613/jair.1.14947

B. Charbuty and A. Abdulazeez, "Classification Based on Decision Tree Algorithm for Machine Learning," Journal of Applied Science and Technology Trends, vol. 2, no. 01, 2021. doi: 10.38094/jastt20165

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems, 2017, pp. 5998-6008.

A. K. Uysal and S. Gunal, "The impact of preprocessing on text classification," Information Processing and Management, vol. 50, no. 1, 2014. doi: 10.1016/j.ipm.2013.08.006

V. Wisdom and R. Gupta, "An introduction to twitter data analysis in python," Artigence Inc, 2016.

J. Perkins, Python Text Processing with NLTK 2.0 Cookbook, 2010.

L. Singh, "Clustering Text: A Comparison Between Available Text Vectorization Techniques," 2022. doi: 10.1007/978-981-16-1249-7_3

B. Habert and M. Jardino, "R. Harald Baayen — Word Frequency Distributions. Text, Speech and Language Technology n°18, Dordrecht : Kluwer Academic Publishers, 2001, 334 p. + 1 CD-Rom," Corpus, vol. 2, 2003. doi: 10.4000/corpus.42

D. Valcarce, A. Bellogín, J. Parapar, and P. Castells, "Assessing ranking metrics in top-N recommendation," Information Retrieval Journal, vol. 23, no. 4, 2020. doi: 10.1007/s10791-020-09377-x

M. A. Hambali, M. D. Gbolagade, and Y. A. Olasupo, "Heart Disease Prediction Using Principal Component Analysis and Decision Tree Algorithm," Journal of Computer Science and Engineering (JCSE), vol. 4, no. 1, 2023.

R. Egger, “Text Representations and Word Embeddings,” Tourism on the Verge. Springer International Publishing, pp. 335–361, 2022. doi: 10.1007/978-3-030-88389-8_16.




DOI: https://doi.org/10.36596/jcse.v5i2.870

Refbacks

  • There are currently no refbacks.


Journal of Computer Science and Engineering (JCSE)
ISSN 2721-0251 (online)
Published by : ICSE (Institute of Computer Sciences and Engineering)
Website : http://icsejournal.com/index.php/JCSE/
Email: jcse@icsejournal.com

Creative Commons License is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.