Selection of The Optimal Dataset to Improve the Efficiency of Skin Raman Spectra Identification
Ksenia E. Tomnikova1, Irina A. Matveeva1; 1Samara University, Samara, Russia
Abstract
Analysis of skin Raman spectra is a challenging task, as spectral data contain extensive information about the chemical composition of the skin as well as background signals [1]. Various artificial intelligence methods are commonly used for the identification of Raman spectra [2,3]. Effective application of these methods requires careful preparation of training data. The aim of this work is to select the optimal dataset that achieves the highest classification efficiency of Raman spectra using a multilayer perceptron model.
Four types of datasets were considered. The first dataset includes only Raman scattering spectra (spectral counts). A total of 615 skin spectra with various diseases were registered. The study was conducted at the Samara Regional Clinical Oncology Dispensary. The second dataset consists of relative concentrations of thirty components obtained by multivariate curve resolution (MCR) analysis of Raman spectra. This data reduction method enables physical interpretation of the resulting components and assessment of their contributions to the original Raman spectrum. The third dataset is based on patient anamnesis. Anamnesis plays an important role in the diagnosis of skin diseases. Information on past illnesses, allergies, heredity, and lifestyle helps the physician to form a comprehensive picture of the patient's condition. In this work, 15 different risk factors are presented. The fourth dataset combines the second and third datasets and includes both the relative concentrations of thirty components and all presented risk factors.
Classification models were developed for three cases: benign neoplasms vs. malignant neoplasms; malignant melanoma vs. pigmented nevus; malignant melanoma vs. pigmented nevus and seborrheic keratosis. For each case, models based on the multilayer perceptron algorithm were developed using the following feature combinations: spectral counts ("counts"), relative concentrations from MCR analysis ("MCR"), patient anamnesis ("anamnesis"), and relative concentrations from MCR analysis combined with anamnesis ("MCR+anamnesis").
The most effective models were developed using the dataset where the classification features were the relative concentrations obtained from MCR analysis and patient anamnesis. Using the "MCR+anamnesis" dataset not only increases the efficiency of the classification models but also improves computation speed (compared to the "Counts" dataset) and allows for physical interpretation of the results.
[1] Bratchenko I. A., Bratchenko L. A., Moryatov A. A., Khristoforova Y. A., Artemyev D. N., Myakinin O. O., Orlov A. E., Kozlov S. V., Zakharov V. P., In vivo diagnosis of skin cancer with a portable Raman spectroscopic device, Experimental Dermatology, vol. 30, no. 5, pp. 652-663, (2021).
[2] Santos I. P., van Doorn R., Caspers P. J., Bakker Schut T. C., Barroso E. M., Nijsten T. E. C., Noordhoek Hegt V., Koljenović S., Puppels G. J., Improving clinical diagnosis of early-stage cutaneous melanoma based on Raman spectroscopy, British Journal of Cancer, vol. 119, no. 11, pp. 1339-1346, (2018).
[3] Araújo D. C., Veloso A. A., de Oliveira Filho R. S., Giraud M. N., Raniero L. J., Ferreira L. M., Bitar R. A., Finding reduced Raman spectroscopy fingerprint of skin samples for melanoma diagnosis through machine learning, Artificial Intelligence in Medicine, vol. 120, p. 102161, (2021).
Speaker
Ksenia Tomnikova
Samara University
Russia
Discussion
Ask question