Transcriptomics and the “curse of dimensionality”: Monte Carlo simulations of ml-models as a tool for analyzing multidimensional data in tasks of searching markers of biological processes

G. J. Osmak; Осьмак Г. Ж.; M. V. Pisklova; Писклова М. В.

doi:10.31857/S0026898425010117

Transcriptomics and the “curse of dimensionality”: Monte Carlo simulations of ml-models as a tool for analyzing multidimensional data in tasks of searching markers of biological processes

Авторлар: Osmak G.J.¹^,2, Pisklova M.V.¹^,2
Мекемелер:
1. Сhazov National Medical Research Center for Cardiology
2. Pirogov Russian National Research Medical University
Шығарылым: Том 59, № 1 (2025)
Беттер: 154-161
Бөлім: БИОИНФОРМАТИКА
URL: https://rjpbr.com/0026-8984/article/view/682236
DOI: https://doi.org/10.31857/S0026898425010117
EDN: https://elibrary.ru/HCCMTU
ID: 682236

Дәйексөз келтіру

Толық мәтін

Ашық рұқсат
Рұқсат жабық

Рұқсат берілді
Рұқсат жабық

Рұқсат ақылы немесе тек жазылушылар үшін

Аннотация
Толық мәтін
Авторлар туралы
Әдебиет тізімі
Қосымша файлдар
Статистика

Аннотация

High-throughput transcriptomic research methods provide the assessment of a vast number of factors, valuable for researchers. At the same time the “curse of dimensionality” issues arise, which lead to increasing requirements on data processing and analysis methods. In this study, we propose a new algorithm that combines Monte Carlo methods and machine learning. This algorithm will enable feature space reduction by highlighting genes most likely associated with the investigated diseases. Our approach allows not only to generate a set of “interesting” genes but also to assign weight to each gene, indicating its “importance”. This measure can be used in subsequent statistical analysis, visualization, and interpretation of results. Algorithm performance was demonstrated on open transcriptomic data of patients with HCM (GSE36961 and GSE1145). The analysis revealed genes MYH6, FCN3, RASD1, and SERPINA3, which is in good agreement with the available literature.

Негізгі сөздер

transcriptomics, machine learning, Monte Carlo, hypertrophic cardiomyopathy, biomarkers

Толық мәтін

Авторлар туралы

G. Osmak

Сhazov National Medical Research Center for Cardiology; Pirogov Russian National Research Medical University

Хат алмасуға жауапты Автор.
Email: german.osmak@gmail.com
Ресей, Moscow; Moscow

M. Pisklova

Сhazov National Medical Research Center for Cardiology; Pirogov Russian National Research Medical University

Email: german.osmak@gmail.com
Ресей, Moscow; Moscow

Әдебиет тізімі

Akond Z., Alam M., Mollah Md.N.H. (2018) Biomarker identification from RNA-seq data using a robust statistical approach. Bioinformation. 14(4), 153–163.
Tang M., Sun J., Shimizu K., Kadota K. (2015) Evaluation of methods for differential expression analysis on multi-group RNA-seq count data. BMC Bioinformatics. 16(1), 360.
Barbiero P., Squillero G., Tonda A. (2020) Modeling generalization in machine learning: a methodological and computational study. arXiv. 2006.15680.
Robinson M.D., McCarthy D.J., Smyth G.K. (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 26(1), 139–140.
Smyth G.K. (2005) Limma: linear models for microarray data. In: Bioinformatics and computational biology solutions using R and Bioconductor. New York: Springer.
Benjamini Y., Hochberg Y. (1997) Multiple hypotheses testing with weights. Scandinavian J. Statistics. 24(3), 407–418.
Holm S. (1979) A simple sequentially rejective multiple test procedure. Scandinavian J. Statistics. 6(2), 65–70.
Gui J., Tosteson T.D., Borsuk M. (2012) Weighted multiple testing procedures for genomic studies. BioData Mining. 5(1), 4.
Basu P., Cai T. T., Das K., Sun W (2018) Weighted false discovery rate control in large-scale multiple testing. J. Am. Stat. Assoc. 113(523), 1172–1183.
Mann H.B., Whitney D.R. (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann. Mathemat. Statistics. 18(1), 50–60.
Benjamini Y., Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Statist. Soc.: Series B (Methodological). 57(1), 289–300.
Genovese C.R., Roeder K., Wasserman L. (2006) False discovery control with p-value weighting. Biometrika. 93(3), 509–524.
Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Duchesnay E. (2011) Scikit-learn: machine learning in python. J. Machine Learning Res. 12(Oct), 2825–2830.
Anfinson M., Fitts R.H., Lough J.W., James J.M., Simpson P.M., Handler S.S., Mitchell M.E., Tomita-Mitchell A. (2022) Significance of α-myosin heavy chain (MYH6) variants in hypoplastic left heart syndrome and related cardiovascular diseases. J. Cardiovascular Dev. Dis. 9(5), 144.
Ntelios D., Meditskou S., Efthimiadis G., Pitsis A., Zegkos T., Parcharidou D., Theotokis P., Alexouda S., Karvounis H., Tzimagiorgis G. (2022) α-Myosin heavy chain (MYH6) in hypertrophic cardiomyopathy: рrominent expression in areas with vacuolar degeneration of myocardial cells. Pathol. Int. 72(5), 308–310.
Suzuki T., Saito K., Yoshikawa T., Hirono K., Hata Y., Nishida N., Yasuda K., Nagashima M. (2022) A double heterozygous variant in MYH6 and MYH7 associated with hypertrophic cardiomyopathy in a Japanese family. J. Cardiol. Cases. 25(4), 213–217.
Michalski M., Świerzko A.S., Pągowska-Klimek I., Niemir Z.I., Mazerant K., Domżalska-Popadiuk I., Moll M., Cedzyński M. (2015) Primary ficolin-3 deficiency — is it associated with increased susceptibility to infections? Immunobiology. 220(6), 711–713.
Prohászka Z., Munthe-Fog L., Ueland T., Gombos T., Yndestad A., Förhécz Z., Skjoedt MO, Pozsonyi Z., Gustavsen A., Jánoskuti L., Karádi I., Gullestad L., Dahl C.P., Askevold E.T., Füst G., Aukrust P., Mollnes T.E., Garred P. (2013) Association of ficolin-3 with severity and outcome of chronic heart failure. PLoS One. 8(4), e60976.
Li D., Lin H., Li L. (2020) Multiple feature selection strategies identified novel cardiac gene expression signature for heart failure. Front. Physiol. 11, 604241.
Song H., Chen S., Zhang T., Huang X., Zhang Q., Li C., Chen C., Chen S., Liu D., Wang J., Tu Y., Wu Y., Liu Y. (2022) Integrated strategies of diverse feature selection methods identify aging-based reliable gene signatures for ischemic cardiomyopathy. Front. Mol. Biosci. 9, 805235.
Wie J., Kim B.J., Myeong J., Ha K., Jeong S.J., Yang D., Kim E., Jeon J.H., So I. (2015) The roles of Rasd1 small G proteins and leptin in the activation of TRPC4 transient receptor potential channels. Channels. 9(4), 186–195.
Kemppainen R.J., Behrend E.N. (1998) Dexamethasone rapidly induces a novel Ras superfamily member-related gene in AtT-20 cells. J. Biol. Chem. 273(6), 3129–3131.
McGrath M.F., Ogawa T., De Bold A.J. (2012) Ras dexamethasone-induced protein 1 is a modulator of hormone secretion in the volume overloaded heart. Am. J. Physiol. Heart Circ. Physiol. 302(9), H1826–H1837.
Baker C., Belbin O., Kalsheker N., Morgan K. (2007) SERPINA3 (aka alpha-1-antichymotrypsin). Front. Biosci. 12(8–12), 2821–2835.
de Mezer M., Rogaliński J., Przewoźny S., Chojnicki M., Niepolski L., Sobieska M., Przystańska A. (2023) SERPINA3: stimulator or inhibitor of pathological changes. Biomedicines. 11(1), 156.
You H., Dong M. (2023) Prediction of diagnostic gene biomarkers for hypertrophic cardiomyopathy by integrated machine learning. J. Int. Med. Res. 51(11), 03000605231213781.

Қосымша файлдар

Әрекет

1. JATS XML

Жүктеу

2. Fig. 1. Research scheme.

Жүктеу (420KB)

Метадеректер

3. Fig. 2. Results of Monte Carlo simulations for training classifiers. a — Convergence of the algorithm by the size of the set of the most significant genes; red dashes along the abscissa axis show the moments of change in the composition of this set. b — Dynamics of growth depending on the iteration of the algorithm of the number of selected genes (green line); weights of genes included in more than half of the models (red line); iteration at which the set of the most significant genes was changed (red vertical dashes along the abscissa axis). c — Histogram of the distribution of the ROC-AUC measure for ML classifiers in 3000 Monte Carlo simulations. d — Histogram of the distribution of the estimated weight of genes included in at least one model.

Жүктеу (410KB)

Метадеректер

4. Fig. 3. Testing hypotheses about the association of selected genes on the independent GSE1145 dataset. a — Gene expression comparison graph (Volcano plot), the size of the dots denotes their WeightML. b — Summary table of statistics; only significant (by p-value) results are shown. p-valMW — p-value according to the Mann–Whitney criterion; FDRBH — Benjamini–Hochberg multiple comparison correction; FDRwBH — weighted Benjamini–Hochberg multiple comparison correction; WeightML — gene weight, reflecting its significance for classification models based on the results of Monte Carlo simulations; log2FC — logarithm of the ratio of means.

Жүктеу (791KB)

Метадеректер

Пайдаланушының аты
Құпиясөз
Мені есте сақтау

Құпия сөзді ұмыттыңыз ба?	Тіркеу

Пайдаланушының аты
Құпиясөз
Мені есте сақтау

Құпия сөзді ұмыттыңыз ба?	Тіркеу

Том 59, № 4 (2025)

Том 59, № 4 (2025)

Transcriptomics and the “curse of dimensionality”: Monte Carlo simulations of ml-models as a tool for analyzing multidimensional data in tasks of searching markers of biological processes

Толық мәтін

Аннотация

Негізгі сөздер

Толық мәтін

Авторлар туралы

G. Osmak

M. Pisklova

Әдебиет тізімі

Қосымша файлдар