Transcriptomics and the “curse of dimensionality”: Monte Carlo simulations of ml-models as a tool for analyzing multidimensional data in tasks of searching markers of biological processes

Capa

Citar

Texto integral

Acesso aberto Acesso aberto
Acesso é fechado Acesso está concedido
Acesso é fechado Somente assinantes

Resumo

High-throughput transcriptomic research methods provide the assessment of a vast number of factors, valuable for researchers. At the same time the “curse of dimensionality” issues arise, which lead to increasing requirements on data processing and analysis methods. In this study, we propose a new algorithm that combines Monte Carlo methods and machine learning. This algorithm will enable feature space reduction by highlighting genes most likely associated with the investigated diseases. Our approach allows not only to generate a set of “interesting” genes but also to assign weight to each gene, indicating its “importance”. This measure can be used in subsequent statistical analysis, visualization, and interpretation of results. Algorithm performance was demonstrated on open transcriptomic data of patients with HCM (GSE36961 and GSE1145). The analysis revealed genes MYH6, FCN3, RASD1, and SERPINA3, which is in good agreement with the available literature.

Texto integral

Acesso é fechado

Sobre autores

G. Osmak

Сhazov National Medical Research Center for Cardiology; Pirogov Russian National Research Medical University

Autor responsável pela correspondência
Email: german.osmak@gmail.com
Rússia, Moscow; Moscow

M. Pisklova

Сhazov National Medical Research Center for Cardiology; Pirogov Russian National Research Medical University

Email: german.osmak@gmail.com
Rússia, Moscow; Moscow

Bibliografia

  1. Akond Z., Alam M., Mollah Md.N.H. (2018) Biomarker identification from RNA-seq data using a robust statistical approach. Bioinformation. 14(4), 153–163.
  2. Tang M., Sun J., Shimizu K., Kadota K. (2015) Evaluation of methods for differential expression analysis on multi-group RNA-seq count data. BMC Bioinformatics. 16(1), 360.
  3. Barbiero P., Squillero G., Tonda A. (2020) Modeling generalization in machine learning: a methodological and computational study. arXiv. 2006.15680.
  4. Robinson M.D., McCarthy D.J., Smyth G.K. (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 26(1), 139–140.
  5. Smyth G.K. (2005) Limma: linear models for microarray data. In: Bioinformatics and computational biology solutions using R and Bioconductor. New York: Springer.
  6. Benjamini Y., Hochberg Y. (1997) Multiple hypotheses testing with weights. Scandinavian J. Statistics. 24(3), 407–418.
  7. Holm S. (1979) A simple sequentially rejective multiple test procedure. Scandinavian J. Statistics. 6(2), 65–70.
  8. Gui J., Tosteson T.D., Borsuk M. (2012) Weighted multiple testing procedures for genomic studies. BioData Mining. 5(1), 4.
  9. Basu P., Cai T. T., Das K., Sun W (2018) Weighted false discovery rate control in large-scale multiple testing. J. Am. Stat. Assoc. 113(523), 1172–1183.
  10. Mann H.B., Whitney D.R. (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann. Mathemat. Statistics. 18(1), 50–60.
  11. Benjamini Y., Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Statist. Soc.: Series B (Methodological). 57(1), 289–300.
  12. Genovese C.R., Roeder K., Wasserman L. (2006) False discovery control with p-value weighting. Biometrika. 93(3), 509–524.
  13. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Duchesnay E. (2011) Scikit-learn: machine learning in python. J. Machine Learning Res. 12(Oct), 2825–2830.
  14. Anfinson M., Fitts R.H., Lough J.W., James J.M., Simpson P.M., Handler S.S., Mitchell M.E., Tomita-Mitchell A. (2022) Significance of α-myosin heavy chain (MYH6) variants in hypoplastic left heart syndrome and related cardiovascular diseases. J. Cardiovascular Dev. Dis. 9(5), 144.
  15. Ntelios D., Meditskou S., Efthimiadis G., Pitsis A., Zegkos T., Parcharidou D., Theotokis P., Alexouda S., Karvounis H., Tzimagiorgis G. (2022) α-Myosin heavy chain (MYH6) in hypertrophic cardiomyopathy: рrominent expression in areas with vacuolar degeneration of myocardial cells. Pathol. Int. 72(5), 308–310.
  16. Suzuki T., Saito K., Yoshikawa T., Hirono K., Hata Y., Nishida N., Yasuda K., Nagashima M. (2022) A double heterozygous variant in MYH6 and MYH7 associated with hypertrophic cardiomyopathy in a Japanese family. J. Cardiol. Cases. 25(4), 213–217.
  17. Michalski M., Świerzko A.S., Pągowska-Klimek I., Niemir Z.I., Mazerant K., Domżalska-Popadiuk I., Moll M., Cedzyński M. (2015) Primary ficolin-3 deficiency — is it associated with increased susceptibility to infections? Immunobiology. 220(6), 711–713.
  18. Prohászka Z., Munthe-Fog L., Ueland T., Gombos T., Yndestad A., Förhécz Z., Skjoedt MO, Pozsonyi Z., Gustavsen A., Jánoskuti L., Karádi I., Gullestad L., Dahl C.P., Askevold E.T., Füst G., Aukrust P., Mollnes T.E., Garred P. (2013) Association of ficolin-3 with severity and outcome of chronic heart failure. PLoS One. 8(4), e60976.
  19. Li D., Lin H., Li L. (2020) Multiple feature selection strategies identified novel cardiac gene expression signature for heart failure. Front. Physiol. 11, 604241.
  20. Song H., Chen S., Zhang T., Huang X., Zhang Q., Li C., Chen C., Chen S., Liu D., Wang J., Tu Y., Wu Y., Liu Y. (2022) Integrated strategies of diverse feature selection methods identify aging-based reliable gene signatures for ischemic cardiomyopathy. Front. Mol. Biosci. 9, 805235.
  21. Wie J., Kim B.J., Myeong J., Ha K., Jeong S.J., Yang D., Kim E., Jeon J.H., So I. (2015) The roles of Rasd1 small G proteins and leptin in the activation of TRPC4 transient receptor potential channels. Channels. 9(4), 186–195.
  22. Kemppainen R.J., Behrend E.N. (1998) Dexamethasone rapidly induces a novel Ras superfamily member-related gene in AtT-20 cells. J. Biol. Chem. 273(6), 3129–3131.
  23. McGrath M.F., Ogawa T., De Bold A.J. (2012) Ras dexamethasone-induced protein 1 is a modulator of hormone secretion in the volume overloaded heart. Am. J. Physiol. Heart Circ. Physiol. 302(9), H1826–H1837.
  24. Baker C., Belbin O., Kalsheker N., Morgan K. (2007) SERPINA3 (aka alpha-1-antichymotrypsin). Front. Biosci. 12(8–12), 2821–2835.
  25. de Mezer M., Rogaliński J., Przewoźny S., Chojnicki M., Niepolski L., Sobieska M., Przystańska A. (2023) SERPINA3: stimulator or inhibitor of pathological changes. Biomedicines. 11(1), 156.
  26. You H., Dong M. (2023) Prediction of diagnostic gene biomarkers for hypertrophic cardiomyopathy by integrated machine learning. J. Int. Med. Res. 51(11), 03000605231213781.

Arquivos suplementares

Arquivos suplementares
Ação
1. JATS XML
2. Fig. 1. Research scheme.

Baixar (420KB)
3. Fig. 2. Results of Monte Carlo simulations for training classifiers. a — Convergence of the algorithm by the size of the set of the most significant genes; red dashes along the abscissa axis show the moments of change in the composition of this set. b — Dynamics of growth depending on the iteration of the algorithm of the number of selected genes (green line); weights of genes included in more than half of the models (red line); iteration at which the set of the most significant genes was changed (red vertical dashes along the abscissa axis). c — Histogram of the distribution of the ROC-AUC measure for ML classifiers in 3000 Monte Carlo simulations. d — Histogram of the distribution of the estimated weight of genes included in at least one model.

Baixar (410KB)
4. Fig. 3. Testing hypotheses about the association of selected genes on the independent GSE1145 dataset. a — Gene expression comparison graph (Volcano plot), the size of the dots denotes their WeightML. b — Summary table of statistics; only significant (by p-value) results are shown. p-valMW — p-value according to the Mann–Whitney criterion; FDRBH — Benjamini–Hochberg multiple comparison correction; FDRwBH — weighted Benjamini–Hochberg multiple comparison correction; WeightML — gene weight, reflecting its significance for classification models based on the results of Monte Carlo simulations; log2FC — logarithm of the ratio of means.

Baixar (791KB)

Declaração de direitos autorais © Russian Academy of Sciences, 2025