Volume 19, Nº 5 (2024)

Life Sciences

Full-length PacBio Amplicon Sequencing to Unveil RNA Editing Sites

Zhu X., Liao M., Zhu Y., Dong Y.

Resumo

Background:RNA editing enriches post-transcriptional sequence changes. Currently detecting RNA editing sites is mostly based on the Sanger sequencing platform and second-generation sequencing. However, detection with Sanger sequencing is limited by the disturbing background peaks using the direct sequencing method and the clone number using the clone sequencing method, while second-generation sequencing detection is constrained by its short read.

Objective:We aimed to design a pipeline that can accurately detect RNA editing sites for full-length long-read amplicons to meet the requirement when focusing on a few specific genes of interest.

Method:We developed a novel high-throughput RNA editing sites detection pipeline based on the PacBio circular consensus sequences sequencing which is accurate with high-throughput and long-read coverage. We tested the pipeline on cytosolic malate dehydrogenase in the hard-shelled mussel Mytilus coruscus and further validated it using direct Sanger sequencing.

Results:Data generated from the PacBio circular consensus sequences (CCS) amplicons in three mussels were first filtered by quality and then selected by open reading frame. After filtering, 225-2047 sequences of the three mussels, respectively, were used to identify RNA editing sites. With corresponding genomic DNA sequences, we extracted 227-799 candidate RNA editing sites excluding heterozygous sites. We further figured out 7-11 final RESs using a new error model specially designed for RNA editing site detection. The resulting RNA editing sites all agree with the validation using the Sanger sequencing.

Conclusion:We report a near-zero error rate method in identifying RNA editing sites of long-read amplicons with the use of PacBio CCS sequencing.

Current Bioinformatics. 2024;19(5):425-433
pages 425-433 views

SCV Filter: A Hybrid Deep Learning Model for SARS-CoV-2 Variants Classification

Wang H., Gao J.

Resumo

Background:The high mutability of severe acute respiratory syndrome coronavirus 2(SARS-CoV-2) makes it easy for mutations to occur during transmission. As the epidemic continues to develop, several mutated strains have been produced. Researchers worldwide are working on the effective identification of SARS-CoV-2.

Objective:In this paper, we propose a new deep learning method that can effectively identify SARSCoV-2 Variant sequences, called SCVfilter, which is a deep hybrid model with embedding, attention residual network, and long short-term memory as components.

Methods:Deep learning is effective in extracting rich features from sequence data, which has significant implications for the study of Coronavirus Disease 2019 (COVID-19), which has become prevalent in recent years. In this paper, we propose a new deep learning method that can effectively identify SARS-CoV-2 Variant sequences, called SCVfilter, which is a deep hybrid model with embedding, attention residual network, and long short-term memory as components.

Results:The accuracy of the SCVfilter is 93.833% on Dataset-I consisting of different variant strains; 90.367% on Dataset-II consisting of data collected from China, Taiwan, and Hong Kong; and 79.701% on Dataset-III consisting of data collected from six continents (Africa, Asia, Europe, North America, Oceania, and South America).

Conclusion:When using the SCV filter to process lengthy and high-homology SARS-CoV-2 data, it can automatically select features and accurately detect different variant strains of SARS-CoV-2. In addition, the SCV filter is sufficiently robust to handle the problems caused by sample imbalance and sequence incompleteness.

Other:The SCVfilter is an open-source method available at https://github.com/deconvolutionw/SCVfilter.

Current Bioinformatics. 2024;19(5):434-445
pages 434-445 views

Revealing ANXA6 as a Novel Autophagy-related Target for Pre-eclampsia Based on the Machine Learning

Zhu B., Geng H., Yang F., Wu Y., Cao T., Wang D., Wang Z.

Resumo

Background:Preeclampsia (PE) is a severe pregnancy complication associated with autophagy.

Objective:This research sought to uncover autophagy-related genes in pre-eclampsia through bioinformatics and machine learning.

Methods:GSE75010 from the GEO series was subjected to WGCNA to identify key modular genes in PE. Autophagy genes retrieved from the THANATOS overlapped with the modular genes to yield PErelated autophagy genes. Furthermore, the crucial step involved the utilization of two machine learning algorithms (LASSO and SVM-RFE) for dimensionality reduction. The candidate gene was further verified by quantitative reverse transcription polymerase chain reaction, western blot, and immunohistochemistry. Preliminary experiments were conducted on HTR-8/SVneo cell lines to explore the role of candidate genes in autophagy regulation.

Results:WGCNA identified 291 genes from 5 hubs, and after overlapping with 1087 autophagy-related genes obtained from THANATOS, 42 PE-related ARGs were identified. ANXA6 was recognized as a potential target through SVM-RFE and LASSO analyses. The mRNA and protein expression of ANXA6 were verified in placenta samples. In HTR8/SVneo cells, modulating ANXA6 expression altered autophagy levels. Knocking down ANXA6 resulted in an anti-autophagy effect, which was reversed by treatment with CAL101, an inhibitor of PI3K, Akt, and mTOR.

Conclusion:We observed that ANXA6 may serve as a possible PE action target and that autophagy may be crucial to the pathogenesis of PE.

Current Bioinformatics. 2024;19(5):446-457
pages 446-457 views

Prediction of Plant Ubiquitylation Proteins and Sites by Fusing Multiple Features

Guan M., Qiu W., Wang Q., Xiao X.

Resumo

Introduction:Protein ubiquitylation is an important post-translational modification (PTM), which is considered to be one of the most important processes regulating cell function and various diseases. Therefore, accurate prediction of ubiquitylation proteins and their PTM sites is of great significance for the study of basic biological processes and the development of related drugs. Researchers have developed some large-scale computational methods to predict ubiquitylation sites, but there is still much room for improvement. Much of the research related to ubiquitylation is cross-species while the life pattern is diversified, and the prediction method always shows its specificity in practical application. This study just aims at the issue of plants and has constructed computational methods for identifying ubiquitylation protein and ubiquitylation sites.

Method:In this work, we constructed two predictive models to identify plant ubiquitylation proteins and sites. First, in the ubiquitylation proteins prediction model, in order to better reflect protein sequence information and obtain better prediction results, the KNN scoring matrix model based on functional domain Gene Ontology (GO) annotation and word embedding model, i.e. Skip-Gram and Continuous Bag of Words (CBOW), are used to extract the features, and the light gradient boosting machine (LGBM) is selected as the ubiquitylation proteins prediction engine.

Results:As a result, accuracy (ACC), Precision, recall rate (Recall), F1_score and AUC are respectively 85.12%, 80.96%, 72.80%, 76.37% and 0.9193 in the 10-fold cross-validations on independent dataset. In the ubiquitylation sites prediction model, Skip-Gram, CBOW and enhanced amino acid composition (EAAC) feature extraction codes were used to extract protein sequence fragment features, and the predicted results on training and independent test data have also achieved good performance.

Conclusion:In a word, the comparison results demonstrate that our models have a decided advantage in predicting ubiquitylation proteins and sites, and it may provide useful insights for studying the mechanisms and modulation of ubiquitination pathways

Current Bioinformatics. 2024;19(5):458-469
pages 458-469 views

Transformer and Graph Transformer-Based Prediction of Drug-Target Interactions

Qian M., Lu W., Zhang Y., Liu J., Wu H., Lu Y., Li H., Fu Q., Shen J., Xiao Y.

Resumo

Background:As we all know, finding new pharmaceuticals requires a lot of time and money, which has compelled people to think about adopting more effective approaches to locate drugs. Researchers have made significant progress recently when it comes to using Deep Learning (DL) to create DTI

Methods:Therefore, we propose a deep learning model that applies Transformer to DTI prediction. The model uses a Transformer and Graph Transformer to extract the feature information of protein and compound molecules, respectively, and combines their respective representations to predict interactions.

Results:We used Human and C.elegans, the two benchmark datasets, evaluated the proposed method in different experimental settings and compared it with the latest DL model.

Conclusion:The results show that the proposed model based on DL is an effective method for the classification and recognition of DTI prediction, and its performance on the two data sets is significantly better than other DL based methods.

Current Bioinformatics. 2024;19(5):470-481
pages 470-481 views

Predicting the Risk of Breast Cancer Recurrence and Metastasis based on miRNA Expression

Lv Y., Wang Y., Zhang Y., Chen S., Yao Y.

Resumo

Background:Even after surgery, breast cancer patients still suffer from recurrence and metastasis. Thus, it is critical to predict accurately the risk of recurrence and metastasis for individual patients, which can help determine the appropriate adjuvant therapy.

Methods:The purpose of this study is to investigate and compare the performance of several categories of molecular biomarkers, i.e., microRNA (miRNA), long non-coding RNA (lncRNA), messenger RNA (mRNA), and copy number variation (CNV), in predicting the risk of breast cancer recurrence and metastasis. First, the molecular data (miRNA, lncRNA, mRNA, and CNV) of 483 breast cancer patients were downloaded from the Cancer Genome Atlas, which were then randomly divided into the training and test sets with a ratio of 7:3. Second, the feature selection process was applied by univariate Cox and multivariate Cox variance analysis on the training set (e.g., 15 miRNAs). According to the selected features (e.g., 15 miRNAs), a random forest classifier and several other classification methods were established according to the label of recurrence and metastasis. Finally, the performances of the classification models were compared and evaluated on the test set.

Results:The area under the ROC curve was 0.70 for miRNA, better than those using other biomarkers.

Conclusion:These results indicated that miRNA has important guiding significance in predicting recurrence and metastasis of breast cancer.

Current Bioinformatics. 2024;19(5):482-489
pages 482-489 views

DMR_Kmeans: Identifying Differentially Methylated Regions Based on k-means Clustering and Read Methylation Haplotype Filtering

Peng X., Cui W., Kong X., Huang Y., Li J.

Resumo

Introduction::Differentially methylated regions (DMRs), including tissue-specific DMRs and disease-specific DMRs, can be used in revealing the mechanisms of gene regulation and screening diseases. Up until now, many methods have been proposed to detect DMRs from bisulfite sequencing data. In these methods, differentially methylated CpG sites and DMRs are usually identified based on statistical tests or distribution models, which neglect the joint methylation statuses provided in each read and result in inaccurate boundaries of DMRs.

Methods::In this paper, a method, named DMR_Kmeans, is proposed to detect DMRs based on kmeans clustering and read methylation haplotype filtering. In DMR_Kmeans, for each CpG site, the k-means algorithm is used to cluster the methylation levels from two groups, and the methylation difference of the CpG is measured based on the different distributions in clusters. Methylation haplotypes of reads are employed to extract the methylation patterns in a candidate region. Finally, DMRs are identified based on the methylation differences and the methylation patterns in candidate regions.

Result::Comparing the performance of DMR_Kmeans and eight DMR detection methods on the whole genome bisulfite sequencing data of six pairs of tissues, the results show that DMR_Kmeans achieves higher Qn and Ql, and more overlapped promoters than other methods when given a certain threshold of methylation difference greater than 0.4, which indicates that the DMRs predicted by DMR_Kmeans with accurate boundaries contain less CpGs with small methylation differences than those by other methods.

Conclusion::Furthermore, it suggests that DMR_Kmeans can provide a DMR set with high quality for downstream analysis since the total length of DMRs predicted by DMR_Kmeans is longer and the total number of CpG sites in the DMRs is greater than those of other methods.

Current Bioinformatics. 2024;19(5):490-501
pages 490-501 views

A Novel In silico Filtration Method for Discovery of Encrypted Antimicrobial Peptides

Barneh F., Nazarian A., Mousavi Nadoshan R., Pooshang Bagheri K.

Resumo

Background:Antibacterial resistance has been one of the most important causes of death in the last few decades, necessitating the need to discover new antibiotics. Antimicrobial peptides (AMPs) are among the best candidates due to their broad-spectrum and potent activity against bacteria and low probability of developing resistance against them.

Objective:In this study, we proposed a novel filtration method using knowledge-based approaches to discover encrypted AMPs within a protein sequence.

Methods:The encrypted AMPs were selected from a protein sequence, in this case, lactoferrin, based on hydrophobicity, cationicity, alpha-helix structure, helical wheel projection, and binding affinities to gram-negative and positive bacterial membranes.

Results:Six out of 20 potential encrypted AMPs were ultimately selected for further assays. Molecular docking of the selected AMPs with outer and inner membranes of gram-negative bacteria and also gram-positive bacterial membranes showed reasonable binding affinity ranging from ‘-6.7 to -7.5’ and ‘- 4.5 to -5.7’ and ‘-4.6 to -5.7’ kcal/mol, respectively. No toxicity was shown in the candidate AMPs.

Conclusion:According to in silico results, our method succeeded to discover six new encrypted AMPs from human lactoferrin, designated as lactoferrin-derived peptides (LDPs). Further in silico and experimental assays should also be performed to prove the efficiency of our knowledge-based filtration method.

Current Bioinformatics. 2024;19(5):502-512
pages 502-512 views