rna_match
RNA
RNA secondary structure prediction using deep
learning with thermodynamic integration
RNA二级结构方法:
- 热力学模型—最近临方法
- 二级结构分解成若干子结构环(hairpin loops, internal loops, bulge loops, base-pair stackings, multi-branch loops, and external loops),累加每个子结构环的自由能,子结构环自由能由实验获取。
- 动态规划—Zuker计算最小自由能
- 机器学习
- 对分解后的子结构训练得分参数
- 对训练数据过拟合
- 概率生成stochastic context-free grammars (SCFGs)
- 混合方法
结合热力学和机器学习
数据包含:1、RNA序列,2、已知的二级结构dot-brancket格式,3、二级结构的自由能, 构成三元组
预测4种折叠分数类型方法:
- 输入:长度为L的RNA序列,序列元素被编码成d维矢量
UFold: fast and accurate RNA secondary structure prediction with deep learning
RNA分类
ribosomal RNA, transfer RNA, small nuclear RNA, micro RNA, noncoding RNA
RNA功能和secondary structure有关而不是RNA序列本身
secondary struture由原子坐标决定
基对如果是嵌套式(无伪结点)的则最小自由能可以通过动态规划求解,基对非嵌套式(含伪结点)
研究方法:1、单序列能量最小,2、多序列协方差
深度学习方法:
- 输入序列利用LSTM或者transformer encoder
- 深度学习和动态规划结合
挑战:
- LSTM和transformer encoder解决模型参数大的问题导致计算成本高以及效率低
- 深度学习与动力学结合后会受后者的约束影响性能
- 深度学习性能严重依赖于数据的分布
数据:
- bpRNA-new dataset用于cross-family评估
- RNAStralign: 24895用于训练,2854用于测试
- ArchivelII
- bpRNA-1m
本文方法:
输入为二维图片16个通道(用以考虑各种可能的基对以及非准则配对),再外加一个通道记录输入三种匹配基对的概率
- input: $x=(x_1, x_2, \cdots, x_L), \quad x_i\in{‘A’, ‘U’,’C’,’G’}$,
- one-hot 编码$x$序列中的每一个元素,生成二进制矩阵$X\in{0,1}^{L\times4}$
- $X$与$X^T$做Kronecker积生成$K\in{0,1}^{16\times L\times L}$, $K(i,j,k)$表示根据第i条基的配对规则$x_j$和$x_k$之间是否配对
- 为克服$K$矩阵的稀疏性,以及反应出潜在的非准则的基配对,再加一个特征通道,记录各基之间配对发生的概率,该矩阵是非二进制矩阵 $W\in\mathcal{R}^{1\times L\times L}$
- 将$W\in\mathcal{R}^{1\times L\times L}$连接到$K$中得到$I\in\mathcal{R}^{17\times L\times L}$
编解码框架提取多尺度相关性特征
UFold接收$I$张量
编码器由一系列下采样层构成,解码器由一系列上采样层以及同编码器侧向连接层构成
输出$Y=f(I;\theta)$, $Y\in[0,1]^{L\times L}$, $Y_{ij}$表示$x_i$和$x_j$之间发生配对的概率值
输出基对之间相关性得分矩阵
- secondary structure由接触矩阵表达$A\in{0,1}^{L\times L}, \quad A_{ij}=1\; if\, a \,base\, paring\, between\, x_i \,and\, x_j$
后处理
- 将得到的$Y$张量处理成secondary structure
- 后处理考虑4个约束
- 接触矩阵是对称矩阵
- 标准配对准则和U-G配对准则是被允许的
- no sharp loops被允许,即环的尺度有限制至少包含4个base
- 非重叠配对被允许
- 用约束条件生成处理后的得分矩阵$\hat{Y}^*$
A novel SHAPE reagent enables the analysis of RNA structure in living cells with unprecedented accuracy
两类化合物
- nucleobase-specific probe: dimethyl sulfate (DMS)等
- 优点:在MaP实验中的高信噪比
- selective 2’-hydroxyl acylation analyzed by primer extension (SHAPE)
- 已有试剂:1M7, FAI, NAI
- 优点:无偏检测4种RNA基
- 2A3
- 优点:对RNA反应性增加,对细胞膜的高渗透性,这两点会引出高信噪比对比NAI
- 已有试剂:1M7, FAI, NAI
读取RNA,突变分析方法mutational profiling approaches(MaP)
- 典型:检测逆转录下降reverse transcription (RT) drop-off events
Phred数值通常以负数表示,数值越高,表示测序结果越可靠,错误率越低。
rf-map module: 读数
rf-count module:生成突变信号
rf-jackknife module:最优slop/intercept网格搜索
使用SHAPE推理的反应进行经验驱动的RNA结构建模,将反应转换成无结点能量贡献,转换依赖参数slope(m)和intercept(b),we first used the ViennaRNA Package 2.0 (29) to perform a grid search of the optimal m/b pairs
GPT:
Predicting RNA secondary structure using SHAPE (Selective 2’-Hydroxyl Acylation analyzed by Primer Extension) reactivities involves a combination of experimental data and computational methods. Here’s a step-by-step guide on how to use SHAPE reactivities to predict RNA secondary structure:
Obtain SHAPE Reactivity Data:
Perform a SHAPE experiment to obtain reactivity data for the RNA molecule of interest. This involves treating the RNA with a SHAPE reagent and measuring the extent of reactivity at each nucleotide position.
Normalize the reactivity values to a scale between 0 and 1, where 0 represents no reactivity, and 1 represents the highest reactivity. This normalization is important for consistency and comparability.
Calculate Pseudo-Free Energy Contributions:
Use the SHAPE reactivity data to calculate pseudo-free energy contributions for each nucleotide. The pseudo-free energy is calculated using a formula such as:
ΔG = -RT * ln(1 - reactivity)
Where:
ΔG is the pseudo-free energy contribution.
R is the gas constant (approximately 0.001987 kcal/(mol·K)).
T is the temperature in Kelvin.
reactivity is the normalized SHAPE reactivity value for the nucleotide.
Determine a reference position with a pseudo-free energy contribution of zero. This reference position can be an unpaired nucleotide or a known base pair.
Use RNA Folding Software:
Utilize RNA secondary structure prediction software or algorithms that incorporate pseudo-free energy contributions. Popular software packages include RNAstructure, ViennaRNA, and others.
Input the RNA sequence, the calculated pseudo-free energy contributions, and any structural constraints or pairing information if available.
Run Secondary Structure Prediction:
Run the secondary structure prediction algorithm or software to generate the RNA’s secondary structure. The software will use the calculated pseudo-free energy contributions to guide the folding process.
The output will typically provide you with a predicted secondary structure, including base pairs and unpaired regions.
Analyze and Refine the Prediction:
Examine the predicted secondary structure and consider whether it aligns with known structural motifs or functional elements within the RNA.
Optionally, refine the prediction by adjusting the pseudo-free energy contributions or incorporating additional experimental data, if available.
Validate and Compare with Experimental Data:
If experimental structural data, such as chemical probing or NMR, is available for the same RNA molecule, compare the predicted structure with the experimental data to assess the accuracy of the prediction.
Iterate and Improve:
Depending on the quality of the prediction and the RNA’s complexity, you may need to iterate through the process, making adjustments to the pseudo-free energy contributions or exploring different structural constraints to improve the accuracy of the prediction.
It’s important to note that while SHAPE-guided structure prediction can provide valuable insights into RNA secondary structure, it is a computational approximation and may not always perfectly capture the full complexity of RNA folding. Careful consideration of the results and validation against experimental data are essential for a reliable prediction.