计算 Ribosome 距离二肽/三肽基序距离的密度

引言


在 Molecular Cell 文章部分结果复现(续) 中我们计算了一些在 eIF5A 缺失条件下一些核糖体占据明显增多的三肽基序。


接下来我们计算核糖体的 P/E 位点距离这些三肽基序密度的分布情况,下面是图中结果:

安装

install.packages(“devtools”)

devtools::install_github(“junjunlab/RiboProfiler”)

or

remotes::install_github(“junjunlab/RiboProfiler”)

library(RiboProfiler)
测试

adjust_offset 函数默认矫正后是 P 位点的位置:

test

offset <- PsiteOffsetCheck(qc_data = qc_df,read_length = c(28,28))
offset$plot
offset$summary_offset

sample readLengths Offsets bamFiles bamLegends

1 eIF5Ad 28 -12 eIF5Ad eIF5Ad

2 WT 28 -12 WT WT

offset_df <- adjust_offset(offset_df = offset$summary_offset,
qc_df = qc_df)

norm_df <- get_nomalized_counts(longest_trans_file = “longest_info_extend.txt”,
qc_df = offset_df)

计算距离 PPP 基序的相对距离的密度分布, min_counts 筛选 counts 数大于等于 64 的,保持和文章中一致,motif 指定我们的基序,upstream/downstream 则是基序上下游的距离 :
rel_dist_motif(amino_file = “./sac_cds_aa.fa”,
longest_trans_file = “longest_info_extend.txt”,
normed_file = norm_df,
min_counts = 64,
motif = “PPP”,
upstream = -50,
downstream = 50)
筛选原理就是匹配指定的氨基酸并返回位置,筛选在指定位置范围里的目标,然后把 density 储存起来,下面是代码:

#

2_load longtest transcript anno file

#

with open(normed_file,’r’) as input:
for line in input:
# fileds = line.split()
# print(line.split())
,,,,,,trans_pos,trans_id,counts,,,exp = line.split()

    if trans_id in gene_info:
        seq = str(amino[trans_id])

        # search motif patterns for amino acids
        # pattern = re.compile("PPP",flags=re.IGNORECASE)
        pattern = re.compile(motif, flags=re.IGNORECASE)
        pos_res = pattern.finditer(seq)

        # motif codon position
        codon_motif_pos = [pos.start() + 1 for pos in pos_res]

        # check whether has target motif
        if len(codon_motif_pos) > 0:
            # motif codon position to nt position
            motif_nt_pos = [pos*3- 2 for pos in codon_motif_pos]

            # calculate relative distance to motif
            utr5_len = gene_info[trans_id][0]
            rel_pos_to_motif = [int(trans_pos) - utr5_len - nt_pos for nt_pos in motif_nt_pos]

            # filter rel_pos in limited range (-50,50)
            rel_pos_to_motif_in_range = [i for i in rel_pos_to_motif if i >= int(upstream) and i <= int(downstream)]

            # save in dict
            for j in rel_pos_to_motif_in_range:
                if j in rel_dist:
                    rel_dist[j][0] += 1
                    rel_dist[j][1] += float(exp)
                    # rel_dist[j][1] += int(counts)

normalize density | Average ribosome occupancy

sum_counts = sum([val[1] for key,val in rel_dist.items()])
average_exp = sum_counts/len(range(int(upstream),int(downstream + 1)))
rel_dist_norm = {key:val[1]/average_exp for key,val in rel_dist.items()}
然后会在 relative_distance_motif 文件夹下产生这些文件:

然后画图:
library(ggplot2)

plot

files <- c(“relative_distance_motif/WT_PPP_motif_occupancy.txt”,
“relative_distance_motif/eIF5Ad_PPP_motif_occupancy.txt”)

samples <- c(“WT”,”eIF5Ad”)

rel_dist_motif_plot(motif_occupancy_file = files,
sample_name = samples,
motif = “PPP”,
site = “P”)


如果我们要算 E site 的话,adjust_offset 函数指定一下 shift 参数再向左移动 3nt 即可,A site 则向右移动 3nt:

E site

offset_df <- adjust_offset(offset_df = offset$summary_offset,
qc_df = qc_df,shift = -3)

A site

offset_df <- adjust_offset(offset_df = offset$summary_offset,
qc_df = qc_df,shift = 3)
同样的我们计算 PP 二肽基序的看看:
rel_dist_motif(amino_file = “./sac_cds_aa.fa”,
longest_trans_file = “longest_info_extend.txt”,
normed_file = norm_df,
min_counts = 64,
motif = “PP”,
upstream = -50,
downstream = 50)

files <- c(“relative_distance_motif/WT_PP_motif_occupancy.txt”,
“relative_distance_motif/eIF5Ad_PP_motif_occupancy.txt”)

samples <- c(“WT”,”eIF5Ad”)

rel_dist_motif_plot(motif_occupancy_file = files,
sample_name = samples,
motif = “PP”,
site = “P”)

tri-peptide 计算

前面我讲过这个问题:


仔细想想,我们海油一种更综合的算法,就是争对特定的一个氨基酸位置,可分为向两边取,向前取两位和向后取两位三种方式,我们都纳入进来,就可以把氨基酸长度不是 3 的倍数的序列都容纳进来了。
with open(codon_exp_file,’r’) as input:
for line in input:
trans_id,codon_pos,exp = line.split()

      # # get tripeptide position
      # if int(codon_pos)%3 == 0:
      #     pos = int(codon_pos)//3
      # else:
      #     pos = int(int(codon_pos)//3) + 1

      # # check transid in AA fasta file
      # if trans_id in amino.keys():
      #     tripeptide_motif = amino.fetch(trans_id,(pos*3 - 2,pos*3))

      #     # make sure length is 3
      #     if len(tripeptide_motif) == 3:
      #         # get tripeptide and save in dict
      #         if tripeptide_motif in tripeptide_seq:
      #             tmp_exp,tmp_count = tripeptide_seq[tripeptide_motif]
      #             tripeptide_seq[tripeptide_motif] = [float(exp) + tmp_exp,tmp_count + 1]
      #         else:
      #             tripeptide_seq[tripeptide_motif] = [float(exp),1]

      # check transid in AA fasta file (revised code)!
      if trans_id in amino.keys():
          # 3nt window to fetch tri-peptide motif
          type1 = amino.fetch(trans_id,(int(codon_pos) - 1,int(codon_pos) + 1))
          type2 = amino.fetch(trans_id,(int(codon_pos) - 2,int(codon_pos)))
          type3 = amino.fetch(trans_id,(int(codon_pos),int(codon_pos) + 2))

          tripeptide_motif_list = [type1, type2, type3]

          # filter motif length is 3
          tripeptide_motif_list_filterd = [i for i in tripeptide_motif_list if len(i) == 3]

          # get tripeptide and save in dict
          for motif in tripeptide_motif_list_filterd:
              if motif in tripeptide_seq:
                  tmp_exp,tmp_count = tripeptide_seq[motif]
                  tripeptide_seq[motif] = [float(exp) + tmp_exp,tmp_count + 1]
              else:
                  tripeptide_seq[motif] = [float(exp),1]

再分析看看:

calculate

peptide_motif_score(amino_file = “sac_cds_aa.fa”,
codon_exp_file = c(“codon_occupancy/WT_codon_pos_exp.txt”,
“codon_occupancy/eIF5Ad_codon_pos_exp.txt”),
sample_name = c(“WT”,”eIF5Ad”),
occurrence_threshold = 100)

plot

triAmino_scater_plot(occupancy_file = c(“peptide_motif/WT_tripeptide_occupancy.txt”,
“peptide_motif/eIF5Ad_tripeptide_occupancy.txt”),
sampe_name = c(“WT”,”eIF5Ad”),
mark_motif = c(“DDP”,”DPP”,”PPP”,”PDK”,”PDG”,
“GPP”,”PDP”,”DNP”,”DGP”,”SPP”))

结尾


路漫漫其修远兮,吾将上下而求索。
欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群 (微信交流群需收取 20 元入群费用,一旦交费,拒不退还!(防止骗子和便于管理)) 。QQ 群可免费加入, 记得进群按格式修改备注哦。

声明:文中观点不代表本站立场。本文传送门:https://eyangzhen.com/420878.html

(0)
联系我们
联系我们
分享本页
返回顶部