https://www.sciencedirect.com/science/article/pii/S0092867420306188
Pan-Genome of Wild and Cultivated Soybeans
大豆基因组数据下载链接
https://ngdc.cncb.ac.cn/soyomics/download
下载基因组fasta和对应的蛋白注释文件,用gffread提取cds序列和蛋白序列
把所有样本的cds合并到一起
cat *.CDS.fasta > all.samples.cds
计算dN/dS值的和核苷酸多样性整体的计算量还是挺大的,我这里每个类别的基因家族随便选择几个
dat.family.group %>%
filter(group==”Core”) %>%
sample_n(10) %>%
pull(familyID) %>%
write_lines(“cell.soybean.PanGenome/core_family.txt”)
python get_wgd_input.py core_family.txt ../06.longestTranscriptProt/OrthoFinder/Results_Jul27/Orthogroups/Orthogroups.txt core_wgd.input core.gene.ids
python脚本把Orthofinder的结果整理成wgd这个软件的输入格式,同时生成一个所有基因的id,把这些基因先从所有cds里提取出来
python脚本第一个位置是基因家族ID列表,第二个位置是orthofinder的结果,第三个位置是wgd的输入文件,第四个位置是所有计算kaks的基因的id
python脚本的内容
import sys
family_ids = []
with open(sys.argv[1],’r’) as fr:
for line in fr:
family_ids.append(line.strip())
fw01 = open(sys.argv[3],’w’)
fw02 = open(sys.argv[4],’w’)
with open(sys.argv[2],’r’) as fr:
for line in fr:
family_id = line.strip().split(” “)[0].replace(“:”, “”)
if family_id in family_ids:
temp_list = line.strip().split(” “)[1:]
fw01.write("%s\n"%('\t'.join(temp_list)))
for gene_id in temp_list:
fw02.write("%s\n"%(gene_id))
fw01.close()
fw02.close()
提取cds,wgd好像是可以自动提取,速度相对偏慢,用seqkit提取
seqkit grep -f core.gene.ids ../02.cds/all.samples.cds -o core.cds
conda activate wgd01
wgd ksd –n_threads 8 core_wgd.input core.cds –preserve -o core.kaks
重复以上步骤
软核心
dat.family.group %>%
filter(group==”SoftCore”) %>%
sample_n(10) %>%
pull(familyID) %>%
write_lines(“cell.soybean.PanGenome/softcore_family.txt”)
python get_wgd_input.py softcore_family.txt ../06.longestTranscriptProt/OrthoFinder/Results_Jul27/Orthogroups/Orthogroups.txt softcore_wgd.input softcore.gene.ids
~/anaconda3/envs/syri/bin/seqkit grep -f softcore.gene.ids ../02.cds/all.samples.cds -o softcore.cds
wgd ksd –n_threads 8 softcore_wgd.input softcore.cds –preserve -o softcore.kaks
python get_wgd_input.py dispensable_family.txt ../06.longestTranscriptProt/OrthoFinder/Results_Jul27/Orthogroups/Orthogroups.txt dispensable_wgd.input dispensable.gene.ids
~/anaconda3/envs/syri/bin/seqkit grep -f dispensable.gene.ids ../02.cds/all.samples.cds -o dispensable.cds
wgd ksd –n_threads 8 dispensable_wgd.input dispensable.cds –preserve -o dispensable.kaks
dat.family.group %>%
filter(group==”Private”) %>%
sample_n(10) %>%
pull(familyID) %>%
write_lines(“cell.soybean.PanGenome/private_family.txt”)
python get_wgd_input.py private_family.txt ../06.longestTranscriptProt/OrthoFinder/Results_Jul27/Orthogroups/Orthogroups.txt private_wgd.input private.gene.ids
~/anaconda3/envs/syri/bin/seqkit grep -f private.gene.ids ../02.cds/all.samples.cds -o private.cds
wgd ksd –n_threads 8 private_wgd.input private.cds –preserve -o private.kaks
所有结果合并
myfun<-function(x){read_tsv(x)%>%mutate(group=str_extract(x,pattern = “[a-z]+.cds”))}
list.files(“.”,pattern = “*.tsv”,recursive = TRUE,full.names = TRUE)%>%map(myfun)%>%bind_rows()%>%write_tsv(“kaks.txt”)
作图
read_tsv(“cell.soybean.PanGenome/kaks.txt”) %>%
mutate(group=str_replace(group,”.cds”,””)) %>%
mutate(group=factor(group,levels=c(“core”,”softcore”,
“dispensable”,”private”))) %>%
ggplot(aes(x=group,y=Omega))+
geom_boxplot(aes(fill=group),
show.legend = FALSE)+
ylim(0,2)+
theme_bw(base_size = 15)+
theme(panel.grid = element_blank())+
labs(x=NULL)
这个样本量很小,基本的规律是体现不出来的
欢迎大家关注我的公众号
声明:文中观点不代表本站立场。本文传送门:https://eyangzhen.com/423974.html