大豆Cell论文中泛基因家族分析复现二：不同类别基因家族比例饼图和频率柱形图

https://www.sciencedirect.com/science/article/pii/S0092867420306188
Pan-Genome of Wild and Cultivated Soybeans

大豆基因组数据下载链接

https://ngdc.cncb.ac.cn/soyomics/download
下载基因组fasta和对应的蛋白注释文件，用gffread提取cds序列和蛋白序列
下载的基因组数据的序列ID和对应的gff文件里的序列ID不一样，基因组数据里的序列有前缀，需要对应的修改
seqkit replace -p “SoyC01.” -r “” SoyC01.v1.fasta -o SoyC01.v1.replaceID.fasta

总共27个基因组都需要这样检查修改
提取cds和蛋白
gffread -g SoyC01.v1.replaceID.fasta -x cds.fa -y pep.fa 04.gff/SoyC01.gene.gff
大豆的每个基因组注释每个基因对应着多个转录本，我这里利用gff文件获取最长转录本id，然后根据id提取cds和蛋白
Rscript getLongestTranscriptID.R 04.gff/SoyC01.gene.gff SoyC01.list
seqkit grep -f SoyC01.list cds.fa -o SoyC01.cds.fa
seqkit grep -f SoyC01.list pep.fa -o SoyC01.pep.fa
每个基因组都重复一下以上的步骤，然后将所有的蛋白序列放到一个文件夹下，运行orthofinder
orthofinder -f 06.longestTranscriptProt/ -M msa -S diamond -T fasttree -a 52 -t 52
这一步数据比较多的话需要的时间比较长，也需要比较大的运行内存
运行完这一步得到两个文件，在文件夹下 06.longestTranscriptProt/OrthoFinder/Results_Jul27/Orthogroups/
Orthogroups.GeneCount.tsv
这个文件里是基因家族ID和每个物种在这个基因家族里有多少个基因

部分数据截图
Orthogroups_UnassignedGenes.tsv
这个文件里是没有被归类到基因家族里的单个基因
这里总共是27个基因组，在27个基因组里都存在的核心基因家族，25 26个事软核心，2-24是可变基因家族，1个事私有基因家族
R语言代码

library(tidyverse)
read_tsv(“cell.soybean.PanGenome/Orthogroups.GeneCount.tsv”) %>%
dplyr::select(-Total) %>%
column_to_rownames(“Orthogroup”) %>%
mutate(across(everything(),~ifelse(.>0,1,0))) %>%
rowSums() %>%
as.data.frame() %>%
rownames_to_column() %>%
magrittr::set_colnames(c(“familyID”,”total”)) %>%
mutate(group=case_when(
total == 27 ~ “Core”,
total < 27 & total >=25 ~ “SoftCore”,
total < 25 & total >=2 ~ “Dispensable”,
total == 1 ~ “Private”
)) -> dat.family.group

myfun<-function(x){
y<-c()
for(i in 1:length(x)){
if(i == 1){
new_y = x[i]/2
y<-append(y,new_y)
}
else{
new_y = sum(x[1:i-1])+x[i]/2
y<-append(y,new_y)
}
}
return(y)
}

myfun(c(1,4,7,8))

library(ggrepel)

dat.family.group %>%
group_by(group) %>%
summarise(count=n()) %>%
mutate(prop=sprintf(“%0.2f”,count*100/sum(count))) %>%
mutate(label=paste0(prop,”%”)) %>%
mutate(group=factor(group,levels=rev(c(“Core”,”SoftCore”,”Dispensable”,”Private”)))) %>%
arrange(group) %>%
ungroup() %>%
mutate(y.position=myfun(rev(count))) %>%
ggplot(aes(y=count,x=””,fill=group))+
geom_text_repel(aes(label=paste0(rev(group),
“\n”,
rev(label)),
y=y.position),
nudge_x = c(1,1,2,1))+
geom_bar(width = 1, stat=”identity”, alpha=1,
show.legend = FALSE)+
coord_polar(“y”, start=0)+
theme_void(base_size = 15)+
scale_fill_manual(values = rev(c(“#e46a6f”,”#e58ea2″,”#5f9ed5″,”#e6db91″)))+
theme(plot.margin = unit(c(1,1,1,1),’cm’))-> fig2.1

fig2.1

read_tsv(“cell.soybean.PanGenome/Orthogroups_UnassignedGenes.tsv”) %>%
pivot_longer(!Orthogroup) %>%
na.omit() %>%
pull(name) %>%
table() %>%
as.data.frame() %>%
arrange(Freq) %>%
magrittr::set_colnames(c(“species”,”Freq”)) %>%
mutate(species=factor(species,levels=species)) -> bar.dat.02

dat.family.group %>%
pull(total) %>% table() %>%
as.data.frame() %>%
magrittr::set_colnames(c(“Frequency”,”Family number”)) %>%
mutate(Frequency=as.numeric(Frequency)) %>%
mutate(group=case_when(
Frequency == 27 ~ “Core”,
Frequency < 27 & Frequency >=25 ~ “SoftCore”,
Frequency < 25 & Frequency >=2 ~ “Dispensable”,
Frequency == 1 ~ “Private”
)) %>%
mutate(group=factor(group,levels=c(“Core”,”SoftCore”,”Dispensable”,”Private”))) -> bar.dat.01

ggplot(data=bar.dat.01,
aes(x=Frequency,y=Family number))+
geom_col(aes(fill = group),
show.legend = FALSE)+
scale_x_continuous(breaks = c(seq(1,27,3),27))+
scale_fill_manual(values = c(“#e46a6f”,”#e58ea2″,”#5f9ed5″,”#e6db91″))+
theme_bw(base_size = 15)+
theme(panel.grid = element_blank())+
scale_y_continuous(expand = expansion(mult = c(0,0.1)))+
geom_bar(data=bar.dat.02,
aes(x=1,y=Freq),
stat=”identity”,
position = “stack”,
color=”grey”,
lty=”dashed”,
show.legend = FALSE,
fill=”#e6db91″)-> fig2.2

fig2.2

把两个图组合到一起
library(ggpmisc)

fig2.2+
geom_plot(data=tibble(x=2,y=22000,plot=list(fig2.1)),
aes(x=x,y=y,label=plot),
vp.width=0.8,vp.height=0.8)

这个数据和论文中不是完全一致，这个问题不大
欢迎大家关注我的公众号
小明的数据分析笔记本

声明：文中观点不代表本站立场。本文传送门：https://eyangzhen.com/423757.html

大豆Cell论文中泛基因家族分析复现二：不同类别基因家族比例饼图和频率柱形图

作者专栏