大豆Cell论文中泛基因家族分析复现三：核心和泛基因家族曲线

https://www.sciencedirect.com/science/article/pii/S0092867420306188
Pan-Genome of Wild and Cultivated Soybeans

大豆基因组数据下载链接

https://ngdc.cncb.ac.cn/soyomics/download下载基因组fasta和对应的蛋白注释文件，用gffread提取cds序列和蛋白序列前一篇推文已经运行了orthofinder，拿到了Orthogroups.GeneCount.tsv文件，利用这个文件转换得到PanGP这个软件的输入数据，然后用PanGP得到画图数据文件格式转换代码

library(tidyverse)
read_tsv(“cell.soybean.PanGenome/Orthogroups.GeneCount.tsv”) %>%
dplyr::select(-Total) %>%
column_to_rownames(“Orthogroup”) %>%
mutate(across(everything(),~ifelse(.>0,1,0))) %>%
write_delim(file = “cell.soybean.PanGenome/cell.soybean.PanGP.input”,
delim = “”,
col_names = FALSE)
PanGP输入数据的部分截图

每行是一个基因家族，每列是一个样本，1代表这个样本里有这个基因家族，0代表这个样本里没有这个基因家族
输入数据类型选择0/1矩阵
算法选择 total random
sample size选择200
sample repeat 选择4
这个最快，如果样本量比较大的话运行相对也比较慢

直接出一个图
在file选项里可以把作图数据导出
在 analyze选项里可以拟合方程
它的拟合泛基因家族用到的是幂函数，核心基因家族用到的是指数函数拟合
导出数据自己用R语言作图
导出的数据格式截图

先拟合一下数据

核心和泛基因家族都用指数函数来拟合
library(tidyverse)
read_tsv(“cell.soybean.PanGenome/Orthogroups.GeneCount.tsv”) %>%
dplyr::select(-Total) %>%
column_to_rownames(“Orthogroup”) %>%
mutate(across(everything(),~ifelse(.>0,1,0))) %>%
write_delim(file = “cell.soybean.PanGenome/cell.soybean.PanGP.input”,
delim = “”,
col_names = FALSE)

library(ggtrendline)

PanGP.dat<-read_tsv(“cell.soybean.PanGenome/PanGenomeData.txt”) PanGP.dat %>% colnames()

sample.size.x<-PanGP.dat %>% pull(GenomeNum)

pan.y<-PanGP.dat %>% pull(Pan-Genome Size)
core.y<-PanGP.dat %>% pull(Core Genome Size)

ggtrendline(sample.size.x,pan.y,
model = “exp3P”,eSize = 5,
eq.x = 20,
eq.y = 52000,
rrp.x = 20,
rrp.y = 50000)+
theme_bw()+
theme(panel.grid = element_blank())+
labs(x=”Sample size”,y=”Pan gene family”) -> p1

ggtrendline(sample.size.x,core.y,
model = “exp3P”,eSize = 5,
eq.x = 20,
eq.y = 32000,
rrp.x = 20,
rrp.y = 30000)+
theme_bw()+
theme(panel.grid = element_blank())+
labs(x=”Sample size”,y=”Core gene family”) -> p2

library(patchwork)

p1+p2

做箱线图的代码

PanGP.dat %>%
pivot_longer(!GenomeNum) %>%
mutate(GenomeNum=factor(GenomeNum,levels=1:27)) %>%
ggplot(aes(x=GenomeNum,y=value))+
geom_boxplot(aes(fill=name,color=name),
outliers = FALSE)+
theme_bw(base_size = 15)+
theme(panel.grid = element_blank(),
legend.position = c(0.7,0.5))+
scale_fill_manual(values = c(“#f07571″,”#4ea5d6”),
name=NULL)+
scale_color_manual(values = c(“#f07571″,”#4ea5d6”),
name=NULL)+
geom_line(data = data.frame(x=seq(1,27,by=0.1)) %>%
mutate(y=-14916exp(-0.152x)+58428),
aes(x=x,y=y),
lty=”dashed”,
color=”grey”)+
geom_line(data = data.frame(x=seq(1,27,by=0.1)) %>%
mutate(y=18675exp(-0.0865x)+22056),
aes(x=x,y=y),
lty=”dashed”,
color=”grey”)+
scale_x_discrete(breaks=c(seq(1,27,3),27))+
labs(x=”Sample number”,
y=”Family number”)

欢迎大家关注我的公众号
小明的数据分析笔记本

声明：文中观点不代表本站立场。本文传送门：https://eyangzhen.com/423799.html

大豆Cell论文中泛基因家族分析复现三：核心和泛基因家族曲线

作者专栏