最新发表的土豆泛基因组Nature论文中用pggb做泛基因分析的代码

论文

Leveraging a phased pangenome for haplotype design of hybrid potato
https://www.nature.com/articles/s41586-024-08476-9
论文中的代码链接
https://github.com/Chenglin20170390/Haplotype-diversity/tree/main
论文中关于pggb构建泛基因组的方法描述
The Minigraph-Cactus pipeline27 and PGGB26 were used to construct a pseudo-phased pangenome with all 61 haplotypes based on the whole-genome alignment (including the DMv6.1 reference genome). For the PGGB, we estimated the divergence of each chromosome with mash distances and confirmed chromosome community with wfmash87 mapping. Then, we used “pggb -s 10000 -n 61 -p 90 -k 47 -P asm20 -O 0.001” to build each chromosome graph. We visualized the 1D layout of the graph and estimated presence and absence ratios to the DMv6.1 reference in 100-kb sliding windows using ODGI88. The small variants and SVs were detected by vg deconstruct from snarls, and we only kept top-level and <1-Mb variants with vcfbub. For the Minigraph-Cactus pipeline, we assigned DMv6.1 as the guide for the paths, and progressively aligned the 60 haplotypes to it. We used the cactus-pangenome script with parameters “–gfa full –gbz full –vcfReference DMv6.1” to generate complete workflows and execute commands. The generated graph fragment assembly (GFA) format graph was used for edge, node and coverage statistics and subgraph generation from a BED input. The VCF output file comprises all graph variations based on the DMv6.1 reference, enabling the calculation of polymorphisms.
关于方法部分自己还有一些不明白的地方
1、confirmed chromosome community 这一步的作用是什么？
2、pggb构建泛基因组的参数怎么选？人类的泛基因组论文里好像是做了这方面的工作，再好好看看人类泛基因的那篇论文
3、依据参考基因组的滑窗bed统计pav是不是只能统计参考基因组里有的序列，如果参考基因组里没有的序列是不是就忽略了
4、top-level的变异是什么意思？都有哪些level
用拟南芥的数据运行一下论文中的代码

第一步是把拟南芥的序列id改成PanSN的形式

改序列id并把把4个拟南芥的基因组合并到一起
sed -i ‘s/Chr/col0#1#Chr/g’ at.col0.chr.fna
sed -i ‘s/Chr/An1#1#Chr/g’ An1.fa
sed -i ‘s/Chr/C24#1#Chr/g’ C24.fa
sed -i ‘s/Chr/Kyo#1#Chr/g’ Kyo.fa

cat at.col0.chr.fna An1.fa C24.fa Kyo.fa > pggb.input.4AT.genomes.fa

第二步是？

confirmed chromosome community，这一步我理解的是对染色体进行一个聚类？
wfmash -m -p 90 -s 20000 -n 4 -t 24 pggb.input.4AT.genomes.fa.gz > pggb.input.4AT.genomes.paf
less -S pggb.input.4AT.genomes.paf

python paf2net.py -p pggb.input.4AT.genomes.paf
pggb -i abc/pggb.input.4AT.genomes.fa.bf3285f.community.1.fa -p 90 -s 20000 -n 4 -k 47 -P asm10 -O 0.001 -G 700,900,1100 -t 24 -v -o pggb.chr2.output

这一步会生成三个用于画网络图的数据

用这个数据画网络图
library(tidyverse)
read_delim(“C:/Users/xiaoming/Desktop/pggb.input.4AT.genomes.paf.vertices.id2name.txt”,
col_names = FALSE,delim=” “) -> node

read_delim(“C:/Users/xiaoming/Desktop/pggb.input.4AT.genomes.paf.edges.list.txt”,
col_names = FALSE,delim=” “) %>%
bind_cols(read_delim(“C:/Users/xiaoming/Desktop/pggb.input.4AT.genomes.paf.edges.weights.txt”,
col_names = FALSE,delim=” “) %>%
magrittr::set_colnames(“X3”)) -> edges

library(igraph)
net<-graph_from_data_frame(d=edges,
vertices=node,directed=T)

library(ggraph)
?geom_edge_link
ggraph(net)+
geom_edge_link(color=”green”,aes(edge_width =X3))+
geom_node_point(color=”red”,size=10)+
theme_void()+
geom_node_label(aes(label=X2))

partition-before-pggb -i in.fa \ # input file in FASTA format
-o output \ # output directory
-n 9 \ # number of haplotypes (optional with PanSN-spec)
-t 16
运行这个命令可以把合并到一起的数据按照染色体拆分
第三步是运行pggb

我这里只用2号染色体测试
pggb -i abc/pggb.input.4AT.genomes.fa.bf3285f.community.1.fa \
-p 90 -s 20000 -n 4 -k 47 -P asm10 -O 0.001 \
-G 700,900,1100 -t 24 -v -o pggb.chr2.output
这里的参数怎么选还不确定
第四步是odgi 可视化

这里我也没太搞懂，可视化的结果怎么看还需要研究研究
第五步是odgi算pav

这一步先跳过
第六步是odgi统计图形泛基因组的一些指标

odgi stats -t 12 -S -i pggb.chr2.output/pggb.input.4AT.genomes.fa.bf3285f.community.1.fa.2a595b4.7608fc1.9cf87a9.smooth.final.og
第七步是 vg deconstruct

将gfa文件拆成vcf文件
vg deconstruct -P col0 -H ‘#’ -e -a -t 12 pggb.chr2.output/pggb.input.4AT.genomes.fa.bf3285f.community.1.fa.2a595b4.7608fc1.9cf87a9.smooth.final.gfa > pggb.chr2.vg.deconstruct.vcf
把vcf文件中的snp数据拆出来
bcftools view –threads 4 -v snps pggb.chr2.vg.deconstruct.vcf -O v -o pggb.chr2.vg.deconstruct.snp.vcf
bcftools view –threads 4 -V snps pggb.chr2.vg.deconstruct.vcf -O v -o pggb.chr2.vg.deconstruct.non.snp.vcf

这里有个疑问是从gfa文件拆出来的vcf文件里有倒位（Inversion）的变异吗？
是不是只有插入和删除变异？
论文中的代码部分也写了
1Mb < SVs < 100Mb (large inversion)
那还是有inversion的吗？
挑选大于50bp的变异
论文中提供的工具是vcfbub
我这边用这个工具是报错的，没有-A参数，可能是版本不一样？
第八步是关于path的内容

这一步也先跳过，暂时没有搞懂

声明：文中观点不代表本站立场。本文传送门：https://eyangzhen.com/424688.html

最新发表的土豆泛基因组Nature论文中用pggb做泛基因分析的代码

作者专栏

小明的数据分析笔记本