跟着Nature正刊学数据分析:酵母1086nearT2T论文中图形泛基因组部分的分析(1)

From genotype to phenotype with 1,086 near telomere-to-telomere yeast genomes
https://www.nature.com/articles/s41586-025-09637-0
数据和代码

https://zenodo.org/records/15698884
https://github.com/HaploTeam/1086YeastGenomes/tree/main
图形泛基因组部分是分别用minigraph和minigraph-cactus构建了图形泛基因组,论文中总共1000多个基因组,构建图形泛基因组是只选用了其中的500个

论文中对这部分的方法描述
We build a graph pangenome using the Minigraph-Cactus pipeline
v.2.6.4 (ref. 59) with 500 haplotypes, including the reference genome
and 499 genomes selected to represent a maximum number of SVs. We
used the first graph generated by Minigraph58, that uniquely contains
SVs, in order to identify repetitive reference segments in the graph and
novel sequences. Only segments larger than 100 bp were considered for
these analyses. The segments were mapped to the reference genome
using minimap2 -ax asm5 v.2.24 (ref. 70) and the coverage depth along
the genome was retrieved using samtools depth v.1.16.1 (ref. 89).
论文中提供的代码

首先是过滤小于100bp的片段
python ../1086YeastGenomes-main/AssemblyPipelineCollapsed/GenomeAssemblyTools/filterContigSize.py -f SacePangenomeGraph.500Haplotypes.sv.gfa.fa -m 0.1 -o abc
把过滤的序列比对到参考基因组
minimap2 -ax asm5 -t 8 00.genomes/S288C.genome.fa GraphBasedPangenomes/abc.min0.1kb.fasta | samtools sort -o MappingOnRef.bam
计算每个位点的深度
samtools depth -aa MappingOnRef.bam | gzip > bam.depth.tsv.gz
作图代码

read_tsv(“C:/Users/lenovo/Desktop/bam.depth.tsv.gz”,
col_names = FALSE) -> dat

pdf(file=”figs18.pdf”,width=16,height = 16)
dat %>%
#filter(X1==”chrIII”) %>%
ggplot(aes(x=X2,y=X3))+
geom_line(aes(color=X1),show.legend = FALSE)+
facet_wrap(~X1)+
theme_bw(base_size = 15)+
theme(panel.grid = element_blank())+
labs(x=”Position on chromosome”,y=”Number of occurences in the graph pangenome”)+
scale_x_continuous(labels = function(x){paste0(x/1000000,” M”)})
dev.off()

这个图对应的是论文中的 fig S18,论文中对这部分的描述

Although the size of the graphs differs, they are both much larger than the linear reference (Table S13). The main reason for this increase in size is the large redundancy present in the graph, caused by non-collapsed CNVs. Out of 34,319 segments larger than 100 bp in the Minigraph pangenome (for a cumulative length of 47,416,522 bp), 19,276 segments (cumulative length of 41,783,718 bp) align back to the reference genome. Some regions of the linear reference are present more than a hundred times in the graph, mostly corresponding to Ty elements or long terminal repeats (LTR) (Fig. S18)
欢迎大家关注我的公众号
小明的数据分析笔记本

声明:来自小明的数据分析笔记本,仅代表创作者观点。链接:http://eyangzhen.com/5277.html

小明的数据分析笔记本的头像小明的数据分析笔记本

相关推荐

关注我们
关注我们
购买服务
购买服务
返回顶部