Cell Research论文中人类T2T基因组补gap的策略

论文

https://www.nature.com/articles/s41422-023-00849-5
The complete and fully-phased diploid genome of a male Han Chinese
verkko 和 hifiasm初始组装
flye 组装，输入数据是 canu-based binned ONT reads
这里 binned ont reads的概念是什么暂时没有搞明白，我查了一下
“Binned ONT reads” refers to a set of DNA sequencing reads generated by Oxford Nanopore Technology (ONT) that have been grouped together based on specific characteristics, typically used in metagenomics to separate reads belonging to different microbial genomes within a sample, essentially assigning each group of reads to a “bin” representing a potential individual organism.
把序列按照特定的特征进分组，宏基因组测序数据可以把数据划分给不同的个体，但是人类的数据是怎么分组呢？按照来自父本和母本吗？这个暂时没有搞明白。欢迎大家留言讨论这个问题
首先是把hifiasm 和verkko 的组装结果搭成scaffold,，然后用TGS-gapcloser基于hifiasm flye的组装结果和 binned ONT reads补gap，剩余的gap用ont补（提取gap上下游5kb序列和ont比对，比对结果用linkview可视化https://github.com/YangJianshun/LINKVIEW2）然后polish
大体的流程理解，具体怎么实现还需要再想想
在论文对应的代码里
https://github.com/dongyawu/humanT2TPhasedGenome/blob/main/gap_filling_by_assembly.sh
分别用 hifiasm + Hic , verkko 和flye的组装结果去补gap
对应的代码是一个perl脚本
https://github.com/dongyawu/humanT2TPhasedGenome/blob/main/R7_gap_filling_by_assembly.pl
把这个脚本下载下来测试一下
perl R7_gap_filling_by_assembly.pl gap.fill.test/practice.fasta verkko.asm.output/assembly.fasta output.gap.filling
第一个位置参数需要补gap的基因组序列
第二个位置参数其他组装软件组装的基因组
第三个位置参数输出文件的前缀
需要修改脚本里的minimap2路径
把脚本里的最后一行也去掉，最后一行是用来画dotplot的代码，不去掉的话也要对应着修改路径
测试自己的数据1个gap也没有补上，只有一种数据，然后用不同的软件的组装结果去补gap的策略可能不是太可行
(主要是看不懂perl的代码，搞不明白这个代码里都做了什么事情，perl还是有必要学起来)
论文对应的补充方法里还写了端粒序列延长的方法，还要仔细看这部分的内容
论文的作者发了一篇水稻的着丝粒相关论文
https://www.biorxiv.org/content/10.1101/2024.07.28.605524v1
Genetic diversity and evolution of rice centromeres
这个论文里组装和补gap的策略
hifiasm 和 verkko各自组装，然后
Hifiasm and Verkko for assembly, respectively. The scaffolded assembly of Verkko was split into contigs and used to scaffold and extend Hifiasm contigs using a in-house script. The extended contigs (larger than 10 Kb) were then scaffolded on chromosomes using RagTag with no further chimeric splitting (v2.1.0, scaffold -f 2000 –remove-small –aligner minimap2)(Alonge et al.,2022), using T2T assembly of NIP as the reference genome
这段方法对应的代码
https://github.com/dongyawu/CenTools/blob/main/Asm/02_gap_filling_polising.sh
找时间对应着方法描述和代码用拟南芥的基因组试试
方法里写到的 in-house script 是perl语言写的，暂时还看不懂perl的代码，看来还是得花时间看perl的基础语法

声明：文中观点不代表本站立场。本文传送门：https://eyangzhen.com/424485.html

Cell Research论文中人类T2T基因组补gap的策略

作者专栏

小明的数据分析笔记本