跟着NAR学数据分析:群体重测序SNP数据划分不同单倍型

https://academic.oup.com/nar/article/54/5/gkag134/8500555Glycine max Jumonji C domain-containing proteins JMJ19/20 exhibit endopeptidase activity and interact with LUXs to mediate flowering time 论文里Figure7部分的分析用到的数据是之前cell发表的大豆的泛基因论文里的数据，这个数据可以下载，可以用这个数据复现一下论文中Figue7的内容，今天的推文复现一下论文中的Figure7D 和Figure7E首先是对原始snp数据的过滤After filtering out SNPs with missing rates exceeding 10% and minor allele frequency below 5% in VCF format file, the remaining Single Nucleotide Polymorphisms (SNPs) were utilized for genetic diversity analysis.这里用bcftools进行过滤，这里加一个过滤条件，只保留二等位的变异位点bcftools view input.vcf -m2 -M2 -i ‘F_MISSING < 0.1 && MAF > 0.05’ -O v -o input.filter.vcf
提取基因区间的snp位点bcftools view -r Chr12:4046347-4050864 01.filter.snp.vcf/Chr12.snp.filter.vcf.gz -O v -o Chr12_4046347-4050864.vcf
用snpEff 注释一下变异位点相对于基因的位置java -Xmx8G -jar ~/biotools/snpEff/snpEff.jar ZH13V2 Chr12_4046347-4050864.vcf > Chr12_4046347-4050864.snpEff.vcf
与论文中说的两个变异能够对应上用genehapR做单倍型分析library(geneHapR)
input_vcf<-“C:/Users/lenovo/Desktop/Chr12_4046347-4050864.snpEff.vcf”

vcf.dat<-import_vcf(input_vcf)

hapResult<-vcf2hap(vcf.dat)
hapSummary<-hap_summary(hapResult)

plotHapTable(hapSummary) -> p1

p1
不同单倍型在野生农家品种和栽培品种中的分布data.frame(hap=hapResult$Hap,
acc=hapResult$Accession) %>%
inner_join(read_excel(“metainfo.xlsx”,skip = 1),
by=c(“acc”=”Materials”)) %>%
dplyr::select(1,2,4) -> dat

table(dat$hap,dat$Type)
画一个饼图展示单倍型的分布library(scatterpie)

table(dat$hap,dat$Type) %>%
as.data.frame() %>%
pivot_wider(names_from = “Var1”,values_from = “Freq”) %>%
mutate(x=c(1,3,2),y=1,region=1:3) %>%
ggplot()+
geom_scatterpie(aes(x,y,group=Var2,r=0.4),
cols=paste0(“H00”,1:4))+
theme_void()+
coord_equal()+
theme(legend.position = “top”,
legend.title = element_blank(),
axis.text.x = element_text())+
scale_x_continuous(breaks = 1:3,
label=c(“Wild”,”Landrace”,”Cultivar”))

阅读原文 >>

声明：来自小明的数据分析笔记本，仅代表创作者观点。链接：https://eyangzhen.com/6556.html

跟着NAR学数据分析:群体重测序SNP数据划分不同单倍型

相关推荐