全基因组关联分析(GWAS)后使用机器学习算法+显著位点+表型做基因组预测

论文

Integrative genomics reveals the polygenic basis of seedlessness in grapevine
https://doi.org/10.1016/j.cub.2024.07.022
论文中提供的部分代码
https://github.com/zhouyflab/Polygenic_Basis_Seedless_Grapes
论文中做完GWAS后接着用显著位点加表型用机器学习算法做基因组预测，使用了很多模型。显著位点用连锁不平衡值做了过滤。今天的推文按照论文中的思路做一下。
论文中首先使用beagle软件对vcf文件进行基因型填充，使用bcftools view命令提取需要用到的变异位点，然后提供了一个perl脚本把vcf文件中的基因型修改成了 0 1 2 的格式
运行论文中提供的perl脚本，我用拟南芥的数据做测试
perl change_vcf_format_impute.pl practice.impute.vcf | cut -f 3,10- | grep -v “^#”

image.png
每行是一个位点，每列是一个样本
没有找到论文中做机器学习的代码，使用sklearn，先试着用简单的线性模型跑一下代码（其他模型的代码都类似）
我这里就随便选择了拟南芥vcf文件的前100行位点，没有选择显著位点，这个预测出来的准确性肯定是没有保证的
sklearn的代码参考 https://www.geeksforgeeks.org/multiple-linear-regression-with-scikit-learn/#
表型数据是每行一个样本 Y
基因型数据 X 每行是样本列是位点，上面获取的基因型数据还需要做个转置
perl change_vcf_format_impute.pl practice.impute.vcf | cut -f 3,10- | grep -v “^##” > genotype.txt
读取基因型数据

import pandas as pd
genotype = pd.read_table(“GWAS/ATgwas/genotype.txt”).set_index(“ID”) ## 把第一列样本ID设置为行名
genotype_T = genotype.transpose() ## 转置

读取表型数据

pheno = pd.read_table(“GWAS/ATgwas/ATphenoNewPhytologist.txt”,dtype={“sampleID”:str})

因为样本名是数字编号，将其指定为字符类型

合并数据

dat = pd.merge(pheno,genotype_T.reset_index(),left_on=”sampleID”,right_on=”index”)
new_dat = dat.drop(columns=[“sampleID”,”index”]) ## 去掉多余的两列数据
线性模型需要用到的模块

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_socre
拆分测试集和训练集

X = new_dat.drop(columns=[‘seedSize’])
Y = new_dat[‘seedSize’]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=101)
运行模型
from sklearn import preprocessing
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print(‘mean_squared_error : ‘, mean_squared_error(y_test, predictions))
print(‘mean_absolute_error : ‘, mean_absolute_error(y_test, predictions))
r2_score(y_test,predictions)

image.png
这里的R2 非常非常小，因为位点都是随机选择的，有时间把前面的gwas分析也做一下，用显著位点来试试，也试试其他模型

声明：文中观点不代表本站立场。本文传送门：https://eyangzhen.com/419616.html

全基因组关联分析(GWAS)后使用机器学习算法+显著位点+表型做基因组预测

因为样本名是数字编号，将其指定为字符类型

作者专栏

小明的数据分析笔记本