最近在看 机器学习 shap分析相关的内容,在论文
https://www.sciencedirect.com/science/article/pii/S2590346225003888
Integrating pan-genome analysis, GWAS, and interpretable machine learning to prioritize trait-associated structural variations in Setaria italica
中看到这个分析,可以用机器学习模型结合基因型和表型做基因组预测,shap分析可以给模型中的变量排序,量化变异位点对模型的共享,从而可以筛选出比较重要的变异位点,但是shap分析出的这几个图还看不明白,先把分析代码搞清楚,再具体去研究图是什么意思。
找到一篇论文对shap分析的图进行解释
https://ascpt.onlinelibrary.wiley.com/doi/10.1111/cts.70056
Practical guide to SHAP analysis: Explaining supervised machine learning model predictions in drug development
论文里的数据和代码都可以在论文中找到,论文里有好几个例子,今天先重复其中的一个
xgboost模型根据一些指标预测血压
代码
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import xgboost
import sklearn
from sklearn.neural_network import MLPRegressor
import shap
df = pd.read_csv(‘D:/rstudioProject/env001/hanes.csv’)
X=df.drop([“ID”,’BPSysAve’],axis=1) # not considering the ID nor the target column
y=df[‘BPSysAve’]
Standard train-test split
X_train, X_test, y_train, y_test= sklearn.model_selection.train_test_split(X, y, test_size=0.3,random_state=6)
Xbackground = shap.utils.sample(X_train, nsamples=1000, random_state=3454) # 1000 instances for use as the background distribution, in case we calculate SHAP using a kernel method
xgb_mod = xgboost.XGBRegressor(random_state = 342).fit(X_train, y_train)
y_pred=xgb_mod.predict(X_test)
mse = sklearn.metrics.mean_squared_error(y_pred, y_test)
mae = sklearn.metrics.mean_absolute_error(y_pred, y_test)
print(“Mean Squared Error:”, mse,” Mean absolute error: “, mae)
shap分析
explainer = shap.TreeExplainer(xgb_mod)
xgb_shap_vals = explainer(X_test)
shap.plots.bar(xgb_shap_vals)
shap.plots.beeswarm(xgb_shap_vals)
shap.plots.scatter(xgb_shap_vals[:,”Age”],color=xgb_shap_vals[:,”BMI”])
sample_id=21
shap.plots.waterfall(xgb_shap_vals[sample_id])
shap.plots.heatmap(xgb_shap_vals)
出的几个图
欢迎大家关注我的公众号
声明:来自小明的数据分析笔记本,仅代表创作者观点。链接:https://eyangzhen.com/7718.html