引言
单细胞 RNA 测序(scRNA-seq)数据分析通常涉及复杂的迭代工作流,需要大量的专业知识和时间。为了应对这种复杂性,研究人员开发了 SCassist,这是一个 R 包,利用大型语言模型(LLM)的力量来指导和增强 scRNA-seq 分析。SCassist 将 LLM 集成到关键的工作流步骤中,用于分析用户数据并提供过滤、归一化和聚类参数的相关建议。它还提供了由 LLM 指导的对变量特征和主成分的深入解释,以及细胞类型注释和富集分析。SCassist 使用流行的 LLM,如 Google 的 Gemini、OpenAI 的 GPT 和 Meta 的 Llama3,使 scRNA-seq 分析对所有水平的研究人员都变得易于访问。
github:https://github.com/NIH-NEI/SCassist
图 1. SCassist 算法的整体架构。SCassist,一个由 LLM 驱动的助手,简化了标准 Seurat 工作流程中的单细胞分析。图的上部展示了典型的 Seurat 步骤(质量控制、归一化、降维、聚类和注释),而相互连接的粉红色框代表 SCassist 组件,为每个步骤提供数据驱动的见解和参数推荐。SCassist 可以在标准单细胞工作流程的任何阶段使用,从质量控制阶段开始,用户只需提供包含原始计数矩阵数据的 Seurat 对象作为 SCassist 的输入。对于给定的 Seurat 对象,SCassist 生成诸如汇总统计、分位数数据、解释的方差等指标。这些指标随后用于构建增强的提示,推荐过滤、归一化、降维、识别显著特征以及提供见解(从差异表达基因、主成分、差异表达基因),并进行聚类注释,同时提供详细的推理。
安装
# Install the necessary packages
install.packages("visNetwork")
install.packages("httr")
BiocManager::install("clusterProfiler")
# Install the devtools package if you don't have it
install.packages("devtools")
# Install SCassist from GitHub
devtools::install_github("NIH-NEI/SCassist")
# Install rollama package to use the local ollama llm server
install.packages("rollama")
# Download the model (in R)
pull_model("llama3")
示例
只需要准备一个 seurat 对象,接着每个步骤会给你推荐的函数关键参数,然后你可以使用推荐的参数走标准 seurat 流程:
# Load the SCassist and Seurat packages
library(SCassist)
library(Seurat)
# Load the downloaded example file
KO <- Read10X_h5("GSM6625298_scRNA_LCMV_Day4_CD4_CD8_NK_WT_filtered_feature_bc_matrix.h5", use.names = T)
# Create seurat object
KO <- CreateSeuratObject(counts = KO[["Gene Expression"]], names.field = 2,names.delim = "\\-")
# Set api_key_file variable
api_key_file = "api_key_from_google.txt"
# Recommend quality control filters using Gemini (online)
qc_recommendations <- SCassist_analyze_quality("KO", llm_server="google", api_key_file = api_key_file)
# Recommend quality control filters using OpenAi GPT model (online)
qc_recommendations <- SCassist_analyze_quality("KO", llm_server="openai", api_key_file = api_key_file)
# Recommend quality control filters using Llama3 (local)
qc_recommendations <- SCassist_analyze_quality("KO", llm_server="ollama")
# ...and many more functions!
示例详细代码:
比如使用SCassist_analyze_quality,给的参数:
## Based on the data summary, below are my recommendations for the quality filtering of the data:
##
## **nCount_RNA:**
##
## * **Lower Cutoff:** 1500. This value is chosen to be slightly above the 5th percentile (947) to remove cells with very low counts, but still capture a significant portion of the data.
## * **Upper Cutoff:** 35000. This value is chosen to be slightly below the 95th percentile (26528) to remove cells with extremely high counts, which could indicate potential doublets or other artifacts.
##
## **nFeature_RNA:**
##
## * **Lower Cutoff:** 700. This value is chosen to be slightly above the 5th percentile (528) to remove cells with very few detected genes, but still capture a significant portion of the data.
## * **Upper Cutoff:** 5500. This value is chosen to be slightly below the 95th percentile (4793) to remove cells with an unusually high number of detected genes, which could indicate potential doublets or other artifacts.
##
## **percent.mito:**
##
## * **Upper Cutoff:** 15. This value is chosen to be significantly lower than the 95th percentile (23.01) to remove cells with a high percentage of mitochondrial reads, which could indicate cell stress or damage.
##
## **percent.ribo:**
##
## * **Upper Cutoff:** 45. This value is chosen to be slightly lower than the 95th percentile (41.97) to remove cells with a high percentage of ribosomal reads, which could indicate cell stress or damage.
##
## **percent.hb:**
##
## * **Upper Cutoff:** 0.025. This value is chosen to be slightly higher than the 95th percentile (0.019) to remove cells with a high percentage of hemoglobin reads, which could indicate contamination from blood cells.
##
## **Important Note:** These are just recommendations, and the researcher should test a range of values around these recommendations to determine the optimal cutoffs for their specific dataset. The optimal cutoffs will depend on the specific characteristics of the data and the research question being addressed.
然后可以使用推荐的参数筛选过滤细胞:
# Filter data based on the above plot.
allsamplesgood <- subset(allsamples,
subset = nCount_RNA > 1500 &
nCount_RNA < 35000 &
nFeature_RNA > 700 &
nFeature_RNA < 5500 &
percent.mito < 15 &
percent.ribo < 45 &
percent.hb < 0.025
)
结尾
路漫漫其修远兮,吾将上下而求索。
声明:来自老俊俊的生信笔记,仅代表创作者观点。链接:https://eyangzhen.com/2251.html