pySCENIC 转录因子调控网络分析教程

引言


SCENIC (Single-Cell Regulatory Network Inference and Clustering) 是一种专门用于单细胞 RNA 测序数据的计算工具。它结合了基因调控网络的推断与计算分析,能够从单细胞转录组数据中推断转录因子活性并识别细胞状态和种类。SCENIC 的核心目标是解析单细胞数据中转录因子与其下游靶基因的调控关系,从而揭示每个细胞的调控状态及细胞群的生物学特性。
核心分析步骤:
1.基因共表达网络构建:利用工具如 GENIE3 或 GRNBoost2,从 RNA-seq 数据中推断基因之间的相互作用,并识别潜在转录因子和其靶基因间的关系。
2.调控模块生成:将转录因子及其推断的靶基因集合定义为「调控模块」,此步骤通过结合数据库中已知的 转录因子结合基序(Motif) 进一步验证,筛选出更可信的调控网络。
3.调控模块活性评分:使用 AUCell 算法计算每个调控模块在每个单细胞中的活性。AUCell(Area Under the Curve for geneset enrichment) 是用来衡量特定基因模块在单细胞中的表达水平的方法。
原理可以看看这个:SCENIC:单细胞转录因子调控模式(一)

官网:https://scenic.aertslab.org/

R 语言版 scenic 已经不再维护了,此外 R 语言的运行速度会慢很多,因此团队转向了 python 版的 pyscenic:

pySCENIC:https://github.com/aertslab/pySCENIC

使用文档:https://pyscenic.readthedocs.io/en/latest/index.html

安装

即使你安装官网的安装也会在分析过程中保很多错误,这里分享我解决的方案:
conda create -y -n pyscenic python=3.10
conda activate pyscenic

pip install pyscenic
pip install dask-expr==0.5.3 distributed==2024.2.1
pip install pandas==1.5.3
准备文件

地址:https://resources.aertslab.org/cistarget/


主要需要三个文件,motif 的排序文件,motif 注释文件和转录因子文件:

wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg19/refseq_r45/mc9nr/gene_based/hg19-tss-centered-5kb-7species.mc9nr.genes_vs_motifs.rankings.feather

wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl

这里我下的是,当然你下载上面的也没问题的,这里我用 wget 下载的是乱码,于是自己复制了一下内容到 hgnc_tfs.txt 里:

wget https://github.com/aertslab/pySCENIC/blob/master/resources/hs_hgnc_tfs.txt
运行 pySCENIC

这部分参考了 生信随笔 和 生信技能树 公众号的 Python 版 SCENIC 转录因子分析(四)一文就够了 推文。

pySCENIC 的输入文件是 counts 矩阵,csv 格式或者 loom 格式(The name of the file that contains the expression matrix for the single cell experiment. Two file formats are supported: csv (rows=cells x columns=genes) or loom (rows=genes x columns=cells) ):
csv 格式的需要转置一下,用 csv 格式分析完后会直接输出转录因子的活性矩阵:

这里我们使用 loom 文件来跑:
输出转置矩阵:
library(Seurat)

remotes::install_github(“satijalab/seurat-data”)

library(SeuratData)
AvailableData()

InstallData(“pbmc3k”)

data(“pbmc3k”)
pbmc3k

head(t(as.matrix(pbmc3k@assays$RNA@counts))[1:3,1:3])
colnames(t(as.matrix(pbmc3k@assays$RNA@counts)))

output matrix

write.csv(t(as.matrix(pbmc3k@assays$RNA@counts)),file = “scenic.data.csv”)
转为 loom 文件格式:

in jupyter notebook

counts to loom

import os,sys
os.getcwd()
os.listdir(os.getcwd())

import loompy as lp;
import numpy as np;
import scanpy as sc;
x=sc.read_csv(“scenic.data.csv”);
row_attrs = {“Gene”: np.array(x.var_names),};
col_attrs = {“CellID”: np.array(x.obs_names)};
lp.create(“scenic.loom”,x.X.transpose(),row_attrs,col_attrs);
三步走(grn,ctx,AUCell),最终输出的 aucell.loom 是我们需要的:

in terminal

step1 grn

pyscenic grn \
–num_workers 15 \
–output grn.tsv \
–method grnboost2 \
scenic.loom \
hgnc_tfs.txt

step2 ctx

pyscenic ctx \
grn.tsv hg19-tss-centered-5kb-7species.mc9nr.genes_vs_motifs.rankings.feather \
–annotations_fname motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl \
–expression_mtx_fname scenic.loom \
–mode “dask_multiprocessing” \
–output ctx.csv \
–num_workers 20 \
–mask_dropouts

step3 AUCell

pyscenic aucell \
scenic.loom \
ctx.csv \
–output aucell.loom \
–num_workers 15
结果可视化

get_regulons 函数返回行为转录因子,列为基因的矩阵,值代表该基因是否属于转录因子的靶基因,是的话值为 1,不是则为 0;

==============================================================================

plot in R

==============================================================================

library(Seurat)
library(SCopeLoomR)
library(AUCell)
library(SCENIC)
library(dplyr)
library(KernSmooth)
library(RColorBrewer)
library(plotly)
library(BiocParallel)
library(grid)
library(ComplexHeatmap)
library(data.table)
library(scRNAseq)
library(patchwork)
library(ggplot2)
library(stringr)
library(circlize)

setwd(“pyscenic_test/”)

load data

loom <- open_loom(‘aucell.loom’)

regulons_incidMat <- SCopeLoomR::get_regulons(loom, column.attr.name=”Regulons”)
regulons_incidMat[1:4,1:4]

AL627309.1 AP006222.2 RP11-206L10.2 RP11-206L10.9

ACAA1(+) 0 0 0 0

ATF1(+) 0 0 0 0

ATF3(+) 0 0 0 0

ATF4(+) 0 0 0 0

regulonsToGeneLists 函数返回每个转录因子调控的靶基因列表:
regulons <- SCENIC::regulonsToGeneLists(regulons_incidMat)
head(regulons,3)

$ACAA1(+)

[1] “SEMA4A” “ZEB2” “PCYT1A” “TRIM38” “HLA-DRA” “TSPAN33” “CNOT4”

[8] “OSTF1” “SH2B3” “ERCC1”

#

$ATF1(+)

[1] “UBE4B” “ZCCHC11” “SFT2D2” “DDX59” “LBH” “FOXP1” “UGDH”

[8] “HOMER1” “NDFIP1” “LY86” “NUB1” “SASH3” “NFX1” “ATF1”

[15] “DICER1” “ATP10A” “ALDOA” “CCL5” “TBX21”

#

$ATF3(+)

[1] “ERI3” “CPT2” “CTSS” “PEA15” “USF1” “ATF3”

[7] “INSIG2” “PID1” “ITPR1” “BHLHE40” “NR2C2” “PRKAR2A”

[13] “PLSCR1” “SH3BP2” “NAAA” “CD14” “HRH2” “HLA-DQA2”

[19] “HLA-DMA” “CDKN1A” “CNPY3” “CYBB” “VSIG4” “ARMCX3”

[25] “SSR4” “HCFC1” “IRAK1” “PDLIM2” “BNIP3L” “AKNA”

[31] “RHOG” “AMICA1” “CBL” “ZNF384” “MCRS1” “TMBIM6”

[37] “NR4A1” “ZNF385A” “SPPL3” “SFSWAP” “SLC7A7” “MAX”

[43] “C15orf48” “IQGAP1” “KLHL36” “CBFA2T3” “GABARAP” “TGIF1”

[49] “CHMP1B” “DNTTIP1” “MCOLN1” “CD97” “PKN1” “TTLL1”

get_regulons_AUC 返回计算好的转录因子的活性分数矩阵,行为转录因子,列为细胞
regulonAUC <- SCopeLoomR::get_regulons_AUC(loom,column.attr.name=’RegulonsAUC’)
regulonAUC

AUC for 114 regulons (rows) and 2700 cells (columns).

#

Top-left corner of the AUC matrix:

cells

regulons AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG

ACAA1(+) 0.01370262 0.09650146 0.00000000 0.11807580 0.04227405

ATF1(+) 0.05048335 0.07250269 0.03521559 0.00000000 0.08424122

ATF3(+) 0.03876471 0.02731886 0.02380952 0.06187237 0.02094806

ATF4(+) 0.07221350 0.08779996 0.00000000 0.00000000 0.00000000

ATF5(+) 0.04269430 0.06856565 0.05392273 0.06860505 0.07770599

BACH1(+) 0.09394234 0.14415290 0.18934240 0.07912213 0.11726595

get_regulon_thresholds 返回每个转录因子的 AUC 阈值:
regulonAucThresholds <- SCopeLoomR::get_regulon_thresholds(loom)
head(regulonAucThresholds,3)

0.0420843310110376 0.0788214508694053 0.102648709410756

“ACAA1(+)” “ATF1(+)” “ATF3(+)”

embeddings <- SCopeLoomR::get_embeddings(loom)
embeddings

list()

close_loom(loom)
处理一下 pbmc3k:

plot

pbmc3k <- UpdateSeuratObject(pbmc3k)

seurat.data <- pbmc3k %>%
NormalizeData(verbose = F) %>%
FindVariableFeatures(selection.method = “vst”, nfeatures = 2000, verbose = F) %>%
ScaleData(verbose = F) %>%
RunPCA(npcs = 30, verbose = F) %>%
RunUMAP(reduction = “pca”, dims = 1:30, verbose = F) %>%
FindNeighbors(reduction = “pca”, k.param = 10, dims = 1:30)

Idents(seurat.data) <- “seurat_annotations”
把每个转录因子的活性合并到 metadata 里面:

precoess

sub_regulonAUC <- regulonAUC[,match(colnames(seurat.data),colnames(regulonAUC))]

check

identical(colnames(sub_regulonAUC), colnames(seurat.data))

combine

seurat.data@meta.data = cbind(seurat.data@meta.data ,t(assay(sub_regulonAUC)))

VlnPlot(seurat.data, features = c(“TCF4(+)”,”GATA3(+)”),pt.size = 0)

FeaturePlot(object = seurat.data,features = c(“TCF4(+)”,”GATA3(+)”))

RidgePlot(seurat.data, features = c(“TCF4(+)”,”GATA3(+)”) , ncol = 2)

热图可视化:

==============================================================================

AVG expression

==============================================================================

cellClusters <- data.frame(row.names = colnames(seurat.data), seurat_clusters = as.character(seurat.data$seurat_annotations)) |>
dplyr::mutate(seurat_clusters = ifelse(is.na(seurat_clusters),”unkown”,seurat_clusters))

cellsPerGroup <- split(rownames(cellClusters),cellClusters$seurat_clusters)

remove extened regulons

sub_regulonAUC <- sub_regulonAUC[onlyNonDuplicatedExtended(rownames(sub_regulonAUC)),]

Calculate average expression:

regulonActivity_byGroup <- sapply(cellsPerGroup,
function(cells)
rowMeans(getAUC(sub_regulonAUC)[,cells]))

scale

regulonActivity_byGroup_Scaled <- t(scale(t(regulonActivity_byGroup),center = T, scale=T))

heatmap

Heatmap(matrix = t(regulonActivity_byGroup_Scaled[1:20,]))


绘制所有细胞的转录因子活性热图:
ht_auc <- assay(sub_regulonAUC)
ht_auc <- ht_auc[,as.character(unlist(cellsPerGroup))]
identical(colnames(ht_auc),as.character(unlist(cellsPerGroup)))

[1] TRUE

ht_auc_scale <- t(scale(t(ht_auc),center = T,scale = T))

x = 1

lapply(seq_along(names(cellsPerGroup)),function(x){
rep(names(cellsPerGroup)[x],length(cellsPerGroup[[x]]))
}) |> unlist() -> celltypes

col_anno <- columnAnnotation(celltype = celltypes)

plot

Heatmap(matrix = ht_auc_scale,
top_annotation = col_anno,
col = colorRamp2(c(-2, 0, 2), c(“#003399”, “white”, “#990066”)),
cluster_columns = F,
show_row_names = F,
show_column_names = F)


利用 AUC 计算的阈值,绘制二分类变量热图,也是文献经常见的:

==============================================================================

all cells heatmap

==============================================================================

ht_auc <- data.frame(ht_auc)

x = 1

purrr::map_df(seq_along(regulonAucThresholds),function(x){
tmp <- ht_auc[regulonAucThresholds[x],] val <- as.numeric(names(regulonAucThresholds[x])) tmp <- data.frame(apply(tmp, c(1,2), function(x) ifelse(x > val,1,0)))
return(tmp)
}) -> binary_tfs

binary_tfs <- binary_tfs[,as.character(unlist(cellsPerGroup))]
identical(colnames(binary_tfs),as.character(unlist(cellsPerGroup)))

[1] TRUE

mark <- sample(rownames(binary_tfs),size = 10,replace = F)
at <- match(mark,rownames(binary_tfs))

tfs_mark = rowAnnotation(foo = anno_mark(at = at,
labels =mark))

plot

Heatmap(matrix = binary_tfs,
name = “Binary activity of regulon”,
cluster_columns = F,
col = c(“white”, “black”),
top_annotation = col_anno,
right_annotation = tfs_mark,
# km = 5,
show_row_names = F,
show_column_names = F)

对细胞聚类:

plot

Heatmap(matrix = binary_tfs,
name = “Binary activity of regulon”,
col = c(“white”, “black”),
top_annotation = col_anno,
show_row_names = F,
show_column_names = F)

结尾


路漫漫其修远兮,吾将上下而求索。
欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群 (微信交流群需收取 20 元入群费用,一旦交费,拒不退还!(防止骗子和便于管理)) 。

声明:文中观点不代表本站立场。本文传送门:https://eyangzhen.com/424325.html

联系我们
联系我们
分享本页
返回顶部