引言
❝
TCGA 数据库里的很多癌症对应的正常组织的样本数量相对于癌症样本非常少,这对于后续的下游分析比如差异分析可能会带来一定的偏差。GTEX 数据库则收集来自健康捐献者的多个正常组织,通过对成千上万名捐献者的 RNA 测序 (RNA-seq) 、基因型测序 (Genotyping) 和 表型数据 ,建立了一个跨多个组织的基因表达数据库。 我们可以去里面下载对应组织的数据来作为正常样本,解决 TCGA 癌症正常样本数量过少的问题。本教程参考了部分 被炸熟的虾推文:矫正批次效应。致以感谢。
https://gtexportal.org/home/
GTEX 已经更新到 V10 版本,引入了 small RNAseq 的数据:
如果我们只想分析某个癌症,我们可以根据具体的组织名称去下载数据,而不用下载所有的数据:
我们这里下载了肺的数据:https://storage.googleapis.com/adult-gtex/bulk-gex/v10/rna-seq/counts-by-tissue/gene_reads_v10_lung.gct.gz
我们然后去 ucscxena 数据库下载 TCGA 的 lung 的表达数据
❝
表达矩阵:https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LUAD.star_counts.tsv.gz
注释文件:https://gdc-hub.s3.us-east-1.amazonaws.com/download/gencode.v36.annotation.gtf.gene.probemap
phenotype:https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LUAD.clinical.tsv.gz
数据处理
❝
下载好相应的数据后,我们需要对数据进行预处理。把 GTEX 的矩阵和 TCGA 的进行合并。
BiocManager::install(“sva”)
BiocManager::install(“apeglm”)
library(tidyverse)
library(data.table)
library(edgeR)
library(sva)
library(“FactoMineR”)
library(“factoextra”)
==============================================================================
pca plot
PCA_plot = function(data,col){
df.pca <- PCA(t(data), graph = FALSE)
fviz_pca_ind(df.pca,
geom.ind = “point”,
col.ind = col ,
addEllipses = TRUE,
legend.title = “Groups”) +
ggplot2::theme_bw() +
ggplot2::theme(panel.grid = element_blank())
}
首先读入 TCGA 的表达数据和表型信息进行对应筛选,可以看到只有 76 的正常样本,我们使用 edgeR::filterByExpr 来过滤低表达基因:
==============================================================================
load raw counts
count <- fread(“TCGA-LUAD.star_counts.tsv.gz”) %>%
column_to_rownames(var = “Ensembl_ID”)
count <- 2^count – 1
count <- apply(count, c(1,2), as.integer)
filter low counts
count_ft <- count[edgeR::filterByExpr(count,min.count = 1),]
load phenotype
pheno <- fread(“TCGA-LUAD.clinical.tsv.gz”) %>%
filter(sample %in% colnames(count)) %>%
mutate(type = ifelse(endsWith(sample,”01A”),”tumor”,”normal”),
.after = sample) %>%
arrange(type)
table(pheno$type)
normal tumor
76 513
order count matrix
count_ft <- count_ft[,pheno$sample]
check
identical(colnames(count_ft),pheno$sample)
[1] TRUE
接着读取 GTEX 的矩阵,进行合并,使用 cpm 函数获取标准化的矩阵来绘制 PCA, PCA 图显示 GTEX 的样本和 TCGA 的正常样本具有明显的批次效应:
==============================================================================
load gtex data
==============================================================================
gtex <- data.table::fread(“gene_reads_v10_lung.gct.gz”,skip = 2,header = T) gtex_ct <- gtex[,-2] %>%
column_to_rownames(var = “Name”)
filter common genes
co_gene <- intersect(rownames(count_ft),rownames(gtex_ct))
count_ft <- count_ft[co_gene,]
gtex_ct <- gtex_ct[co_gene,]
check
identical(rownames(count_ft),rownames(gtex_ct))
[1] TRUE
combine matrix
cmb <- cbind(count_ft,gtex_ct)
pheno_new <- data.frame(sample = c(pheno$sample,colnames(gtex_ct)), group = c(pheno$type,rep(“normal”,ncol(gtex_ct))), batch = c(rep(“TCGA”,nrow(pheno)),rep(“GTEX”,ncol(gtex_ct)))) %>%
mutate(var = paste(batch,group,sep = “-“))
get cpm
cpm_mt <- cpm(y = cmb,log = T,normalized.lib.sizes = T)
plot pca
PCA_plot(data = cpm_mt,col = pheno_new$var)
❝
既然有批次效应,我们就不能简单的进行合并了,我们使用 sva::ComBat_seq 函数来去除批次效应,这个是专门争对 RNASEQ 数据的。可以看到,去除批次效应后,GTEX 的正常样本可以和 TCGA 的正常样本比较好的混合了:
==============================================================================
remove batch effect
==============================================================================
rmb_cmb <- sva::ComBat_seq(counts = as.matrix(cmb),
batch = pheno_new$batch,
group = pheno_new$group)
check
head(rmb_cmb[1:4,1:4])
TCGA-44-6777-11A TCGA-55-6982-11A TCGA-50-6595-11A TCGA-55-6968-11A
ENSG00000000003.15 495 719 1132 385
ENSG00000000457.14 171 399 811 309
ENSG00000000460.17 41 64 106 67
ENSG00000000938.13 4677 3568 6744 5836
get cpm
rmb_cpm_mt <- cpm(y = rmb_cmb,log = T,normalized.lib.sizes = T)
plot pca
PCA_plot(data = rmb_cpm_mt,col = pheno_new$var)
❝
sva::ComBat_seq 对于大点的数据,运行速度可能会比较慢,github 上面也有人提到这个问题。
❝
于是我找到了这个,去年发表的 python 版的:
github: https://github.com/epigenelabs/inmoose
说明文档:https://inmoose.readthedocs.io/en/latest/index.html感兴趣的可以去试试。
❝
此外,ComBat 还提出了一种新的方法 ComBat-ref, 采用负二项分布模型来调整计数数据,可以提高 RNA-seq 数据中差异表达分析的统计效力和可靠性。
github:https://github.com/xiaoyu12/Combat-ref/tree/main
差异分析
❝
处理好上面的数据,我们就可以进行简单的差异分析了:
==============================================================================
diff analysis
==============================================================================
library(rtracklayer)
library(DESeq2)
load gene annotation
g_anno <- fread(“gencode.v36.annotation.gtf.gene.probemap”) %>%
select(id,gene) %>%
rowwise() %>%
mutate(gene_id = sapply(strsplit(id,split = “\.”),”[“,1))
load gtf annotation
gtf <- import.gff(“Homo_sapiens.GRCh38.113.gtf.gz”,format = “gtf”) %>%
data.frame() %>%
select(gene_id,gene_name,gene_biotype) %>% unique()
merge annotation
g_anno <- g_anno %>%
left_join(y = gtf,by = “gene_id”)
diff analysis
coldata <- data.frame(row.names = colnames(rmb_cmb),
condition = pheno_new$group)
dds <- DESeqDataSetFromMatrix(countData = rmb_cmb,
colData = coldata, design = ~condition)
dds <- DESeq(dds)
resultsNames(dds)
[1] “Intercept” “condition_tumor_vs_normal”
res_lfc <- lfcShrink(dds = dds,coef = 2,type = “apeglm”)
#
plotMA(res_lfc)
plotMA(dds)
res <- results(dds,contrast = c(“condition”,”tumor”,”normal”)) %>%
data.frame() %>%
rownames_to_column(var = “gene_id”) %>%
filter(log2FoldChange != “NA”) %>%
mutate(type = case_when(log2FoldChange >= 1 & pvalue < 0.05 ~ “sigUp”,
log2FoldChange <= -1 & pvalue < 0.05 ~ “sigDown”,
.default = “nonSig”))
load gene anno
res_anno <- res %>% left_join(y = g_anno,by = c(“gene_id”=”id”))
output
write.csv(res_anno,file = “all_deseq2_diff.csv”,row.names = F)
❝
讨论:在进行不同数据集之前的合并时,是需要考虑批次效应的问题,防止分析产生错误的结论。
结尾
❝
路漫漫其修远兮,吾将上下而求索。
声明:文中观点不代表本站立场。本文传送门:https://eyangzhen.com/423619.html