eQTL分析中对转录组表达量的值进行分位数标准化和反正则转换

论文

Structural variants and short tandem repeats impact gene expression and splicing in bovine testis tissue

https://academic.oup.com/genetics/article/225/3/iyad161/7258327?login=false

代码

https://github.com/Meenu-Bhati/SV-STR/blob/main/RNA_quantification/TPM_normalization.r
https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py
https://github.com/broadinstitute/pyqtl/blob/master/qtl/norm.py

为啥要做这个分位数标准化和反正则转换暂时不太理解,这个链接给出一些解释 https://bsunfullsite.github.io/post/quantile-nomrlization-and-inverse-normal-transform/quantile-normalization-and-inverse-normal-transform/

也没看太明白,基本上做eQTL分析的论文里方法部分都会写这个,比如开头提到的论文,方法不部分写到

Finally, TPM values were quantile normalized and inverse normal transformed across samples per gene using the R package RNOmni

水稻泛基因组的论文 (A super pan-genomic landscape of rice),做eQTL分析,方法部分写到

To obtain a normal distribution of expression values for each gene, FPKM values of each gene were further normalized using the quantile-quantile normalization (qqnorm) function in R (version 3.1.2).

玉米eQTL的论文 (A role for heritable transcriptomic variation in maize adaptation to temperate environments)方法部分写到

For each gene, expression values were transformed using the Box-Cox method [103] prior to mapping

西红柿泛基因组 eQTL分析部分方法

Genes with TPM values of >0.5 were defined as expressed. Only genes expressed in at least 100 accessions were retained for the downstream analyses. Raw expression levels were normalized with quantile–quantile normalization

这篇文章里提供的代码链接 https://github.com/YaoZhou89/TGG/blob/main/5.Genetic_analysis/scripts/prepare_gene_expression.R

这里标准化是自己自定义的函数

quantile_normalisation <- function(df){
  df_rank <- apply(df,2,rank,ties.method="min")
  df_sorted <- data.frame(apply(df, 2, sort))
  df_mean <- apply(df_sorted, 1, mean)
  
  index_to_mean <- function(my_index, my_mean){
    return(my_mean[my_index])
  }
  
  df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean)
  rownames(df_final) <- rownames(df)
  return(df_final)
}

我试了一下这个函数的输出和preprocessCore::normalize.quantiles 这个函数的输出是一致的

https://github.com/broadinstitute/pyqtl/blob/master/qtl/norm.py 这个链接里提供了python做这个标准化的函数

def normalize_quantiles(df):
    """
    Quantile normalization to the average empirical distribution
    Note: replicates behavior of R function normalize.quantiles
          from library("preprocessCore")

    Reference:
     [1] Bolstad et al., Bioinformatics 19(2), pp. 185-193, 2003

    Adapted from https://github.com/andrewdyates/quantile_normalize
    """
    M = df.values.copy()

    Q = M.argsort(axis=0)
    m,n = M.shape

    # compute quantile vector
    quantiles = np.zeros(m)
    for i in range(n):
        quantiles += M[Q[:,i],i]
    quantiles = quantiles / n

    for i in range(n):
        # Get equivalence classes; unique values == 0
        dupes = np.zeros(m, dtype=np.int)
        for j in range(m-1):
            if M[Q[j,i],i] == M[Q[j+1,i],i]:
                dupes[j+1] = dupes[j]+1

        # Replace column with quantile ranks
        M[Q[:,i],i] = quantiles

        # Average together equivalence classes
        j = m-1
        while j >= 0:
            if dupes[j] == 0:
                j -= 1
            else:
                idxs = Q[j-dupes[j]:j+1,i]
                M[idxs,i] = np.median(M[idxs,i])
                j -= 1 + dupes[j]
        assert j == -1

    return pd.DataFrame(M, index=df.index, columns=df.columns)

开头提到的论文里除了分位数标准化还做了反正则转换,这个有现成的R包 RNOmni,代码expr.int = t(apply(file_filter_norm, 1, RankNorm ))

eQTL分析还有一步是用peer这个包计算混杂因素(To remove potential batch effects and cconfounding factors),之前有一个困惑是直接用TPM值去计算混杂因素还是用标准化后的表达数据去计算这个混杂因素

https://github.com/broadinstitute/gtex-pipeline/tree/master/qtl

这个链接里有一些步骤,这里用的是标准化后的数据。计算多少个混杂因素这个链接里也给出了建议

The number of PEER factors was selected as function of sample size (N):

  • 15 factors for N < 150
  • 30 factors for 150 ≤ N < 250
  • 45 factors for 250 ≤ N < 350
  • 60 factors for N ≥ 350

声明:文中观点不代表本站立场。本文传送门:https://eyangzhen.com/388019.html

(0)
联系我们
联系我们
分享本页
返回顶部