翻译 - TCGAbiolinks 9 - Extension

TCGAbiolinks - 新功能

作者:Mohamed Mounir

发表时间:2019年3月20日

新功能:TCGAbiolinks版本

TCGAbiolinks目前还提供了查询和下载TARGET数据库的数据的函数,并获得用于下游分析的SummarizedExperiments对象。此外,TCGAanalyze_DEA准备进一步的的设计,以允许更多用户定制和支持limma分析流程。

1. TARGET数据:支持

TARGET(Therapeutically Applicable Research To Generate Effective Treatments)采用综合的基因组方法( comprehensive genomic approach)来确定驱动儿童癌症的分子变化。研究人员形成一个协作网络组成,以促进分子靶标的发现,并将这些发现转化为临床应用。TARGET目前由NCI’s Office of Cancer Genomics and Cancer Therapy Evaluation Program维护。

1.1. TARGET数据:查询,下载和准备

下面是通过tcgabiolinks包从TARGET数据库查询、下载和准备数据的示例

## 目前可用的TARGET项目:
projects <- getGDCprojects()$project_id
as.data.frame(grep("TARGET", projects,value = TRUE))
##   grep("TARGET", projects, value = TRUE)
## 1                             TARGET-AML
## 2                             TARGET-NBL
## 3                              TARGET-OS
## 4                              TARGET-RT
## 5                            TARGET-CCSK
## 6                              TARGET-WT
## 7                          TARGET-ALL-P3
#Downloading and prepare TARGET CASE
query_Target <- GDCquery(project = "TARGET-AML", 
                         data.category = "Transcriptome Profiling", 
                         data.type = "Gene Expression Quantification", 
                         workflow.type = "HTSeq - Counts")

samplesDown_Target <- getResults(query_Target, cols = c("cases"))

dataSmTB_Target <- TCGAquery_SampleTypes(barcode = samplesDown_Target,
                                         typesample = "TB")

dataSmNB_Target <- TCGAquery_SampleTypes(barcode = samplesDown_Target,
                                         typesample = "TBM")
dataSmTB_short_Target <- dataSmTB_Target[1:10]
dataSmNB_short_Target <- dataSmNB_Target[1:10]

queryDown_Target <- GDCquery(project = "TARGET-AML", 
                             data.category = "Transcriptome Profiling",
                             data.type = "Gene Expression Quantification", 
                             workflow.type = "HTSeq - Counts", 
                             barcode = c(dataSmTB_short_Target, dataSmNB_short_Target))

GDCdownload(queryDown_Target)

### SummarizedExperiment = TRUE
data <- GDCprepare(queryDown_Target)

dataPrep_Target <- TCGAanalyze_Preprocessing(object = data, 
                                             cor.cut = 0.6,
                                             datatype = "HTSeq - Counts")                      

dataNorm_Target <- TCGAanalyze_Normalization(tabDF = dataPrep_Target,
                                             geneInfo = geneInfoHT,
                                             method = "gcContent") 

boxplot(dataPrep_Target, outline = FALSE)

boxplot(dataNorm_Target, outline = FALSE)

dataFilt_Target <- TCGAanalyze_Filtering(tabDF = dataNorm_Target,
                                         method = "quantile", 
                                         qnt.cut =  0.25)   

2. BRCA数据:准备及下游分析(差异表达分析)

CancerProject <- "TCGA-BRCA"
DataDirectory <- paste0("../GDC/",gsub("-","_",CancerProject))
FileNameData <- paste0(DataDirectory, "_","HTSeq_Counts",".rda")

query <- GDCquery(project = CancerProject,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts")

samplesDown <- getResults(query,cols = c("cases"))

dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "TP")

dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "NT")
dataSmTP_short <- dataSmTP[1:10]
dataSmNT_short <- dataSmNT[1:10]

queryDown <- GDCquery(project = CancerProject, 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification", 
                      workflow.type = "HTSeq - Counts", 
                      barcode = c(dataSmTP_short, dataSmNT_short))

GDCdownload(query = queryDown)

dataPrep1 <- GDCprepare(query = queryDown, 
                        save = TRUE, 
                        save.filename = "TCGA_BRCA_HTSeq_Countds.rda")

dataPrep <- TCGAanalyze_Preprocessing(object = dataPrep1, 
                                      cor.cut = 0.6,
                                      datatype = "HTSeq - Counts")


dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfoHT,
                                      method = "gcContent")
dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfoHT,
                                      method = "geneLength")

dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)

3. UseRaw_afterFilter:过滤后,保留原始计数(raw counts)

应用过滤和规范化(filtering and normalization)步骤后,此函数会将原始计数(raw counts)重新分配给过滤后的数据框genesXsamples。此步骤对于仅仅需要使用原始计数(raw counts)的方法非常有用,例如edgeR分析流程。

参数为:

  • DataPrep:由GDCprepare()返回的DataPrep对象
  • Filtered:在标准化和过滤(normalization and filtering)步骤之后获得的数据框(data frame),其中列是样品,行是基因
### dataframe will have genes from dataFilt but raw counts from dataPrep
dataPrep_raw <- UseRaw_afterFilter(dataPrep, dataFilt)

4. TCGA_MolecularSubtype:查询癌症数据的子类型:

此函数将TCGA Barcodes的向量作为输入,并返回下面两个列表:

  • filtered :没有分子亚型的TCGA Barcodes的向量
  • Condtypes:一个数据框,包含了:samplessubtypecolorbarcodes.

如果需要更多数据,请检查包数据中可用的TbSubtypesCol_merged数据框。

##use previously fetched BRCA data:
Pam50.subtypes<-TCGA_MolecularSubtype(colnames(dataFilt))

5. TCGAanalyze_DEA():差异表达分析:

TCGAbiolinks正在为TCGAanalyze_DEA()函数提供一种新设计,通过提供差异分析公式,以在两组或者多组中执行差异表达分析(DEA)。Limma分析流程也被贴加到EdgeR中。

该函数采用以下参数:

  • mat1:数值矩阵,每行代表一个基因,每列代表一个Cond1type的样本
  • mat2:数值矩阵,每行代表一个基因,每列代表一个Cond2type的样本
  • Cond1type:包含mat1中样本的类标签的字符串(例如,对照组)
  • Cond2type:包含mat2中样本的类标签的字符串(例如,案例组)
  • pipeline :是一个字符串,用来指定要使用的差异分析包(“limma”或“edgeR”)
  • method :用于edgeR的方法:‘glmLRT’或者‘exactTest’ 。1)glmLRT将负二项式广义对数线性模型(negative binomial generalized log-linear model)拟合到每个基因的读数计数(read counts);2)exactTest计算 genewise exact tests for differences in the means between two groups of negative-binomially distributed counts.
  • fdr.cut:是一个阈值,根据修正的p值(p-value corrected)过滤DEGs
  • logFC.cut:是一个阈值,是根据logFC过滤DEGs。
  • batch.factors:包含字符串的向量,用于指定批量修正的选项。选项有:“Plate”、“TSS”、“Year”、“Portion”、“Center”、“Patients” 、以及它们在设计矩阵中的标注
  • ClinicalDF:由GDCquery_clinic()返回的数据框,用作提取进行批量修正(batch correction (blocking))的“Year”数据提取的输入项
  • paired :一个布尔值,以考虑配对或非配对样本。设置为TRUE则是配对样本
  • log.trans:一个布尔值,log cpm转换。设置为TRUE则执行log transformation
  • voom:一个布尔值,执行voom transform data。设置为TRUE则执行转换
  • trend:一个布尔值,来执行limma-trend pipeline。设置为TRUE则执行limma-trend
  • MAT:一个矩阵,包含基因表达,列是全部样本,行是基因。如果使用mat1和mat2,则不需要提供该矩阵
  • contrast.formula:字符串,作为输入,用于确定系数(coefficients),并以自定义方式设计对比度(contrasts)
  • Condtypes:一个向量,用于确定MAT中的样本的分组

5.1. Limma分析流程

在该案例中,将在通过GDC门户网站的tcgabiolinks下载数据,并通过控制组和案例组执行差异表达式分析。其他分析流程的选项可以是是edgeR

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmTP_short],
                            mat2 = dataFilt[,dataSmNT_short],
                            pipeline="limma",
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT", ClinicalDF = data.frame())

5.2. contrast.formula参数 - 自定义对比度(contrast)

contrast.formula参数是一个字符串公式,用于limma pipeline差异表达分析的输入。公式中对比元素(elements of contrast)应该是CondTypes向量中的唯一元素之一。下面的示例显示了contrast.formula参数应具有的语法。我们在这里对比了一些肺癌样本的Pam50分子亚型(Pam50 molecular subtypes)。

注意:对比公式的语法应该采用这种形式:formula <- “contrast1=type1-type2, contrast2=type2-type3, contrast3=type1-(type2+type3)/2,…”

CondTypes 向量中的元素最好是没有空格字符。

### ##download and prepare lung data through GDC### ##
query.lung <- GDCquery(project = "TCGA-LUSC",
                       data.category = "Transcriptome Profiling",
                       data.type = "Gene Expression Quantification", 
                       workflow.type = "HTSeq - Counts")

samplesDown.lusc <- getResults(query.lung,cols=c("cases"))

dataSmTP.lusc <- TCGAquery_SampleTypes(barcode = samplesDown.lusc,
                                       typesample = "TP")

dataSmNT.lusc <- TCGAquery_SampleTypes(barcode = samplesDown.lusc,
                                       typesample = "NT")
dataSmTP_short.lusc <- dataSmTP.lusc[1:10]
dataSmNT_short.lusc <- dataSmNT.lusc[1:10]
length(dataSmTP.lusc)


queryDown.lung <- GDCquery(project = "TCGA-LUSC", 
                           data.category = "Transcriptome Profiling",
                           data.type = "Gene Expression Quantification", 
                           workflow.type = "HTSeq - Counts", 
                           barcode = c(dataSmTP.lusc, dataSmNT.lusc))

GDCdownload(query = queryDown.lung)

dataPrep1.tcga <- GDCprepare(query = queryDown.lung, 
                             save = TRUE, 
                             save.filename = "TCGA_LUSC_HTSeq_Countds.rda")
dataPrep.tcga <- TCGAanalyze_Preprocessing(object = dataPrep1.tcga, 
                                           cor.cut = 0.6,
                                           datatype = "HTSeq - Counts")

dataNorm.tcga <- TCGAanalyze_Normalization(tabDF = dataPrep.tcga,
                                           geneInfo = geneInfoHT,
                                           method = "gcContent")

dataNorm.tcga <- TCGAanalyze_Normalization(tabDF = dataNorm.tcga,
                                           geneInfo = geneInfoHT,
                                           method = "geneLength")
dataFilt.tcga <- TCGAanalyze_Filtering(tabDF = dataNorm.tcga,
                                       method = "quantile", 
                                       qnt.cut =  0.25)

### #Filtering data so all samples have a pam50 subtype for LUSC

diff<-setdiff(colnames(dataFilt.tcga),
              TCGA_MolecularSubtype(colnames(dataFilt.tcga))$filtered)
TCGA_MolecularSubtype(diff)$subtypes$barcodes

dataFilt.tcga.lusc<-dataFilt.tcga[,diff]
LUSC.subtypes<-TCGA_MolecularSubtype(colnames(dataFilt.tcga.lusc))$subtypes$subtype

### Differential expression analysis after correcting for "Plate" factor.
DEG.lusc <- TCGAanalyze_DEA(MAT=dataFilt.tcga.lusc,
                            pipeline="limma",
                            batch.factors = c("Plate"),
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT", ClinicalDF = data.frame(),
                            Condtypes = LUSC.subtypes,
                            contrast.formula = "LUSC.basalvsLUSC.classical=LUSC.basal - LUSC.classical")

6. TCGAbatch_correction:执行批量修正和lima-voom转换

此函数从sva包调用ComBat函数来处理批量修正,显示一些用于探索性数据分析(exploratory data analysis purposes)的图。更具体地说,它显示了一个图,其中黑色作为经验批效应密度的核估计(kernel estimate of the empirical batch effect density),红色作为参数(parametric)。

  • tabDF:数值矩阵,每行代表一个基因,每列代表一个样本
  • batch.factor:一个包含批处理因子(batch factor)的字符串,用于批处理校正,字符串选项有:“Plate”、“TSS”、“Year”、“Portion”、“Center”、“Patients”。该函数一次只接受一个批处理因子
  • adjustment:一个包含协变量因子(covariate factors)的字符串向量,用来对ComBat进行调整。选项有:“Plate”、“TSS”、“Year”、“Portion”、“Center”、“Patients”。
  • ClinicalDF:由GDCquery_clinic()返回的数据框,用于提取年份数据

除了通过ComBat图进行数据探索之外,如果没有提供批量因子,则不会执行任何操作。运行以下代码查看示例:

batch.corrected <- TCGAbatch_Correction(tabDF = dataFilt.tcga, batch.factor = "Plate")

batch.corrected.adjusted <- TCGAbatch_Correction(tabDF = dataFilt.tcga, 
                                                 batch.factor = "Plate", 
                                                 adjustment=c("TSS"))

7. TCGAbatch_correction:未发布的数据集处理

此函数从sva包调用ComBat函数来处理批量修正,显示一些用于探索性数据分析(exploratory data analysis purposes)的图。更具体地说,它显示了一个图,其中黑色作为经验批效应密度的核估计(kernel estimate of the empirical batch effect density),红色作为参数(parametric)。

  • tabDF:数值矩阵,每行代表一个基因,每列代表一个样本
  • UnpublishedData:如果是TRUE,在添加新数据后执行批量修正
  • AnnotationDF:一个数据框,其中包含Batch列,表示tabDF中不同批次的样本

除了通过ComBat图进行数据探索之外,如果没有提供批量因子,则不会执行任何操作。运行以下代码查看示例:

以下示例显示了如何考虑两种癌症类型LUAD和LUSC以及正常和肿瘤样本。例如,用户最后可以用未发布的数据替换LUSC。

#dataFilt is a gene expression matrix from pancancer33 RNASeq data that can be generated following our previous described workflow

sampleTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), typesample = "TP")
sampleNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), typesample = "NT")
dataFiltTP <- dataFilt[,sampleTP]
dataFiltNT <- dataFilt[,sampleNT]
    
dataSubt_LUAD <-TCGAquery_subtype(tumor = "LUAD")
sampleStageI_LUAD <- dataSubt_LUAD[dataSubt_LUAD$Tumor.stage %in% c("Stage I","Stage IA","Stage IB"),]
dataFilt.LUAD.stageI.TP <- dataFiltTP[,substr(colnames(dataFiltTP),1,12) %in% sampleStageI_LUAD$patient]
dataFilt.LUAD.stageI.NT <- dataFiltNT[,substr(colnames(dataFiltNT),1,12) %in% sampleStageI_LUAD$patient]

# here for example the end-user can replace LUSC with unpublished data in a dataframe containing counts.
# and adding the information for LUSC.stageI.TP or LUSC.stageI.NT with unpublished data

dataSubt_LUSC <-TCGAquery_subtype(tumor = "LUSC")
sampleStageI_LUSC <- dataSubt_LUSC[dataSubt_LUSC$T.stage %in% c("T1","T1a","T1b"),]
dataFilt.LUSC.stageI.TP <- dataFiltTP[,substr(colnames(dataFiltTP),1,12) %in% sampleStageI_LUSC$patient]
dataFilt.LUSC.stageI.NT <- dataFiltNT[,substr(colnames(dataFiltNT),1,12) %in% sampleStageI_LUSC$patient]

countsTable <-cbind(dataFilt.LUAD.stageI.TP,dataFilt.LUAD.stageI.NT,
                    dataFilt.LUSC.stageI.TP, dataFilt.LUSC.stageI.NT)

AnnotationCounts <- matrix(0,ncol(countsTable),2)
colnames(AnnotationCounts) <- c("Samples","Batch")
rownames(AnnotationCounts) <- colnames(countsTable)

AnnotationCounts <- as.data.frame(AnnotationCounts)
AnnotationCounts$Samples <- colnames(countsTable)

AnnotationCounts[colnames(dataFilt.LUAD.stageI.TP),"Batch"] <- "LUAD.stageI.TP"
AnnotationCounts[colnames(dataFilt.LUAD.stageI.NT),"Batch"] <- "LUAD.stageI.NT"
AnnotationCounts[colnames(dataFilt.LUSC.stageI.TP),"Batch"] <- "LUSC.stageI.TP"
AnnotationCounts[colnames(dataFilt.LUSC.stageI.NT),"Batch"] <- "LUSC.stageI.NT"

countsCorrected <- TCGAbatch_Correction(tabDF =countsTable,
                                        UnpublishedData = TRUE, 
                                        AnnotationDF = AnnotationCounts)

8. TCGAtumor_purity:根据肿瘤纯度(tumor purity)过滤TCGA样品

该函数将TCGA条形码作为输入,并根据用户指定的值对其进行过滤。所有的默认值都是0。参数是使用从TCGA导入的不同度量的估计值,它们如下(参见*[1]*):

  • estimate:使用141个免疫基因和141个基质基因的基因表达谱
  • absolute:使用体细胞拷贝数数据(somatic copy-number data),估计仅适用于11种癌症类型
  • lump (leukocytes unmethylation for purity):平均44个非甲基化免疫特异性CpG位点(non-methylated immune-specific CpG sites)
  • ihc:是通过全国儿童医院生物样本核心资源产生的苏木精和伊红染色载玻片的图像分析的估计
  • cpe: CPE是衍生的共识测量值,作为从所有方法归一化水平后的中位数纯度水平(median purity level),以给予它们相等的平均值和标准偏差。

该函数返回一个列表(list)对象,其中:1)pure_barcodes属性是纯样本的向量;2)filtered属性是过滤样本的向量,这些样本没有纯度信息。

有关所有TCGA测量样本的肿瘤纯度信息,请查看包装中可用的Tumor.purity数据框

[1] D. Aran, M. Sirota, and A. J. Butte. Systematic pan-cancer analysis of tumour purity. Nat Commun, 6:8971, 2015.

Purity.BRCA <- TCGAtumor_purity(colnames(dataPrep.tcga), 0, 0, 0, 0, 0.7)

9. 通过Recount2项目下载GTEx数据:

Recount2项目已将基因计数数据提供给科学界下载。Recount2是一个在线资源,由RNA-seq gene 和 exon counts组成,以及覆盖2041个不同研究的bigWig文件。它是ReCount项目的第二代。如在Recount2论文中所述,使用Rail-RNA处理原始测序数据。此功能的目的是在单个管道下处理数据,以避免任何技术变化影响下游整合分析的。返回的对象将是一个列表,其中包含名为project_tissue的元素和每个样本的基因计数的SummarizedExperiment (SE)。这对于研究无批次数据或增加健康/正常样品库特别有用。

参数是:

  • project:字符串输入,指定它是TCGA数据还是GTEx
  • tissue:包含感兴趣的组织以进行查询。
##Brain data through Recount2

brain.rec <- TCGAquery_recount2(project = "gtex", tissue = "brain")

评论

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×