翻译 - TCGAbiolinks 7 - Analyzing and visualizing TCGA data

TCGAbiolinks:TCGA数据分析和可视化

在运行程序前,我们需要导入必要的包:

library(TCGAbiolinks)

library(SummarizedExperiment)

1. TCGA的数据分析 - TCGAanalyze

我们可以使用下面的函数轻松分析TCGA数据:

1.1. 基因表达数据的预处理 - TCGAanalyze_Preprocessing

我们可以轻松搜索TCGA样本,并下载、准备基因表达矩阵。

# You can define a list of samples to query and download providing relative TCGA barcodes.
listSamples <- c("TCGA-E9-A1NG-11A-52R-A14M-07","TCGA-BH-A1FC-11A-32R-A13Q-07",
                 "TCGA-A7-A13G-11A-51R-A13Q-07","TCGA-BH-A0DK-11A-13R-A089-07",
                 "TCGA-E9-A1RH-11A-34R-A169-07","TCGA-BH-A0AU-01A-11R-A12P-07",
                 "TCGA-C8-A1HJ-01A-11R-A13Q-07","TCGA-A7-A13D-01A-13R-A12P-07",
                 "TCGA-A2-A0CV-01A-31R-A115-07","TCGA-AQ-A0Y5-01A-11R-A14M-07")

# Query platform Illumina HiSeq with a list of barcode 
query <- GDCquery(project = "TCGA-BRCA", 
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  experimental.strategy = "RNA-Seq",
                  platform = "Illumina HiSeq",
                  file.type = "results",
                  barcode = listSamples, 
                  legacy = TRUE)

# Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2
GDCdownload(query)

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
BRCARnaseqSE <- GDCprepare(query)

BRCAMatrix <- assay(BRCARnaseqSE,"raw_count") # or BRCAMatrix <- assay(BRCARnaseqSE,"raw_count")

# For gene expression if you need to see a boxplot correlation and AAIC plot to define outliers you can run
BRCARnaseq_CorOutliers <- TCGAanalyze_Preprocessing(BRCARnaseqSE)

结果如下所示:

  1. 基因表达矩阵的示例(行中10个基因,列中2个样本)

    TCGA-BH-A1FC-11A-32R-A13Q-07TCGA-A7-A13D-01A-13R-A12P-07
    LMX1A | 400923.000.00
    CPNE1 | 89041608.637119.57
    UBE2C | 1106527.007596.00
    PREP | 55501493.003215.00
    OR6S1 | 3417990.000.00
  2. TCGAanalyze_Preprocessing的分析结果如下所示:

1.2. 差异表达分析 - TCGAanalyze_DEA & TCGAanalyze_LevelTab

我们可以是使用TCGAanalyze_DEA函数进行差异表达分析(DEA, Differential expression analysis),从而鉴定差异表达的基因(DEG, differentially expressed genes)。

TCGAanalyze_DEA执行差异表达分析,需要使用R中的一些函数:

  1. edgeR :: DGEList:将基因的count matrix转换为一个edgeR对象
  2. edgeR :: estimateCommonDisp:为每个基因分配相同的色散估计(dispersion estimate)
  3. edgeR :: exactTest:对两组之间的差异表达式执行成对测试(pair-wise tests)
  4. edgeR :: topTags:从exactTest()获取输出,使用假发现率(FDR, False Discovery Rate)校正调整原始P-values,并返回差异表达最大的基因

使用TCGAanalyze_DEA函数,需要以下几个参数:

  • mat1:Group 1的矩阵(在本例中,Group 1是正常样本的表达数据),
  • mat2:Group 2的矩阵(在本例中,Group 2是肿瘤样本的表达数据)
  • Cond1type:Group 1标签
  • Cond2type:Group 2标签

接下来,我们通过abs(LogFC) >=1过滤dataDEGs的输出,并使用TCGAanalyze_LevelTab函数创建一个表格,这个表格中包含:1)差异表达基因(DEGs);2)Log Fold Change (FC);3)False Discovery Rate (FDR);4)在Cond1type和Cond2type的样品中的基因的表达水平;5) Delta value (the difference of gene expression between the two conditions multiplied logFC)

# Downstream analysis using gene expression data  
# TCGA samples from IlluminaHiSeq_RNASeqV2 with type rsem.genes.results
# save(dataBRCA, geneInfo , file = "dataGeneExpression.rda")
library(TCGAbiolinks)

# normalization of genes
dataNorm <- TCGAanalyze_Normalization(tabDF = dataBRCA, geneInfo =  geneInfo)

# quantile filter of genes
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)

# selection of normal samples "NT"
samplesNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt),
                                   typesample = c("NT"))

# selection of tumor samples "TP"
samplesTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), 
                                   typesample = c("TP"))

# Diff.expr.analysis (DEA)
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,samplesNT],
                            mat2 = dataFilt[,samplesTP],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT")

# DEGs table with expression values in normal and tumor samples
dataDEGsFiltLevel <- TCGAanalyze_LevelTab(dataDEGs,"Tumor","Normal",
                                          dataFilt[,samplesTP],dataFilt[,samplesNT])

结果如下所示:

DEA分析后的DEGs表

mRNAlogFCFDRTumorNormalDelta
FN12.881.296151e-19347787.4841234.121001017.3
COL1A11.771.680844e-08358010.3289293.72633086.3
C4orf75.202.826474e-5087821.362132.76456425.4
COL1A21.409.480478e-06273385.4491241.32383242.9
GAPDH1.323.290678e-05179057.4463663.00236255.5
CLEC3A6.797.971002e-7427257.16259.60185158.6
IGFBP51.241.060717e-04128186.8853323.12158674.6
CPB14.273.044021e-3737001.762637.72157968.8
CARTPT6.721.023371e-7221700.96215.16145872.8
DCD7.261.047988e-8019941.2084.80144806.3

其他示例将在下一节中介绍。

1.2.1. HTSeq数据:BRCA下游分析

CancerProject <- "TCGA-BRCA"
DataDirectory <- paste0("../GDC/",gsub("-","_",CancerProject))
FileNameData <- paste0(DataDirectory, "_","HTSeq_Counts",".rda")

query <- GDCquery(project = CancerProject,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts")

samplesDown <- getResults(query,cols=c("cases"))

dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "TP")

dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "NT")
dataSmTP_short <- dataSmTP[1:10]
dataSmNT_short <- dataSmNT[1:10]

queryDown <- GDCquery(project = CancerProject, 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification", 
                      workflow.type = "HTSeq - Counts", 
                      barcode = c(dataSmTP_short, dataSmNT_short))
                    
GDCdownload(query = queryDown,
            directory = DataDirectory)

dataPrep <- GDCprepare(query = queryDown, 
                       save = TRUE, 
                       directory =  DataDirectory,
                       save.filename = FileNameData)

dataPrep <- TCGAanalyze_Preprocessing(object = dataPrep, 
                                      cor.cut = 0.6,
                                      datatype = "HTSeq - Counts")                      

dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfoHT,
                                      method = "gcContent") 

boxplot(dataPrep, outline = FALSE)

boxplot(dataNorm, outline = FALSE)

dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)   

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmTP_short],
                            mat2 = dataFilt[,dataSmNT_short],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT")  

1.2.2. miRNA表达数据:BRCA下游分析

require(TCGAbiolinks)

CancerProject <- "TCGA-BRCA"
DataDirectory <- paste0("../GDC/",gsub("-","_",CancerProject))
FileNameData <- paste0(DataDirectory, "_","miRNA_gene_quantification",".rda")

query.miR <- GDCquery(project = CancerProject, 
                  data.category = "Gene expression",
                  data.type = "miRNA gene quantification",
                  file.type = "hg19.mirna",
                  legacy = TRUE)

samplesDown.miR <- getResults(query.miR,cols=c("cases"))

dataSmTP.miR <- TCGAquery_SampleTypes(barcode = samplesDown.miR,
                                  typesample = "TP")

dataSmNT.miR <- TCGAquery_SampleTypes(barcode = samplesDown.miR,
                                  typesample = "NT")
dataSmTP_short.miR <- dataSmTP.miR[1:10]
dataSmNT_short.miR <- dataSmNT.miR[1:10]

queryDown.miR <- GDCquery(project = CancerProject, 
                      data.category = "Gene expression",
                      data.type = "miRNA gene quantification",
                      file.type = "hg19.mirna",
                      legacy = TRUE,
                      barcode = c(dataSmTP_short.miR, dataSmNT_short.miR))

GDCdownload(query = queryDown.miR,
            directory = DataDirectory)

dataAssy.miR <- GDCprepare(query = queryDown.miR, 
                           save = TRUE, 
                           save.filename = FileNameData, 
                           summarizedExperiment = TRUE,
                           directory =DataDirectory )
rownames(dataAssy.miR) <- dataAssy.miR$miRNA_ID

# using read_count's data 
read_countData <-  colnames(dataAssy.miR)[grep("count", colnames(dataAssy.miR))]
dataAssy.miR <- dataAssy.miR[,read_countData]
colnames(dataAssy.miR) <- gsub("read_count_","", colnames(dataAssy.miR))

dataFilt <- TCGAanalyze_Filtering(tabDF = dataAssy.miR,
                                  method = "quantile", 
                                  qnt.cut =  0.25)   

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmNT_short.miR],
                            mat2 = dataFilt[,dataSmTP_short.miR],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT")  

1.3. 富集分析 - TCGAanalyze_EAcomplete & TCGAvisualize_EAbarplot

为了更好地理解潜在的生物过程,研究人员通常希望获得一组基因的功能特征。这可以通过富集分析(enrichment analysis)来完成。

  1. 富集分析

    我们可以使用TCGAanalyze_EAcomplete函数对基因集进行富集分析。在给定一定条件下上调的一组基因,富集分析将使用该基因集的注释信息确定过量表达的基因或蛋白质的类别。

  2. 结果查看

    在完成富集分析后,可以使用TCGAvisualize_EAbarplot进行结果展示

library(TCGAbiolinks)
# Enrichment Analysis EA
# Gene Ontology (GO) and Pathway enrichment by DEGs list
Genelist <- rownames(dataDEGsFiltLevel)

system.time(ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",Genelist))

# Enrichment Analysis EA (TCGAVisualize)
# Gene Ontology (GO) and Pathway enrichment barPlot

TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP), 
                        GOBPTab = ansEA$ResBP,
                        GOCCTab = ansEA$ResCC,
                        GOMFTab = ansEA$ResMF,
                        PathTab = ansEA$ResPat,
                        nRGTab = Genelist, 
                        nBar = 10)

结果如下所示:

1.4. 生存分析 - TCGAanalyze_survival

在分析生存时,会出现与目前讨论的问题不同的问题。一个问题是我们如何处理退出研究的病人数据。例如,假设我们测试了一种新的抗癌药物。当一些受试者死亡时,其他人可能认为新药无效,并决定在研究结束前退出研究。当我们调查机器在故障发生前持续运行多长时间的时候,会遇到类似的问题。

使用TCGAanalyze_survival函数,我们可以利用临床数据得到生存分析的结果:

clin.gbm <- GDCquery_clinic("TCGA-GBM", "clinical")
TCGAanalyze_survival(clin.gbm,
                     "gender",
                     main = "TCGA Set\n GBM",height = 10, width=10)

TCGAanalyze_survival的参数有:

  • clinical_patient: TCGA 临床病人,并且有病人死亡日期(days_to_death)的信息
  • clusterCol: 包含要绘制的组别的列。这是一个必填字段,标题将基于此列
  • legend:图片的图例标题
  • xlim:限制X轴的范围,比如xlim = c(0, 1000). 呈现较窄的X轴,但不影响生存估计
  • main:绘图的主要标题main title of the plot
  • ylab:Y轴的内容
  • xlab:X轴的内容
  • filename:保存为PDF文件的时候,文件的名称
  • color:定义图中线条的颜色
  • pvalue:在图中,展示生存分析的P-value.
  • risk.table:决定释放显示风险表格
  • conf.int:显示生存曲线的点估计的置信区间

结果如下所示:

1.5. 关联基因表达和生存分析 - TCGAanalyze_SurvivalKM

library(TCGAbiolinks)
# Survival Analysis SA

clinical_patient_Cancer <- GDCquery_clinic("TCGA-BRCA","clinical")
dataBRCAcomplete <- log2(BRCA_rnaseqv2)

tokenStop<- 1

tabSurvKMcomplete <- NULL

for( i in 1: round(nrow(dataBRCAcomplete)/100)){
    message( paste( i, "of ", round(nrow(dataBRCAcomplete)/100)))
    tokenStart <- tokenStop
    tokenStop <-100*i
    tabSurvKM<-TCGAanalyze_SurvivalKM(clinical_patient_Cancer,
                                      dataBRCAcomplete,
                                      Genelist = rownames(dataBRCAcomplete)[tokenStart:tokenStop],
                                      Survresult = F,
                                      ThreshTop=0.67,
                                      ThreshDown=0.33)
    
    tabSurvKMcomplete <- rbind(tabSurvKMcomplete,tabSurvKM)
}

tabSurvKMcomplete <- tabSurvKMcomplete[tabSurvKMcomplete$pvalue < 0.01,]
tabSurvKMcomplete <- tabSurvKMcomplete[order(tabSurvKMcomplete$pvalue, decreasing=F),]

tabSurvKMcompleteDEGs <- tabSurvKMcomplete[
    rownames(tabSurvKMcomplete) %in% dataDEGsFiltLevel$mRNA,
    ]

结果如下所示:

生存分析后的KM-survival genes的列表

pvalueCancer DeathsCancer Deaths with TopCancer Deaths with Down
DCTPP16.204170e-08664620
APOO9.390193e-06654916
LOC3876461.039097e-05694821
PGK11.198577e-05714922
CCNE22.100348e-05654817
Mean Tumor TopMean Tumor DownMean Normal
DCTPP113.3112.0111.74
APOO11.4010.1710.01
LOC3876467.924.645.90
PGK115.6614.1814.28
CCNE211.078.237.03

1.6. 差异甲基化区域分析 - TCGAanalyze_DMR

我们将使用TCGAanalyze_DMR函数搜索差异甲基化的CpG位点(differentially methylated CpG sites)。为了找到这些区域,我们使用β值(甲基化值范围从0.0到1.0)来比较两组之间的差异。

  1. 首先,计算每个探针的每组平均DNA甲基化之间的差异

  2. 其次,使用通过Benjamini-Hochberg方法调整的wilcoxon检验来测试两组之间的差异表达。默认参数设置为:1)要求最小绝对β值差值为0.2;2)调整后的p值<0.01

  3. 在这些测试之后,我们保存火山图(x轴:平均甲基化差异,y轴:统计显着性),这将帮助用户识别差异甲基化的CpG位点,然后将结果保存在csv文件中(DMR_results.groupCol.group1.group2.csv),最后,使用rowRanges中的微积分返回对象

TCGAanalyze_DMR的参数是:

参数描述
dataTCGAPrepare获得的SummarizedExperiment对象
groupColSummarizedExperiment对象内的列,这些列包含组别信息(这将通过函数colData(data)获得)
group1如果我们的对象有两个以上的组,您应该设置该组的名称
group2如果我们的对象有两个以上的组,您应该设置该组的名称
calculate.pvalues.probes为了更快地获得探针,用户可以选择具有DNA甲基化差异的探针来计算P-value。默认是计算所有探针。可选的值有: “all”、“differential”。默认是“all”
plot.filename文件名。默认是:volcano.pdf,volcano.svg,volcano.png。如果设置为FALSE,则没有绘图。
ylaby轴内容
xlabX轴内容
title绘图的主要标题。如果没有指定,将是"Volcano plot (group1 vs group2)"
legend图例标题
color用于图形中颜色的向量
label要在图中使用的标签的向量。例如:c(“Not Significant”,“Hypermethylated in group1”, “Hypomethylated in group1”))
xlimxx轴的限制范围
ylimy轴的限制范围
p.cutP-values阈值。默认是: 0.01
probe.names是否是探针名称
diffmean.cut平均差异(diffmean)阈值。默认是: 0.2
pairedWilcoxon paired parameter。默认是: FALSE
adj.method调整P-value计算的方法
overwriteOverwrite the pvalues and diffmean values if already in the object for both groups? Default: FALSE
coresNumber of cores to be used in the non-parametric test Default = groupCol.group1.group2.rda
saveSave object with results? Default: TRUE
data <- TCGAanalyze_DMR(data, groupCol = "methylation_subtype",
                        group1 = "CIMP.H",
                        group2="CIMP.L",
                        p.cut = 10^-5,
                        diffmean.cut = 0.25,
                        legend = "State",
                        plot.filename = "coad_CIMPHvsCIMPL_metvolcano.png")

输出将是如下所示的图。绿点是与组1相比,组2中低甲基化(hypomethylated)的探针,而红色组是组2中与组1相比的高甲基化(hypermethylated)探针:

此外,TCGAanalyze_DMR函数将绘图保存为pdf,并返回相同的SummarizedExperiment对象,该对象包含了:1)p-value;2)p-value adjusted;3)diffmean;4)the group it belongs in the graph (non significant, hypomethylated, hypermethylated) in the rowRanges。collumns将是(其中group1和group2是组的名称):

  • diffmean.group1.group2 (mean.group2 - mean.group1)
  • diffmean.group2.group1 (mean.group1 - mean.group2)
  • p.value.group1.group2
  • p.value.adj.group1.group2
  • status.group1.group2 (Status of probes in group2 in relation to group1)
  • status.group2.group1 (Status of probes in group1 in relation to group2)

可以使用rowRanges (rowRanges(data))查看/访问这些数值。

**观察:**使用相同的参数再次调用相同的函数将仅绘制结果,因为它已经计算过。如果您想要让他们重新计算,请设置overwriteTRUE,或删除计算collumns。

2. TCGA的数据可视化 - TCGAvisualize

您可以轻松地查看以下某些功能的结果:

2.1. 热图 - TCGAvisualize_Heatmap

为了更好地了解基因在样本中的聚类信息,我们通常使用热图。而TCGAvisualize_Heatmap可以绘制热图,并添加每一个样本bar来代表很多特征。此函数是ComplexHeatmap包的包装器,

这个函数的参数是:

  • data:用于绘制热图的数据(expression, methylation)
  • col.metadata:热图中列(患者)的元数据。这些列可以是bcr_patient_barcode或者代表patients barcodes的病人或者ID
  • row.metadata:热图中行的元数据,这些行可以是genes (expression) 或者 probes (methylation)
  • col.colors:列的颜色列表
  • row.colors:行的颜色列表
  • show_column_names:是否显示列的名称?默认是:FALSE
  • show_row_names:是否显示行的名称?默认是:FALSE
  • cluster_rows:是否对行进行聚类?默认是:FALSE
  • cluster_columns:是否对列进行聚类?默认是:FALSE
  • sortCol:用列的名称进行列的排序
  • title:热图的标题
  • type:热图中值的颜色。候选的值有:“expression” (default)、“methylation”
  • scale:使用z-score来制作热图?1)如果我们想显示基因之间的差异,最好通过样本进行Z-score(对于每个样本的均值为:0;标准差为:1);2)如果我们想要显示样本之间的差异,最好通过基因进行Z-score(对于每个基因的均值为:0;标准差为:1)。候选的值有:“row”、“col“、”none"(Default)

有关更多信息,请查看**case study #2**

结果如下所示:

2.2. 火山图 - TCGAvisualize_Volcano

为DNA甲基化或基因表达绘制火山图

这个函数的参数是:

ArgumentDescription
xx轴数据
yy轴数据
filename绘制的火山图文件名称,默认是: volcano.pdf、volcano.svg、volcano.png
ylaby轴的标题
xlabx轴的标题
title火山图标题。如果没有指定,则是“Volcano plot (group1 vs group2)”
legend火山图图例标题
label要在火山图中使用的标签的向量。例如: c(“Not Significant”,“Hypermethylated in group1”, “Hypomethylated in group1”))#’
xlimx轴的限制范围
ylimy轴的限制范围
color在火山图中使用的颜色的向量
namesNames to be ploted if significant. Should be the same size of x and y
names.fillNames should be filled in a color box? Default: TRUE
show.namesWhat names will be showd? Possibilities: “both”, “significant”, “highlighted”
x.cutx-axis threshold. Default: 0.0 If you give only one number (e.g. 0.2) the cut-offs will be -0.2 and 0.2. Or you can give diffenrent cutt-ofs as a vector (e.g. c(-0.3,0.4))
y.cutp-values threshold.
height图片高度
width图片宽度
highlightList of genes/probes to be highlighted. It should be in the names argument.
highlight.colorColor of the points highlighted
names.size火山图中文字的大小
dpi图片的分辨率:dpi

有关更多信息,请查看**case study #3**。

2.3. 差异表达基因的主成分分析 - TCGAvisualize_PCA

为了更好地理解我们的基因,我们可以进行PCA来减少基因组的维数。TCGAvisualize_PCA将为不同的组绘制PCA。

这个函数的参数是:

  • dataFilt The expression matrix after normalization and quantile filter
  • dataDEGsFiltLevel The TCGAanalyze_LevelTab output
  • ntopgenes number of DEGs genes to plot in PCA
  • ** group1 a string containing the barcode list of the samples in control group
  • ** group2 a string containing the barcode list of the samples in disease group
# normalization of genes
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo)

# quantile filter of genes
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)

# selection of normal samples "NT" 
group1 <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))
# selection of normal samples "TP" 
group2 <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))

# Principal Component Analysis plot for ntop selected DEGs
 pca <- TCGAvisualize_PCA(dataFilt,dataDEGsFiltLevel, ntopgenes = 200, group1, group2)

结果如下所示:

2.4. 平均DNA甲基化分析 - TCGAvisualize_meanMethylation

使用该数据并计算每组的平均DNA甲基化,可以使用TCGAvisualize_meanMethylation函数创建平均DNA甲基化箱图:

query <- GDCquery(project = "TCGA-GBM",
                  data.category = "DNA methylation",
                  platform = "Illumina Human Methylation 27",
                  legacy = TRUE, 
                  barcode = c("TCGA-02-0058-01A-01D-0186-05", "TCGA-12-1597-01B-01D-0915-05",
                              "TCGA-12-0829-01A-01D-0392-05", "TCGA-06-0155-01B-01D-0521-05",
                              "TCGA-02-0099-01A-01D-0199-05", "TCGA-19-4068-01A-01D-1228-05",
                              "TCGA-19-1788-01A-01D-0595-05", "TCGA-16-0848-01A-01D-0392-05"))
GDCdownload(query, method = "api")
data <- GDCprepare(query)

# "shortLetterCode" is a column in the SummarizedExperiment::colData(data) matrix
TCGAvisualize_meanMethylation(data, groupCol = "shortLetterCode",filename = NULL)
##   groups      Mean    Median       Max       Min
## 1     NT 0.4938523 0.4956219 0.5247570 0.4687553
## 2     TP 0.4954753 0.4996517 0.5097262 0.4792454
## 3     TR 0.4967371 0.5010733 0.5128546 0.4801191
##           NT        TP        TR
## NT        NA 0.8567414 0.7446906
## TP 0.8567414        NA 0.8557347
## TR 0.7446906 0.8557347        NA

TCGAvisualize_meanMethylation 的参数是:

ArgumentDescription
dataSummarizedExperiment object obtained from GDCPrepare
groupColColumns in colData(data) that defines the groups. If no columns defined a columns called “Patients” will be used
subgroupColColumns in colData(data) that defines the subgroups.
shapesShape vector of the subgroups. It must have the size of the levels of the subgroups. Example: shapes = c(21,23) if for two levels
print.pvaluePrint p-value for two groups
plot.jitterPlot jitter? Default TRUE
jitter.sizePlot jitter size? Default 3
filenameThe name of the pdf that will be saved
ylaby axis text in the plot
xlabx axis text in the plot
titlemain title in the plot
labelsLabels of the groups
group.legendName of the group legend. DEFAULT: groupCol
subgroup.legendName of the subgroup legend. DEFAULT: subgroupCol
colorvector of colors to be used in graph
y.limitsChange lower/upper y-axis limit
sortSort boxplot by mean or median. Possible values: mean.asc, mean.desc, median.asc, meadian.desc
orderOrder of the boxplots
legend.positionLegend position (“top”, “right”,“left”,“bottom”)
legend.title.positionLegend title position (“top”, “right”,“left”,“bottom”)
legend.ncolsNumber of columns of the legend
add.axis.x.textAdd text to x-axis? Default: FALSE
widthPlot width default:10
heightPlot height default:10
dpiPdf dpi default:600
axis.text.x.angleAngle of text in the x axis

其他使用这些参数的案例:

# setting y limits: lower 0, upper 1
TCGAvisualize_meanMethylation(data,groupCol = "shortLetterCode", 
                              filename = NULL, y.limits = c(0,1))
##   groups      Mean    Median       Max       Min
## 1     NT 0.4938523 0.4956219 0.5247570 0.4687553
## 2     TP 0.4954753 0.4996517 0.5097262 0.4792454
## 3     TR 0.4967371 0.5010733 0.5128546 0.4801191
##           NT        TP        TR
## NT        NA 0.8567414 0.7446906
## TP 0.8567414        NA 0.8557347
## TR 0.7446906 0.8557347        NA

# setting y limits: lower 0
TCGAvisualize_meanMethylation(data,groupCol = "shortLetterCode", 
                              filename = NULL, y.limits = 0)
##   groups      Mean    Median       Max       Min
## 1     NT 0.4938523 0.4956219 0.5247570 0.4687553
## 2     TP 0.4954753 0.4996517 0.5097262 0.4792454
## 3     TR 0.4967371 0.5010733 0.5128546 0.4801191
##           NT        TP        TR
## NT        NA 0.8567414 0.7446906
## TP 0.8567414        NA 0.8557347
## TR 0.7446906 0.8557347        NA

# Changing shapes of jitters to show subgroups
TCGAvisualize_meanMethylation(data,
                              groupCol = "subtype_Pan.Glioma.DNA.Methylation.Cluster", 
                              subgroupCol ="vital_status", filename = NULL)
##   groups      Mean    Median       Max       Min
## 1 group1 0.4959208 0.5010733 0.5075258 0.4792454
## 2 group2 0.5012787 0.5047264 0.5247570 0.4730471
## 3 group3 0.4888652 0.4890000 0.5044062 0.4687553
##           group1    group2    group3
## group1        NA 0.5258007 0.2715927
## group2 0.5258007        NA 0.1658830
## group3 0.2715927 0.1658830        NA

# Sorting bars by descending mean methylation
TCGAvisualize_meanMethylation(data,
                              groupCol = "subtype_Pan.Glioma.DNA.Methylation.Cluster",
                              sort="mean.desc",
                              filename=NULL)
##   groups      Mean    Median       Max       Min
## 1 group1 0.4959208 0.5010733 0.5075258 0.4792454
## 2 group2 0.5012787 0.5047264 0.5247570 0.4730471
## 3 group3 0.4888652 0.4890000 0.5044062 0.4687553
##           group1    group2    group3
## group1        NA 0.5258007 0.2715927
## group2 0.5258007        NA 0.1658830
## group3 0.2715927 0.1658830        NA

# Sorting bars by asc mean methylation
TCGAvisualize_meanMethylation(data,
                              groupCol = "subtype_Pan.Glioma.DNA.Methylation.Cluster",
                              sort = "mean.asc",
                              filename=NULL)
##   groups      Mean    Median       Max       Min
## 1 group1 0.4959208 0.5010733 0.5075258 0.4792454
## 2 group2 0.5012787 0.5047264 0.5247570 0.4730471
## 3 group3 0.4888652 0.4890000 0.5044062 0.4687553
##           group1    group2    group3
## group1        NA 0.5258007 0.2715927
## group2 0.5258007        NA 0.1658830
## group3 0.2715927 0.1658830        NA

TCGAvisualize_meanMethylation(data,
                              groupCol = "vital_status",
                              sort = "mean.asc",
                              filename=NULL)
##   groups      Mean    Median       Max       Min
## 1  ALIVE 0.4954753 0.4996517 0.5097262 0.4792454
## 2   DEAD 0.4952947 0.4983476 0.5247570 0.4687553
##           ALIVE      DEAD
## ALIVE        NA 0.9780798
## DEAD  0.9780798        NA

2.5. 基因表达和DNA甲基化数据的整合 - TCGAvisualize_starburst

Starburst绘图首次在2010年被提出(Noushmehr et al. 2010),它能够结合两个火山图的信息,并应用于DNA甲基化和基因表达的研究。为了重现这个图,我们将使用TCGAvisualize_starburst函数。

该函数可以创建Starburst图,用于比较DNA甲基化和基因表达。对于每个基因,DNA甲基化的log10 (FDR-corrected P value)绘制在x轴上,基因表达绘制在y轴上。黑色虚线表示FDR-adjusted P value为0.01。

这个函数的参数是:

ArgumentDescription
expObject obtained by DEArnaSEQ function
group1The name of the group 1 Obs: Column p.value.adj.group1.group2 should exist
group2The name of the group 2. Obs: Column p.value.adj.group1.group2 should exist
exp.p.cutexpression p value cut-off
met.p.cutmethylation p value cut-off
diffmean.cutIf set, the probes with diffmean higher than methylation cut-off will be highlighted in the plot. And the data frame return will be subseted.
logFC.cutIf set, the probes with expression fold change higher than methylation cut-off will be highlighted in the plot. And the data frame return will be subseted.
met.platformDNA methylation platform (“27K”,“450K” or “EPIC”)
genomeGenome of reference (“hg38” or “hg19”) used to identify nearest probes TSS
namesAdd the names of the significant genes? Default: FALSE
names.fillNames should be filled in a color box? Default: TRUE
filenameThe filename of the file (it can be pdf, svg, png, etc)
return.plotIf true only plot object will be returned (pdf will not be created)
ylaby axis text
xlabx axis text
titlemain title
legendlegend title
colorvector of colors to be used in graph
labelvector of labels to be used in graph
xlimx limits to cut image
ylimy limits to cut image
heightFigure height
widthFigure width
dpiFigure dpi
starburst <- TCGAvisualize_starburst(coad.SummarizeExperiment, 
                                     different.experssion.analysis.data,
                                     group1 = "CIMP.H",
                                     group2 = "CIMP.L",
                                     met.platform = "450K",
                                     genome = "hg19",
                                     met.p.cut = 10^-5, 
                                     exp.p.cut = 10^-5,
                                     names = TRUE)

结果,该函数将绘制下图并返回具有Gene_symbol的矩阵,其状态与基因表达(up regulated/down regulated)和DNA甲基化(Hyper/Hypo methylated)有关。case study 3显示了创建此图的完整过程。

3. 版本信息

sessionInfo()
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] png_0.1-7                   DT_0.5                     
##  [3] dplyr_0.8.0.1               SummarizedExperiment_1.12.0
##  [5] DelayedArray_0.8.0          BiocParallel_1.16.6        
##  [7] matrixStats_0.54.0          Biobase_2.42.0             
##  [9] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
## [11] IRanges_2.16.0              S4Vectors_0.20.1           
## [13] BiocGenerics_0.28.0         TCGAbiolinks_2.10.5        
## [15] testthat_2.0.1             
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.3               circlize_0.4.5               
##   [3] AnnotationHub_2.14.5          aroma.light_3.12.0           
##   [5] plyr_1.8.4                    selectr_0.4-1                
##   [7] ConsensusClusterPlus_1.46.0   lazyeval_0.2.2               
##   [9] splines_3.5.3                 usethis_1.4.0                
##  [11] ggplot2_3.1.0                 sva_3.30.1                   
##  [13] digest_0.6.18                 foreach_1.4.4                
##  [15] htmltools_0.3.6               magrittr_1.5                 
##  [17] memoise_1.1.0                 cluster_2.0.7-1              
##  [19] doParallel_1.0.14             limma_3.38.3                 
##  [21] remotes_2.0.2                 ComplexHeatmap_1.20.0        
##  [23] Biostrings_2.50.2             readr_1.3.1                  
##  [25] annotate_1.60.1               sesameData_1.0.0             
##  [27] R.utils_2.8.0                 prettyunits_1.0.2            
##  [29] colorspace_1.4-1              ggrepel_0.8.0                
##  [31] blob_1.1.1                    rvest_0.3.2                  
##  [33] xfun_0.5                      callr_3.2.0                  
##  [35] crayon_1.3.4                  RCurl_1.95-4.12              
##  [37] jsonlite_1.6                  genefilter_1.64.0            
##  [39] survival_2.43-3               zoo_1.8-4                    
##  [41] iterators_1.0.10              glue_1.3.1                   
##  [43] survminer_0.4.3               gtable_0.2.0                 
##  [45] sesame_1.0.0                  zlibbioc_1.28.0              
##  [47] XVector_0.22.0                GetoptLong_0.1.7             
##  [49] wheatmap_0.1.0                pkgbuild_1.0.2               
##  [51] shape_1.4.4                   scales_1.0.0                 
##  [53] DESeq_1.34.1                  DBI_1.0.0                    
##  [55] edgeR_3.24.3                  ggthemes_4.1.0               
##  [57] Rcpp_1.0.1                    xtable_1.8-3                 
##  [59] progress_1.2.0                cmprsk_2.2-7                 
##  [61] matlab_1.0.2                  bit_1.1-14                   
##  [63] preprocessCore_1.44.0         km.ci_0.5-2                  
##  [65] htmlwidgets_1.3               httr_1.4.0                   
##  [67] RColorBrewer_1.1-2            pkgconfig_2.0.2              
##  [69] XML_3.98-1.19                 R.methodsS3_1.7.1            
##  [71] DNAcopy_1.56.0                locfit_1.5-9.1               
##  [73] labeling_0.3                  later_0.8.0                  
##  [75] tidyselect_0.2.5              rlang_0.3.1                  
##  [77] AnnotationDbi_1.44.0          munsell_0.5.0                
##  [79] tools_3.5.3                   downloader_0.4               
##  [81] cli_1.1.0                     ExperimentHub_1.8.0          
##  [83] generics_0.0.2                RSQLite_2.1.1                
##  [85] devtools_2.0.1                broom_0.5.1                  
##  [87] evaluate_0.13                 stringr_1.4.0                
##  [89] yaml_2.2.0                    processx_3.3.0               
##  [91] knitr_1.22                    bit64_0.9-7                  
##  [93] fs_1.2.7                      survMisc_0.5.5               
##  [95] randomForest_4.6-14           purrr_0.3.2                  
##  [97] EDASeq_2.16.3                 nlme_3.1-137                 
##  [99] mime_0.6                      R.oo_1.22.0                  
## [101] xml2_1.2.0                    biomaRt_2.38.0               
## [103] compiler_3.5.3                rstudioapi_0.10              
## [105] curl_3.3                      interactiveDisplayBase_1.20.0
## [107] tibble_2.1.1                  geneplotter_1.60.0           
## [109] stringi_1.4.3                 highr_0.7                    
## [111] ps_1.3.0                      GenomicFeatures_1.34.6       
## [113] desc_1.2.0                    lattice_0.20-38              
## [115] Matrix_1.2-16                 KMsurv_0.1-5                 
## [117] pillar_1.3.1                  BiocManager_1.30.4           
## [119] GlobalOptions_0.1.0           data.table_1.12.0            
## [121] bitops_1.0-6                  httpuv_1.5.0                 
## [123] rtracklayer_1.42.2            R6_2.4.0                     
## [125] latticeExtra_0.6-28           hwriter_1.3.2                
## [127] promises_1.0.1                ShortRead_1.40.0             
## [129] gridExtra_2.3                 codetools_0.2-16             
## [131] sessioninfo_1.1.1             assertthat_0.2.0             
## [133] pkgload_1.0.2                 rprojroot_1.3-2              
## [135] rjson_0.2.20                  withr_2.1.2                  
## [137] GenomicAlignments_1.18.1      Rsamtools_1.34.1             
## [139] GenomeInfoDbData_1.2.0        mgcv_1.8-27                  
## [141] hms_0.4.2                     tidyr_0.8.3                  
## [143] rmarkdown_1.12                ggpubr_0.2                   
## [145] shiny_1.2.0

4. 参考

  • Noushmehr, Houtan, Daniel J Weisenberger, Kristin Diefes, Heidi S Phillips, Kanan Pujara, Benjamin P Berman, Fei Pan, et al. 2010. “Identification of a Cpg Island Methylator Phenotype That Defines a Distinct Subgroup of Glioma.” Cancer Cell 17 (5). Elsevier:510–22.
更新时间:2019-05-25 17:28:54

本文由 石九流 创作,如果您觉得本文不错,请随意赞赏
采用 知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外,均为本站原创或翻译,转载前请务必署名
原文链接:https://blog.computsystmed.com/archives/translation-tcgabiolinks-analyzing-and-visualizing-tcga-data
最后更新:2019-05-25 17:28:54

评论

Your browser is out of date!

Update your browser to view this website correctly. Update my browser now

×