在运行程序前,我们需要导入必要的包:
library(TCGAbiolinks)
library(SummarizedExperiment)
TCGAanalyze
我们可以使用下面的函数轻松分析TCGA数据:
TCGAanalyze_Preprocessing
我们可以轻松搜索TCGA样本,并下载、准备基因表达矩阵。
# You can define a list of samples to query and download providing relative TCGA barcodes.
listSamples <- c("TCGA-E9-A1NG-11A-52R-A14M-07","TCGA-BH-A1FC-11A-32R-A13Q-07",
"TCGA-A7-A13G-11A-51R-A13Q-07","TCGA-BH-A0DK-11A-13R-A089-07",
"TCGA-E9-A1RH-11A-34R-A169-07","TCGA-BH-A0AU-01A-11R-A12P-07",
"TCGA-C8-A1HJ-01A-11R-A13Q-07","TCGA-A7-A13D-01A-13R-A12P-07",
"TCGA-A2-A0CV-01A-31R-A115-07","TCGA-AQ-A0Y5-01A-11R-A14M-07")
# Query platform Illumina HiSeq with a list of barcode
query <- GDCquery(project = "TCGA-BRCA",
data.category = "Gene expression",
data.type = "Gene expression quantification",
experimental.strategy = "RNA-Seq",
platform = "Illumina HiSeq",
file.type = "results",
barcode = listSamples,
legacy = TRUE)
# Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2
GDCdownload(query)
# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
BRCARnaseqSE <- GDCprepare(query)
BRCAMatrix <- assay(BRCARnaseqSE,"raw_count") # or BRCAMatrix <- assay(BRCARnaseqSE,"raw_count")
# For gene expression if you need to see a boxplot correlation and AAIC plot to define outliers you can run
BRCARnaseq_CorOutliers <- TCGAanalyze_Preprocessing(BRCARnaseqSE)
结果如下所示:
基因表达矩阵的示例(行中10个基因,列中2个样本)
TCGA-BH-A1FC-11A-32R-A13Q-07 | TCGA-A7-A13D-01A-13R-A12P-07 | |
---|---|---|
LMX1A | 4009 | 23.00 | 0.00 |
CPNE1 | 8904 | 1608.63 | 7119.57 |
UBE2C | 11065 | 27.00 | 7596.00 |
PREP | 5550 | 1493.00 | 3215.00 |
OR6S1 | 341799 | 0.00 | 0.00 |
TCGAanalyze_Preprocessing
的分析结果如下所示:
TCGAanalyze_DEA
&
TCGAanalyze_LevelTab
我们可以是使用TCGAanalyze_DEA
函数进行差异表达分析(DEA, Differential expression analysis),从而鉴定差异表达的基因(DEG, differentially expressed genes)。
TCGAanalyze_DEA
执行差异表达分析,需要使用R中的一些函数:
count matrix
转换为一个edgeR
对象exactTest()
获取输出,使用假发现率(FDR, False Discovery Rate)校正调整原始P-values,并返回差异表达最大的基因使用TCGAanalyze_DEA
函数,需要以下几个参数:
接下来,我们通过abs(LogFC) >=1
过滤dataDEGs
的输出,并使用TCGAanalyze_LevelTab
函数创建一个表格,这个表格中包含:1)差异表达基因(DEGs);2)Log Fold Change (FC);3)False Discovery Rate (FDR);4)在Cond1type和Cond2type的样品中的基因的表达水平;5) Delta value (the difference of gene expression between the two conditions multiplied logFC)
# Downstream analysis using gene expression data
# TCGA samples from IlluminaHiSeq_RNASeqV2 with type rsem.genes.results
# save(dataBRCA, geneInfo , file = "dataGeneExpression.rda")
library(TCGAbiolinks)
# normalization of genes
dataNorm <- TCGAanalyze_Normalization(tabDF = dataBRCA, geneInfo = geneInfo)
# quantile filter of genes
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
method = "quantile",
qnt.cut = 0.25)
# selection of normal samples "NT"
samplesNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt),
typesample = c("NT"))
# selection of tumor samples "TP"
samplesTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt),
typesample = c("TP"))
# Diff.expr.analysis (DEA)
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,samplesNT],
mat2 = dataFilt[,samplesTP],
Cond1type = "Normal",
Cond2type = "Tumor",
fdr.cut = 0.01 ,
logFC.cut = 1,
method = "glmLRT")
# DEGs table with expression values in normal and tumor samples
dataDEGsFiltLevel <- TCGAanalyze_LevelTab(dataDEGs,"Tumor","Normal",
dataFilt[,samplesTP],dataFilt[,samplesNT])
结果如下所示:
DEA分析后的DEGs表
mRNA | logFC | FDR | Tumor | Normal | Delta |
---|---|---|---|---|---|
FN1 | 2.88 | 1.296151e-19 | 347787.48 | 41234.12 | 1001017.3 |
COL1A1 | 1.77 | 1.680844e-08 | 358010.32 | 89293.72 | 633086.3 |
C4orf7 | 5.20 | 2.826474e-50 | 87821.36 | 2132.76 | 456425.4 |
COL1A2 | 1.40 | 9.480478e-06 | 273385.44 | 91241.32 | 383242.9 |
GAPDH | 1.32 | 3.290678e-05 | 179057.44 | 63663.00 | 236255.5 |
CLEC3A | 6.79 | 7.971002e-74 | 27257.16 | 259.60 | 185158.6 |
IGFBP5 | 1.24 | 1.060717e-04 | 128186.88 | 53323.12 | 158674.6 |
CPB1 | 4.27 | 3.044021e-37 | 37001.76 | 2637.72 | 157968.8 |
CARTPT | 6.72 | 1.023371e-72 | 21700.96 | 215.16 | 145872.8 |
DCD | 7.26 | 1.047988e-80 | 19941.20 | 84.80 | 144806.3 |
其他示例将在下一节中介绍。
CancerProject <- "TCGA-BRCA"
DataDirectory <- paste0("../GDC/",gsub("-","_",CancerProject))
FileNameData <- paste0(DataDirectory, "_","HTSeq_Counts",".rda")
query <- GDCquery(project = CancerProject,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
samplesDown <- getResults(query,cols=c("cases"))
dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
typesample = "TP")
dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,
typesample = "NT")
dataSmTP_short <- dataSmTP[1:10]
dataSmNT_short <- dataSmNT[1:10]
queryDown <- GDCquery(project = CancerProject,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = c(dataSmTP_short, dataSmNT_short))
GDCdownload(query = queryDown,
directory = DataDirectory)
dataPrep <- GDCprepare(query = queryDown,
save = TRUE,
directory = DataDirectory,
save.filename = FileNameData)
dataPrep <- TCGAanalyze_Preprocessing(object = dataPrep,
cor.cut = 0.6,
datatype = "HTSeq - Counts")
dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
geneInfo = geneInfoHT,
method = "gcContent")
boxplot(dataPrep, outline = FALSE)
boxplot(dataNorm, outline = FALSE)
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
method = "quantile",
qnt.cut = 0.25)
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmTP_short],
mat2 = dataFilt[,dataSmNT_short],
Cond1type = "Normal",
Cond2type = "Tumor",
fdr.cut = 0.01 ,
logFC.cut = 1,
method = "glmLRT")
require(TCGAbiolinks)
CancerProject <- "TCGA-BRCA"
DataDirectory <- paste0("../GDC/",gsub("-","_",CancerProject))
FileNameData <- paste0(DataDirectory, "_","miRNA_gene_quantification",".rda")
query.miR <- GDCquery(project = CancerProject,
data.category = "Gene expression",
data.type = "miRNA gene quantification",
file.type = "hg19.mirna",
legacy = TRUE)
samplesDown.miR <- getResults(query.miR,cols=c("cases"))
dataSmTP.miR <- TCGAquery_SampleTypes(barcode = samplesDown.miR,
typesample = "TP")
dataSmNT.miR <- TCGAquery_SampleTypes(barcode = samplesDown.miR,
typesample = "NT")
dataSmTP_short.miR <- dataSmTP.miR[1:10]
dataSmNT_short.miR <- dataSmNT.miR[1:10]
queryDown.miR <- GDCquery(project = CancerProject,
data.category = "Gene expression",
data.type = "miRNA gene quantification",
file.type = "hg19.mirna",
legacy = TRUE,
barcode = c(dataSmTP_short.miR, dataSmNT_short.miR))
GDCdownload(query = queryDown.miR,
directory = DataDirectory)
dataAssy.miR <- GDCprepare(query = queryDown.miR,
save = TRUE,
save.filename = FileNameData,
summarizedExperiment = TRUE,
directory =DataDirectory )
rownames(dataAssy.miR) <- dataAssy.miR$miRNA_ID
# using read_count's data
read_countData <- colnames(dataAssy.miR)[grep("count", colnames(dataAssy.miR))]
dataAssy.miR <- dataAssy.miR[,read_countData]
colnames(dataAssy.miR) <- gsub("read_count_","", colnames(dataAssy.miR))
dataFilt <- TCGAanalyze_Filtering(tabDF = dataAssy.miR,
method = "quantile",
qnt.cut = 0.25)
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmNT_short.miR],
mat2 = dataFilt[,dataSmTP_short.miR],
Cond1type = "Normal",
Cond2type = "Tumor",
fdr.cut = 0.01 ,
logFC.cut = 1,
method = "glmLRT")
TCGAanalyze_EAcomplete & TCGAvisualize_EAbarplot
为了更好地理解潜在的生物过程,研究人员通常希望获得一组基因的功能特征。这可以通过富集分析(enrichment analysis)来完成。
富集分析
我们可以使用TCGAanalyze_EAcomplete
函数对基因集进行富集分析。在给定一定条件下上调的一组基因,富集分析将使用该基因集的注释信息确定过量表达的基因或蛋白质的类别。
结果查看
在完成富集分析后,可以使用TCGAvisualize_EAbarplot
进行结果展示
library(TCGAbiolinks)
# Enrichment Analysis EA
# Gene Ontology (GO) and Pathway enrichment by DEGs list
Genelist <- rownames(dataDEGsFiltLevel)
system.time(ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",Genelist))
# Enrichment Analysis EA (TCGAVisualize)
# Gene Ontology (GO) and Pathway enrichment barPlot
TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
GOBPTab = ansEA$ResBP,
GOCCTab = ansEA$ResCC,
GOMFTab = ansEA$ResMF,
PathTab = ansEA$ResPat,
nRGTab = Genelist,
nBar = 10)
结果如下所示:
TCGAanalyze_survival
在分析生存时,会出现与目前讨论的问题不同的问题。一个问题是我们如何处理退出研究的病人数据。例如,假设我们测试了一种新的抗癌药物。当一些受试者死亡时,其他人可能认为新药无效,并决定在研究结束前退出研究。当我们调查机器在故障发生前持续运行多长时间的时候,会遇到类似的问题。
使用TCGAanalyze_survival
函数,我们可以利用临床数据得到生存分析的结果:
clin.gbm <- GDCquery_clinic("TCGA-GBM", "clinical")
TCGAanalyze_survival(clin.gbm,
"gender",
main = "TCGA Set\n GBM",height = 10, width=10)
TCGAanalyze_survival
的参数有:
xlim = c(0, 1000)
. 呈现较窄的X轴,但不影响生存估计结果如下所示:
TCGAanalyze_SurvivalKM
library(TCGAbiolinks)
# Survival Analysis SA
clinical_patient_Cancer <- GDCquery_clinic("TCGA-BRCA","clinical")
dataBRCAcomplete <- log2(BRCA_rnaseqv2)
tokenStop<- 1
tabSurvKMcomplete <- NULL
for( i in 1: round(nrow(dataBRCAcomplete)/100)){
message( paste( i, "of ", round(nrow(dataBRCAcomplete)/100)))
tokenStart <- tokenStop
tokenStop <-100*i
tabSurvKM<-TCGAanalyze_SurvivalKM(clinical_patient_Cancer,
dataBRCAcomplete,
Genelist = rownames(dataBRCAcomplete)[tokenStart:tokenStop],
Survresult = F,
ThreshTop=0.67,
ThreshDown=0.33)
tabSurvKMcomplete <- rbind(tabSurvKMcomplete,tabSurvKM)
}
tabSurvKMcomplete <- tabSurvKMcomplete[tabSurvKMcomplete$pvalue < 0.01,]
tabSurvKMcomplete <- tabSurvKMcomplete[order(tabSurvKMcomplete$pvalue, decreasing=F),]
tabSurvKMcompleteDEGs <- tabSurvKMcomplete[
rownames(tabSurvKMcomplete) %in% dataDEGsFiltLevel$mRNA,
]
结果如下所示:
生存分析后的KM-survival genes的列表
pvalue | Cancer Deaths | Cancer Deaths with Top | Cancer Deaths with Down | |
---|---|---|---|---|
DCTPP1 | 6.204170e-08 | 66 | 46 | 20 |
APOO | 9.390193e-06 | 65 | 49 | 16 |
LOC387646 | 1.039097e-05 | 69 | 48 | 21 |
PGK1 | 1.198577e-05 | 71 | 49 | 22 |
CCNE2 | 2.100348e-05 | 65 | 48 | 17 |
Mean Tumor Top | Mean Tumor Down | Mean Normal | |
---|---|---|---|
DCTPP1 | 13.31 | 12.01 | 11.74 |
APOO | 11.40 | 10.17 | 10.01 |
LOC387646 | 7.92 | 4.64 | 5.90 |
PGK1 | 15.66 | 14.18 | 14.28 |
CCNE2 | 11.07 | 8.23 | 7.03 |
TCGAanalyze_DMR
我们将使用TCGAanalyze_DMR
函数搜索差异甲基化的CpG位点(differentially methylated CpG sites)。为了找到这些区域,我们使用β值(甲基化值范围从0.0到1.0)来比较两组之间的差异。
首先,计算每个探针的每组平均DNA甲基化之间的差异
其次,使用通过Benjamini-Hochberg
方法调整的wilcoxon检验
来测试两组之间的差异表达。默认参数设置为:1)要求最小绝对β值差值为0.2;2)调整后的p值<0.01
在这些测试之后,我们保存火山图(x轴:平均甲基化差异,y轴:统计显着性),这将帮助用户识别差异甲基化的CpG位点,然后将结果保存在csv文件中(DMR_results.groupCol.group1.group2.csv
),最后,使用rowRanges中的微积分返回对象
TCGAanalyze_DMR
的参数是:
参数 | 描述 |
---|---|
data | 从TCGAPrepare 获得的SummarizedExperiment 对象 |
groupCol | SummarizedExperiment 对象内的列,这些列包含组别信息(这将通过函数colData(data) 获得) |
group1 | 如果我们的对象有两个以上的组,您应该设置该组的名称 |
group2 | 如果我们的对象有两个以上的组,您应该设置该组的名称 |
calculate.pvalues.probes | 为了更快地获得探针,用户可以选择具有DNA甲基化差异的探针来计算P-value。默认是计算所有探针。可选的值有: “all”、“differential”。默认是“all” |
plot.filename | 文件名。默认是:volcano.pdf,volcano.svg,volcano.png。如果设置为FALSE,则没有绘图。 |
ylab | y轴内容 |
xlab | X轴内容 |
title | 绘图的主要标题。如果没有指定,将是"Volcano plot (group1 vs group2)" |
legend | 图例标题 |
color | 用于图形中颜色的向量 |
label | 要在图中使用的标签的向量。例如:c(“Not Significant”,“Hypermethylated in group1”, “Hypomethylated in group1”)) |
xlim | xx轴的限制范围 |
ylim | y轴的限制范围 |
p.cut | P-values阈值。默认是: 0.01 |
probe.names | 是否是探针名称 |
diffmean.cut | 平均差异(diffmean)阈值。默认是: 0.2 |
paired | Wilcoxon paired parameter。默认是: FALSE |
adj.method | 调整P-value计算的方法 |
overwrite | Overwrite the pvalues and diffmean values if already in the object for both groups? Default: FALSE |
cores | Number of cores to be used in the non-parametric test Default = groupCol.group1.group2.rda |
save | Save object with results? Default: TRUE |
data <- TCGAanalyze_DMR(data, groupCol = "methylation_subtype",
group1 = "CIMP.H",
group2="CIMP.L",
p.cut = 10^-5,
diffmean.cut = 0.25,
legend = "State",
plot.filename = "coad_CIMPHvsCIMPL_metvolcano.png")
输出将是如下所示的图。绿点是与组1相比,组2中低甲基化(hypomethylated)的探针,而红色组是组2中与组1相比的高甲基化(hypermethylated)探针:
此外,TCGAanalyze_DMR
函数将绘图保存为pdf,并返回相同的SummarizedExperiment
对象,该对象包含了:1)p-value;2)p-value adjusted;3)diffmean;4)the group it belongs in the graph (non significant, hypomethylated, hypermethylated) in the rowRanges。collumns将是(其中group1和group2是组的名称):
可以使用rowRanges
(rowRanges(data)
)查看/访问这些数值。
**观察:**使用相同的参数再次调用相同的函数将仅绘制结果,因为它已经计算过。如果您想要让他们重新计算,请设置
overwrite
为TRUE
,或删除计算collumns。
TCGAvisualize
您可以轻松地查看以下某些功能的结果:
TCGAvisualize_Heatmap
为了更好地了解基因在样本中的聚类信息,我们通常使用热图。而TCGAvisualize_Heatmap
可以绘制热图,并添加每一个样本bar来代表很多特征。此函数是ComplexHeatmap包的包装器,
这个函数的参数是:
bcr_patient_barcode
或者代表patients barcodes
的病人或者IDFALSE
FALSE
FALSE
FALSE
z-score
来制作热图?1)如果我们想显示基因之间的差异,最好通过样本进行Z-score(对于每个样本的均值为:0;标准差为:1);2)如果我们想要显示样本之间的差异,最好通过基因进行Z-score(对于每个基因的均值为:0;标准差为:1)。候选的值有:“row”、“col“、”none"(Default)有关更多信息,请查看**case study #2
**
结果如下所示:
TCGAvisualize_Volcano
为DNA甲基化或基因表达绘制火山图
这个函数的参数是:
Argument | Description |
---|---|
x | x轴数据 |
y | y轴数据 |
filename | 绘制的火山图文件名称,默认是: volcano.pdf、volcano.svg、volcano.png |
ylab | y轴的标题 |
xlab | x轴的标题 |
title | 火山图标题。如果没有指定,则是“Volcano plot (group1 vs group2)” |
legend | 火山图图例标题 |
label | 要在火山图中使用的标签的向量。例如: c(“Not Significant”,“Hypermethylated in group1”, “Hypomethylated in group1”))#’ |
xlim | x轴的限制范围 |
ylim | y轴的限制范围 |
color | 在火山图中使用的颜色的向量 |
names | Names to be ploted if significant. Should be the same size of x and y |
names.fill | Names should be filled in a color box? Default: TRUE |
show.names | What names will be showd? Possibilities: “both”, “significant”, “highlighted” |
x.cut | x-axis threshold. Default: 0.0 If you give only one number (e.g. 0.2) the cut-offs will be -0.2 and 0.2. Or you can give diffenrent cutt-ofs as a vector (e.g. c(-0.3,0.4)) |
y.cut | p-values threshold. |
height | 图片高度 |
width | 图片宽度 |
highlight | List of genes/probes to be highlighted. It should be in the names argument. |
highlight.color | Color of the points highlighted |
names.size | 火山图中文字的大小 |
dpi | 图片的分辨率:dpi |
有关更多信息,请查看**case study #3
**。
TCGAvisualize_PCA
为了更好地理解我们的基因,我们可以进行PCA来减少基因组的维数。TCGAvisualize_PCA
将为不同的组绘制PCA。
这个函数的参数是:
# normalization of genes
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo)
# quantile filter of genes
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
method = "quantile",
qnt.cut = 0.25)
# selection of normal samples "NT"
group1 <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))
# selection of normal samples "TP"
group2 <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))
# Principal Component Analysis plot for ntop selected DEGs
pca <- TCGAvisualize_PCA(dataFilt,dataDEGsFiltLevel, ntopgenes = 200, group1, group2)
结果如下所示:
TCGAvisualize_meanMethylation
使用该数据并计算每组的平均DNA甲基化,可以使用TCGAvisualize_meanMethylation
函数创建平均DNA甲基化箱图:
query <- GDCquery(project = "TCGA-GBM",
data.category = "DNA methylation",
platform = "Illumina Human Methylation 27",
legacy = TRUE,
barcode = c("TCGA-02-0058-01A-01D-0186-05", "TCGA-12-1597-01B-01D-0915-05",
"TCGA-12-0829-01A-01D-0392-05", "TCGA-06-0155-01B-01D-0521-05",
"TCGA-02-0099-01A-01D-0199-05", "TCGA-19-4068-01A-01D-1228-05",
"TCGA-19-1788-01A-01D-0595-05", "TCGA-16-0848-01A-01D-0392-05"))
GDCdownload(query, method = "api")
data <- GDCprepare(query)
# "shortLetterCode" is a column in the SummarizedExperiment::colData(data) matrix
TCGAvisualize_meanMethylation(data, groupCol = "shortLetterCode",filename = NULL)
## groups Mean Median Max Min
## 1 NT 0.4938523 0.4956219 0.5247570 0.4687553
## 2 TP 0.4954753 0.4996517 0.5097262 0.4792454
## 3 TR 0.4967371 0.5010733 0.5128546 0.4801191
## NT TP TR
## NT NA 0.8567414 0.7446906
## TP 0.8567414 NA 0.8557347
## TR 0.7446906 0.8557347 NA
TCGAvisualize_meanMethylation
的参数是:
Argument | Description |
---|---|
data | SummarizedExperiment object obtained from GDCPrepare |
groupCol | Columns in colData(data) that defines the groups. If no columns defined a columns called “Patients” will be used |
subgroupCol | Columns in colData(data) that defines the subgroups. |
shapes | Shape vector of the subgroups. It must have the size of the levels of the subgroups. Example: shapes = c(21,23) if for two levels |
print.pvalue | Print p-value for two groups |
plot.jitter | Plot jitter? Default TRUE |
jitter.size | Plot jitter size? Default 3 |
filename | The name of the pdf that will be saved |
ylab | y axis text in the plot |
xlab | x axis text in the plot |
title | main title in the plot |
labels | Labels of the groups |
group.legend | Name of the group legend. DEFAULT: groupCol |
subgroup.legend | Name of the subgroup legend. DEFAULT: subgroupCol |
color | vector of colors to be used in graph |
y.limits | Change lower/upper y-axis limit |
sort | Sort boxplot by mean or median. Possible values: mean.asc, mean.desc, median.asc, meadian.desc |
order | Order of the boxplots |
legend.position | Legend position (“top”, “right”,“left”,“bottom”) |
legend.title.position | Legend title position (“top”, “right”,“left”,“bottom”) |
legend.ncols | Number of columns of the legend |
add.axis.x.text | Add text to x-axis? Default: FALSE |
width | Plot width default:10 |
height | Plot height default:10 |
dpi | Pdf dpi default:600 |
axis.text.x.angle | Angle of text in the x axis |
其他使用这些参数的案例:
# setting y limits: lower 0, upper 1
TCGAvisualize_meanMethylation(data,groupCol = "shortLetterCode",
filename = NULL, y.limits = c(0,1))
## groups Mean Median Max Min
## 1 NT 0.4938523 0.4956219 0.5247570 0.4687553
## 2 TP 0.4954753 0.4996517 0.5097262 0.4792454
## 3 TR 0.4967371 0.5010733 0.5128546 0.4801191
## NT TP TR
## NT NA 0.8567414 0.7446906
## TP 0.8567414 NA 0.8557347
## TR 0.7446906 0.8557347 NA
# setting y limits: lower 0
TCGAvisualize_meanMethylation(data,groupCol = "shortLetterCode",
filename = NULL, y.limits = 0)
## groups Mean Median Max Min
## 1 NT 0.4938523 0.4956219 0.5247570 0.4687553
## 2 TP 0.4954753 0.4996517 0.5097262 0.4792454
## 3 TR 0.4967371 0.5010733 0.5128546 0.4801191
## NT TP TR
## NT NA 0.8567414 0.7446906
## TP 0.8567414 NA 0.8557347
## TR 0.7446906 0.8557347 NA
# Changing shapes of jitters to show subgroups
TCGAvisualize_meanMethylation(data,
groupCol = "subtype_Pan.Glioma.DNA.Methylation.Cluster",
subgroupCol ="vital_status", filename = NULL)
## groups Mean Median Max Min
## 1 group1 0.4959208 0.5010733 0.5075258 0.4792454
## 2 group2 0.5012787 0.5047264 0.5247570 0.4730471
## 3 group3 0.4888652 0.4890000 0.5044062 0.4687553
## group1 group2 group3
## group1 NA 0.5258007 0.2715927
## group2 0.5258007 NA 0.1658830
## group3 0.2715927 0.1658830 NA
# Sorting bars by descending mean methylation
TCGAvisualize_meanMethylation(data,
groupCol = "subtype_Pan.Glioma.DNA.Methylation.Cluster",
sort="mean.desc",
filename=NULL)
## groups Mean Median Max Min
## 1 group1 0.4959208 0.5010733 0.5075258 0.4792454
## 2 group2 0.5012787 0.5047264 0.5247570 0.4730471
## 3 group3 0.4888652 0.4890000 0.5044062 0.4687553
## group1 group2 group3
## group1 NA 0.5258007 0.2715927
## group2 0.5258007 NA 0.1658830
## group3 0.2715927 0.1658830 NA
# Sorting bars by asc mean methylation
TCGAvisualize_meanMethylation(data,
groupCol = "subtype_Pan.Glioma.DNA.Methylation.Cluster",
sort = "mean.asc",
filename=NULL)
## groups Mean Median Max Min
## 1 group1 0.4959208 0.5010733 0.5075258 0.4792454
## 2 group2 0.5012787 0.5047264 0.5247570 0.4730471
## 3 group3 0.4888652 0.4890000 0.5044062 0.4687553
## group1 group2 group3
## group1 NA 0.5258007 0.2715927
## group2 0.5258007 NA 0.1658830
## group3 0.2715927 0.1658830 NA
TCGAvisualize_meanMethylation(data,
groupCol = "vital_status",
sort = "mean.asc",
filename=NULL)
## groups Mean Median Max Min
## 1 ALIVE 0.4954753 0.4996517 0.5097262 0.4792454
## 2 DEAD 0.4952947 0.4983476 0.5247570 0.4687553
## ALIVE DEAD
## ALIVE NA 0.9780798
## DEAD 0.9780798 NA
TCGAvisualize_starburst
:Starburst绘图
首次在2010年被提出(Noushmehr et al. 2010),它能够结合两个火山图的信息,并应用于DNA甲基化和基因表达的研究。为了重现这个图,我们将使用TCGAvisualize_starburst
函数。
该函数可以创建Starburst图
,用于比较DNA甲基化和基因表达。对于每个基因,DNA甲基化的log10 (FDR-corrected P value)
绘制在x轴上,基因表达绘制在y轴上。黑色虚线表示FDR-adjusted P value
为0.01。
这个函数的参数是:
Argument | Description |
---|---|
exp | Object obtained by DEArnaSEQ function |
group1 | The name of the group 1 Obs: Column p.value.adj.group1.group2 should exist |
group2 | The name of the group 2. Obs: Column p.value.adj.group1.group2 should exist |
exp.p.cut | expression p value cut-off |
met.p.cut | methylation p value cut-off |
diffmean.cut | If set, the probes with diffmean higher than methylation cut-off will be highlighted in the plot. And the data frame return will be subseted. |
logFC.cut | If set, the probes with expression fold change higher than methylation cut-off will be highlighted in the plot. And the data frame return will be subseted. |
met.platform | DNA methylation platform (“27K”,“450K” or “EPIC”) |
genome | Genome of reference (“hg38” or “hg19”) used to identify nearest probes TSS |
names | Add the names of the significant genes? Default: FALSE |
names.fill | Names should be filled in a color box? Default: TRUE |
filename | The filename of the file (it can be pdf, svg, png, etc) |
return.plot | If true only plot object will be returned (pdf will not be created) |
ylab | y axis text |
xlab | x axis text |
title | main title |
legend | legend title |
color | vector of colors to be used in graph |
label | vector of labels to be used in graph |
xlim | x limits to cut image |
ylim | y limits to cut image |
height | Figure height |
width | Figure width |
dpi | Figure dpi |
starburst <- TCGAvisualize_starburst(coad.SummarizeExperiment,
different.experssion.analysis.data,
group1 = "CIMP.H",
group2 = "CIMP.L",
met.platform = "450K",
genome = "hg19",
met.p.cut = 10^-5,
exp.p.cut = 10^-5,
names = TRUE)
结果,该函数将绘制下图并返回具有Gene_symbol的矩阵,其状态与基因表达(up regulated/down regulated)和DNA甲基化(Hyper/Hypo methylated)有关。case study 3
显示了创建此图的完整过程。
sessionInfo()
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] png_0.1-7 DT_0.5
## [3] dplyr_0.8.0.1 SummarizedExperiment_1.12.0
## [5] DelayedArray_0.8.0 BiocParallel_1.16.6
## [7] matrixStats_0.54.0 Biobase_2.42.0
## [9] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
## [11] IRanges_2.16.0 S4Vectors_0.20.1
## [13] BiocGenerics_0.28.0 TCGAbiolinks_2.10.5
## [15] testthat_2.0.1
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.3 circlize_0.4.5
## [3] AnnotationHub_2.14.5 aroma.light_3.12.0
## [5] plyr_1.8.4 selectr_0.4-1
## [7] ConsensusClusterPlus_1.46.0 lazyeval_0.2.2
## [9] splines_3.5.3 usethis_1.4.0
## [11] ggplot2_3.1.0 sva_3.30.1
## [13] digest_0.6.18 foreach_1.4.4
## [15] htmltools_0.3.6 magrittr_1.5
## [17] memoise_1.1.0 cluster_2.0.7-1
## [19] doParallel_1.0.14 limma_3.38.3
## [21] remotes_2.0.2 ComplexHeatmap_1.20.0
## [23] Biostrings_2.50.2 readr_1.3.1
## [25] annotate_1.60.1 sesameData_1.0.0
## [27] R.utils_2.8.0 prettyunits_1.0.2
## [29] colorspace_1.4-1 ggrepel_0.8.0
## [31] blob_1.1.1 rvest_0.3.2
## [33] xfun_0.5 callr_3.2.0
## [35] crayon_1.3.4 RCurl_1.95-4.12
## [37] jsonlite_1.6 genefilter_1.64.0
## [39] survival_2.43-3 zoo_1.8-4
## [41] iterators_1.0.10 glue_1.3.1
## [43] survminer_0.4.3 gtable_0.2.0
## [45] sesame_1.0.0 zlibbioc_1.28.0
## [47] XVector_0.22.0 GetoptLong_0.1.7
## [49] wheatmap_0.1.0 pkgbuild_1.0.2
## [51] shape_1.4.4 scales_1.0.0
## [53] DESeq_1.34.1 DBI_1.0.0
## [55] edgeR_3.24.3 ggthemes_4.1.0
## [57] Rcpp_1.0.1 xtable_1.8-3
## [59] progress_1.2.0 cmprsk_2.2-7
## [61] matlab_1.0.2 bit_1.1-14
## [63] preprocessCore_1.44.0 km.ci_0.5-2
## [65] htmlwidgets_1.3 httr_1.4.0
## [67] RColorBrewer_1.1-2 pkgconfig_2.0.2
## [69] XML_3.98-1.19 R.methodsS3_1.7.1
## [71] DNAcopy_1.56.0 locfit_1.5-9.1
## [73] labeling_0.3 later_0.8.0
## [75] tidyselect_0.2.5 rlang_0.3.1
## [77] AnnotationDbi_1.44.0 munsell_0.5.0
## [79] tools_3.5.3 downloader_0.4
## [81] cli_1.1.0 ExperimentHub_1.8.0
## [83] generics_0.0.2 RSQLite_2.1.1
## [85] devtools_2.0.1 broom_0.5.1
## [87] evaluate_0.13 stringr_1.4.0
## [89] yaml_2.2.0 processx_3.3.0
## [91] knitr_1.22 bit64_0.9-7
## [93] fs_1.2.7 survMisc_0.5.5
## [95] randomForest_4.6-14 purrr_0.3.2
## [97] EDASeq_2.16.3 nlme_3.1-137
## [99] mime_0.6 R.oo_1.22.0
## [101] xml2_1.2.0 biomaRt_2.38.0
## [103] compiler_3.5.3 rstudioapi_0.10
## [105] curl_3.3 interactiveDisplayBase_1.20.0
## [107] tibble_2.1.1 geneplotter_1.60.0
## [109] stringi_1.4.3 highr_0.7
## [111] ps_1.3.0 GenomicFeatures_1.34.6
## [113] desc_1.2.0 lattice_0.20-38
## [115] Matrix_1.2-16 KMsurv_0.1-5
## [117] pillar_1.3.1 BiocManager_1.30.4
## [119] GlobalOptions_0.1.0 data.table_1.12.0
## [121] bitops_1.0-6 httpuv_1.5.0
## [123] rtracklayer_1.42.2 R6_2.4.0
## [125] latticeExtra_0.6-28 hwriter_1.3.2
## [127] promises_1.0.1 ShortRead_1.40.0
## [129] gridExtra_2.3 codetools_0.2-16
## [131] sessioninfo_1.1.1 assertthat_0.2.0
## [133] pkgload_1.0.2 rprojroot_1.3-2
## [135] rjson_0.2.20 withr_2.1.2
## [137] GenomicAlignments_1.18.1 Rsamtools_1.34.1
## [139] GenomeInfoDbData_1.2.0 mgcv_1.8-27
## [141] hms_0.4.2 tidyr_0.8.3
## [143] rmarkdown_1.12 ggpubr_0.2
## [145] shiny_1.2.0
本文由 石九流 创作,如果您觉得本文不错,请随意赞赏
采用 知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外,均为本站原创或翻译,转载前请务必署名
原文链接:https://blog.computsystmed.com/archives/translation-tcgabiolinks-analyzing-and-visualizing-tcga-data
最后更新:2019-05-25 17:28:54
Update your browser to view this website correctly. Update my browser now