作者:Mohamed Mounir
发表时间:2019年3月20日
新功能:TCGAbiolinks版本
TCGAbiolinks目前还提供了查询和下载TARGET
数据库的数据的函数,并获得用于下游分析的SummarizedExperiments
对象。此外,TCGAanalyze_DEA准备进一步的的设计,以允许更多用户定制和支持limma分析流程。
TARGET(Therapeutically Applicable Research To Generate Effective Treatments)采用综合的基因组方法( comprehensive genomic approach)来确定驱动儿童癌症的分子变化。研究人员形成一个协作网络组成,以促进分子靶标的发现,并将这些发现转化为临床应用。TARGET目前由NCI’s Office of Cancer Genomics and Cancer Therapy Evaluation Program
维护。
下面是通过tcgabiolinks
包从TARGET数据库查询、下载和准备数据的示例:
## 目前可用的TARGET项目:
projects <- getGDCprojects()$project_id
as.data.frame(grep("TARGET", projects,value = TRUE))
## grep("TARGET", projects, value = TRUE)
## 1 TARGET-AML
## 2 TARGET-NBL
## 3 TARGET-OS
## 4 TARGET-RT
## 5 TARGET-CCSK
## 6 TARGET-WT
## 7 TARGET-ALL-P3
#Downloading and prepare TARGET CASE
query_Target <- GDCquery(project = "TARGET-AML",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
samplesDown_Target <- getResults(query_Target, cols = c("cases"))
dataSmTB_Target <- TCGAquery_SampleTypes(barcode = samplesDown_Target,
typesample = "TB")
dataSmNB_Target <- TCGAquery_SampleTypes(barcode = samplesDown_Target,
typesample = "TBM")
dataSmTB_short_Target <- dataSmTB_Target[1:10]
dataSmNB_short_Target <- dataSmNB_Target[1:10]
queryDown_Target <- GDCquery(project = "TARGET-AML",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = c(dataSmTB_short_Target, dataSmNB_short_Target))
GDCdownload(queryDown_Target)
### SummarizedExperiment = TRUE
data <- GDCprepare(queryDown_Target)
dataPrep_Target <- TCGAanalyze_Preprocessing(object = data,
cor.cut = 0.6,
datatype = "HTSeq - Counts")
dataNorm_Target <- TCGAanalyze_Normalization(tabDF = dataPrep_Target,
geneInfo = geneInfoHT,
method = "gcContent")
boxplot(dataPrep_Target, outline = FALSE)
boxplot(dataNorm_Target, outline = FALSE)
dataFilt_Target <- TCGAanalyze_Filtering(tabDF = dataNorm_Target,
method = "quantile",
qnt.cut = 0.25)
CancerProject <- "TCGA-BRCA"
DataDirectory <- paste0("../GDC/",gsub("-","_",CancerProject))
FileNameData <- paste0(DataDirectory, "_","HTSeq_Counts",".rda")
query <- GDCquery(project = CancerProject,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
samplesDown <- getResults(query,cols = c("cases"))
dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
typesample = "TP")
dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,
typesample = "NT")
dataSmTP_short <- dataSmTP[1:10]
dataSmNT_short <- dataSmNT[1:10]
queryDown <- GDCquery(project = CancerProject,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = c(dataSmTP_short, dataSmNT_short))
GDCdownload(query = queryDown)
dataPrep1 <- GDCprepare(query = queryDown,
save = TRUE,
save.filename = "TCGA_BRCA_HTSeq_Countds.rda")
dataPrep <- TCGAanalyze_Preprocessing(object = dataPrep1,
cor.cut = 0.6,
datatype = "HTSeq - Counts")
dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
geneInfo = geneInfoHT,
method = "gcContent")
dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
geneInfo = geneInfoHT,
method = "geneLength")
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
method = "quantile",
qnt.cut = 0.25)
UseRaw_afterFilter
:过滤后,保留原始计数(raw counts)应用过滤和规范化(filtering and normalization)步骤后,此函数会将原始计数(raw counts)重新分配给过滤后的数据框genesXsamples
。此步骤对于仅仅需要使用原始计数(raw counts)的方法非常有用,例如edgeR
分析流程。
参数为:
GDCprepare()
返回的DataPrep
对象### dataframe will have genes from dataFilt but raw counts from dataPrep
dataPrep_raw <- UseRaw_afterFilter(dataPrep, dataFilt)
TCGA_MolecularSubtype
:查询癌症数据的子类型:此函数将TCGA Barcodes
的向量作为输入,并返回下面两个列表:
TCGA Barcodes
的向量如果需要更多数据,请检查包数据中可用的
TbSubtypesCol_merged
数据框。
##use previously fetched BRCA data:
Pam50.subtypes<-TCGA_MolecularSubtype(colnames(dataFilt))
TCGAanalyze_DEA()
:差异表达分析:TCGAbiolinks正在为TCGAanalyze_DEA()
函数提供一种新设计,通过提供差异分析公式,以在两组或者多组中执行差异表达分析(DEA)。Limma分析流程也被贴加到EdgeR中。
该函数采用以下参数:
Cond1type
的样本Cond2type
的样本mat1
中样本的类标签的字符串(例如,对照组)mat2
中样本的类标签的字符串(例如,案例组)edgeR
的方法:‘glmLRT’或者‘exactTest’ 。1)glmLRT
将负二项式广义对数线性模型(negative binomial generalized log-linear model
)拟合到每个基因的读数计数(read counts
);2)exactTest
计算 genewise exact tests for differences in the means between two groups of negative-binomially distributed counts.logFC
过滤DEGs。GDCquery_clinic()
返回的数据框,用作提取进行批量修正(batch correction (blocking))的“Year”数据提取的输入项log cpm
转换。设置为TRUE则执行log transformation
voom transform data
。设置为TRUE则执行转换limma-trend pipeline
。设置为TRUE则执行limma-trend
在该案例中,将在通过GDC门户网站的tcgabiolinks
下载数据,并通过控制组和案例组执行差异表达式分析。其他分析流程的选项可以是是edgeR
。
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmTP_short],
mat2 = dataFilt[,dataSmNT_short],
pipeline="limma",
Cond1type = "Normal",
Cond2type = "Tumor",
fdr.cut = 0.01 ,
logFC.cut = 1,
method = "glmLRT", ClinicalDF = data.frame())
contrast.formula
参数 - 自定义对比度(contrast)contrast.formula
参数是一个字符串公式,用于limma pipeline
差异表达分析的输入。公式中对比元素(elements of contrast)应该是CondTypes
向量中的唯一元素之一。下面的示例显示了contrast.formula
参数应具有的语法。我们在这里对比了一些肺癌样本的Pam50分子亚型(Pam50 molecular subtypes)。
注意:对比公式的语法应该采用这种形式:formula <- “contrast1=type1-type2, contrast2=type2-type3, contrast3=type1-(type2+type3)/2,…”
CondTypes
向量中的元素最好是没有空格字符。
### ##download and prepare lung data through GDC### ##
query.lung <- GDCquery(project = "TCGA-LUSC",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
samplesDown.lusc <- getResults(query.lung,cols=c("cases"))
dataSmTP.lusc <- TCGAquery_SampleTypes(barcode = samplesDown.lusc,
typesample = "TP")
dataSmNT.lusc <- TCGAquery_SampleTypes(barcode = samplesDown.lusc,
typesample = "NT")
dataSmTP_short.lusc <- dataSmTP.lusc[1:10]
dataSmNT_short.lusc <- dataSmNT.lusc[1:10]
length(dataSmTP.lusc)
queryDown.lung <- GDCquery(project = "TCGA-LUSC",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = c(dataSmTP.lusc, dataSmNT.lusc))
GDCdownload(query = queryDown.lung)
dataPrep1.tcga <- GDCprepare(query = queryDown.lung,
save = TRUE,
save.filename = "TCGA_LUSC_HTSeq_Countds.rda")
dataPrep.tcga <- TCGAanalyze_Preprocessing(object = dataPrep1.tcga,
cor.cut = 0.6,
datatype = "HTSeq - Counts")
dataNorm.tcga <- TCGAanalyze_Normalization(tabDF = dataPrep.tcga,
geneInfo = geneInfoHT,
method = "gcContent")
dataNorm.tcga <- TCGAanalyze_Normalization(tabDF = dataNorm.tcga,
geneInfo = geneInfoHT,
method = "geneLength")
dataFilt.tcga <- TCGAanalyze_Filtering(tabDF = dataNorm.tcga,
method = "quantile",
qnt.cut = 0.25)
### #Filtering data so all samples have a pam50 subtype for LUSC
diff<-setdiff(colnames(dataFilt.tcga),
TCGA_MolecularSubtype(colnames(dataFilt.tcga))$filtered)
TCGA_MolecularSubtype(diff)$subtypes$barcodes
dataFilt.tcga.lusc<-dataFilt.tcga[,diff]
LUSC.subtypes<-TCGA_MolecularSubtype(colnames(dataFilt.tcga.lusc))$subtypes$subtype
### Differential expression analysis after correcting for "Plate" factor.
DEG.lusc <- TCGAanalyze_DEA(MAT=dataFilt.tcga.lusc,
pipeline="limma",
batch.factors = c("Plate"),
Cond1type = "Normal",
Cond2type = "Tumor",
fdr.cut = 0.01 ,
logFC.cut = 1,
method = "glmLRT", ClinicalDF = data.frame(),
Condtypes = LUSC.subtypes,
contrast.formula = "LUSC.basalvsLUSC.classical=LUSC.basal - LUSC.classical")
TCGAbatch_correction
:执行批量修正和lima-voom
转换此函数从sva
包调用ComBat
函数来处理批量修正,显示一些用于探索性数据分析(exploratory data analysis purposes)的图。更具体地说,它显示了一个图,其中黑色作为经验批效应密度的核估计(kernel estimate of the empirical batch effect density),红色作为参数(parametric)。
ComBat
进行调整。选项有:“Plate”、“TSS”、“Year”、“Portion”、“Center”、“Patients”。GDCquery_clinic()
返回的数据框,用于提取年份数据除了通过ComBat图进行数据探索之外,如果没有提供批量因子,则不会执行任何操作。运行以下代码查看示例:
batch.corrected <- TCGAbatch_Correction(tabDF = dataFilt.tcga, batch.factor = "Plate")
batch.corrected.adjusted <- TCGAbatch_Correction(tabDF = dataFilt.tcga,
batch.factor = "Plate",
adjustment=c("TSS"))
TCGAbatch_correction
:未发布的数据集处理此函数从sva
包调用ComBat
函数来处理批量修正,显示一些用于探索性数据分析(exploratory data analysis purposes)的图。更具体地说,它显示了一个图,其中黑色作为经验批效应密度的核估计(kernel estimate of the empirical batch effect density),红色作为参数(parametric)。
除了通过ComBat图进行数据探索之外,如果没有提供批量因子,则不会执行任何操作。运行以下代码查看示例:
以下示例显示了如何考虑两种癌症类型LUAD和LUSC以及正常和肿瘤样本。例如,用户最后可以用未发布的数据替换LUSC。
#dataFilt is a gene expression matrix from pancancer33 RNASeq data that can be generated following our previous described workflow
sampleTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), typesample = "TP")
sampleNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), typesample = "NT")
dataFiltTP <- dataFilt[,sampleTP]
dataFiltNT <- dataFilt[,sampleNT]
dataSubt_LUAD <-TCGAquery_subtype(tumor = "LUAD")
sampleStageI_LUAD <- dataSubt_LUAD[dataSubt_LUAD$Tumor.stage %in% c("Stage I","Stage IA","Stage IB"),]
dataFilt.LUAD.stageI.TP <- dataFiltTP[,substr(colnames(dataFiltTP),1,12) %in% sampleStageI_LUAD$patient]
dataFilt.LUAD.stageI.NT <- dataFiltNT[,substr(colnames(dataFiltNT),1,12) %in% sampleStageI_LUAD$patient]
# here for example the end-user can replace LUSC with unpublished data in a dataframe containing counts.
# and adding the information for LUSC.stageI.TP or LUSC.stageI.NT with unpublished data
dataSubt_LUSC <-TCGAquery_subtype(tumor = "LUSC")
sampleStageI_LUSC <- dataSubt_LUSC[dataSubt_LUSC$T.stage %in% c("T1","T1a","T1b"),]
dataFilt.LUSC.stageI.TP <- dataFiltTP[,substr(colnames(dataFiltTP),1,12) %in% sampleStageI_LUSC$patient]
dataFilt.LUSC.stageI.NT <- dataFiltNT[,substr(colnames(dataFiltNT),1,12) %in% sampleStageI_LUSC$patient]
countsTable <-cbind(dataFilt.LUAD.stageI.TP,dataFilt.LUAD.stageI.NT,
dataFilt.LUSC.stageI.TP, dataFilt.LUSC.stageI.NT)
AnnotationCounts <- matrix(0,ncol(countsTable),2)
colnames(AnnotationCounts) <- c("Samples","Batch")
rownames(AnnotationCounts) <- colnames(countsTable)
AnnotationCounts <- as.data.frame(AnnotationCounts)
AnnotationCounts$Samples <- colnames(countsTable)
AnnotationCounts[colnames(dataFilt.LUAD.stageI.TP),"Batch"] <- "LUAD.stageI.TP"
AnnotationCounts[colnames(dataFilt.LUAD.stageI.NT),"Batch"] <- "LUAD.stageI.NT"
AnnotationCounts[colnames(dataFilt.LUSC.stageI.TP),"Batch"] <- "LUSC.stageI.TP"
AnnotationCounts[colnames(dataFilt.LUSC.stageI.NT),"Batch"] <- "LUSC.stageI.NT"
countsCorrected <- TCGAbatch_Correction(tabDF =countsTable,
UnpublishedData = TRUE,
AnnotationDF = AnnotationCounts)
TCGAtumor_purity
:根据肿瘤纯度(tumor purity)过滤TCGA样品该函数将TCGA条形码作为输入,并根据用户指定的值对其进行过滤。所有的默认值都是0。参数是使用从TCGA导入的不同度量的估计值,它们如下(参见*[1]*):
该函数返回一个列表(list)对象,其中:1)pure_barcodes
属性是纯样本的向量;2)filtered
属性是过滤样本的向量,这些样本没有纯度信息。
有关所有TCGA测量样本的肿瘤纯度信息,请查看包装中可用的Tumor.purity
数据框
[1] D. Aran, M. Sirota, and A. J. Butte. Systematic pan-cancer analysis of tumour purity. Nat Commun, 6:8971, 2015.
Purity.BRCA <- TCGAtumor_purity(colnames(dataPrep.tcga), 0, 0, 0, 0, 0.7)
Recount2项目已将基因计数数据提供给科学界下载。Recount2是一个在线资源,由RNA-seq gene 和 exon counts组成,以及覆盖2041个不同研究的bigWig
文件。它是ReCount项目的第二代。如在Recount2论文中所述,使用Rail-RNA
处理原始测序数据。此功能的目的是在单个管道下处理数据,以避免任何技术变化影响下游整合分析的。返回的对象将是一个列表,其中包含名为project_tissue
的元素和每个样本的基因计数的SummarizedExperiment (SE)
。这对于研究无批次数据或增加健康/正常样品库特别有用。
参数是:
##Brain data through Recount2
brain.rec <- TCGAquery_recount2(project = "gtex", tissue = "brain")
本文由 石九流 创作,如果您觉得本文不错,请随意赞赏
采用 知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外,均为本站原创或翻译,转载前请务必署名
原文链接:https://blog.computsystmed.com/archives/translation-tcgabiolinks-extension
最后更新:2019-05-25 17:32:43
Update your browser to view this website correctly. Update my browser now