Methylation software




















The preseq package is aimed at predicting the number of distinct reads and how many will be expected from additional sequencing using an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples. Go to the preseq homepage. It can also apply to estimating the expected number of species as a function of the number of captures, which is called the species discovery curve or species accumulation curve in ecology.

For people who prefer to work under the R statistical computing environment, we provide an R package called preseqR, which makes the functionality of preseq available in R.

It can work with or without control sample. It can be used to find regions with differential histone modifications patterns, either comparison between two cell types or between two kinds of histone modifications. RMAP is aimed to map accurately reads from the next-generation sequencing technology.

RMAP can map reads with or without error probability information quality scores and supports paired-end reads or bisulfite-treated reads mapping. There are no limitations on read widths or number of mismatches. RMAP can now map more than 8 million reads in an hour at full sensitivity to 2 mismatches. Ribotricer is a method for detecting actively-translating ORFs by directly leveraging the three-nucleotide periodicity of Ribo-seq data.

RnBeads uses differences in copy number of genomic regions located on the sex chromosomes to predict the sex of sample donors, in order to infer missing annotation information or to detect sample mix-ups. For microarray data, sex prediction is based on the comparison of the average signal intensities for the sex chromosomes with those on the autosomes, calculating a predicted sex probability by logistic regression.

For bisulfite sequencing data, RnBeads compares the sequencing coverage for the sex chromosomes with those for the autosomes, followed by logistic regression trained on datasets with known sex information. In addition to analysis based on individual CpGs, RnBeads aggregates and compares DNA methylation levels across genomic regions of interest, which can enhance statistical power and interpretability [ 17 ]. This collection includes region sets defined based on consensus epigenome profiles such as putative regulatory regions in the Ensembl Regulatory Build [ 33 ] and regions associated with DNA methylation variability [ 50 ].

The presence of missing values in DNA methylation datasets constitutes an important analytical challenge, for which RnBeads implements several alternative solutions, namely: Sample-wise means and medians, CpG-wise means and medians, random replacement from other samples in the dataset, and k-nearest neighbor KNN imputation.

KNN imputation tends to provide adequate estimates of missing values when enough nearby data points are available. It has been used extensively for gene expression microarray data [ 51 ], and it has also been applied to DNA methylation data [ 8 ]. For those cases in which the model assumptions of KNN imputation are not met due to disproportionally high numbers of missing values which is not uncommon for bisulfite sequencing datasets , we implemented the mean and median imputation approaches.

This metric quantifies the deviation of the signals of autosomal single nucleotide polymorphism probes on the microarray from the expected values of 0 and 1 homozygosity as well as 0. Such deviations can indicate technical problems of the microarray-based analysis, contamination with DNA samples from other individuals, or deviations from the diploid case e. RnBeads implements reference-based and reference-free methods for estimating intra-sample heterogeneity [ 21 , 30 , 52 , 53 ].

This includes reference-based estimation of immune cell content [ 30 ] based on the DNA methylation profiles of purified blood cell populations [ 29 ] as well as the LUMP algorithm [ 22 ] for estimating immune cell invasion in bulk tumor samples based on a preselected set of CpGs that are exclusively unmethylated in blood cells.

While this algorithm was developed specifically for the Infinium k assay, its implementation in RnBeads supports both microarray-based and bisulfite sequencing-based assays. CpGs and genomic regions can differ between cases and controls not only in terms of their average DNA methylation levels, but also in terms of the variability of DNA methylation levels; for example, epigenetic variability may be higher or lower in tumors than in healthy tissue.

DiffVar uses an empirical Bayes framework, while iEVORA is based on the Bartlett test, which tests for differences in variance heteroscedasticity across samples. Striking the right balance between reporting too few and too many differentially variable cytosines DVCs and differentially variable regions DVRs represents an unsolved statistical challenge, especially when the data do not follow a normal distribution.

Therefore, we implemented a strategy analogous to the identification of differentially methylated cytosines DMCs and differentially methylated regions DMRs between sample groups in RnBeads: DVCs and DVRs are ranked by the worst highest rank of the following criteria: i the adjusted p value of the statistical test either diffVar or iEVORA , ii the difference in variance between the groups, and iii the log-ratio of the two group-wise variances.

RnBeads produces summary plots comparing group-wise variances, p values, and ranks, while also exporting detailed tables of DVCs and DVRs. To investigate the biological processes relevant to observed DNA methylation differences, RnBeads implements region set enrichment analysis using LOLA [ 27 ], in addition to gene set analysis based on Gene Ontology terms.

The LOLA tool compares a set of genomic regions of interest i. By default, RnBeads uses the LOLA Core database as a reference, which includes transcription factor binding sites, tissue-specific enhancer elements, and genome annotations such as CpG islands and repetitive elements.

Plots showing enrichment p values and log-odds ratios visualize the most enriched region sets in the RnBeads report. We have also developed an interface that facilitates the automatic distribution of RnBeads analysis runs across an HPC cluster e.

Finally, to facilitate DNA methylation analysis on small computers including personal laptops, RnBeads provides options that disable the most resource-intensive steps; these configurations are available as pre-defined option profiles for low-, medium-, and high-resource settings.

The benchmarking was performed on a Debian Wheezy machine with 32 cores 1. Three different tool configurations with different depths of analysis were evaluated Additional file 2 : Table S2 : i data import only, ii core modules enabled, and iii comprehensive analysis with most features enabled. Furthermore, to complement the performance-oriented benchmarking with a feature-oriented comparison, we conducted a comprehensive survey of popular software tools for DNA methylation analysis in comparison to RnBeads Additional file 1 : Table S1.

We also considered selected DNA methylation analysis tools outside Bioconductor based on the literature review. Advances in the profiling of DNA modifications: cytosine methylation and beyond. Nat Rev Genet. Laird PW. Principles and challenges of genome-wide DNA methylation analysis.

Human DNA methylomes at base resolution show widespread epigenomic differences. Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution. Nat Methods. Locally disordered methylation forms the basis of intratumor methylome variation in chronic lymphocytic leukemia.

Cancer Cell. Distinct evolution and dynamics of epigenetic and genetic heterogeneity in acute myeloid leukemia. Nat Med. DNA methylation heterogeneity defines a disease spectrum in Ewing sarcoma.

The DNA methylation landscape of glioblastoma disease progression shows extensive heterogeneity in time and space. Targeted bisulfite sequencing reveals changes in DNA methylation associated with nuclear reprogramming. Nat Biotechnol. Targeted and genome-scale strategies reveal gene-body methylation signatures in human cells.

Validation of a DNA methylation microarray for , CpG sites of the human genome enriched in enhancer sequences. CAS Google Scholar. Integrative analysis of reference human epigenomes. PubMed Central Google Scholar. The International Human Epigenome Consortium: a blueprint for scientific collaboration and discovery. Epigenome-wide association studies for common human diseases.

Recommendations for the design and analysis of epigenome-wide association studies. Bock C. Analysing and interpreting DNA methylation data. Strategies for analyzing bisulfite sequencing data. J Biotechnol. Methods for identifying differentially methylated regions for sequence- and array-based data. Brief Funct Genomics. Cell-type deconvolution in epigenome-wide association studies: a review and recommendations. Systematic pan-cancer analysis of tumour purity.

Nat Commun. Horvath S. DNA methylation age of human tissues and cell types. Genome Biol. Horvath S, Levine AJ. In the gene body, CG methylation is weakly positively correlated with gene expression in humans, while in Arabidopsis, modest CG methylation is related to higher gene expression [ 9 , 10 ]. Although the global trends of the correlation described above have been reported, variability exists for individual genes, and more recent research has shown that the correlation between promoter methylation and gene expression is not always negative [ 11 , 12 , 13 ].

Dynamic changes in DNA methylation in the genome-wide profile i. For instance, methylation changes play a role in gene regulation during sexual reproduction in both plants and animals [ 15 ].

In plants, DNA methylation can shape the transcriptome of the plant during seed germination and under biotic and abiotic stresses [ 15 , 16 ]. In mammals, alterations of DNA methylation have been shown to be associated with altered gene expression in the development of cancer and cardiovascular diseases [ 17 ]. The relationship between methylation changes and gene expression changes under different biological conditions and at different timepoints is important, but the effects of DNA methylation on gene expression remain unclear and complicated [ 18 ].

Therefore, the measurement of their correlation is of significance to aid in the understanding of epigenetic regulatory networks. Whole-genome bisulfite sequencing WGBS enables genome-wide analyses of cytosine methylation at single-nucleotide resolution [ 19 ], whereas RNA-sequencing RNA-seq can quantify gene expression by counting the reads mapped to the transcriptome [ 20 ]. ViewBS can correlate between non-CG methylation and gene expression, but the users need to process the data first to allow correlation analyses.

They do not allow users to provide their own data, and they can only be applied to specific species. Therefore, bioinformatics tools specialized for evaluating the correlation between DNA methylation and gene expression could help facilitate epigenomic research. In this research, we developed MethGET, web-based bioinformatics software for analyzing the correlation between genome-wide DNA methylation and gene expression.

MethGET includes single-methylome analyses for viewing the correlation within a single sample and multiple-methylome analyses for detecting the correlations between DNA methylation changes and gene expression changes between two groups of samples. It also determines DNA methylation in different contexts CG, CHG, and CHH and across different genomic regions gene body, promoter, exon, and intron to explore the different roles of methylation mechanisms in gene expression.

We demonstrated the capability of MethGET with Japonica rice data, and MethGET revealed a decrease in both CHH methylation and gene expression in most genes in the gene body region as the embryo developed into a regenerated callus, which was not reported in the original paper [ 26 ] and warrants further investigation. MethGET is a Python software that performs various analyses, including single-methylome analyses and multiple-methylome analyses Fig.

In single-methylome analyses, the correlations within a single sample are detected; these analyses include the following: 1 correlation analyses of genome-wide DNA methylation and gene expression correlation ; 2 ordinal association analyses with genes ranked by gene expression level ordinal association ; 3 distribution of DNA methylation by groups of genes with different expression levels grouping statistics ; and 4 average methylation level profiling according to different expression groups around genes metagene.

In multiple-methylome analyses, two groups of samples Group A vs. Group B are compared; these analyses include the following: 1 gene-level associations between DNA methylation changes and gene expression changes comparison and 2 visualization of DNA methylation and gene expression data together heatmap.

Schematic diagram of MethGET. The diagram shows the inputs and outputs of single-methylome analyses and multiple-methylome analyses. CGmap files including the DNA methylation levels, read counts and methylation context of each cytosine are the output of the bisulfite specific aligners such as BS-Seeker and its variants [ 29 , 30 , 31 ]. Other methylation calling files can be converted to CGmap format by MethGET, including CX report files generated by Bismark, the methylation calls generated by methratio.

Gene expression values represent quantitative measurements of gene expression. The gene body is defined as the region from the transcription start site TSS to the transcription end site TES , and the promoter is defined as the region two kilobases upstream of the gene body. Finally, MethGET averages the methylation levels at different genomic locations for downstream analysis and methylome visualization. Single-methylome analyses investigate the association between the methylome and transcriptome within a single sample.

We demonstrate the following single-methylome analyses using the data from human cancer-associated fibroblasts [ 37 ] and Arabidopsis thaliana ecotype Columbia [ 38 ]. Since over-plotting often occurs in the scatterplot, a 2D kernel density plot is also provided to represent the density distribution. Groups of genes can be identified on the basis of deeper coloration; for example, it can be seen in Fig. Correlation analyses of genome-wide DNA methylation and gene expression human data.

The correlation coefficient R and p -value P are provided in the top right corner of the plot. To investigate the methylation pattern associated with relative gene expression, MethGET provides scatterplots with genes ranked by gene expression level from low expression levels to high expression levels.

Additionally, MethGET can generate fitting curves for the scatterplot via the moving average method to smooth out noise and highlight trends of methylation. In Fig. Ordinal association analyses with genes ranked by gene expression level human data. Scatterplot and fitting curves of DNA methylation and relative gene expression. To better reveal the complex regulation of methylation, in MethGET both boxplots and violin plots are provided to visualize the central tendency and dispersion of DNA methylation levels according to groups with different gene expression levels Fig.

Genes are grouped as non-expressed genes and 5 quantiles of expressed genes according to the gene expression level groups from low to high; the 1st quintile is the lowest, and the 5th is the highest. In addition, the correlation coefficient of DNA methylation and gene expression in each group as well as descriptive statistics such as the mean and standard deviation are available in the provided spreadsheet Additional file 2 : Table S1.

Distribution of DNA methylation by groups of genes with different expression levels Arabidopsis data. The methylation patterns both upstream and downstream of genes are shown for half of the gene body i. This can help to elucidate the mechanisms of DNA methylation at certain bases around a specific point. The regions two kilobases upstream and downstream of the reference point are divided into 10 windows, and the average methylation level is calculated in each window.

Average methylation level profiling according to different expression groups around genes Arabidopsis data. Multiple-methylome analyses investigate the correlation between alterations in methylomes and the differences in transcriptomes between two groups of samples e. Moreover, the correlation can be explored at the gene level to understand the DNA methylation regulatory network associated with gene expression changes. DNA methylation changes between two groups of samples may exert a specific functional impact on gene expression between them e.

To calculate the changes between two groups Group A vs. Multiple-methylome analyses Arabidopsis mutant Group a vs. To identify the genes with clear changes of DNA methylation and gene expression i. They are marked in red color in the scatterplot, and the users can choose to show the number of differential genes in the four quadrants of the plots.

These genes with different DNA methylation statuses associated with gene expression changes are important because their expression may potentially be regulated by differences in DNA methylation between the two groups. The information for the differential genes gene names, methylation levels, and gene expression values in the output table allows for downstream analyses such as KEGG pathway analysis or Gene Ontology functional analysis [ 40 , 41 ].

Each row represents a gene, and the DNA methylation level and gene expression are averaged within each group in the columns. Hierarchical clustering of similar methylation and gene expression patterns can also be performed, and the resulting dendrogram is presented at the left margin of the heatmap.

This is useful for identifying genes that are commonly regulated, and the order of the clustered genes will be listed in the output table. MethGET is available through both the web application and the stand-alone version for command-line usage. On the web platform, users can directly upload their datasets and download all output figures with a high resolution of dpi in one click.

The web tutorial is provided in Additional file 1 , and guidance regarding the stand-alone version is provided at the GitHub repository. The processing time with and without metagene analyses for Arabidopsis, rice, human, and wheat are in Table 1. Y next-generation bioinformatics software for research in life science, biotech, food and plant industries, as well as academia.

The powerful visualization-based data analysis tool with inbuilt powerful statistics delivers immediate results and provides instant exploration and visualization of big data. As such, it is part of the epigenetic code and is also the best characterized epigenetic mechanism. Research has shown that DNA methylation is manifested in a number of important biological processes and human diseases including cancer.

The program is well suited as well for analysis of Illumina DNA methylation and cancer data. For more details about supported data formats and data import see Data Import or Contact us with questions. A range of samples including DNA from patient blood, primary tissue from tumors, and cell lines, are studied.

This case study is an example of how the use of public information from multiple sources was used to propose a new classification for glioma cancer.



0コメント

  • 1000 / 1000