Chapter 8 Omics Analysis

After filtering and annotation, the next step is often to place metabolomics results into a broader biological context. Omics analysis aims to connect metabolite-level findings with pathways, networks, and other omics layers in order to answer a specific biological question. At this stage, data generated from xcms or other preprocessing platforms can be transferred into pathway tools, databases, or multi-omics integration frameworks.

You will get an updated database list here.

It is still difficult to connect genes, proteins, metabolites, and other molecular entities seamlessly across all databases for a complete view of one biological process. In practice, however, even partial integration across these layers can generate useful mechanistic insight.

8.1 From Bottom-up to Top-down

Bottom-up analysis models each metabolite separately. In this setting, the goal is to identify which metabolites are associated with the experimental design or phenotype of interest. As always, multiple-comparison control is essential.

\[ metabolite = f(control/treatment, co\text{-}variables) \]

Top-down analysis means the model for output. In this case, we could evaluate the contribution of each metabolite. You need variable selection to make a better model.

\[ control/treatment = f(metabolite 1,metabolite 2,...,metaboliteN,co\text{-}variables) \]

For omics study, you might need to integrate datasets from different sources.

\[ control/treatment = f(metabolites, proteins, genes, miRNA,co\text{-}variables) \]

8.2 Pathway analysis

Pathway analysis maps annotated data into known pathways and makes statistical analysis to find the influenced pathway or the compounds with high influences on certain pathway.

8.2.1 A practical pathway analysis workflow

In practice, pathway analysis is not simply uploading a metabolite list into a website. A more reliable workflow is:

  1. Start from a filtered and normalized feature table with clear sample grouping and statistical results.
  2. Separate annotated metabolites from unknown features. Known compounds can enter classical pathway enrichment, while unknown features may require mummichog-like approaches or chemical-class analysis.
  3. Harmonize metabolite identifiers before pathway mapping. Different tools may expect KEGG IDs, HMDB IDs, PubChem identifiers, or metabolite names, and identifier mismatch is a common source of failure.
  4. Choose the pathway strategy:
    • over-representation analysis for a list of significant metabolites
    • topology-based analysis when pathway position and connectivity are important
    • mummichog-like analysis when many signals remain unannotated
    • chemical similarity enrichment when pathway databases are incomplete
  5. Inspect pathway hits manually rather than accepting the ranked list directly. Check whether the mapped metabolites are biologically coherent, whether the direction of change is consistent, and whether the pathway result is driven by only one or two compounds.
  6. Return to metabolite-level evidence after pathway ranking. A pathway hit is only as strong as the features and annotations supporting it.

Therefore, pathway analysis should be treated as a downstream interpretation step built on annotation, normalization, and statistical modeling rather than as a shortcut around them.

8.2.2 Pathway Database

  • SMPDB (The Small Molecule Pathway Database) is an interactive, visual database containing more than 618 small molecule pathways found in humans. More than 70% of these pathways (>433) are not found in any other pathway database. The pathways include metabolic, drug, and disease pathways.

  • KEGG (Kyoto Encyclopedia of Genes and Genomes) is one of the most complete and widely used databases containing metabolic pathways (495 reference pathways) from a wide variety of organisms (>4,700). These pathways are hyperlinked to metabolite and protein/enzyme information. Currently KEGG has >17,000 compounds (from animals, plants and bacteria), 10,000 drugs (including different salt forms and drug carriers) and nearly 11,000 glycan structures.

  • BioCyc is a collection of 14558 Pathway/Genome Databases (PGDBs), plus software tools for exploring them.

  • Reactome is an open-source, open access, manually curated and peer-reviewed pathway database. It provides intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic and clinical research, genome analysis, modeling, systems biology and education.

  • WikiPathway is a database of biological pathways maintained by and for the scientific community.

8.2.3 Pathway software

  • MetaboAnalyst is the most widely used platform for metabolomics pathway analysis(Pang et al. 2024). It integrates Mummichog for pathway activity prediction directly from m/z features without prior annotation, as well as conventional over-representation analysis and pathway topology analysis using annotated compound lists.

  • FELLA is an R package that performs metabolite enrichment analysis using KEGG sub-network topology rather than simple pathway lists, which can capture cross-pathway connections(Picart-Armada et al. 2018).

  • ChemRICH performs Chemical Similarity Enrichment Analysis as an alternative to biochemical pathway mapping(Barupal and Fiehn 2017). Instead of relying on incomplete pathway databases, ChemRICH groups metabolites by chemical similarity (e.g., Tanimoto scores) and tests for enrichment within these chemical classes. This is particularly useful for untargeted metabolomics where many identified metabolites may not map to known pathways.

  • RaMP-DB is a relational database that integrates pathway information from KEGG, Reactome, WikiPathways and HMDB for batch pathway analysis(B. Zhang et al. 2023).

  • Pathway Commons online tools for pathway analysis.

  • metabox could make pathway analysis.

  • impala is used for pathway enrichment analysis.

  • Metscape based on Debiased Sparse Partial Correlation (DSPC) algorithm (Basu et al. 2017) to make annotation.

8.2.4 Pathway interpretation pitfalls

Pathway analysis is useful, but it is also easy to over-interpret. Common pitfalls include:

  • identifier mismatch: the same metabolite may appear under different names or map to multiple database entries

  • small hit count: a pathway may look significant because only one or two metabolites are mapped

  • database incompleteness: many metabolites, especially lipids, xenobiotics, and unknowns, are poorly represented in pathway resources

  • background set problems: enrichment results depend strongly on what is treated as the metabolite universe

  • false confidence from annotation uncertainty: weak or putative annotations can produce very confident-looking pathways

  • pathway redundancy: overlapping pathways may all appear significant because they share the same few metabolites

  • mixing association with mechanism: a significant pathway does not prove causal involvement

For these reasons, pathway results should be presented as biological hypotheses supported by mapped metabolites, not as proof that a pathway is definitively activated or inhibited.

8.3 Network analysis

Mummichog could make pathway and network analysis without annotation. The algorithm is now integrated into MetaboAnalyst(Pang et al. 2024).

MSS: sequential feature screening procedure to select important sub-network and identify the optimal matching for metabolomics data (Cai et al. 2017).

Metapone is a joint pathway testing package for untargeted metabolomics data (Tian et al. 2022).

8.3.1 Network construction choices

Network analysis in metabolomics can mean very different things, so the first task is to define what the edges represent.

  • Pathway networks connect metabolites through known biochemical reactions or curated pathway relationships. These are useful for interpretation but limited by database coverage.

  • Correlation or partial-correlation networks connect metabolites with similar behavior across samples. These are useful for discovering coordinated modules, but edges represent statistical association rather than known chemistry.

  • MS/MS similarity or molecular networks connect compounds by fragmentation similarity. These are especially useful for annotation propagation and chemical family discovery.

  • Multi-omics networks connect metabolites with genes, proteins, or microbiome features based on known biology or statistical association.

The choice depends on the question:

  • use pathway networks when you already have confident annotation and want biochemical interpretation

  • use correlation networks when you want to discover modules or co-regulated metabolite sets

  • use molecular networks when the study is rich in MS/MS and compound family discovery is important

  • use multi-omics networks when the goal is cross-layer integration rather than metabolite-only structure

For correlation-style networks, construction choices matter a lot:

  • Pearson correlation is simple and common, but sensitive to outliers and linearity assumptions

  • Spearman correlation is more robust for monotonic relationships and often safer for metabolomics

  • Partial correlation can reduce indirect associations but usually needs more samples and careful regularization

  • Sparse graphical models may improve interpretability in high-dimensional settings, but assumptions must be checked

Threshold choice is also critical. Very low correlation cutoffs produce dense, hard-to-interpret networks; very high cutoffs may miss meaningful modules. In practice, thresholds should be justified by sample size, expected noise level, and network stability rather than chosen only for visual appeal.

8.3.2 Network interpretation pitfalls

As with pathway analysis, network plots can be visually persuasive but biologically misleading if used carelessly.

  • Correlation is not mechanism: two metabolites can correlate because of shared sample structure, batch effects, diet, or other confounders

  • hub nodes may reflect abundance or detectability, not biological centrality

  • small sample sizes create unstable networks, especially in untargeted studies with many features

  • different preprocessing choices change the network substantially, including normalization, filtering, and missing-value handling

  • community detection is method-dependent, so modules should not be treated as unique truths

  • annotated and unknown features may mix, which can be useful for discovery but difficult to interpret confidently

Therefore, network analysis should usually be used for hypothesis generation, prioritization, and visualization of structure in the data rather than as standalone proof of biological regulation.

8.4 Omics integration

Multi-omics integration aims to combine data from different omics layers (genomics, transcriptomics, proteomics, metabolomics) to gain a more comprehensive understanding of biological systems. Several approaches and tools are available:

  • mixOmics is an R package for multi-omics data integration using multivariate projection-based methods including sparse PLS, DIABLO (Data Integration Analysis for Biomarker discovery using Latent cOmponents) and other supervised/unsupervised approaches(Rohart et al. 2017).

  • MOFA+ (Multi-Omics Factor Analysis) is a statistical framework for comprehensive integration of multi-modal data. It identifies shared and dataset-specific sources of variation across omics layers using a Bayesian factor model(Argelaguet et al. 2020). MOFA+ extends the original MOFA to support single-cell data and multiple sample groups.

  • The Omics Discovery Index (OmicsDI) provides a knowledge discovery framework across heterogeneous omics data (genomics, proteomics, transcriptomics and metabolomics).

  • Standardized multi-omics of Earth’s microbiomes could check this GNPS based work(Shaffer et al. 2022).

  • Windows Scanning Multiomics: Integrated Metabolomics and Proteomics(Shi et al. 2023).

References

Argelaguet, Ricard, Damien Arnol, Danila Bredikhin, et al. 2020. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data.” Genome Biology 21: 111. https://doi.org/10.1186/s13059-020-02015-1.
Barupal, Dinesh Kumar, and Oliver Fiehn. 2017. Chemical Similarity Enrichment Analysis (ChemRICH) as alternative to biochemical pathway mapping for metabolomic datasets.” Scientific Reports 7: 14567. https://doi.org/10.1038/s41598-017-15231-w.
Basu, Sumanta, William Duren, Charles R. Evans, Charles F. Burant, George Michailidis, and Alla Karnovsky. 2017. “Sparse Network Modeling and Metscape-Based Visualization Methods for the Analysis of Large-Scale Metabolomics Data.” Bioinformatics 33 (10): 1545–53. https://doi.org/10.1093/bioinformatics/btx012.
Cai, Qingpo, Jessica A. Alvarez, Jian Kang, and Tianwei Yu. 2017. “Network Marker Selection for Untargeted LCMS Metabolomics Data.” Journal of Proteome Research 16 (3): 1261–69. https://doi.org/10.1021/acs.jproteome.6b00861.
Pang, Zhiqiang, Lei Xu, Charles Viau, et al. 2024. MetaboAnalystR 4.0: A Unified LC-MS Workflow for Global Metabolomics.” Nature Communications 15 (1): 3675. https://doi.org/10.1038/s41467-024-48009-6.
Picart-Armada, Sergio, Francesc Fernández-Albert, Maria Vinaixa, Oscar Yanes, and Alexandre Perera-Lluna. 2018. FELLA: An R Package to Enrich Metabolomics Data.” BMC Bioinformatics 19: 538. https://doi.org/10.1186/s12859-018-2487-5.
Rohart, Florian, Benoı̂t Gautier, Amrit Singh, and Kim-Anh Lê Cao. 2017. mixOmics: An R Package for ’Omics Feature Selection and Multiple Data Integration.” PLOS Computational Biology 13 (11): e1005752. https://doi.org/10.1371/journal.pcbi.1005752.
Shaffer, Justin P., Louis-Félix Nothias, Luke R. Thompson, et al. 2022. “Standardized Multi-Omics of Earth’s Microbiomes Reveals Microbial and Metabolite Diversity.” Nature Microbiology 7 (12): 2128–50. https://doi.org/10.1038/s41564-022-01266-x.
Shi, Jiachen, Jialiang Zhao, Yu Zhang, et al. 2023. “Windows Scanning Multiomics: Integrated Metabolomics and Proteomics.” Analytical Chemistry, ahead of print, December. https://doi.org/10.1021/acs.analchem.3c03785.
Tian, Leqi, Zhenjiang Li, Guoxuan Ma, et al. 2022. “Metapone: A Bioconductor Package for Joint Pathway Testing for Untargeted Metabolomics Data.” Bioinformatics 38 (14): 3662–64. https://doi.org/10.1093/bioinformatics/btac364.
Zhang, Bofei, Shunchao Hu, Elizabeth Baskin, Andrew Patt, Jalal K Siddiqui, and Ewy A Mathé. 2023. RaMP-DB 2.0: a renovated knowledgebase for deriving biological and chemical insight from metabolites, proteins, and genes.” Bioinformatics 39 (1): btac726. https://doi.org/10.1093/bioinformatics/btac726.