Chapter 5 Workflow
This chapter focuses on practical workflow choices for metabolomics data analysis, from preprocessing platforms to project organization and data sharing(Li 2020).
DiagrammeR::mermaid("
flowchart TB
I(peak-picking) --> C
C(visualization) --> D(normalization/batch correction)
D --> A(annotation/identification)
A --> H(statistical analysis)
C --> A --> B(omics analysis)
D --> H
B --> H
H --> E(experimental validation)
A --> E
H --> A
B --> E
C --> H
")5.1 Platform for metabolomics data analysis
Many open-source metabolomics projects are available, and a useful overview can be found here.
5.1.1 Recommended pipelines by use case
The metabolomics software ecosystem is now large enough that too many choices can slow down a project rather than help it. In practice, most users do not need to evaluate dozens of tools before starting. A more useful strategy is to choose a workflow according to study type, software background, and whether the main goal is data processing, annotation, or downstream interpretation.
Practical recommendations are:
If you are new to untargeted LC-MS metabolomics and want a local reproducible workflow: use
ProteoWizard/msconvert -> xcms or xcmsrocker -> IPO or AutoTuner if needed -> annotation tools -> MetaboAnalyst for downstream statistics and pathway analysis.If you prefer a graphical interface and strong MS/MS support: use
MS-DIAL -> MS-FINDER or GNPS/SIRIUS -> MetaboAnalyst or other downstream tools.If you want strong community networking and MS/MS-centered interpretation: use
MZmine or MS-DIAL -> GNPS feature-based molecular networking -> SIRIUS if needed -> pathway/statistical tools.If your goal is targeted quantification and validation: use vendor software or triple-quadrupole-oriented workflows first, then export quantitative tables for statistical analysis in R or MetaboAnalyst.
If you work at large scale and need scripted, reproducible analysis: choose an R- or Python-based workflow such as xcms, tidymass, OpenMS, or Asari rather than a click-based interface alone.
These are not the only valid choices, but they are realistic starting points. In most cases, it is better to complete one coherent workflow end to end than to combine too many partially overlapping tools.
5.1.2 XCMS & XCMS online
XCMS online is hosted by Scripps Institute. If your datasets are not large and you want a web-based workflow, XCMS online is still one of the most accessible starting points. They use METLIN and isoMETLIN to annotate the MS/MS data, and pathway analysis is also supported. This is a reasonable option for teaching, pilot studies, or users who are not ready to script their workflow locally.
xcms is different from XCMS online although they share some conceptual background. For local metabolomics data analysis, xcms remains one of the most flexible and reproducible options, especially for users who are comfortable with R. A practical default workflow is msconvert -> IPO or AutoTuner -> xcms -> annotation tools -> MetaboAnalyst or R-based downstream analysis. If you want full scripting, parameter tracking, and scalability, this is still one of the strongest starting points. If you are not familiar with R, the learning curve is real and a GUI-centered platform may be easier.
IPO is a tool for automated optimization of xcms parameters(Libiseller et al. 2015), and Warpgroup is used for chromatogram subregion detection, consensus integration bound determination and accurate missing value integration(Mahieu, Spalding, and Patti 2016). A case study to compare different xcms parameters with IPO can be found for GC-MS(Dos Santos and Canuto 2023). Another option is AutoTuner, which is much faster than IPO(McLean and Kujawinski 2020). In practice, parameter optimization is most useful when you have representative QC files and enough time to test settings. It is not always necessary for every small project, and default settings should not be treated as universally safe.
Check those papers for the XCMS based workflow(Forsberg et al. 2018; Huan et al. 2017; Mahieu, Spalding, Gelman, et al. 2016; Montenegro-Burke et al. 2017; Domingo-Almenara and Siuzdak 2020; Stancliffe et al. 2022). For metlin related annotation, check those papers(Guijas et al. 2018; Tautenhahn et al. 2012; Xue, Guijas, et al. 2020; Domingo-Almenara, Montenegro-Burke, Ivanisevic, et al. 2018).
MAIT based on xcms and you could find source code here(Fernández-Albert et al. 2014).
iMet-Q is an automated tool with friendly user interfaces for quantifying metabolites in full-scan liquid chromatography-mass spectrometry (LC-MS) data (Chang et al. 2016)
compMS2Miner is an Automatable Metabolite Identification, Visualization, and Data-Sharing R Package for High-Resolution LC–MS Data Sets. Here is related papers (Edmands et al. 2017; Edmands et al. 2018, 2015).
mzMatch is a modular, open source and platform independent data processing pipeline for metabolomics LC/MS data written in the Java language, which could be coupled with xcms (Scheltema et al. 2011; Creek et al. 2012). It also could be used for annotation with MetAssign(Daly et al. 2014).
5.1.3 PRIMe
PRIMe is from RIKEN and UC Davis. They update their database frequently(Tsugawa et al. 2016). You could use MS-DIAL for untargeted analysis and MRMPROBS for targeted analysis. For annotation, they developed MS-FINDER and statistic tools with Excel. This platform is especially strong for MS/MS-rich workflows, lipidomics, and users who want a mature GUI. In my view, MS-DIAL is one of the best first choices for users who want serious untargeted analysis without committing to an R-based workflow from the start. The main limitation is that pathway analysis is not the center of this ecosystem, so downstream interpretation may still move to other tools.
MS-DIAL 4 added support for lipidomics with an integrated CCS and retention time atlas(Tsugawa et al. 2020). The latest version, MS-DIAL 5, further extends the platform with multimodal mass spectrometry data mining capabilities including improved DIA deconvolution(Tsugawa et al. 2024).
For PRIMe based workflow, check those papers(Lai et al. 2018; Matsuo et al. 2017; Treutler et al. 2016; Tsugawa et al. 2015; Tsugawa et al. 2016; Kind et al. 2018). There are also extensions for their workflow(Uchino et al. 2022) and workflow for environmental science(Bonnefille et al. 2023).
5.1.4 GNPS
GNPS is an open-access knowledge base for community-wide organization and sharing of raw, processed or identified tandem mass (MS/MS) spectrometry data. It is not a full replacement for primary preprocessing software, but it is one of the most useful platforms for MS/MS-centered annotation, feature-based molecular networking, and community data sharing. Feature-based molecular networking within GNPS could be coupled with xcms, OpenMS, MS-DIAL, MZmine, and other popular software. If your study relies heavily on tandem MS interpretation, GNPS should be considered early rather than added only at the end.
Check those papers for GNPS and related projects(Aron et al. 2020; Nothias et al. 2020; Scheubert et al. 2017; Silva et al. 2018; Wang et al. 2016; Bittremieux et al. 2023; Schmid et al. 2021).
5.1.5 OpenMS & SIRIUS
OpenMS is another good platform for mass spectrum data analysis developed with C++. You could use it as a plugin of KNIME. OpenMS is a strong option when transparency of workflow steps, interoperability, and scalable processing are more important than a minimal learning curve. TOPPView is also one of the better tools for visualizing MS data. If you want a workflow that is explicit, modular, and suitable for engineering-style data pipelines, OpenMS is a good choice.
Check those paper for OpenMS based workflow(Bertsch et al. 2011; Pfeuffer et al. 2017, 2024; Röst et al. 2014, 2016; Rurik et al. 2020; Alka et al. 2020).
OpenMS could be coupled to SIRIUS for annotation. SIRIUS is a software framework for de novo identification of metabolites using single and tandem mass spectrometry. It integrates tools such as CSI:FingerID, ZODIAC and CANOPUS. If your project emphasizes formula assignment, structural class prediction, and in silico annotation, SIRIUS is one of the most important downstream tools to learn.
5.1.6 MZmine
MZmine was originally developed on the Java platform. In 2023, MZmine 3 was released with a completely rewritten architecture, adding support for ion mobility spectrometry data, improved feature detection, and native integration with GNPS molecular networking and SIRIUS(Schmid et al. 2023). MZmine 3 is now one of the most actively maintained open-source platforms for untargeted metabolomics. If you want a GUI workflow with strong modern integration to annotation tools, MZmine is one of the best current choices. Like MS-DIAL, it usually needs to be paired with other tools for pathway analysis.
Check those papers for MZmine based workflow(Pluskal et al. 2010; Pluskal et al. 2020; Schmid et al. 2023).
5.1.7 Emory MaHPIC
This platform is composed by several R packages from Emory University including apLCMS to collect the data, xMSanalyzer to handle automated pipeline for large-scale, non-targeted metabolomics data, xMSannotator for annotation of LC-MS data and Mummichog for pathway and network analysis for high-throughput metabolomics. Note that the original Mummichog is no longer actively maintained; its algorithm is now integrated into MetaboAnalyst(Pang et al. 2024). This platform would be preferred by someone from environmental science to study exposome.
You could check those papers for Emory workflow(Uppal et al. 2013, 2017; Yu et al. 2009; S. Li et al. 2013; Liu et al. 2020).
5.1.8 Others
MetaboAnalyst is a comprehensive web-based platform for metabolomics data analysis, covering statistical analysis, pathway analysis, biomarker discovery and more. The latest version, MetaboAnalyst 6.0(Pang et al. 2024), provides a unified platform for metabolomics data processing, analysis and interpretation, integrating Mummichog for pathway analysis without prior annotation. My suggestion is to use it mainly for downstream statistics, visualization, and pathway interpretation, not as the only place where all upstream preprocessing decisions are made.
PMDDA is a reproducible workflow for exhaustive MS2 data acquisition of MS1 features(Yu et al. 2022) with data and script available online.
tidymass is an object-oriented reproducible analysis framework for LC–MS data(Shen et al. 2022).
R for mass spectrometry is a R software collection for the analysis and interpretation of high throughput mass spectrometry assays.
Additional tools exist for specialized settings such as GC-MS, imaging MS, environmental non-target screening, feature matching across studies, high-dimensional ion mobility data, and automated QQQ preprocessing(Melamud et al. 2010; Clasquin et al. 2012; Y.-J. Yu et al. 2019; Zhang et al. 2020; Palmer et al. 2017; Riquelme et al. 2020; Hiller et al. 2009; Giacomoni et al. 2015; Wen et al. 2017; Jalili et al. 2020; Kew et al. 2017; Habra et al. 2021; Delabriere et al. 2021; Helmus et al. 2021; Bai et al. 2022; Eilertz et al. 2022; Baygi et al. 2022; Colby et al. 2022; Plyushchenko et al. 2022; Zheng et al. 2022; Li et al. 2023; Volikov et al. 2023; Goracci et al. 2024; Liu et al. 2023). For most readers, these are better treated as second-round options after one core workflow is already working.
5.1.9 Workflow Comparison
Here are some comparisons for different workflow and you could make selection based on their works(Myers et al. 2017; Weber et al. 2017; Li et al. 2018; Liao et al. 2023).
xcmsrocker is a docker image for metabolomics to compare R based software with template(Yu et al. 2022).
5.1.10 A simple opinionated choice guide
If a short list is needed instead of a long catalog:
For reproducible R-based untargeted metabolomics: xcms and related R tools
For GUI-centered untargeted workflows with strong MS/MS support: MS-DIAL or MZmine
For MS/MS networking and community annotation: GNPS
For in silico structural annotation: SIRIUS
For downstream statistics and pathway analysis: MetaboAnalyst or scripted analysis in R
This is enough for many real projects. The best workflow is usually not the one with the most software, but the one where every step is documented, reproducible, and appropriate for the study objective.
5.2 Project Setup
I suggest building your data analysis projects in RStudio (Click File - New project - New Directory - Empty project). Then assign a name for your project. I also recommend the following tips if you are familiar with it.
Use git/github to make version control of your code and sync your project online.
Don’t use your name for your project because other peoples might cooperate with you and someone might check your data when you publish your papers. Each project should be a work for one paper or one chapter in your thesis.
Use workflow document(txt or doc) in your project to record all of the steps and code you performed for this project. Treat this document as digital version of your experiment notebook
Use data folder in your project folder for the raw data and the results you get in data analysis
Use figure folder in your project folder for the figure
Use manuscript folder in your project folder for the manuscript (you could write paper in rstudio with the help of template in Rmarkdown)
Just double click \[yourprojectname\].Rproj to start your project
5.3 Data Standards and Metadata
Reproducible metabolomics research requires not only sharing raw data but also providing well-structured metadata that describes how the data was generated and processed. The FAIR principles (Findable, Accessible, Interoperable, Reusable) provide a general framework for scientific data management(Wilkinson et al. 2016), and their adoption in metabolomics is critical for cross-study comparisons and meta-analyses.
5.3.1 Metadata Standards
The Investigation-Study-Assay (ISA) framework is the most widely adopted metadata standard in metabolomics(Sansone et al. 2012). ISA-Tab provides a structured format to describe the experimental design (Investigation), the samples and their biological context (Study), and the analytical measurements (Assay). For metabolomics specifically, mwTab is the format used by the Metabolomics Workbench(Sud et al. 2016).
When submitting data to public repositories, the minimum reporting standards proposed by the Metabolomics Standards Initiative (MSI) should be followed(Salek et al. 2013). These standards cover the biological context, chemical analysis, data processing and statistical analysis metadata. In practice, compliance with these minimum reporting standards remains a challenge in the community(R. A. Spicer et al. 2017).
5.3.2 FAIR in Metabolomics
FAIR principles(Wilkinson et al. 2016) have been increasingly adopted in metabolomics workflows. Several tools and resources have been developed to make metabolomics data more FAIR(Rocca-Serra et al. 2016):
Use persistent identifiers (e.g., InChI, SMILES) for compounds and DOIs for datasets
Deposit raw data in open formats (mzML) to public repositories (MetaboLights(Haug et al. 2020), Metabolomics Workbench(Sud et al. 2016))
Document the complete analytical and computational workflow with version-controlled parameters
Use controlled vocabularies and ontologies (e.g., Chemical Entities of Biological Interest, ChEBI) for annotation
The metaRbolomics initiative provides an overview of R-based tools that support FAIR-compliant metabolomics workflows(Stanstrup et al. 2019). For quality assurance and quality control standards in practice, check the mQACC consortium guidelines(O’Brien et al. 2024).
5.3.3 Practical Recommendations
For new metabolomics practitioners, the following checklist could help improve data quality and reproducibility(Rampler et al. 2021; Broadhurst, Goodacre, Stacey N. Reinke, et al. 2018a):
Record all instrument parameters, column information and mobile phase composition in a machine-readable format
Include pooled QC and blank samples in every analytical batch and document their preparation
Convert vendor-specific raw files to open formats (mzML via ProteoWizard) immediately after acquisition
Use standardized file naming conventions that encode sample metadata (group, batch, injection order)
Track all data processing parameters (software version, peak picking thresholds, alignment settings) in a reproducible script or workflow file
5.4 Data sharing
See this paper(Haug et al. 2017):
MetaboLights is a major general-purpose international repository and a good default choice for many studies.
The Metabolomics Workbench is another major repository with strong adoption, especially in the United States, and uses the mwTab ecosystem.
MetaboBank is a useful repository in Japan and Asia-Pacific contexts.
MetabolomeXchange is best treated as a discovery portal rather than the primary home for your submission.
MetabolomeExpress is a public place to process, interpret and share GC/MS metabolomics datasets(Carroll et al. 2010).
In practice, the decision can be simple:
choose MetaboLights if you want a broadly recognized default repository with strong international visibility
choose Metabolomics Workbench if your community, funder, journal, or collaborators already work within that ecosystem
use MetaboBank when it best matches your regional infrastructure or collaboration network
Whichever repository you choose, the important point is to deposit raw data in open formats when possible, include metadata that satisfy MSI-style minimum reporting, and provide enough information for another group to reproduce the computational workflow.
5.5 Contest
- CASMI predict small molecular contest(Blaženović et al. 2017)