Chapter 10 Statistical Analysis
The statistical strategy in a metabolomics study depends strongly on the research goal. Because many metabolomics experiments are performed in an untargeted mode, analysis often begins with exploratory methods before moving to formal hypothesis testing or predictive modeling. The basic goals of exploratory analysis are:
Find the relationship among variables
Find the relationship among samples/group of samples.
This is basically unsupervised analysis.
However, many studies also include group labels or continuous outcomes that can be used to identify biomarkers or model associations between variables and phenotype. These settings require supervised methods. For a broader discussion of statistical analysis in metabolic phenotyping, see(Blaise et al. 2021). Before discussing specific algorithms, it is useful to review a few core concepts.
10.1 Basic Statistical Analysis
A statistic is a numerical summary used to describe properties of variables or samples. It can be designed to extract signal, reduce noise, or support inference. Statistical models and hypothesis tests are built from such summaries rather than from raw observations alone.
\[Statistic = f(sample_1,sample_2,...,sample_n)\]
Null Hypothesis Significance Testing (NHST) is commonly used for statistical inference. A p-value is the probability of observing a statistic at least as extreme as the one obtained, assuming the null hypothesis holds.
In omics studies, the multiple-comparison problem appears as soon as many metabolites are tested at the same time. False discovery rate (FDR) control is therefore essential to limit false positives. Common approaches include the Benjamini-Hochberg adjustment of raw p-values or direct estimation of q-values using Storey’s method.
NHST is widely used, but p-values are often misinterpreted and do not by themselves quantify evidence in favor of an alternative model. Bayesian hypothesis testing can address some of these limitations by using Bayes factors to compare the null hypothesis with an alternative hypothesis.
\[Bayes\ factor = \frac{p(D|Ha)}{p(D|H0)} = \frac{posterior\ odds}{prior\ odds}\]
Statistical models use measured data to support prediction or explanation. Most models require parameter tuning to achieve good performance, and they can be evaluated with metrics such as \(R^2\) or ROC curves. Once candidate models are fitted, model selection can be used to compare their performance and choose an appropriate balance between interpretability and predictive accuracy.
\[Target = g(Statistic) = g(f(sample_1,sample_2,...,sample_n))\]
The bias-variance tradeoff is a core concept in statistical modeling. Poorly tuned models may be overfitted (low bias, high variance) or underfitted (high bias, low variance).
\[E[(y - \hat f)^2] = \sigma^2 + Var[\hat f] + Bias[\hat f]\]
Cross-validation is commonly used to compare models through repeated training-testing schemes such as jackknife resampling, bootstrap resampling, or k-fold cross-validation.
Regularization can also improve predictive performance and robustness. Ridge regression, LASSO, and related penalties are widely used for this purpose.
For supervised analysis, linear models and tree-based models are two common starting points. Linear models are useful when the goal is to estimate interpretable relationships between variables and outcomes. Tree-based models, including bagging, random forest, and boosting, instead learn hierarchical decision structures and often perform well for prediction. Other model families, such as Support Vector Machines (SVMs), artificial neural networks (ANNs), and deep learning methods, make different assumptions and may also be useful depending on data size and study goals. If prediction is the main objective, comparing several model classes is usually more informative than relying on a single algorithm.
10.2 A practical method-choice guide
For most metabolomics studies, statistical analysis can be organized around a few common goals:
Do I want to test individual metabolites?
Use univariate methods such as t-test, ANOVA, LIMMA, or regression models with multiple-testing correction.Do I want to explore overall sample structure without using group labels?
Use unsupervised methods such as PCA, clustering, and exploratory visualization.Do I want to build a supervised separation model between groups?
Consider PLS-DA, sPLS-DA, OPLS-DA, regularized regression, random forest, or SVM, but only with careful validation.Do I want feature selection for biomarker discovery?
Use sparse supervised methods, penalized regression, or tree-based models, and compare them with simpler baselines.Do I want network-level or systems-level interpretation?
Use pathway analysis and network analysis after normalization, annotation, and differential analysis are already stable.
As a rough guide:
Small two-group untargeted study: start with PCA, QC checks, t-test or LIMMA, and simple effect-size reporting.
Multi-group or covariate-rich study: use ANOVA, linear models, LIMMA, and adjusted regression rather than many separate t-tests.
Biomarker panel discovery: compare univariate methods with sPLS-DA, regularized regression, and random forest instead of relying on one classifier.
Mechanism-oriented study: combine univariate analysis, PCA, pathway analysis, and network interpretation.
Prediction-focused study with external validation: supervised ML is reasonable, but model selection, nested validation, and batch control become critical.
10.3 Differences analysis
After we get corrected peaks across samples, the next step is to find the differences between two groups. Actually, you could perform ANOVA or Kruskal-Wallis Test for comparison among more than two groups. The basic idea behind statistic analysis is to find the meaningful differences between groups and extract such ions or peak groups.
So how to find the differences? In most metabolomics software, such task is completed by a t-test and report p-value and fold changes. If you only compare two groups on one peaks, that’s OK. However, if you compare two groups on thousands of peaks, statistic textbook would tell you to notice the false positive. For one comparison, the confidence level is 0.05, which means 5% chances to get false positive result. For two comparisons, such chances would be \(1-0.95^2\). For 10 comparisons, such chances would be \(1-0.95^{10} = 0.4012631\). For 100 comparisons, such chances would be \(1-0.95^{100} = 0.9940795\). You would almost certainly to make mistakes for your results.
In statistics, the false discovery rate(FDR) control is always mentioned in omics studies for multiple tests. I suggested using q-values to control FDR. If q-value is less than 0.05, we should expect a lower than 5% chances we make the wrong selections for all of the comparisons showed lower q-values in the whole dataset. Also we could use local false discovery rate, which showed the FDR for certain peaks. However, such values are hard to be estimated accurately.
Karin Ortmayr thought fold change might be better than p-values to find the differences (Ortmayr et al. 2016).
10.3.1 T-test or ANOVA
If one peak shows significant differences among two groups or multiple groups, T-test or ANOVA could be used to find such peaks. However, when multiple hypothesis testings are performed, the probability of false positive would increase. In this case, false discovery rate(FDR) control is required. Q value or adjusted p value could be used in this situation. At certain confidence interval, we could find peaks with significant differences after FDR control.
10.3.2 LIMMA
Linear Models for MicroArray Data(LIMMA) model could also be used for high-dimensional data like metabolomics. They use a moderated t-statistic to make estimation of the effects called Empirical Bayes Statistics for Differential Expression. It is a hierarchical model to shrink the t-statistic for each peak to all the peaks. Such estimation is more robust. In LIMMA, we could add the known batch effect variable as a covariance in the model. LIMMA is different from t-test or ANOVA while we could still use p value and FDR control on LIMMA results.
10.3.3 Bayesian mixture model
Another way to make difference analysis is based on Bayesian mixture model without p value. Such model would not use hypothesis testing and directly generate the posterior estimation of parameters. A posterior probability could be used to check whether certain peaks could be related to different condition. If we want to make comparison between classical model like LIMMA and Bayesian mixture model. We need to use simulation to find the cutoff.
10.4 Exploratory multivariate analysis
10.4.1 PCA
In metabolomics, PCA is primarily an exploratory data analysis method. It is widely used for visualization, but its value goes beyond plotting alone. PCA compresses a high-dimensional data matrix into a smaller set of orthogonal components that capture as much variance as possible. For example, instead of visualizing 100 metabolites across 100 samples directly, PCA projects the data into a smaller number of principal components that summarize the dominant structure of the dataset.
This projection makes it easier to inspect sample relationships, outliers, QC clustering, and batch structure. Samples with similar metabolite patterns will tend to project near each other in score plots, while dissimilar samples will separate along the principal components. Although two principal components are commonly plotted for convenience, they are usually not sufficient to represent the whole dataset. In many cases, several PCs are needed to explain most of the variance.
PCA can also be used to explore metabolite relationships through loadings, which describe how strongly each metabolite contributes to a principal component. Metabolites with similar loadings often show similar abundance patterns across samples. In this way, PCA helps summarize both sample-level structure and variable-level structure.
Besides visualization, PCA loadings can sometimes be used for downstream interpretation. Yamamoto et al.(Yamamoto et al. 2014) used statistical testing on loading-related correlations to identify peaks associated with biologically meaningful PCs. Such approaches are most useful when a relatively small number of PCs explain a large proportion of the variance.
10.4.2 Cluster Analysis
After we got a lot of samples and analyzed the concentrations of many compounds in them, we may ask about the relationship between the samples. You might have the sampling information such as the date and the position and you could use boxplot or violin plot to explore the relationships among those categorical variables. However, you could also use the data to find some potential relationship.
But how? if two samples’ data were almost the same, we might think those samples were from the same potential group. On the other hand, how do we define the “same” in the data?
Cluster analysis told us that just define a “distances” to measure the similarity between samples. Mathematically, such distances would be shown in many different manners such as the sum of the absolute values of the differences between samples.
For example, we analyzed the amounts of compound A, B and C in two samples and get the results:
| Compounds(ng) | A | B | C |
|---|---|---|---|
| Sample 1 | 10 | 13 | 21 |
| Sample 2 | 54 | 23 | 16 |
The distance could be:
\[ distance = |10-54|+|13-23|+|21-16| = 59 \]
Also you could use the sum of squares or other way to stand for the similarity. After you defined a “distance”, you could get the distances between all of pairs for your samples. If two samples’ distance was the smallest, put them together as one group. Then calculate the distances again to combine the small group into big group until all of the samples were include in one group. Then draw a dendrogram for those process.
The following issue is that how to cluster samples? You might set a cut-off and directly get the group from the dendrogram. However, sometimes you were ordered to cluster the samples into certain numbers of groups such as three. In such situation, you need K means cluster analysis.
The basic idea behind the K means is that generate three virtual samples and calculate the distances between those three virtual samples and all of the other samples. There would be three values for each samples. Choose the smallest values and class that sample into this group. Then your samples were classified into three groups. You need to calculate the center of those three groups and get three new virtual samples. Repeat such process until the group members unchanged and you get your samples classified.
OK, the basic idea behind the cluster analysis could be summarized as define the distances, set your cut-off and find the group. By this way, you might show potential relationships among samples.
10.5 Supervised multivariate discrimination
10.5.1 PCA / PLS-DA / sPLS-DA / OPLS-DA
These methods are often used together in metabolomics papers, but they answer different questions and should not be treated as interchangeable.
| Method | Uses class labels | Main purpose | Strengths | Main risks / limitations |
|---|---|---|---|---|
| PCA | No | Explore global structure and major variance directions | Simple, robust, useful for QC, outliers, and batch patterns | May miss group separation when biological variance is smaller than technical variance |
| PLS-DA | Yes | Find latent components related to class separation | Handles collinearity well and gives interpretable loadings | Easily overfits in small high-dimensional studies |
| sPLS-DA | Yes | Class separation with embedded feature selection | More suitable for biomarker discovery because it removes many irrelevant variables | Feature selection can still be unstable without resampling |
| OPLS-DA | Yes | Separate predictive variation from orthogonal variation | Often gives clearer visual separation and easier interpretation | Can appear more convincing than justified if validation is weak |
In practical terms:
PCA is usually the first multivariate method to run because it does not use labels and is useful for checking QC clustering, outliers, batch effects, and dominant variance structure.
PLS-DA is useful when the goal is supervised discrimination and interpretation of class-related latent variables, but it should never be trusted from score plots alone.
sPLS-DA is often a better choice than standard PLS-DA when the goal is feature prioritization, because sparse penalties reduce noise from irrelevant metabolites.
OPLS-DA can be useful when you want to isolate variation related to the class from variation unrelated to the class, but it requires especially careful validation because attractive separation plots can be misleading.
No supervised separation method should be interpreted without cross-validation, permutation testing, and comparison with simpler alternatives.
10.5.2 PLSDA
PLS-DA, OPLS-DA and HPSO-OPLS-DA (Yang et al. 2017) could be used.
Partial least squares discriminant analysis(PLSDA) was first used in the 1990s. However, Partial least squares(PLS) was proposed in the 1960s by Hermann Wold. Principal components analysis produces the weight matrix reflecting the covariance structure between the variables, while partial least squares produces the weight matrix reflecting the covariance structure between the variables and classes. After rotation by weight matrix, the new variables would contain relationship with classes.
The classification performance of PLSDA is identical to linear discriminant analysis(LDA) if class sizes are balanced, or the columns are adjusted according to the mean of the class mean. If the number of variables exceeds the number of samples, LDA can be performed on the principal components. Quadratic discriminant analysis(QDA) could model nonlinearity relationship between variables while PLSDA is better for collinear variables. However, as a classifier, there is little advantage for PLSDA. The advantage of PLSDA is that this model could show relationship between variables, which is not the goal of regular classifier.
Different algorithms (Andersson 2009) for PLSDA would show different scores, while PCA always shows the same score with fixed algorithm. For PCA, both new variables and classes are orthogonal. However, for PLS(Wold), only new classes are orthogonal. For PLS(Martens), only new variables are orthogonal. This paper show the details of using such methods (Brereton and Lloyd 2018).
Sparse PLS discriminant analysis(sPLS-DA) make an L1 penalty on the variable selection to remove the influences from unrelated variables, which make sense for high-throughput omics data (Lê Cao et al. 2011).
For o-PLS-DA, s-plot could be used to find features(Wiklund et al. 2008).
10.5.3 When to choose which method
For typical metabolomics study designs:
Exploratory cohort profiling: use PCA first, then clustering if you want to inspect natural sample grouping.
Two-class discrimination with moderate sample size: use PCA for diagnostics and then compare PLS-DA with simpler classifiers or univariate baselines.
Biomarker shortlist generation in high-dimensional data: prefer sPLS-DA or regularized regression because they embed feature selection.
Need to separate class-related from unrelated structured variation: consider OPLS-DA, but validate with permutation tests and external or nested cross-validation.
Very small sample size with many features: avoid over-interpreting any supervised latent-variable model; rely more on PCA, univariate analysis, and conservative reporting.
Prediction as the main goal: compare PLS-DA family methods against random forest, SVM, or penalized regression rather than assuming latent-variable methods are best.
10.6 Network analysis
10.6.1 Vertex and edge
Each node is a vertex and the connection between nodes is a edge in the network. The connection can be directed or undirected depending on the relationship.
10.6.2 Build the network
Adjacency matrices were always used to build the network. It’s a square matrix with n dimensions. Row i and column j is equal to 1 if and only if vertices i and j are connected. In directed network, such values could be 1 for i to j and -1 for j to i.
10.6.3 Network attributes
Vertex/edge attributes could be the group information or metadata about the nodes/connections. The edges could be weighted as attribute.
Path is the way from one node to another node in the network and you could find the shortest path in the path. The largest distance of a graph is called its diameter.
An undirected network is connected if there is a way from any vertex to any other. Connected networks can further classified according to the strength of their connectedness. An undirected network with at least two paths between each pairs of nodes is said to be biconnected.
The transitivity of network is a crude summary of the structure. A high value means that nodes are connected well locally with dense subgraphs. Network data sets typically show high transitivity.
Maximum flows and minimum cuts could be used to check the largest volumns and smallest path flow between two nodes. For example, two hubs is connected by one node and the largest volumn and smallest path flow between two nodes from each hub could be counted at the select node.
Sparse network has similar number of edges and the number of nodes. Dense network has the number of edges as a quadratic function of the nodes.
10.7 Machine Learning and Deep Learning
Machine learning (ML) and deep learning (DL) have been increasingly applied across the metabolomics workflow, from peak detection to biomarker discovery. However, proper application requires understanding both the opportunities and the pitfalls specific to high-dimensional, low-sample-size metabolomics data(Galal et al. 2022; Sen et al. 2021).
10.7.1 Where ML/DL is applied in metabolomics
ML/DL methods appear at multiple stages of the metabolomics pipeline:
Peak detection and deconvolution: Deep learning models can learn peak shapes from training data and improve detection in noisy spectra. NeatMS uses a convolutional neural network for peak quality assessment(Gloaguen et al. 2022), and 2D pseudo-color image-based deep learning has been applied to LC-MS feature detection(Zhao et al. 2021).
Spectral annotation: ML models predict molecular fingerprints from MS/MS spectra. Tools like CSI:FingerID(Dührkop et al. 2019), CANOPUS(Dührkop et al. 2020), MS2DeepScore(Huber et al. 2020), and MSBERT(H. Zhang et al. 2024) use various architectures (random forest, deep neural networks, transformers) to match or predict spectra. MS2Query enables analogue searching using learned spectral embeddings(de Jonge et al. 2023).
Retention time prediction: Models such as RT-Transformer(Xue et al. 2024) and Retip(Bonini et al. 2020) predict chromatographic retention times from molecular structure, adding an orthogonal dimension for annotation confidence.
Biomarker discovery and classification: Supervised models (random forest, SVM, PLS-DA, gradient boosting, neural networks) are commonly used to classify samples and select discriminating features. Ensemble methods and model stacking often outperform single models.
Data integration: Multi-omics integration benefits from ML approaches that can handle heterogeneous data types, such as MOFA (Multi-Omics Factor Analysis) and similarity network fusion.
10.7.2 Common pitfalls
Metabolomics datasets are particularly susceptible to ML overfitting due to the high number of features relative to the sample size (the “p >> n” problem). Common mistakes include:
Data leakage: Performing feature selection, normalization, or missing value imputation before splitting data into training/test sets. All preprocessing steps that use information across samples must be performed within the cross-validation loop.
Inadequate validation: Using only a single random train/test split instead of repeated k-fold cross-validation. For small sample sizes (n < 100), nested cross-validation is recommended to get unbiased performance estimates.
Ignoring batch effects: If batch structure correlates with the biological groups, ML models may learn batch signatures rather than biology. Always check whether your model predictions correlate with known technical factors.
Reporting bias: Reporting only the best-performing model among many tried, without adjusting for the model selection process. Permutation testing (shuffling labels to establish null performance) is essential to verify that classification accuracy exceeds what is expected by chance.
10.7.3 Best practices
Practical guidance for training and reporting ML models in metabolomics has been proposed(Zhao et al. 2025). The EMBRACE checklist provides a framework for environmental ML, emphasizing baseline reporting and comprehensive evaluation(Zhu et al. 2024). Key recommendations include:
Always report the full cross-validation scheme, including how hyperparameters were tuned
Include a permutation test to verify statistical significance of classification accuracy
Report multiple metrics (accuracy, AUC, sensitivity, specificity, F1) rather than a single number
Make code and processed data available for reproducibility
Compare ML results against simple baselines (e.g., univariate t-test + fold change) to justify the added complexity
For feature importance, use model-agnostic methods (e.g., SHAP values, permutation importance) that provide interpretable results
10.7.4 Deep learning considerations
Deep learning models (autoencoders, variational autoencoders, transformers, graph neural networks) are powerful but typically require larger training sets than classical ML. In metabolomics, they are most useful when:
Large public spectral databases are available for pretraining (e.g., MS/MS spectra prediction)
The task involves pattern recognition in complex spectral or image data
Transfer learning from related domains can compensate for limited sample sizes
For classification tasks with fewer than a few hundred samples, classical methods (random forest, gradient boosting, regularized regression) often perform as well as or better than deep learning, with the added benefit of interpretability(Sen et al. 2021).
10.8 Causal inference in metabolomics
Standard metabolomics analyses identify associations between metabolites and outcomes, but association does not imply causation. Confounding, reverse causation, and measurement error can all produce spurious associations. Several approaches are now being applied to strengthen causal claims from metabolomics data.
10.8.1 Mendelian randomization
Mendelian randomization (MR) uses genetic variants as instrumental variables to estimate the causal effect of an exposure (e.g., a metabolite level) on an outcome (e.g., disease risk). Because genetic variants are randomly allocated at conception, they are generally not affected by the confounders that complicate observational studies. With the availability of large-scale genome-wide association studies (GWAS) for hundreds of circulating metabolites, two-sample MR has become a popular approach to test whether a metabolite is causally related to a health outcome(Holmes et al. 2018). The MR-Base platform and the TwoSampleMR R package provide accessible tools for conducting such analyses(Hemani et al. 2018).
Key assumptions of MR include: the genetic instrument must be associated with the metabolite (relevance), must not be associated with confounders (independence), and must affect the outcome only through the metabolite (exclusion restriction). Violations of these assumptions, particularly horizontal pleiotropy, can bias results. Sensitivity analyses such as MR-Egger, weighted median, and MR-PRESSO are essential to assess robustness.
10.8.2 Practical considerations
For beginners applying causal inference methods in metabolomics, it is worth noting that MR works best for metabolites with strong genetic determinants (high heritability) and is less powerful for metabolites primarily driven by recent environmental exposure. Combining MR evidence with traditional observational associations and biological pathway knowledge provides the strongest basis for causal claims. Mediation analysis and structural equation modeling are additional tools for understanding how metabolites sit on the causal path between exposure and disease.
10.9 Software
caret could employ more than 200 statistical models in a general framework to build/select models. You could also show the variable importance for some of the models.
caretEnsemble Functions for creating ensembles of caret models
pROC Tools for visualizing, smoothing and comparing receiver operating characteristic (ROC curves). (Partial) area under the curve (AUC) can be compared with statistical tests based on U-statistics or bootstrap. Confidence intervals can be computed for (p)AUC or ROC curves.
gWQS Fits Weighted Quantile Sum (WQS) regressions for continuous, binomial, multinomial and count outcomes.
Community ecology tool could be used to analysis metabolomic data(Passos Mansoldo et al. 2022).