Reactomics and Paired Mass Distance Analysis

# Reactomics and Paired Mass Distance Analysis
### Miao Yu
### 2018/11/11 (updated: 2019-12-10)

---

## MS based Target/Untargeted Analysis

- Target analysis and untargeted Analysis are designed for different purposes
- They could be part of one workflow for certain research

---

## Workflow for Untargeted Analysis

.large[
- [Sample collection]
- [Pretreatment]
- [Instrumental analysis (Mass Spectrometry)]
- [From raw data to peaks in each sample]
- Align peaks to make retention time correction for multiple samples
- Fill the peaks for aligned peaks list
- Peaks list
  - Peaks with mass to charge ratio @ retention time in row
  - Samples in column
- Annotation for peaks
- Validation by standards (targeted analysis)
- [Prediction/Inference for scitific purpose]
]

---

## Demo of XC-MS Data

---

## Demo of Peaks

---

## Demo of Retention Time Correction

<div class="figure" style="text-align: center">
<img src="https://yufree.github.io/presentation/figure/obiwarp.gif" alt="Demo of Obiwarp" width="50%" />
<p class="caption">Demo of Obiwarp</p>
</div>

.half[
Prince, J. T., & Marcotte, E. M. (2006). Chromatographic Alignment of ESI-LC-MS Proteomics Data Sets by Ordered Bijective Interpolated Warping. Analytical Chemistry, 78(17), 6140–6152. doi:10.1021/ac0605344
]

???
Loess alignment use local region to align the peaks. However, obiwarp alignment with bijective interpolated dynamic time warping. Raw data from two LC−MS runs, whether successive fractions or across different biological conditions, (1) is interpolated into a (2) uniform matrix (or rectilinear matrix). (3) An all vs all similarity matrix of the spectra is constructed. (4) The similarity matrix distribution is mean centered and normalized by the standard deviation. (5) Dynamic programming is performed by adding similarity scores along a recursively generated optimal path while off-diagonal transitions are penalized by either a local or global gap penalty to give (6) an additive score matrix. (7) Pointers are kept in a traceback matrix used to deliver (8) the optimal alignment path. (9) High scoring points in the optimal path are selected to create a bijective (one-to-one) mapping, which is used as anchors for PCHIP interpolation to generate a smooth warp function. (II) Verification and optimization. (11) MS/MS spectra from the raw MS runs are searched via SEQUEST and Peptide/Protein Prophet to determine peak identities. (12) High-confidence identifications are selected and (13) the overlapping set of peptide identifications (after filtering outliers) is used as the alignment standard. (14) The warp function produced through the comparison of MS data is applied to the standards. (15) The ideal alignment would shift all standards to the diagonal. The accuracy of an alignment is calculated as the sum of the square residuals from the diagonal.

---

## Demo of Peaks Filling

---

## Demo of Many XC-MS Data

---

## Major issue

<div class="figure" style="text-align: center">
<img src="https://yufree.github.io/presentation/figure/cat.jpg" alt="Annotation is similar to find real cat in this picture" width="42%" />
<p class="caption">Annotation is similar to find real cat in this picture</p>
</div>

---

## Annotation for peaks

- Predefined rules between peaks/features and compounds

- Generate pseudo-spectrum

- Search database or *in silico* prediction to identify compounds

- Build the links between compounds by pathway/network analysis

> Features -> Compounds -> Relationship among compounds

- Problems

- Time consuming - too many peaks
    - Sensitivity - DDA or MS/MS
    - Standards coverage

---

## My Idea

> Features -> Compounds -> Relationship among compounds

- You ACTUALLY don't need people (compounds) name to know their relationship

From [Wikipedia Commons](https://commons.wikimedia.org/wiki/File:A_Sunday_on_La_Grande_Jatte,_Georges_Seurat,_1884.jpg):A Sunday on La Grande Jatte, Georges Seurat

???
- all compounds from metabolomcis study is a snapshot with metabolites and parent compounds
- We could find the relationship among people without know the name of each person
- mass spec could measure the distance without known the name of compounds

---

## My Idea

> Features -> ~~Compounds~~ -> Relationship among compounds

- Mass spectrum could directly measure reactions

???
- Annotation is not really necessary for certain scientific problem
- Relationship among compounds or reaction matters

---
## Why Reactions?

- Unit: Gene(5) < Protein(20+2) < Metabolite(100K) < Compound(100M)

- Combination: Gene(20,000-25,000) < Protein(20,000-25,000) < Compound(???)

- Small molecular **combination** is chemical reaction or paired mass distance

---

## Why PMD?

- [Nuclear Binding Energy](https://en.wikipedia.org/wiki/Nuclear_binding_energy)

`$$\Delta m = Zm_{H} + Nm_{n} - M$$`
- The missing mass was converted into energy ( `$E=mc^2$` ) and emitted when the atom made

- Atoms -> Compounds -> Mass distances between compounds

- **Paired Mass Distances(PMD)** is unique

- **High resolution** mass spectrometry WINs

???
- Mass defects could be transferred from atom to paired mass distance
- HRMS could measure PMDs for qualitative analysis

---

## Sources of PMDs in the real data

### Where is PMD?

- in source reaction

- `$[M+H]^+$`  `$[M+Na]^+$`
  - 21.982 Da
]

- Lipid `$-[CH_2]-$`
  - 14.016 Da

- Xenobiotic metabolism

- Phase I hydrolation
  - 15.995 Da
]

---

## Quantitative and Qualitative analysis for Reaction

### KEGG reaction database

|   PMD   | Freq |                        Example                        |
|:-------:|:----:|:-----------------------------------------------------:|
|  1.008  | 2037 |     NAD(+) + succinate <=> fumarate + H(+) + NADH     |
|  2.016  | 1748 | NAD(+) + propanoyl-CoA <=> acryloyl-CoA + H(+) + NADH |
| 15.995  | 1170 |                ATP + GDP <=> ADP + GTP                |
| 13.979  | 1122 |   deoxynogalonate + O2 <=> H(+) + H2O + nogalonate    |
| 17.003  | 929  | H2O + hypotaurine + NAD(+) <=> H(+) + NADH + taurine  |
| 79.966  | 750  |         ATP + H2O <=> ADP + H(+) + phosphate          |
| 14.016  | 611  |  acetyl-CoA + propanoate <=> acetate + propanoyl-CoA  |
|    0    | 533  |              L-glutamate <=> D-glutamate              |
| 162.053 | 365  |       H2O + lactose <=> D-galactose + D-glucose       |
| 18.011  | 361  |        L-serine <=> 2-aminoprop-2-enoate + H2O        |

- Real reactions contain ions
- Skewed by known reactions

---

## Quantitative and Qualitative analysis for Reaction

### HMDB compounds database

|       | C | H | O |
|:------|:-:|:-:|:-:|
|14.016 | 1 | 2 | 0 |
|2.016  | 0 | 2 | 0 |
|28.031 | 2 | 4 | 0 |
|26.016 | 2 | 2 | 0 |
|15.995 | 0 | 0 | 1 |
|12     | 1 | 0 | 0 |
|56.063 | 4 | 8 | 0 |
|42.047 | 3 | 6 | 0 |
|30.011 | 1 | 2 | 1 |
|24     | 2 | 0 | 0 |

- Dominated by C, H and O
- Structure or reaction?

???
- We need quantitative mass ready database for PMD annotation

---
## Quantitative and Qualitative analysis for Reaction

### HMDB compounds database

|      |  PMD   | frequency | accuracy |  PMD  | frequency | accuracy |
|:-----|:------:|:---------:|:--------:|:-----:|:---------:|:--------:|
|+C2H  | 14.016 |   4934    |  0.9755  | 14.02 |   8003    |  0.6014  |
|+2H   | 2.016  |   4909    |  0.9703  | 2.02  |   7959    |  0.5984  |
|+2C4H | 28.031 |   4878    |  0.9783  | 28.03 |   7799    |  0.6119  |
|+2C2H | 26.016 |   4229    |  0.9775  | 26.02 |   7343    |  0.5630  |
|+O    | 15.995 |   4214    |  0.9808  | 15.99 |   7731    |  0.5346  |
|+C    | 12.000 |   3861    |  0.9826  | 12.00 |   7145    |  0.5310  |
|+4C8H | 56.063 |   3861    |  0.9653  | 56.06 |   6699    |  0.5564  |
|+3C6H | 42.047 |   3771    |  0.9737  | 42.05 |   6558    |  0.5599  |
|+C2HO | 30.011 |   3698    |  0.9440  | 30.01 |   6761    |  0.5163  |
|+2C   | 24.000 |   3689    |  0.9810  | 24.00 |   6963    |  0.5197  |

---
## Quantitative and Qualitative analysis for Reaction

### HMDB compounds database

|      | PMD  | frequency | accuracy | PMD | frequency | accuracy |
|:-----|:----:|:---------:|:--------:|:---:|:---------:|:--------:|
|+C2H  | 14.0 |   50419   |  0.0955  | 14  |  156245   |  0.0354  |
|+2H   | 2.0  |   50467   |  0.0944  |  2  |  156260   |  0.0352  |
|+2C4H | 28.0 |   50797   |  0.0939  | 28  |  155410   |  0.0356  |
|+2C2H | 26.0 |   48517   |  0.0852  | 26  |  154346   |  0.0309  |
|+O    | 16.0 |   51278   |  0.0806  | 16  |  155811   |  0.0307  |
|+C    | 12.0 |   49335   |  0.0769  | 12  |  155339   |  0.0283  |
|+4C8H | 56.1 |   36417   |  0.1026  | 56  |  151894   |  0.0286  |
|+3C6H | 42.0 |   49808   |  0.0737  | 42  |  153764   |  0.0275  |
|+C2HO | 30.0 |   51241   |  0.0681  | 30  |  154369   |  0.0260  |
|+2C   | 24.0 |   48099   |  0.0752  | 24  |  154278   |  0.0273  |

---

## Quantitative and Qualitative analysis for Reaction

### Static v.s. dynamic

- Static mass pairs: paired intensity ratio is stable across samples
- Dynamic mass pairs: paired intensity ratio is stable across samples
- For example, [A,B], [C,D] and [E,F] are involved in the same PMD:

|  A   |  B  | Ins ratio |  C  | D  | Ins ratio |  E  |  F  | Ins ratio |
|:----:|:---:|:---------:|:---:|:--:|:---------:|:---:|:---:|:---------:|
| 100  | 50  |    2:1    | 100 | 50 |    2:1    | 30  | 40  |    3:4    |
| 1000 | 500 |    2:1    | 10  | 95 |   2:19    | 120 | 160 |    3:4    |

- [A,B] and [E,F] could be used for Quantitative analysis for certain PMD, rsd cutoff 30%
- [C,D] could be used to check dynamics of specific reaction

???
- Response factor is the slope of calibration curve for certain compound
- Total intensity of all pairs with the same PMD
- Count once for ions involved in multiple reactions

---
class: inverse, center, middle

# Reactomics Application

## Exhaustive screen

---

## Sensitivity matters

- Target analysis could capture peaks with low intensity

- Untargeted analysis would loss sensitivity to capture all peaks

- Send unknown while independent peaks for MS/MS

---

## How many real compounds among features?

.half[
Mahieu, N. G., & Patti, G. J. (2017). Systems-Level Annotation of a Metabolomics Data Set Reduces 25 000 Features to Fewer than 1000 Unique Metabolites. Analytical Chemistry, 89(19), 10397–10406. doi:10.1021/acs.analchem.7b02380
]

---

## Gap between features and compounds

---

## GlobalStd Algorithm

.half[
Yu, M., Olkowicz, M., & Pawliszyn, J. (2019). Structure/reaction directed analysis for LC-MS based untargeted analysis. Analytica Chimica Acta, 1050, 16–24. doi:10.1016/j.aca.2018.10.062
]

---
## GlobalStd Algorithm Step 1

### Retention time cluster analysis

---

## GlobalStd Algorithm Step 2

### High frequency PMD analysis across RT clusters - example

- Based on data itself, those adducts/multiply charged ions/neutral loss/isotopologues can be unknown

---
## GlobalStd Algorithm Step 3

### Independent peaks selection

---
## GlobalStd Algorithm Step 3

### Independent peaks selection - example

---
## GlobalStd Algorithm Step 3

### Why redundant?

- ~14.3% peaks can capture similar variances of all peaks
- For CAMERA/RAMclust, peaks with highest intensity from pcgroup were selected as independent peaks

???
- Similar to isotope labeled results (5% peaks)
- Untargeted analysis does not mean big data

---

## Target compounds validation

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> Independent peaks </th>
   <th style="text-align:right;"> Target compounds found </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> pmd </td>
   <td style="text-align:right;"> 985 </td>
   <td style="text-align:right;"> 18 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> CAMERA </td>
   <td style="text-align:right;"> 1297 </td>
   <td style="text-align:right;"> 15 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> RAMclust </td>
   <td style="text-align:right;"> 461 </td>
   <td style="text-align:right;"> 12 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> profinder </td>
   <td style="text-align:right;"> 6628 </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
</tbody>
</table>

- 103 compounds for validation
- 36 compounds could be found by xcms 6885 features
- 7 could be found by profinder untargeted analysis 6628 features

---
## Untargeted MS/MS analysis - PMDDA

- Only use GlobalStd peaks for MS/MS analysis
    - Multiple injections

- MS/MS spectral library annotation on [GNPS](https://gnps.ucsd.edu)

- Compare with Data Dependent Acquisition (DDA) (173 compounds)
    - Annotated 235 extra compounds and overlap 59 compounds
    - Less contaminant ions

???
- GNPS MS/MS annotation
- 235:59:114 PMDDS:overlap:DDA

---

## Untargeted MS/MS analysis - PMDDA

???
- GNPS MS/MS annotation
- 235:59:114 PMDDS:overlap:DDA

---

## Untargeted MS/MS analysis - PMMD Annotation

- Use pmd and rank of pmd for annotation

- Intensity filter(10%) and robust for noise

- 957/1098 PMDR/HMDB QqQ data

- some compounds share the same pmd 87%

---
class: inverse, center, middle

# Reactomics Application

## Metabolites Discovery

---

## Metabolites of exogenous compound

- Environmental pollution metabolites
  - Drug metabolites

### Xenobiotic metabolism

- Phase I
    - Oxidation (R-H ⇒ R-OH, pmd 15.995 Da)
    - Reduction (R-C=O ⇒ R-C-OH, pmd 2.016 Da)
    
  - Phase II
    - Methylation (R-OH ⇒ R-O-C,pmd 14.016 Da)
    - Sulfation (R-OH ⇒ R-SO4, pmd 46.976 Da)
    - Acetylation (R-OH ⇒ R-O-COCH3, pmd 42.011 Da)
    - Glucuronidation (R-NH2 ⇒ R-NH-C6H9O7, pmd 192.027 Da)
    - Glycosylation (R-OH ⇒ R-O-C6H11O5, pmd 162.053 Da)

---

## Metabolites of TBBPA in Pumpkin

- Mass defect analysis to screen Brominated Compounds

- Confirmation by synthesized standards

.half[
Hou, X., Yu, M., Liu, A., Wang, X., Li, Y., Liu, J., … Jiang, G. (2019). Glycosylation of Tetrabromobisphenol A in Pumpkin. Environmental Science & Technology. doi:10.1021/acs.est.9b02122
]

---

## Metabolites of TBBPA in Pumpkin

- TBBPA Metabolites PMD network

.half[
Hou, X., Yu, M., Liu, A., Wang, X., Li, Y., Liu, J., … Jiang, G. (2019). Glycosylation of Tetrabromobisphenol A in Pumpkin. Environmental Science & Technology. doi:10.1021/acs.est.9b02122
]

---

## KEGG reaction network

- Metabolites of four compounds

---

## Endogenous vs Exogenous

- T3DB Endogenous (255) vs Exogenous (705)

- Use top 20 high frequency PMDs

---

# Reactomics Application

## Biomarker Reaction

---

## Lung cancer

- MTBLS28 1005 human urine samples

- PMD 2.02 Da show differences among control and diseases

---

## How

<div class="figure" style="text-align: center">
<img src="https://yufree.github.io/presentation/figure/owl.png" alt="Paper method v.s. Practical method in Metabolomics" width="72%" />
<p class="caption">Paper method v.s. Practical method in Metabolomics</p>
</div>

---

## Software

### [enviGCMS package](http://yufree.github.io/enviGCMS/)
  
  - Target analysis
  - Mass defect analysis

### [pmd package](http://yufree.github.io/pmd/)
  
  - Untargeted analysis
  - GlobalStd algorithm
  - Reactomics analysis

### [rmwf package](https://github.com/yufree/rmwf)

- NIST 1950 data 
  - Script

---

# Thanks

## Q&A

## miao.yu@mssm.edu