Reactomics: Using mass spectrometry as a chemical reaction detector

class: center, middle, inverse, title-slide

# Reactomics: Using mass spectrometry as a chemical reaction detector
## ID:304104
### Miao Yu, Lauren Petrick
### 2020-06-06

---

## Typical workflow for HRMS metabolomics

- Collect samples

- Acquire data (peaks) using mass spectrometry

- Annotate peaks for identification with compound name

- Build links between compounds using pathway/network analysis

> Sample -> Peaks -> Compounds -> Relationship among compounds

- Problems

- Time consuming - too many peaks ~20k
    - Standards coverage - unknown unknown

???
Hello everyone, my name is Miao Yu and I am from Icahn School of medicine at Mount Sinai. My title is 'Reactomics: Using mass spectrometry as a chemical reaction detector'. I fully understand that you are tired of different 'omics' and I hope this one will make a difference.

Before the discussion of the concepts, let's review the general workflow for metabolomics or untargeted analysis. We collect samples and then perform the instrument analysis. In terms of mass spectrometry, we will collect peaks signals. Then we will process the data and turn the peaks signals into featuers table and annotate the featuers as compounds by either database or prediction. The next step is building the links between those compounds by pathway analysis or network analysis. In summary, samples turn into peaks, peaks turn into compounds and compounds turn into relationship among compounds. The final purpose of metabolomics is not identification of a compound and the real scientific purpose is always the relationship among compounds.

However, there are some problems at current stage. The first one is time consuming. We need to deal with about around 20 thousands peaks from each sample or featuers from multiple samples. It's just impossible to check them one by one and we need to focus on a reasonable number of compounds of interests. The second issue is the standards coverage. I doubt how many of you have received comments from reviewer 3 about standards confirmation or annotation of unkown peaks. In most cases, metabolomics and untarget analysis are facing the unknown unknown paradox. If we never find some compounds before, how could we make annotation based on current database.

---

## Idea

> Sample -> Peaks -> ~~Compounds~~ -> Relationship among compounds

- Mass spectrum could directly measure relationship (reactions)

???
- Annotation is not really necessary for certain scientific problem
- Relationship among compounds or reaction matters

Here is my idea to those problems. Let's review the information flow. Samples, peaks, compounds, relationship among compounds... Wait, if we try to find the relationship among compounds, is there a way to skip the compounds identification phase?

The answer should be yes. Let's take oxidation process as an example. If it happens, no matter which compounds are involved, there should always be a mass differnces of sixteen dalton between those compounds. Since our mass spectrometry could detect their mass to charge ratio, we actually could directly extract all those relationship by checking paired mass distance among the unknown peaks. The unit of the relationship between compounds should be chemical reactions.

---
## Why Reactions?

- Unit: Gene (5) < Protein (20+2) < Metabolite (100K) < Compound (100M)

- Combination: Gene (20,000-25,000) < Protein (20,000-25,000) < Compound(???)

- Small molecule  **combination** is a chemical reaction or paired mass distance

???
So why reactions? Compared with other omics, the major difference in metabolomics is the missing of unit of relationship. For genomics, five moleculars are the unit and their combination unit is base pair. For proteomics, twenty two amino acids are the molecular unit and their combination unit is peptide bond. For small molecular, over one hundred million molecular are unit while actually we only care the one who will show activity with other molecular. In this case, the combination unit should be chemical reactions. In terms of mass spectrometry, reactions mean paired mass distance among different compounds.

---

## Why PMD?

- [Nuclear Binding Energy](https://en.wikipedia.org/wiki/Nuclear_binding_energy)

`$$\Delta m = Zm_{H} + Nm_{n} - M$$`
- The missing mass was converted into energy ( `$E=mc^2$` ) and emitted when the atom was made

- Atoms -> Compounds -> Mass distances between compounds

- **Paired Mass Distances (PMD)** is unique

- **High resolution** mass spectrometry WINs

???
- Mass defects could be transferred from atom to paired mass distance
- HRMS could measure PMDs for qualitative analysis

So why PMD? The best feature for PMD is that it comes from the nuclear binding energy and almost unique to a specific element composition. In this case, if we have high resolution mass spectrometry, we could annotate a certain paired mass distance with element composition and check if it's linked to a known reaction.

---

## Sources of PMDs in real data

### Where is PMD?

.pull-left[- Isotopologues 
  
  - `$[M]^+$` `$[M+1]^+$`
  - 1.006 Da

- in source reaction

- `$[M+H]^+$`  `$[M+Na]^+$`
  - 21.982 Da
]

.pull-right[
- Homologous series

- Lipid `$-[CH_2]-$`
  - 14.016 Da

- Xenobiotic metabolism

- Phase I hydrolation
  - 15.995 Da
]

???
However, not all of the PMD is linked to reaction among compounds. Some of the PMD could be generated from one compounds such as isotopologue or in source reaction or MSMS fragmental ions. Some PMDs could also show the homologous series like lipid, which is structure similarity instead of reaction relationship.

---

## Quantitative and Qualitative analysis in Reactomics

### KEGG reaction database

|   PMD   | Freq | ExampleReactionClass | ExampleEnzyme |
|:-------:|:----:|:--------------------:|:-------------:|
|  2.016  | 1732 |       RC00095        |   1.3.1.84    |
| 15.995  | 1169 |       RC00046        |   1.3.99.18   |
| 79.966  | 729  |       RC00002        |    3.6.1.3    |
| 14.016  | 594  |       RC00060        |    1.5.3.1    |
|    0    | 532  |       RC00302        |    5.1.1.3    |
| 18.011  | 359  |       RC00680        |    3.5.2.5    |
| 162.053 | 365  |       RC00049        |   3.2.1.23    |
| 159.933 | 243  |       RC02056        |   4.2.3.42    |
|  1.032  | 262  |       RC00006        |    1.4.1.2    |
| 42.011  | 237  |       RC00004        |   2.3.1.54    |

???
In this case, though we don't need to annotate all the compounds, a PMD database are still needed for PMD annotate as element composition and their corresponding reaction class or enzyme. Here, I analyzed all the reactions PMD from KEGG reaction database and removed the ions related PMD, which is not detectable by mass spectrometry. I found our known reactions are actually highly enriched by a few PMDs. The double bonds formation or broken reaction appeared in 1732 reactions, almost fifteen percent of the reactions in KEGG. In this case, for reactomics we could skip quantitative analysis of endless unknown compounds and focus on the changes of those enriched PMDs.

---

## Quantitative and Qualitative analysis in Reactomics

### Static v.s. dynamic

- Static mass pairs: paired intensity ratio is stable across samples
- Dynamic mass pairs: paired intensity ratio is not stable across samples
- For example, [A,B], [C,D] and [E,F] are involved in the same PMD:

|  A   |  B  | Ins ratio |  C  | D  | Ins ratio |  E  |  F  | Ins ratio |
|:----:|:---:|:---------:|:---:|:--:|:---------:|:---:|:---:|:---------:|
| 100  | 50  |    2:1    | 100 | 50 |    2:1    | 30  | 40  |    3:4    |
| 1000 | 500 |    2:1    | 10  | 95 |   2:19    | 120 | 160 |    3:4    |

- [A,B] and [E,F] are static mass pairs so can be used in quantitative analysis, rsd cutoff 30%
- [C,D] can be used to check dynamics of a specific reaction

???
- Response factor is the slope of calibration curve for certain compound
- Total intensity of all pairs with the same PMD
- Count once for ions involved in multiple reactions

When we have the list of enriched PMDs, we need to define how to make quantitative analysis of the paired mass distance. Here I proposed to use static PMD pairs, which always show stable intensity ratios across samples, for quantitative analysis of reactions. After the discussion of concept and technique issues, now we could see some application of reactomics.

---
class: inverse, center, middle

# Reactomics Application

## Metabolites Discovery

???
The first application is metabolites discovery. Current workflow for metabolites discovery is the prediction of metabolites based on MS/MS database or pathway database and then match the predicted ions with real data. Here, based on reactomics, we proposed a local recursive search for metabolites discovery.

---

## Metabolites of exogenous compound

- Environmental pollution metabolites
  - Drug metabolites

### Xenobiotic metabolism

- Phase I
    - Oxidation (R-H => R-OH, pmd 15.995 Da)
    - Reduction (R-C=O => R-C-OH, pmd 2.016 Da)
    
  - Phase II
    - Methylation (R-OH => R-O-C,pmd 14.016 Da)
    - Sulfation (R-OH => R-SO4, pmd 46.976 Da)
    - Acetylation (R-OH => R-O-COCH3, pmd 42.011 Da)
    - Glucuronidation (R-NH2 => R-NH-C6H9O7, pmd 192.027 Da)
    - Glycosylation (R-OH => R-O-C6H11O5, pmd 162.053 Da)

???
Metabolites discovery is commen in two types of studies: environmental pollution metabolites and drug metabolites discovery. In general, it involved xenbiotic metabolism pathway. It's usually contain phase one reaction, which will be oxdation or reduction of original compounds. And then phase two reaction, which will add functional group to increase the polarity of compounds and help to release those exdogenous compounds into the environment. At chemical reaction level, there are only a few PMDs associated with this process.

---

## Metabolites of TBBPA in Pumpkin

- Mass defect analysis to screen Brominated Compounds

- Confirmation by synthesized standards

.half[
Hou, X., Yu, M., Liu, A., Wang, X., Li, Y., Liu, J., … Jiang, G. (2019). Glycosylation of Tetrabromobisphenol A in Pumpkin. Environmental Science & Technology, 53(15), 8805–8812. doi:10.1021/acs.est.9b02122
]

???
Here I will use an example to demo such process. Here is a study to find the metabolites of tetrabromobisphenol A in pumpkin. I am actually one of the authors of this published work. Here we use mass defect analysis to screen all the Brominated compounds and then make confirmation by synthesized standards. You can see that we found seven metabolites of TBBPA. If you checked the structure, you will find it seems the metabolites follow a reaction flow. Metabolites one is the glycosylation of TBBPA and metabolites two is the mylonylation of metabolites one. Metabolites three is the glycosylation of both side of the hydroxlation groups of TBBPA. It seems the metabolites could be extended by certain PMDs.

---

## Metabolites of TBBPA in Pumpkin

- TBBPA Metabolites PMD network

???
Since I am one of the authors of this work, I have the raw data of this study and I reanalyzed the data in terms of reactomics. What I did is that I gather the five PMDs reported in this study. They are debrominated, glycosylation, mylonylation, methylation and hydroxylation process. You might notice that the PMDs are different from formula. For example, the debrominated process involved both adding Bromine atom and removing hydrogen atom and the PMD could not be shown as chemical formula but a plus and minus format. Anyway, I labeled all the previous found compounds as node of a network. In this case, the metabolites discovery is not based on prediction but based on the data itself. If the searching could find a peak with certain PMD, the node will be extanded and new search will begin at the new node for predefined PMD. In this case, we could find the metabolites of metabolites of metabolites of metabolites of metabolites in few seconds and it turn out that we still miss lots of metabolites of TBBPA when we didn't expect those reaction flow. However, reactomics show the power to reveal all the possibility of metabolites.

---
class: inverse, center, middle

# Reactomics Application

## Endogenous or Exogenous

???
The second application of reactomics is an issue confusing lots of researchers. Do we have a way to separate endogenous compounds with exogenous compounds in our samples? For example, if we collect serum samples and get the untargeted metabolites profile. We want to link a disease with certain unknown compounds show differences among control and disease group. However, we have no clue whether this compound is from ourself or the environment. So, how could reactomics help?

---

## Endogenous vs Exogenous

- T3DB Endogenous (255) vs Exogenous (705)

- Use top 10 high frequency PMDs

???
Here I use T3DB, which is the only database with the annotation of endogenous or exogenous for their compounds. I collected all of the 255 endogenous compounds and 705 exogenous compounds and polled their exact mass together. Then I performed PMD analysis to get all of the possible PMD among those compounds. As I mentioned before, PMD will always enriched in few high frequency ones. Here we didn't predefine those set but directly use top 10 high frequency PMD and construct the PMD network as we did for pumpkin study and here is the results.

I removed all the compounds without any high frequency PMDs relationship with other and you could find interesting pattern for endogenous compounds and exogenous compounds. Most of the endogeous compounds could be connected by few large clusters. As for exogenous compounds, through they have larger numbers, they could not form a larger network. In this case, we need to think what does endogenous means. It means this compound could be generated in a biological process and biological process is alway enriched by few high freqency PMDs. As for exogenous compounds, they are new to biological process and we might not have two many pathways to metabolize them.

---

## KEGG reaction network

- Metabolites of four compounds

- Average network distance: Glucose(9.74), 5-Cholestene(6.55), Caffeine(3.31), Bromophenol(1.8), TBBPA(3.88 for pumpkin)

???
Let's see some real reactions from KEGG database. Here I use the top 10 high freqency PMDs from KEGG and generate the reaction network for four compounds. This is Glucose, which is very important to any biological process and it's a exogenous compound. You can see its network could be extended in all of high freqency PMD and lots of compounds could be coupled into this network. Then we could see Cholestene, which is also a endogenous compounds while not that important as Glucose. It's network is smaller. Then we could see Caffine, which is exogenous compounds for most animals. It also show a network while the network structure is much simpler than endogenous compounds. Then we could see bromophenol, which is also endogenous compounds and its network shows it didn't involved in too much reactions.

OK, we could not always use our eye to make annotation of endogenous or exogenous. Based on reacotmics, the average network distance of  their high frequency PMD network could be a good option. The average network distance means the average numbers of how many PMD connections within a network. If such number is large, we will see a much larger network for certain compounds. Those number for Glucose is 9.74 and for Bromophenol is 1.8.

To be honest, there is no hard cut off for whether a compound is endogenous or exogenous. Some compounds will be endogenous for plant sample while exogenous for animal sample. In terms of reactomics, I suggest such cutoff should be generated from the samples themselves. Using the high freqency PMDs of your samples to get the PMD network for your specific samples. In my experience, you will always see few large clusters and lots of smaller clusters. Then when you find some compounds of your interests, check if this compound is on the larger PMD cluster or smaller cluster to guess whether this compound is from environment or not. Based on known compounds or T3DB, this method works in most cases.

---

class: inverse, center, middle

# Reactomics Application

## Biomarker Reaction

???
This is my favourite application of reactomics. I am from a medical school and all of our samples should be linked to healthcares. Biomarker is always of the most interests to our collaborator for certain disease. In the past, biomarker always means biomarker compounds and that's why lots of analytical paper spent lots of time on identification of unknown compounds. However, in reactomics such identification has been skipped by PMD. In this case, what we will check will not be biomarker compound, but biomarker reactions. I don't believe one compound could serve as biomarker well and when we talk about reactions, it always means at least two compounds and reaction level changes will be relatively stable for prediction.

---

## Lung cancer

- MTBLS28 1005 human urine samples

- PMD 2.02 Da show differences among control and diseases

???
Though we had some exciting data, I prefer to use public data to avoid the bias of our methods. Here we use MTBLS28 dataset from metabolights. It contains 1005 human urine samples and linked to lung cancer. What I did is simply check the high freqency PMDs from the data itself and some known high freqency PMD from KEGG.

As I mentioned in the quanlitative analysis of reactomics, only the pairs with stable intensity ratios were used to check if it's biomarker reaction or not. Here I found PMD 2.02 Da could be a biomarker reaction for lung cancer. As you could see here in the figure, lung cancer group will show a relatively lower PMD 2.02 Da level compared with control group. If this is true, urine samples could be used to predict disease at reaction level.

---

## Software

### [pmd package](http://yufree.github.io/pmd/)
  
  - GlobalStd algorithm
  - pmd-reaction database based on KEGG/HMDB
  - quantitative pmd analysis
  - pmd network anaysis
  - pmd MS/MS annotation algorithm
  - GUI application based on shiny

### [rmwf package](https://github.com/yufree/rmwf)

- NIST 1950 data 
  - Rmarkdown template
  - From raw data to report
  - part of `xcmsrocker` docker image
  
### [POSTER](https://docs.google.com/presentation/d/18qDbjy1PYuLZgOOlbSgzkpFBQnc3k-H5XhQPHjszfHA/edit?usp=sharing) 304118

???
Here I show three application of reactomics and you might find more. So what I did is write a R package to include all the functions for pmd analysis or reactomics analysis such as the pmd-reaction database based on KEGG/HMDB, quantitative pmd analysis, PMD network analysis. To make it clear, I only cover the reaction between different compounds instead of reactions from one compound. However, we do have functions to make MS/MS annotation based on pmd and you could check poster 304118 for details.

Meanwhile, to make reactomics fully reproducible at the very beginning, I developed another R package called rmwf, which means reproducible metabolomics work flow. I included our lab data from NIST 1950, which is standard reference material. I included lots of R code and template to help the user to process the raw data into a report. This package is also part of xcmsrocker project, which try to creat a docker image for fully reproducible metabolomics data analysis environemnt based on RStudio.

---

## Take-home message

### Reaction is the "base pair" for small molecules

### Reactomics means the omics study at reaction or PMD level

### Real data always enriched in few PMDs or reactions

### Reaction level changes could be quantitative and predictable

???
Here is the take-home message for reactomics: reaction is the "base pairs" for small molecular, Reactomics means the omics study at reaction or PMD level, real data always enriched in few PMDs or reactions, reaction level changes could be quantitative and predictable.

---

## Acknowledgement

.pull-left[- Institute for Exposomic Research, ISMMS

- Department of Environmental Medicine and Public Health, ISMMS

- Dr. Petrick's research group

- Dr. Xinwang Hou

- Q&A: miao.yu@mssm.edu
]

.pull-right[
<img src="../figure/mssmlabmember.jpg" width="300px" style="display: block; margin: auto;" />
]

???
This research is supported by institute for exposomics research, Icahn school of medicine at mount sinai. Department of environmental medicine and public health, ISMMS. I am a postdoc from Dr. Petrick's research group and here is our group members. Dr. Petrick, Georgia, me, Kelsey, and Dr. Tu. Sepcial thanks to Dr. Xinwang Hou, who share the pumpkin raw data with me and included in enviGCMS package. If you have any question, feel free to contact me via email or twitter. I will be in the live Q&A. See you there and thanks!