Miao Yu

Tips for local installation of MetaboAnalyst on Windows

Miao Yu — 2017-03-29T00:00:00+00:00

I am running Windows 7 to perform metabolomics data analysis(mainly for mscovert). Recently I found MetaboAnalyst could be installed locally. Since some group members really care about their data safety, I just installed MetaboAnalyst on one of group computers. Here is some tips for it:

Windows 7 is currently not supported by Metaboanalyst, so I use virtualbox to install a 64-bit Ubuntu 16.10.
For Ubuntu, you need to install a few packages to support both the R and Java environment, also some packages. You might follow the script in bash:

sudo apt-get install libnetcdf-dev graphviz libxml2-dev libcairo2-dev default-jdk r-base-dev 

You also need to install some packages from either CRAN or Bioconductor
- Install Rserver in bash to get rid of configure of R

sudo apt-get isntall r-cran-rserve
R

# Use the following code to install packages in R:
install.packages(c("ellipse", "scatterplot3d","pls", "caret", "lattice", "Cairo", "randomForest", "e1071","gplots", "som", "xtable", "RColorBrewer", "pheatmap", "igraph", "RJSONIO", "caTools", "ROCR", "pROC"))
source("https://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("xcms", "impute", "pcaMethods", "siggenes", "globaltest", "GlobalAncova", "Rgraphviz", "KEGGgraph", "preprocessCore", "genefilter", "SSPA", "sva"))

If you want to install Rstudio on 64-bit Ubuntu, you need the following steps:
- Download “libgstreamer plugin” from here
- Download “libgstreamer” from here
- Install two packages above
- Install the following packages
```
sudo apt-get install libjpeg62
```
- Install the Rstudio
MetaboAnalyst is actually a java-based web application (also, R based). You need java environment and use Tomcat or Glassfish to host the *.war file on server (Linux or Mac OS). Then you only need to access it by browser, just like what you did online.
Install Glassfish. I tried Tomcat and the deploy always failed and I suggest to use Glassfish following the guide(you might need to set up user and password) and upload the *.war file by a web interface at http://localhost:4848

wget download.java.net/glassfish/4.1.1/release/glassfish-4.1.1.zip
apt-get install unzip
unzip glassfish-4.0.zip -d /opt
cd /opt/glassfish/bin
./asadmin start-domain
./asadmin enable-secure-admin
./asadmin restart-domain

Run the Rserve in bash:

R CMD Rserve

After the installation of MetaboAnalyst on Glassfish, make a port transfer to ensure you could access the MetaboAnalyst on browsers of windows. You need to know the local IP address of both your host and virtual machine(VM).
- Your host address is the IP for the connection between host and VM. Use ipconfig /all to get it
- Your VM address could be found by connection information
- Set up the NAT port transfer to ensure you could access MetaboAnalyst on VM from host browser
- Save a bookmark for the url(in my case: http://192.168.56.1:8080/MetaboAnalyst/ ) Open the virtualbox all the time at the background
- Enjoy local access (while not updated) to MetaboAnalyst
Every time you restart your computer, input this in bash to start the MetaboAnalyst:

R CMD Rserve
cd /opt/glassfish/bin
./asadmin start-domain

For the other thing, just follow the official guide here

Statistical uncertainty of Isotope Ratio

Miao Yu — 2017-01-15T00:00:00+00:00

In Analytical Chemistry, the measurements of isotope ratios are commons. However, I found the uncertainty of ratios are always shown in the format of standard deviation of independant vairiable, which is inappropriate in statistic. You accually measure at least two values to get one measurement.

In fact, if you want to use the differences of isotope ratios as a measurement for certain process, you need to accept the assumption that the intensities of different isotopes are independant. Then we could make the Taylor series expansion of the ratio x/y around the mean of x and y:

$\begin{split} \frac{x}{y} \approx \frac{x}{y}\Big|_{\mu_x,\mu_y}&+(x-\mu_x)\frac{\partial}{\partial x}\Big(\frac{x}{y}\Big)\Big|_{\mu_x,\mu_y}+(y-\mu_y)\frac{\partial}{\partial y}\Big(\frac{x}{y}\Big)\Big|_{\mu_x,\mu_y}\\&+\frac{1}{2}(x-\mu_x)^2\frac{\partial^2}{\partial x^2}\Big(\frac{x}{y}\Big)\Big|_{\mu_x,\mu_y}+\frac{1}{2}(y-\mu_y)^2\frac{\partial^2}{\partial y^2}\Big(\frac{x}{y}\Big)\Big|_{\mu_x,\mu_y}+(x-\mu_x)(y-\mu_y)\frac{\partial^2}{\partial x \partial y}\Big(\frac{x}{y}\Big)\Big|_{\mu_x,\mu_y}\\&+\mathcal{O}\Big(\Big((x-\mu_x)\frac{\partial}{\partial x}+(y-\mu_y)\frac{\partial}{\partial y}\Big)^3\Big(\frac{x}{y}\Big)\Big) \end{split}$

The expectation of the ratio is

$\mathbb{E}[r] = \mathbb{E}\Big[\frac{\bar x}{\bar y}\Big] = \frac{\mu_x}{\mu_y} + Var(\bar y)\frac{\mu_x}{\mu_y^3} - \frac{Cov(\bar x,\bar y)}{\mu_y^2} \approx \frac{\mu_x}{\mu_y} + \frac{1}{n}\Big(Var(y)\frac{\mu_x}{\mu_y^3} - \frac{Cov(x,y)}{\mu_y^2}\Big)$

The variance of the ratio is

$\begin{split} Var(r) &= Var\Big( \frac{\bar x}{\bar y} \Big) = \mathbb{E}\Big[\Big(\frac{\bar x}{\bar y} - \mathbb{E}\Big[\frac{\bar x}{\bar y}\Big]\Big)^2\Big] \\&\approx \mathbb{E}\Big[\Big(\frac{\bar x}{\bar y} - \frac{\mu_x}{\mu_y}\Big)^2\Big]\\&\approx \mathbb{E}\Big[\Big((\bar x-\mu_x)\frac{\partial}{\partial \bar x}\Big(\frac{\bar x}{\bar y}\Big)\Big|_{\mu_x,\mu_y} + (\bar y - \mu_y)\frac{\partial}{\partial \bar y}\Big(\frac{\bar x}{\bar y}\Big)\Big|_{\mu_x,\mu_y}\Big)^2\Big]\\&\approx\frac{Var(\bar x)}{\mu^2_y} + \frac{\mu^2_x Var(\bar y)}{\mu^4_y} - \frac{2\mu_x Cov(\bar x, \bar y)}{\mu^3_y}\\&\approx\frac{1}{n}\Big(\frac{Var(x)}{\mu^2_y} + \frac{\mu_x^2 Var(y)}{\mu^4_y} - \frac{2\mu_x Cov(x,y)}{\mu^3_y}\Big) \end{split}$

Such values could be used as the uncertainty of the isotope ratios instead of the standard deviation of the ratios themselves.

Evaluation and reduction of the analytical uncertainties in GC-MS analysis using a boundary regression model

Miao Yu — 2016-11-29T00:00:00+00:00

This paper received opposite comments from reviewers. One rejected and the other recommanded. Anyway, this is just the beginning of this kind of data analysis for mass spectrum. Also this work was the basis of one chapter in my thesis.

In this work, I wanted to access and reduce the uncertainties in the whole procedure of environmental analysis. In regular analysis, we would use pure standards to optimized the analysis method and recovery and RSD were commonly used for quality control analysis. My concerns are:

Uncertainties were hard to be found with standards in advance. When you injected a dirty samples, you instruments would be polluted after you see the results. Furthormore, when you found your targeted compounds were influenced by something from the matrices, you have to start the analysis from the beginning with new methods. So, I wounder if we could access some common properties during the analysis before we analysis the samples. Then I used visualization methods to show the Uncertainties in the raw data from GC-MS.
Another issue is that how to escape the influnces from the uncertainties found in the visualization methods. My solution was that building a boundary regression models to seperate the “clean” zone from the “dirty” zone in the raw data. By this model, we would get a better sensitivity by choosing right ions regardless of the matrices or pretreatment.

I am always wondering whether different pretreatments would show similar results for certain matrix and compounds. From this paper, my answer is almost yes. Certain pretreatments would remove something we do not like or harmful to the instruments. However, such influnces might be pointless and can’t be detected on mass spectrum. In GC-MS, the co-elute influnces are hard to affect the your target compounds at the same retention time and the same massed. Only the rising baseline is important and we could get rid of it by the boundary model. Then the only thing we need to consider is the pollution of the instruments.

Meanwhile, I need to say such model might not be suitable for high-resolution mass spectrum. However, this idea could be used to improve the analytical methods for some compounds, especially for PBDEs. Also this paper supplied some basic data for environmental analysis. As a rule of thumb, you might know:

When you rise 1 degrees centigrade, the ‘dirty’ zone’s boundary would rise about 2 unit mass in the worst matrix and pretreatment. Always try to choose heavier ions for qualitative and quantitative analysis.

Here is the graphical summary for the whole methods and I think more patterns could be mined from the data of GC-MS:

Also I developed a package to perform this kind of analysis in R. Check here. This package has been published on CRAN and you could install and load it by:

install.packages('enviGCMS')
library('enviGCMS')

You might find Easter Eggs in this package.

If you have questions about this paper, comment here and I will reply as soon as possible.

Use Chinese in RStudio Beamer Slides

Miao Yu — 2016-09-19T00:00:00+00:00

RStudio is an excellent IDE for R. However, using Chinese in default setting of Rmd to output a PDF document is always annoying. Well, the source is tex.

RStudio uses knitr to covert the Rmd document into md document. Then it uses Pandoc to convert the md document into tex document. Then they actually use tex engine such as pdflatex or xelatex to get PDF document.

Why Chinese would not display? This issue happens at the last step. By default, some templates such as beamer in RStudio use pdflatex. However, you might need CJK package. However you would need to use CJK environment to display Chinese. I don’t think it is a good way and you need to write ugly documents.

xeCJK package would be preferred because you only need to set up the font for your Chinese and you will get the output. However, such configuration need you use xelatex to compile you documents.

For the beamer template in RStudio, they use pdflatex. So the first way to show Chinese is telling Pandoc to use xelatex other than pdflatex. You could set up such command in the yaml.

The second issue is the font. When you use xelatex(actually xeCJK package), you need to set the font for CJK charactors such as Chinese. Maybe you could try the following yaml to use a font without sources.

---
title: "中文测试"
author: "Yufree"
date: "2016年9月19日"
CJKmainfont: FandolFang
output:
  beamer_presentation:
    latex_engine: xelatex
---

Not everyone knows how to find the right name of a font. However, the updated ctex package solved such problem. They use some default setting to avoid the font issue. All you need to do is use the ctex package for your tex template.

We might also use yaml:

---
title: "中文测试"
author: "Yufree"
date: "2016年9月19日"
header-includes:
  - \usepackage{ctex}
output: 
  beamer_presentation:
    latex_engine: xelatex
---

OK, now you would see Chinese in your Beamer PDF slides.

Summary

Three solutions:

Use CJK packages along with pdflatex (Not recommanded, only for Guru from 20 century)

Set you font yourself in the yaml with xelatex (for Geek)

Use the ctex package in your yaml (for everyone)

For Chinese in the figure and pdf, check here.

Basic idea behind cluster analysis

Miao Yu — 2016-09-11T00:00:00+00:00

After we got a lot of samples and analyzed the concentrations of many compounds in them, we may ask about the relationship between the samples. You might have the sampling information such as the date and the position and you could use boxplot or violin plot to explore the relationships among those categorical variables. However, you could also use the data to find some potential relationship.

But how? if two samples’ data were almost the same, we might think those samples were from the same potential group. On the other hand, how do we define the “same” in the data?

Cluster analysis told us that just define a “distances” to measure the similarity between samples. Mathematically, such distances would be shown in many different manners such as the sum of the absolute values of the differences between samples.

For example, we analyzed the amounts of compound A, B and C in two samples and get the results:

Compounds(ng)	A	B	C
Sample 1	10	13	21
Sample 2	54	23	16

The distance could be:

$distance = |10-54|+|13-23|+|21-16| = 59$

Also you could use the sum of squares or other way to stand for the similarity. After you defined a “distance”, you could get the distances between all of pairs for your samples. If two samples’ distance was the smallest, put them together as one group. Then calculate the distances again to combine the small group into big group until all of the samples were include in one group. Then draw a dendrogram for those process.

The following issue is that how to cluster samples? You might set a cut-off and directly get the group from the dendrogram. However, sometimes you were ordered to cluster the samples into certain numbers of groups such as three. In such situation, you need K means cluster analysis.

The basic idea behind the K means is that generate three virtual samples and calculate the distances between those three virtual samples and all of the other samples. There would be three values for each samples. Choose the smallest values and class that sample into this group. Then your samples were classified into three groups. You need to calculate the center of those three groups and get three new virtual samples. Repeat such process until the group members unchanged and you get your samples classified.

OK, the basic idea behind the cluster analysis could be summarized as define the distances, set your cut-off and find the group. By this way, you might show potential relationships among samples.

Basic idea behind principal components analysis

Miao Yu — 2016-08-31T00:00:00+00:00

For environmental scientist, data analysis might be the only way to show your ability when you get the data from observation. I found many students even researcher showed their data in a bad way and many data analysis pattern just came from certain one paper. However, data analysis methods always have their scopes and some methods might just not suit your cases.

Thanks to data analysis software, you need not to calculate some values by hand. But to make their usage clear, you need to know the basic idea. I will show some basic ideas behind certain method in a few posts. The first one is principal components analysis(PCA).

In most cases, PCA is used as an exploratory data analysis(EDA) method. In most of those most cases, PCA is just served as visualization method. I mean, when I need to visualize some high-dimension data, I would use PCA.

So, the basic idea behind PCA is compression. When you have 100 samples with concentrations of certain compound, you could plot the concentrations with samples’ ID. However, if you have 100 compounds to be analyzed, it would by hard to show the relationship between the samples. Actually, you need to show a matrix with sample and compounds (100 * 100 with the concentrations filled into the matrix) in an informal way.

The PCA would say: OK, guys, I could convert your data into only 100 * 2 matrix with the loss of information minimized. Yeah, that is what the mathematical guys or computer programmer do. You just run the command of PCA. The new two “compounds” might have the cor-relationship between the original 100 compounds and retain the variances between them. After such projection, you would see the compressed relationship between the 100 samples. If some samples’ data are similar, they would be projected together in new two “compounds” plot. That is why PCA could be used for cluster and the new “compounds” could be referred as principal components(PCs).

However, you might ask why only two new compounds could finished such task. I have to say, two PCs are just good for visualization. In most cases, we need to collect PCs standing for more than 80% variances in our data if you want to recovery the data with PCs. If each compound have no relationship between each other, the PCs are still those 100 compounds. So you have found a property of the PCs: PCs are orthogonal between each other.

Another issue is how to find the relationship between the compounds. We could use PCA to find the relationship between samples. However, we could also extract the influences of the compounds on certain PCs. You might find many compounds showed the same loading on the first PC. That means the concentrations pattern between the compounds are looked similar. So PCA could also be used to explore the relationship between the compounds.

OK, next time you might recall PCA when you need it instead of other paper showed them.

Metabolomics workflow in Rstudio

Miao Yu — 2016-08-21T00:00:00+00:00

I have moved to Canada for about three weeks. Now I am a PostDoc in University of Waterloo. I will handle two projects about in silico studies in analytical chemistry. Well, I treated them as another data and modeling-driven interdisciplinary studies.

The first step is building the data analysis envrionment for group members. Since I could set down such envrionment on a super computer with RAM 128 GB, I preferred to use R and xcms for metabolomics data analysis.

For a well-trained analytical chemist, software or programming related stuff is always something agonizing. However, for degree or promotion, researchers have to learn related contents. xcms online is well-designed metabolomics data analysis tool for user with limited coding experiences. Actually, the earlier online version might come from xcms package for R.

If you know the more details of data processing, you might get more insights for the data. Understanding the each steps might cost you whole day. But I also want to show them in my way. Such process would be helpful if you want to make further development to answer your scientific problems.

Here is the workflow in Rstudio and a brife version in Chinese.

Structure Prediction of Methyoxy-polybrominated diphenyls ethers (MeO-PBDEs) through GC-MS analysis of their corresponding PBDEs

Miao Yu — 2016-02-11T00:00:00+00:00

This is a paper with many rejection and comments. It was finally published by Talanta with DOI 10.1016/j.talanta.2016.01.047. One study said the average read times of an academic paper was no more than 3. In my case, at least 11 reviewers had read this paper before published and I thanked all of them though some of them really misunderstand my idea.

The basic idea before the structure prediction is that the combination of two qualitative methods. Usually, we use full scan of mass spectrum to get some rules about the structure such as the position of the substitute group of certain compound. Meanwhile, the retention time of seperation process also showed us some information about the compound. For unknown MeO-PBDEs, mass spectrum could tell us the position of MeO- group while can not show us the position of the Br atoms. Chromatography could show us the Br atoms position of PBDEs while not MeO-PBDE. what I should do is that building a model to connect those two information sources to get the structure of unknown MeO-PBDEs in certain samples.

But how? I collected 32 MeO-PBDEs and corresponding PBDEs and get the retention time of those pairs under the same analysis condition. I found we could use those data to make a connection between the information from mass spectrum and chromatography. The basic model is

$RT_{MeO-PBDEs} = RT_{PBBDEs} + Group position$

For different positions, the mass spectrum could show a constant. For example, if BDE-47’s retention time is 21.753 min, the ortho- substitute MeO-BDE-47 would show a retention time of 25.648min. The differences of those retention time pairs is a constant around 4. We use regression analysis to get an estimation of such constants. Then when we get a potential peak of MeO-PBDEs. Mass spectrum would tell us the mass, the numbers of Br and the position of MeO- group. Then we just test the standards of potential PBDEs and use the models to check the position of Br atoms. The trick is that the standards of PBDEs were 209 and 837 for MeO-PBDEs. We use small numbers standards to cover large unknown standards(no available standards) . Another thick is that we could use multiple dislike columns to build such models and then the estimation would be much accurate.

This is just a try. I used this method to get three unknown structures of MeO-PBDEs. However, the most important part is that we should try to summarize the data from different analysis method to build a much stronger model. In many studies, scientists use many independent analysis method to explain one problem. I think this is also a model and when we build them, the left could be thrown to computer or automation . We human should do smart things!

If you have questions about this paper, comment here and I will reply as soon as possible.

HPLC 2015 Beijing

Miao Yu — 2015-09-30T00:00:00+00:00

Last week 43rd International Symposium on High Performance Liquid Phase Separations and Related Techniques(HPLC 2015 Beijing) was held at the Beijing International Conference Center in Beijing, China. For my tutor was the chair of this conference, I stayed there for three days and made a poster presentation. Here is some tips from the HPLC 2015 Beijing.

“NMR is your mother, MS is your love and the LC is your superhero.” Prof. Peter Schoenmakers said.
The hairstyle of Prof. Jonothan Sweedler is impressive and I don’t know if he had played punk.
Though Prof. Robert Kennedy has rejected my paper before, I admit he is handsome.
Girls from South Korea were really beauty. In HPLC 2017 Jeju we might see them again.
2D and 3D LC were really popular. However, I found they were limited to a few applications. Yeah, they showed a fantastic column efficiency but in practice they somewhat like the art of butchering dragons while no dragons for them. In environmental analysis, I think new complex matrix effect in various samples might be a dragon for them.
Superficially Porous Particles are interesting and attractive. Better choice for start up group.
Modeling in HPLC is really native and ignore the development of novel methods in computer science or data science.
Mass spectrum is the best spouse for (U)HPLC. However, omics treat the features or profiles more than certain compounds.
Some groups have noticed the data mining of hyphenated method data and I think such issue is the best application for data science.
Chemical modification of certain materials or nanomatreials to gain the selectivity are just permutation and combination. However, they could publish good papers…
Oral presentation for PI is very important. Some Chinese PI need to learn how to make presentation and if English is not good, try to list it on slides.
Young scientists are the future of HPLC and I am really appreciate the workshop for starters.
I bet you only care the beauty and handsome in this post.
See you next HPLC(well, I always need financial support)!

p.s. Finally I could vote on the stackoverflow!

The Data Analysis Similarity between Microarray and GC-MS

Miao Yu — 2015-09-11T00:00:00+00:00

I have finished Data Analysis for Genomics(HarvardX-PH525x) by Prof. Rafael A Irizarry and Dr. Michael I Love for more than a year until recently I realised the data analysis similarity between microarray and Gas chromatography–mass spectrometry(GC-MS).

When we talked about data analysis of microarray, we use different genes or probes as the rows and different samples as the columns. The responses are fluorescence signals.

When we talked about data analysis of microarray, we use different m/z as the rows and different retention times as the columns. The responses are count signals.

Interesting, the Total Ion Chromatorgraphy(TIC) is widely used in GC-MS while heatmap in microarray. How about show the heatmap of GC-MS and TIC of heatmap.

Wait, we couldn’t do a thing without meanings. Why use TIC in GC-MS? Because we always think one compound would show at certain retention time. However, under EI source or hard ionization, one compound could show many m/z responses. In environmental analysis, the matrix effect might also show responses. Then we got the meanings: the heatmap of GC-MS would show a visualization of matrix effect.

How about TIC in microarray? I don’t think such plot has meanings because there is no time dependences in the samples of microarray.

But when the data could be shown in heatmap, we might employ some noise reduction methods to ease the matrix effect. The following two heatmaps were a native “before and after” results processed by some microarray data analysis methods. Yeah, now I think it is OK to use such method to reduce the matrix effect in environmental samples.

Wait, my paper is writing. And I will show the details of such method soon(maybe or might be).