most of the papers looks like owl and they only show two circle to explain how they process the data
Paper method v.s. Practical method in Metabolomics
most of the papers looks like owl and they only show two circle to explain how they process the data
$$Statistic = f(sample_1,sample_2,...,sample_n)$$
Statistics describe certain property among the samples
Statistics could be designed for certain purpose
Statistics extract signal and remove noise
Statistical Inference is based on statistic
Small P value doesn't mean the effect is strong!
NHST could only tell you \(p(D|H0)\), not \(p(H0|D)\)!
$$ 1- (1 - 0.05) = 0.05$$
$$1 - (1 - 0.05)^2 = 0.0975$$
$$1 - (1 - 0.05)^{10} = 0.4012$$
More tests, more chances to get false positive
Thousands of peaks means thousands of tests, single cutoff would find lots of false positive
False Discovery Rate(FDR) control is required for multiple tests
$$p_i \leq \frac{i}{m} \alpha$$
\(\alpha\) means the cutoff of p value, i means the rank of certain test and m means the numbers of comparison
Adjusted p value for FDR control
$$\hat\pi_0 = \frac{\#\{p_i>\lambda\}}{(1-\lambda)m}$$
Directly estimation of FDR from p-value's distribution
Q value means the FDR for each test
BH method is also called as Q value. Storey Q value is not that stable as BH method
This is publication bias or "cherry-picking". Try to avoid p value in the future.
$$p(H0|D) \propto p(D|H0) p(H0)$$
$$Bayes\ factor = \frac{p(D|Ha)}{p(D|H0)} = \frac{posterior\ odds}{prior\ odds}$$
Bayes factor could show the differences between null hypothesis and any other hypothesis
Bayesian Inference Demo: http://rpsychologist.com/d3/bayes/
$$Target = g(Statistic) = g(f(sample_1,sample_2,...,sample_n))$$
Use statistics to make prediction/explanation
Use parameters to fit the data
Based on real data and/or hypothesis
Diagnosed by other statistics( \(R^2\), \(ROC\))
We could tune statistical models by parameters or make model selection
$$ t = \frac{\bar x - \mu}{\frac{\sigma}{\sqrt n}} $$
1-sample T-test: test the mean if it is 0
2-sample unpaired T-test: test the distance between two group if it is 0
2-sample paired T-test: test the paired distance between two group if it is 0
$$ F = \frac{explained\ variance}{unexplained\ variance} $$
Most of parametric test need assumptions for data
You need extra tests to test the assumption before using parametric test
Non-parametric tests are "distribution-free"
Always using non-parametric test is safe with less power(hard to find differences)
$$Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$$
$$Intensity = Group + Random\ Error$$
T-test: peaks show differences among two groups One-way ANOVA: peaks show differences among multiple groups Linear regression could replace both of them in most cases
$$Intensity = Group + Random\ Error$$
After regression, you could get the parameters for each group.
A T-test would be used to test whether the parameter is 0 or not
If the parameter show no significant differences with 0, the group information might show limited contribution for this peak intensity
This peak would not be suitable to predict the group
You also need FDR control for regression analysis
You could use F-test to test the variances explained ($R^2$)
$$Intensity_{t} = Intensity_{t-1} + Group + Random$$
Data points on time series has auto-correlation
Regression analysis is not suitable for time series
Survival analysis might also be used in certain context
If you know nothing about time series analysis, just show trends without claim statistical analysis
$$Group = f_m(peak_m) $$
$$Group/Value = f_m(peak_1, peak_2, ...,peak_n)$$
$$Sensitivity = \frac{Group_{TP}}{Group_{TP}+Group_{FN}}$$
$$Specificity = \frac{Group_{TN}}{Group_{TN}+Group_{FP}}$$
$$Accuracy = \frac{Group_{TP}+Group_{TN}}{Group_{TP}+Group_{TN}+Group_{FP}+Group_{FN}}$$
This is actually not the same with analytically chemistry. A higher sensitivity model could avoid false negative. A higher specificity model could avoid false positive. The accuracy would not distinguish such things.
For certain model, certain cutoff(p-value) could show one point. Change the cutoff, each model could have certain ROC curve. If ROC curve goes to the top left corner, corresponding model would show better performance.
$$E[(y - \hat f)^2] = \sigma^2 + Var[\hat f] + Bias[\hat f]$$
Estimation of model parameters use all data
Leave out 1 sample and make estimation of model parameters again
Calculate the difference between 'full' model and 'leave one out' model
Repeat selections of 1 sample for the entire data set
Use the differences to estimate the model performance
Estimation of model parameters use all data
Sampling data with replacements and make estimation of model parameters again
Calculate the difference between 'full' model and 'bootstraping' model
Repeat sampling for the entire data set
Use the differences to estimate the model performance
Split data into training dataset(60%), validation dataset(20%) and test dataset(20%) Training set is used to build model Validation set is used to turn the parameters for the training set model When the model is done, use test set to show the final performance
$$RSS = \sum_{i=1}^n(y_i - f(x_i))^2$$
To avoid overfitting, regularization is always applied to penal the parameters
Rigid regression(L2)
$$RSS + \lambda \sum_{j = 1}^{p} \beta_j^2$$
$$RSS + \lambda \sum_{j = 1}^{p} |\beta_j|$$
$$y = f(x)$$
$$x = g(x)$$
When the first 2or3 principle components could show 80% variances
If the samples are clustered, they might be similar on the major principle components
PCA is an Exploratory Data Analysis(EDA) method, not statistical inference with p value. Conclusion should be validated by extra statistical methods or simulation
Partial least squares discriminant analysis(PLSDA) was first used in the 1990s. However, Partial least squares(PLS) was proposed in the 1960s by Hermann Wold. Principal components analysis produces the weight matrix reflecting the covariance structure between the variables, while partial least squares produces the weight matrix reflecting the covariance structure between the variables and classes. After rotation by weight matrix, the new variables would contain relationship with classes.
The classification performance of PLSDA is identical to linear discriminant analysis(LDA) if class sizes are balanced, or the columns are adjusted according to the mean of the class mean. If the number of variables exceeds the number of samples, LDA can be performed on the principal components. Quadratic discriminant analysis(QDA) could model nonlinearity relationship between variables while PLSDA is better for collinear variables. However, as a classifier, there is little advantage for PLSDA. The advantages of PLSDA is that this modle could show relationship between variables, which is not the goal of regular classifier.
Different algorithms for PLSDA would show different score, while PCA always show the same score with fixed algorithm. For PCA, both new variables and classes are orthognal. However, for PLS(Wold), only new classes are orthognal. For PLS(Martens), only new variables are orthognal.
Sparse PLS discriminant analysis(sPLS-DA) make a L1 penal on the variable selection to remove the influnces from unrelated variables, which make sense for high-throughput omics data[@lecao2011].
For o-PLS-DA, s-plot could be used to find features.[@wiklund2008]
Concern on the certain projection related to target
Summary of importance of the variables and importance of projection
Use variables' distances among samples to show inner relationship
Find Homogeneity from Heterogeneity
Hierarchical clustering
K-means
Self-organizing map
At each level, different variables play roles in separation of samples
Each branch of the tree belong to certain groups
peaks could be used in different levels for separation Tree base model could also be used to select important variables
At each level use bootstrap to select peaks
Use multiple trees to vote for the separation
Variable importance could be computed by the influences on the separation
Each model is unique since random selection involved
Cross validation could be used to show a stable performance of important variables or peaks
General model
Each tree is based on previous tree and the parameters are weighted
New trees are shrank to fit the residual errors
$$\hat f(x) = \sum_{b=1}^B \lambda \hat f^b(x)$$
p-dimensional vector could be separate by (p-1)-dimensional hyperplane
Find the hyperlane to make largest separation between groups
Kernel function is used to map the higher-dimensional into a lower-dimension space
Similar to logistic regression with kernel function
Variable importance could also be computed by cross validation
multiple layers
Paper method v.s. Practical method in Metabolomics
most of the papers looks like owl and they only show two circle to explain how they process the data
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |