neither We choose to do so because this kind of information leakage is not associated with the classifier training and therefore not expected to lead to significant performance bias. Ratio-based data is obtained by scaling the sample expression value (intensity) by an array of reference expression value (intensity). In cases where there are several reference control samples within each batch, the reference is calculated using the mean of the control samples. Both arithmetic mean and geometric mean of the sample intensity values have been used in computing the reference. We use acronyms Ratio-G and Ratio-A to represent the ratio-based approaches using reference based on geometric and arithmetic means, respectively. If one or more reference samples are possible outliers, the median could be used as a reference that is a more robust measure.
EJLR (Extended Johnson-Li-Rabinovic) method This method is based on Johnson et al.,9 which adjusts the expression values of both training batch and test batch. It is also called COMBAT or Empirical Bayes method. To have a predictive model applicable for the prediction of future samples, the model has to be developed based on the training set without being affected by the future set. The original algorithm has been modified so that the training batch can be used as a reference batch for adjusting batch effect in future batches. The reference (training) batch does not change during the removal process. Thus, a model constructed based on this unchanged training set can be used for the prediction of samples in a test set.
It should be stressed that the applicability and efficacy of all batch effect removal approaches described above, except the ratio-based method, rely on the assumption that each individual batch has reasonable numbers of both positive samples and negative samples. If this assumption is not satisfied, biological information might be jeopardized. Recently a promising hybrid method combining the use of reference samples and the empirical Bayes approach was published by Walker et al.18 Evaluation of batch effect removal effectiveness Cross-batch (group) prediction performance is used as the evaluation measure for batch effect removal, as this is the most practical measure for diagnostic purposes. The class label information of the test set is only used when evaluating the prediction performance and the information is kept strictly blind during the model construction process.
The Matthews Correlation Coefficient (MCC), the primary performance metric in the MAQC-II study,5 is used in this work. It is essentially the Pearson correlation coefficient between the true labels in the test set and the predicted labels in binary form. Its definition Anacetrapib can be found through the link:http://en.wikipedia.org/wiki/Matthews_Correlation_Coefficient.