Robust Tests for the Mean Difference in Paired Data by Using Bootstrap Resampling Technique

The paired sample t-test for testing the difference between two means in paired data is not robust against the violation of the normality assumption. In this paper, some alternative robust tests have been suggested by using the bootstrap method in addition to combining the bootstrap method with the W.M test. Monte Carlo simulation experiments were employed to study the performance of the test statistics of each of these three tests depending on type one error rates and the power rates of the test statistics. The three tests have been applied on different sample sizes generated from three distributions represented by Bivariate normal distribution, Bivariate contaminated normal distribution, and the Bivariate Exponential distribution.


Introduction
Comparing the two means of correlated variables is often of interest to researchers in various fields, especially medical and biological. The Paired t-test is one of the most important tests that are widely used for this purpose. However, the paired t-test is not robust against the departure of the normality assumption. The robustness concept is introduced firstly by Box in 1953. There are many definitions of the concept of robustness, perhaps the most important one that was stipulated in the Huber (1981) definition that robustness has many meanings and implications that may be inconsistent with each other, but robustness can be expressed as referring to insensitivity to slight departures from the assumptions of the test statistics. [1] Bradley (1978) defined what it means (Robust test), stipulates that the test is called robust against the violation of one or more of the test's assumptions if that violation does not effect on the distribution of the test statistic due to tending the true probability of a Type I error to differ from the nominal α. He suggested the criterion of robustness and called it liberal criterion, that is the test could be regarded as robust only if its Type 1 error rate ̂ fall in the following interval: [2] 0.9 < ̂< 1.1 i.e., |̂− | ≤ 10 (1) In the other hand, Salter and Fawcett (1985) proposed another criterion for the robustness of the test which requires the Type I error values to lie within the following interval: [3] ± 2 √[ (1 − )/ ] (2) Where R represents the replicated times. This paper aims to study and investigate the effect of the violation of some assumptions of the hypothesis test equality of means of two correlated variables on the distribution of test statistics. These violations are represented by the following points 1.
Violation of the normality assumption due to the existence of outliers.

2.
The smallness of the sample size. 3.
The paired data follow a distribution other than the normal distribution. 4.
Heterogeneity of the variances of the two dependent variables. The main goal of this paper is to find a robust test that achieves the highest power of the test when the set of paired data violate the assumptions of the normality and the homogeneity of variances of the correlated variables. Therefore, a number of robust tests has been suggested represented by Wilcoxon-matched pairs signed-ranks using bootstrap (BWS), Wilcoxonmatched pairs signed-ranks when sample size n > 25 using bootstrap (BWL), also of bootstrapping the paired t-test (BT).

Test Statistics 2.1 Paired t-Test
The paired t-test is one of the most important tests employed to test the significance of the difference between the means of the two dependent variables, it is sometimes called the dependent sample t-test. Also, known as the repeated measurements, when we have them before and after the treatment. Let the two-dimensional random variables (X, Y) have a bivariate normal distribution with parameters µ , µ , σ X , σ Y and ρ if there joint pdf is defined as: [4,5] Where, µ , µ ∈ , , ∈ + , ∈ (−1,1). And The paired t-test aims to test the following null hypothesis: Against the alternative hypothesis: Then, D ∼ Normal( µ D , σ D 2 ), where, µ = µ − µ is the mean of the difference between two populations D and σ D 2 = σ D 2 + σ D 2 − ( , ) is the variance of D. Therefore, the significance of the difference between and can be tested using the paired t test by testing the following hypothesis 0 : µ = 0 (6) Against alternative hypothesis: The paired t-test statistics is given by: [6,7] Where, ̅ and are presented respectively, the mean and the standard deviation of in the matched sample. Notice that the test statistics T represents the one sample t test applied on the difference between two dependent variables D.

Wilcoxon -Matched Pairs Signed-Ranks (W.M)
This test is an extension of the Wilcoxon signed-rank test, proposed by Frank Wilcoxon in 1945. It is widely used as an alternative test of the paired t-test where the paired data violate the normality assumption which inflates Type I error rate [8]. In fact, this test requires that paired samples should be random and independent. It is used to compare the means of two dependent samples or repeated measurements on a single sample in case of non-normality data. The W.M is used to test whether the matched random sample is drawn from a population in which the median of the differences is equal to a specific value, in other words, to test the following two sided null hypothesis: Against alternative hypothesis: H 1 : θ D ≠ 0 where m is the median of the differences (Di) between the two populations. The W.M test can be carried out using the following steps: 1.
Compute difference scores D i , (i=1, 2, …, n) for each pair of data.

2.
Ranke the absolute value of difference scores |D i |, from 1 through n. If two or more difference scores are the same, the mean of the ranks of these scores is given to each of the tied ranks.

3.
When D i = 0 the pair is not assigned a rank, and reduces n by the number of cases in which the difference score = 0.

4.
Calculate the sum of the ranks of each of the positive signs (R + ) and negative signs (R − ), as follows: Notice that, R + + R − = n(n+1) 2

5.
The test statistics, say W is given by: 6.
Compare the test statistics W with the critical value W * at a specific significant level, then reject H0 if: [8] W ≤ W * If the sample size is relatively, large, the normal approximation of the W.M statistics can be used for testing the null hypothesis (7) by using the following test statistics [8]  For all cases, the null hypothesis will be rejected if z ≥ z * , where, z * represents the tabled critical value of the test at a specific level of significance.

Bivariate Contaminated Normal Distribution
In order to study the robustness of the test's statistics, against the departure of normality assumption, the bivariate normal distribution has been contaminated by outliers. The latter process was done by generating the random sample from the original distribution denoted by with specific proportion, say λ and allowing a few of these sample observations to become other distributions 1 , 2 ⋯ , that differ in their parameters from the original distribution. These observations are known as (Contaminated). Usually, it can be expressed as follows: (1 − λ 1 − λ 2 − ⋯ − λ k ) + 1 1 + ⋯ + whereas, i : contamination rate by the distribution where ⅈ = 1, ⋯ , There are two types of contaminants: the first type is known as symmetric contaminant. The symmetric contaminated is obtained when generating a symmetric contaminated distribution G around the original distribution center F equal in the of both distributions and difference in 2 to make the variance of G bigger than the variance of F. If both distributions G, F are normal distribution where, :~( , 2 ), :~( , 2 ) , > 1 The continuous random variable X resulting from the mixture distribution will have symmetric contaminated normal distribution in the rate of i.e., ~(1 − ) + The other type is known asymmetric contamination. It is obtained when generating the contaminated distribution 2 symmetrically about any point within the distribution F, if the center is not equal. That is 2 and F have the same variance but they are differ from location (i.e. 2~( + , 2 ) , > 0 In this case, the distribution of the random variable X can be expressed as follows:

5.
The Bootstrap Resampling Tchnique Bootstrap is the most popular resampling technique used in statistical analysis. It was first developed and introduced by Ephron in 1979. It is a computer-based resampling technique developed to make statistical inferences simpler [10] and has the potential to be used for precision-based data simulation problems for statistical reasoning. According to Ephron, the boot-up process differs from statistical inference, as the method is very simple and based on re-sampling procedures. The bootstrap statistics (BT) is given by In this paper, we used the following procedure, which gave better results Where, (j = 1, 2, …, b) is the paired t-test statistics applied on * (i = 1, 2, …n), Where, * represents the difference of the i th resampling variables in the j th bootstrap resampling.
Similarly, we are bootstrapping the W.M test and the approximation of W.M to normal distribution respectively as follows: Where WS is j th Wilcoxon -matched pairs signed-ranks for small sampls Where WL is j th Wilcoxon -matched pairs signed-ranks for large sampls 6. In this paper, we use nominal = 0.05. Therefore, Bradley's liberal criterion is 0.045 < ̂< 0.055

Simulation Study
According to the Salter and Fawcett criterion [3], the test will be regarded as robust if it isType I error rate ̂ satisfies 0.05 ± 2 √0.05 (1 − 0.05)/10000 i.e., the test is robust if ̂ within the interval (0.0456 -0.0543), Notice that, in this article, the two criteria of robustness are quite closed, Bradley's liberal criterion will be used because it is more popular. The Algorithm of Simulation Experiments can be summaried in the following table:

Simulation Results
To examine and compare the behavior of test statistics under different cases, the simulation experiment's results represented by Type I error rates and power rates are summarized in Tables 2-11. In this paper, the behavior of different tests will be discussed briefly according to the distribution of the population that matched sample drown from, as follows: 1.
Bivariate normal distribution with equality of variances i) Type I Error Rates Type I error rates for different tests at (α = 0.05) applied on matched data from a bivariate normal distribution are tabulated in the table (2)   which can be used with large sample sizes (n ≥30).  It is clear that, with the increasing sample size, the power rates for all tests are increasing and converged to 1, which corresponds to the central limit theory.  The power rates are increasing with the increase in the correlation coefficient.  It can be observed that the power rates for the BT test when the sample size (n=100) for all the different values of are greater than power rates for the T test.  The paired t-test statistics is extremely sensitive (not robust) to the contaminated data when ≤ 30 with different values of which means it is not robust against the departure from normality assumption. 

Bivariate contaminated normal distribution with equality of variances
The most robust tests are BWL and WL with all cases followed by BWS for small sample sizes.  The BT test improves the robustness of the paired t-test because it robust in all cases except with n = 10 when = 0, 0.4.  All test statistics are robust against the normality assumption when ≥ 50.    Table ( 6) show that: When 2 increases ( 2 = 25) and differs from 2 ( 2 = 1), Type I error rates are very little different from Type I error rates when 2 = 1, (see table 2 and its discussion)  Table (7), results for the power rates show that:  Generally, we can say that the power tests are decreased when the variance of Y increases when compared to the results with the corresponding results of the equality case of variance  It is clear that, the power rates are increasing with the increasing of the correlation coefficient.

Bivariate normal distribution without equality of variances
It can be seen that with the increasing of the sample size, the power rates for all tests are increasing and converged to each other.  Finally, for all test statistics, all power rates of tests are increasing with the increasing Type I error.

4.
Bivariate contaminated normal distribution without equality of variances In this case, different tests have been applied on the paired data from the Bivariate contaminated normal distribution with assuming the variances X and Y which have been assumed unequal i) Type 1 error rates Table (8) includes the results of Type I error rates in case of Bivariate contaminated normal distribution, with the following 80%~(1,1) + 20%~(1,25) and ~ (1,25). The important points of the results can be summarized as follows  The results of the non-parametric tests (BWS, BWL, WS, WL) nearly are the same as the results of  Bivariate contaminated normal distribution, with the equal variances, assumed (homogeneity of variances, see Table 4).  In general, Type I error rates of T test become better than the corresponding values of Bivariate contaminated normal distribution, with the equal variances assumes (homogeneity of variances, see Table 4), due to 20% of the matched samples contaminated by paired data that have the same variances of the two correlated variables, i.e. 2 = 2 = 25  It is noticed that the T test is insensitive to non-normality assumption for all cases except for one case when the sample sizes (n=10) with the different values of .

ii) Power rates
The results of the simulation study of the power rates can be summarized in Table (11).
 The power rates for all methods are increasing with the increase of the sample size and the correlation coefficient may approach 1 when n=30,50 and ρ=0.8.  The t-test has the most powerful rate for all sample sizes compared to other tests when ρ=0.4,0.8.  Generally, the power rates of bivariate exponential distribution and bivariate normal are the most powerful than the bivariate contaminated normal distribution.  When = 0 = 30,50, it can be seen that BWL has higher power rates compared to other tests.  The WL test when the sample size is equal to 10 and ρ≤0.4 have higher power rates.  It is clear that the power rates of the non-parametric tests (WS, WL, BWS, BWL, BT) when the data follow the bivariate exponential distribution are the most powerful when compared with the power rates when data follow the bivariate contaminated normal distribution with different values of ρ.

Conclusion
The Monte-Carlo simulation was employed to study the behavior of different test statistics that are used for comparing the equality of the means of the two paired populations. Based on the theoretical part and the results of the Type one error rates and the power rates of the tests, the most important conclusions have been reached:  It is obvious that the power rates are increasing with the increase of the correlation coefficient and the sample size.  The presence of outliers leads to a decrease of the type I error rates for the paired t-test statistics.  In case of the existence of outliers in 20% and the homogeneity of variances of correlated variables, the bootstrapping of the Wilcoxon signed rank test for large sample sizes is best in comparison to other tests for all cases and different value of ρ. Followed by the bootstrapping of the Wilcoxon signed rank test for small sample sizes when ≤ 10.  When the paired data follow the bivariate exponential distribution, the Wilcoxon signed-rank test for large sample sizes is the most powerful compared to the other tests in case of small sample sizes, with different values of , while the bootstrapping of the paired t-test is the best in comparing to other tests when ≥ 30 and > 0.  In case of the existence of outliers (with homogeneity of variance), we recommend apply the Wilcoxon signed rank test for large sample sizes test when the sample size n ≤ 50 and ρ > 0.