logo

STATISTICAL TEST OF SIGNIFICANCE


Statistical inference is the art of generating conclusions about the distribution of the data. A data scientist is often exposed to question that can only be answered scientifically. Therefore, statistical inference is a strategy to test whether a hypothesis is true, i.e. validated by the data.




Some Definitions

HYPOTHESIS: A hypothesis is a statement or assumption that we make about the population which is tested based on information from a sample. The hypothesis is either accepted after the test or is being rejected. There are two types of hypothesis, null hypothesis and alternative hypothesis.

NULL HYPOTHESIS:
A null hypothesis is a general statement or assumption that there is no difference between two factors, or groups or population parameters. It is usually denoted by \(H_0\).

ALTERNATIVE HYPOTHESIS:
An alternative hypothesis is a complementary statement to our null hypothesis. This statement is accepted when our null hypothesis is being rejected. It is generally denoted by \(H_1\).

Here we will discuss about the following statistical tests-

t-test

t-test is used when we are comparing a single sample mean with the mean of the population from which the sample is generated or when we are comparing the means of two different samples.

It is to be noted that t-test is only applicable when the sample size is less than 30

t-test for single mean

The t-test, or student’s test for single mean, compares the mean of a vector against a theoretical population mean, . The formula used to compute the t-test is:

\[ t= \dfrac{\bar{x}-\mu}{S/\sqrt{n}}\]

Here

  • \(\bar{x}\) refers to the sample mean.
  • \(\mu\) to the assumed population mean.
  • S is the standard error.
  • n the number of observations.

Here we are interested in testing the null hypothesis
\(H_0: \mu(\text{$pop^n$ mean})=\mu_0(\text{particular value})\)

Against the alternative \(H_1: \mu(\text{$pop^n$ mean})\ne\mu_0(\text{particular value})\)

To evaluate the statistical significance of the t-test, you need to compute the p-value. The p-value ranges from 0 to 1, and is interpreted as follow:

  • A p-value lower than 0.05 means you are strongly confident to reject the null hypothesis, thus the alternative hypothesis is accepted.

  • A p-value higher than 0.05 indicates that you don’t have enough evidences to reject the null hypothesis and hence it may be accepted.

Example

In an All-India level test, 25 students from a certain coaching center secured the following marks
49, 25, 50, 49, 36, 7, 16, 26, 30, 66, 41, 18, 13, 26, 26, 24, 34, 17, 37, 11, 27, 36, 32, 42, 29
The owner of the center claimed that average mark scored by a student (All over India) is 30. Test whether his claim is correct or not.

Solution:

Here we are testing the null hypothesis: \(H_0:\mu=30\)
Against the alternative \(H_1: \mu \ne 30\)

# Enter the data
x=c(49, 25, 50, 49, 36, 7, 16, 26, 30, 66, 41, 18, 13, 26, 26, 24, 34, 17, 37, 11, 27, 36, 32, 42, 29)
# Apply t test sor single mean
t.test(x, mu=30)

    One Sample t-test

data:  x
t = 0.24507, df = 24, p-value = 0.8085
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
 24.95326 36.40674
sample estimates:
mean of x 
    30.68 

Here, the p-value is 0.8085 which is greater than 0.05, hence we does not have enough evidence to reject his claim. So we should accept his claim. Also, we are 95% sure that the mark of a student will lie within 24.95326~25 and 36.40674~36.

t-test for paired mean

This test is used when we want to compare the mean of two sets of data assuming that they have equal amount of variation. The test statistics is calculated using the formula- \[t=\dfrac{\bar{x}-\bar{y}}{S\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\] Here,

  • \(\bar{x}\) is the mean of first set of data.
  • \(\bar{y}\) is the mean of second set of data.
  • S is the standard error.
  • \(n_1\) is the number of elements in first set.
  • \(n_2\) is the number of elements in second set.

Here we are interested in testing the null hypothesis \(\bar{x}=\bar{y}\)

Against the alternative \(\bar{x}\ne \bar{y}\)

Example

In an experiment, two different kinds of protein supplement were provided to the members of a Gym. Their strength on a particular scale were measured after one month of using the product. The results are as shown below
Supplement A: 25,32,30,34,24,14,32,24,30,31,35,25.
Supplement B: 44,34,22,10,47,31,40,30,32,35,18,21,35,29,22.
The owner of the gym claimed that the two supplements are not of much difference. Is he correct about his claim? Test using t-test for paired mean.

Solution

Here, our null hypothesis is, \(H_0:\) The two supplements has the same result
Against the alternative that \(H_1:\) They are significantly different.

# Enter first set of data
x=c(25,32,30,34,24,14,32,24,30,31,35,25)

# Enter second set of data
y=c(44,34,22,10,47,31,40,30,32,35,18,21,35,29,22)

# Apply t-test

t.test(x,y,var.equal =T)

    Two Sample t-test

data:  x and y
t = -0.61028, df = 25, p-value = 0.5472
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -8.749507  4.749507
sample estimates:
mean of x mean of y 
       28        30 

Since the p-value = 0.5472 is greater than 0.05, so we should accept the null hypothesis and conclude that the owner’s claim is correct.

Chi-square test

Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them. Both those variables should be from same population and they should be categorical like − Yes/No, Male/Female, Red/Green etc.

Chi square test is only applicable when your variable is categorical.

The formula used for Chi-square test is: \[\chi^2=\sum_{i=1}^n\dfrac{(O_i-E_i)^2}{E_i^2}\] Where \(O_i\) is the observed frequency in each cell and \(E_i\) is the expected frequency in that cell.

For example, we can build a data set with observations on people’s ice-cream buying pattern and try to correlate the gender of a person with the flavor of the ice-cream they prefer. If a correlation is found we can plan for appropriate stock of flavors by knowing the number of gender of people visiting.

The function used for performing chi-Square test is chisq.test().

The basic syntax for creating a chi-square test in R is −

chisq.test(data)

Following is the description of the parameters used −

  • data is the data in form of a table containing the count value of the variables in the observation.

Example:

15 males and 15 females visiting a shopping mall were asked whether they do online shopping or note. The result is shown below. Check whether their online shopping habit is related to their gender or not.

Gender Online shopping Gender Online Shopping
Male Yes Male Yes
Male No Female Yes
Female No Female Yes
Female No Female Yes
Male Yes Female Yes
Male Yes Male Yes
Female Yes Male No
Female No Male No
Male No Male Yes
Female Yes Female Yes
Male No Male Yes
Female No Male Yes
Female No Female No
Female Yes Female Yes
Male Yes Male Yes

Solution:

Here, the null hypothesis that we are testing is, \(H_0:\) There is no relationship between gender and online shopping habit. Against the alternative that they are related.

# Create a factor variable for gender 
# We will use 0 for female and 1 for male
gender=c(1,1,0,0,1,1,0,0,1,0,1,0,0,0,1,1,0,0,0,0,1,1,1,1,0,1,1,0,0,1)
gender=factor(gender,labels = c("Female","Male"))

# Create a factor for shopping habit
# 0: No, 1: Yes
habit=c(1,0,0,0,1,1,1,0,0,1,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,0,1,1)
habit=factor(habit, labels = c("No", "Yes"))

# Create a cross table
my.data=table(gender,habit)
my.data
        habit
gender   No Yes
  Female  6   9
  Male    5  10
# apply chi square test

chisq.test(my.data,correct = FALSE)

    Pearson's Chi-squared test

data:  my.data
X-squared = 0.14354, df = 1, p-value = 0.7048

Since the p-value is greater than 0.05, hence we should accept the null hypothesis and conclude that online shopping habit does not depends on gender of a person.

ANOVA

Analysis of Variation (ANOVA) is a way of testing whether the variation present in your data set is due to one or more grouping factors or they are just by chance.

Based on the number of groups into which your data is divided, there are two different methods of ANOVA,

  • One way ANOVA
  • Two way ANOVA

If your data is classified according to only one criterion then we will conduct One way ANOVA and if your data is divided according to two different criterion then we carry out Two way ANOVA.

For example, consider that you have collected the marks of 500 students from 5 different schools of Guwahati. Now obviously the marks of different students will vary. Now we can group them according to their respective schools and see if their marks varies due to their school (Or in Statistical language: Whether school is a significant factor of variation in marks). In this case since we are considering only one grouping factor, so we will have to apply One way ANOVA. On the other hand, if we consider another grouping factor say their gender and want to see whether gender is a significant factor of variation or not along with their school, then we will have to apply Two way ANOVA.

The basic syntax for an ANOVA test is

aov(formula, data)

Where,

  • formula: The relationship we want to test.
  • data: The data set we are using.

It should be noted that to perform ANOVA, your data should be a numerical variable and grouping variable should be a factor of at least 2 levels.

One way ANOVA:

Example:

Marks of 30 students were collected from 3 different schools for both male and female. Analyse whether school and gender are significant factor for variation in marks.

Solution:

The data set we will be using here can be downloaded from here: ANOVA Data

  • Save the downloaded anova.csv file to your R working directory.

Here, we are interested in testing the null hypothesis that “Gender is not a significant factor for variation in marks”

The analysis can be carried out using the following R program-

# Import the data set
data=read.csv("anova.csv", header = TRUE)
# lets dee the structure of the data
str(data)
'data.frame':   30 obs. of  3 variables:
 $ School: chr  "School A" "School A" "School A" "School A" ...
 $ Gender: chr  "Male" "Male" "Male" "Male" ...
 $ Marks : int  54 47 89 23 84 74 55 64 39 48 ...
# Here we have two factor variables "Gender" and "School"
# and a int variable "Marks"
# apply one way anova

one_way_school=aov(Marks~School, data=data)
summary(one_way_school)
            Df Sum Sq Mean Sq F value Pr(>F)
School       2    150    75.2   0.189  0.829
Residuals   27  10762   398.6               

Since the p-value 0.829 is greater than 0.05, hence we should accept our null hypothesis and conclude that school is not a significant factor of variation in marks of students.

Again if we want to check whether gender is a significant factor of variation, then we can use the following program-

one_way_gender=aov(Marks~Gender, data=data)
summary(one_way_gender)
            Df Sum Sq Mean Sq F value Pr(>F)
Gender       1    173   172.8   0.451  0.508
Residuals   28  10740   383.6               

Here also, the p-value(0.508) is greater than 0.05 and hence we conclude that gender is also not a significant factor of variation.

Two Way ANOVA:

Here we are interested in testing the null hypothesis that gender and school both are not a significant factor of variation in marks of student.

two_way_aov=aov(Marks~School+Gender, data=data)
summary(two_way_aov)
            Df Sum Sq Mean Sq F value Pr(>F)
School       2    150    75.2   0.186  0.832
Gender       1    222   222.5   0.549  0.465
Residuals   26  10540   405.4               

Since p-value for both School and Gender are greater than 0.05, hence both are insignificant for the variation of marks.