StudySmarter - The all-in-one study app.
4.8 • +11k Ratings
More than 3 Million Downloads
Free
Americas
Europe
Neapolitan ice cream is a very well-known ice cream flavor all over the world. It is made up of \(3\) flavors, which are chocolate, vanilla, and strawberry.
It is likely that there are many people in the world who love Neapolitan ice cream, but among those \(3\) flavors, will there be one that is more popular than the others?
In statistics, there are tests that can help you answer these kinds of questions. In this article, different types of tests for inferring the distribution of categorical data will be mentioned. In particular, you'll work with the Chi-square Test for Goodness of Fit.
Recall categorical data (often called qualitative data) are those that can be grouped into categories or groups whose characteristics are similar, and are generally described using words.Examples of categorical data are, blood type which can be, A+, B+, O-, or AB+; eye color which can be, blue, green, or brown; or favorite flavor in Neapolitan ice cream, which can be chocolate, vanilla or strawberry.For categorical data, you are interested in how many cases are in each category. To display this data, you can use frequency tables, bar charts or pie charts.
Some analyses you can do with categorical data are:
Analyze whether a categorical data obtained from a population follows a given distribution.
To see if the distributions of the same categorical data obtained from different populations are equal to each other.
Determine if there is a relationship between two categorical variables.
To analyze the behavior of your categorical data, what you do is a hypothesis test.
This consists of the following:
You propose a null hypothesis, i.e., you propose some expected behavior for the distribution of your categorical data.
You suggest an alternative hypothesis, i.e., something that would happen if your null hypothesis is false, or a hypothesis that negates your null hypothesis.
You use a test (such as the chi-square test) to conclude whether the data you obtained in your sample is sufficient to accept or reject the null hypothesis.
Let's go back to the Neapolitan ice cream example.
Suppose that to find out if people have a preference for a certain flavor, you surveyed \(120\) random people, of whom \(50\) said their favorite flavor was chocolate, \(35\) said vanilla and \(35\) said strawberry.
Using a frequency table to display the information, you will have
Flavor | Chocolate | Vanilla | Strawberry |
Frequency | \(50\) | \(35\) | \(35\) |
In this case, your intuition says that, there is no preference for any flavor of Neapolitan ice cream, so your null hypothesis would be
\(H_0\): All flavors are equally popular.
Meanwhile, your alternative hypothesis, would be to deny the above, i.e.,
\(H_a\): One flavor is more popular than another.
A distribution used to determine the behavior of a categorical data is the so-called Chi-square distribution.
The Chi-square test statistic is a measure of how much the observed counts differ from the expected counts. It is calculated using the formula
\[\chi^2=\sum_{i=1}^n\frac{(O_i-E_i)^2}{E_i}, \]
where \(O_i\) is the \(i^{th}\) observed value and \(E_i\) is the \(i^{th}\) expected value.
There are \(3\) types of Chi-square tests for different situations:
Chi-square Test for Goodness of Fit: It is used to verify whether the observed data follow an expected distribution.
Chi-square Test for Homogeneity: It is used to compare distributions between different groups or populations.
Chi-square Test for Independence: It is used to check if there is a relationship between two variables.
Each of these tests has its explanation on StudySmarter, check it out to learn more about them!
Before using the Chi-square statistic, make sure that the following conditions are met:
You are working with a categorical variable;
The individuals should be a random sample from the population of interest;
Categories are mutually exclusive. It is not possible for one observation to fall into two categories;
At least \(5\) individuals are expected in each category.
❗❗ The data must be counts (no percentage, proportions, or measurements).
Note that, these conditions are indeed satisfied for the ice cream example. The variable is categorical, the surveyed individuals were picked at random, and each individual could only choose \(1\) flavor.
Finally, the assumption was that all flavors were equally popular, meaning that the expected value for each category was \(120/3=40\), which is greater than \(5\).
To use a chi-square test, you first need to set a significance level, \(\alpha\). The most commonly used level is \(5\%\) (\(\alpha=0.05\)).
The next thing is to know the degrees of freedom, this value is given by the number of independent categories that the categorical variable has. So, for the Chi-square test of Goodness-of-Fits, the number of degrees of freedom equals the number of categories \(-1\).
The next step is to find the Chi-square value or \(p-\)value. Either of these values can be used to reach a conclusion. A piece of the chi-square table is presented below.
Degrees of freedom | Significance level \(\alpha\) | |||
\(0.9\) | \(0.1\) | \(0.05\) | \(0.025\) | |
\(1\) | \(0.02\) | \(2.71\) | \(3.84\) | \(5.02\) |
\(2\) | \(0.21\) | \(4.61\) | \(5.99\) | \(7.38\) |
\(3\) | \(0.58\) | \(6.25\) | \(7.82\) | \(9.35\) |
\(4\) | \(1.06\) | \(7.78\) | \(9.49\) | \(11.14\) |
For instance, if you are conducting a test with a significance level of \(5\%\) and \(3\) degrees of freedom. Using the table, the interpretation is as follows: The corresponding Chi-square value (the value where \(\alpha=0.05\) and \(3\) meet) is \(7.82\). If your test statistic is lower than this Chi-square value, then you accept the null hypothesis; otherwise, you rejected it.
Another interpretation is using the \(p-\)value. If your test statistic is \(6.5\), that means your \(p-\)value is between \(0.05\) and \(0.1\). Since your \(p-\)value is greater than the significance level, the null hypothesis is accepted.
Take a look at the articles Chi-square Distribution and Chi-square Tests to learn more about this!
Let's perform the Chi-square test of Goodness-of-Fit to determine the distribution of the ice cream example.
Using the data obtained in your experiment, with a significance level of \(5\%\) (\(\alpha=0.05\)), is there sufficient evidence to conclude that the chocolate flavor is more popular than the other flavors in Neapolitan ice cream?
Solution:
Putting the expected values and observed values in a table, you'll have
Flavor | Chocolate | Vanilla | Strawberry |
Expected | \(40\) | \(40\) | \(40\) |
Observed | \(50\) | \(35\) | \(35\) |
Calculating the test statistic:
\[ \begin{align} \chi^2 &= \sum_{i=1}^3\frac{(O_i-E_i)^2}{E_i} \\ &= \frac{(50-40)^2}{40}+\frac{(35-40)^2}{40}+\frac{(35-40)^2}{40} \\ &= \frac{(10)^2}{40}+\frac{(5)^2}{40}+\frac{(5)^2}{40} \\ &=\frac{150}{40} \\ &=3.74. \end{align} \]
The significance level is \(\alpha= 0.05\) and the degrees of freedom are \(3-1=2\). Using a Chi-square table, the corresponding Chi-square value is \(5.99\). Because the test statistic is \(3.74\), smaller than the Chi-square value, the null hypothesis is accepted.
Another way of looking at it is that the \(p-\)value obtained lies between \(0.1\) and \(0.9\), so is greater than \(0.05\). Therefore, there's not enough evidence to reject \(H_0\) and all flavors are equally popular.
You can verify if the distribution of the categorical data follows a proposed model. You can also verify if 2 or more category data are related, dependent or independent.
Depending on whether you want to find a relation or differences between two variables or more, you can use the Chi-Square test for Goodness of Fit, the Chi-Square test for Homogeneity or the Chi-Square test for Independence.
In a frequency table you find all the categorical variables of interest and how many times a specific value was obtained.
A pie chart is a circle divided into slices, where each slice represents a category. The area of each slice is proportional to the number of cases in that category.
With bar charts, you can compare the observed values with the expected values in a simpler way, plotting them side by side.
of the users don't pass the Inference for Distributions of Categorical Data quiz! Will you pass the quiz?
Start QuizBe perfectly prepared on time with an individual plan.
Test your knowledge with gamified quizzes.
Create and find flashcards in record time.
Create beautiful notes faster than ever before.
Have all your study materials in one place.
Upload unlimited documents and save them online.
Identify your study strength and weaknesses.
Set individual study goals and earn points reaching them.
Stop procrastinating with our study reminders.
Earn points, unlock badges and level up while studying.
Create flashcards in notes completely automatically.
Create the most beautiful study materials using our templates.
Sign up to highlight and take notes. It’s 100% free.
Over 10 million students from across the world are already learning smarter.
Get Started for Free