 Suggested languages for you:

Europe

|
|

# Inference for Distributions of Categorical Data

Inference for Distributions of Categorical Data
• Calculus • Decision Maths • Geometry • Mechanics Maths • Probability and Statistics • Pure Maths • Statistics Neapolitan ice cream is a very well-known ice cream flavor all over the world. It is made up of $$3$$ flavors, which are chocolate, vanilla, and strawberry.

It is likely that there are many people in the world who love Neapolitan ice cream, but among those $$3$$ flavors, will there be one that is more popular than the others?

In statistics, there are tests that can help you answer these kinds of questions. In this article, different types of tests for inferring the distribution of categorical data will be mentioned. In particular, you'll work with the Chi-square Test for Goodness of Fit.

## Statistics Inference for Distributions of Categorical Data Test

Recall categorical data (often called qualitative data) are those that can be grouped into categories or groups whose characteristics are similar, and are generally described using words.Examples of categorical data are, blood type which can be, A+, B+, O-, or AB+; eye color which can be, blue, green, or brown; or favorite flavor in Neapolitan ice cream, which can be chocolate, vanilla or strawberry.For categorical data, you are interested in how many cases are in each category. To display this data, you can use frequency tables, bar charts or pie charts.

Some analyses you can do with categorical data are:

• Analyze whether a categorical data obtained from a population follows a given distribution.

• To see if the distributions of the same categorical data obtained from different populations are equal to each other.

• Determine if there is a relationship between two categorical variables.

## Hypothesis Testing for Categorical Variables

To analyze the behavior of your categorical data, what you do is a hypothesis test.

This consists of the following:

1. You propose a null hypothesis, i.e., you propose some expected behavior for the distribution of your categorical data.

2. You suggest an alternative hypothesis, i.e., something that would happen if your null hypothesis is false, or a hypothesis that negates your null hypothesis.

3. You use a test (such as the chi-square test) to conclude whether the data you obtained in your sample is sufficient to accept or reject the null hypothesis.

## Example of Categorical Hypothesis

Let's go back to the Neapolitan ice cream example.

Suppose that to find out if people have a preference for a certain flavor, you surveyed $$120$$ random people, of whom $$50$$ said their favorite flavor was chocolate, $$35$$ said vanilla and $$35$$ said strawberry.

Using a frequency table to display the information, you will have

 Flavor Chocolate Vanilla Strawberry Frequency $$50$$ $$35$$ $$35$$

In this case, your intuition says that, there is no preference for any flavor of Neapolitan ice cream, so your null hypothesis would be

$$H_0$$: All flavors are equally popular.

Meanwhile, your alternative hypothesis, would be to deny the above, i.e.,

$$H_a$$: One flavor is more popular than another.

## Introducing the Chi-Square Test

A distribution used to determine the behavior of a categorical data is the so-called Chi-square distribution.

The Chi-square test statistic is a measure of how much the observed counts differ from the expected counts. It is calculated using the formula

$\chi^2=\sum_{i=1}^n\frac{(O_i-E_i)^2}{E_i},$

where $$O_i$$ is the $$i^{th}$$ observed value and $$E_i$$ is the $$i^{th}$$ expected value.

There are $$3$$ types of Chi-square tests for different situations:

• Chi-square Test for Goodness of Fit: It is used to verify whether the observed data follow an expected distribution.

• Chi-square Test for Homogeneity: It is used to compare distributions between different groups or populations.

• Chi-square Test for Independence: It is used to check if there is a relationship between two variables.

Each of these tests has its explanation on StudySmarter, check it out to learn more about them!

### Conditions for Chi-Square Test

Before using the Chi-square statistic, make sure that the following conditions are met:

• You are working with a categorical variable;

• The individuals should be a random sample from the population of interest;

• Categories are mutually exclusive. It is not possible for one observation to fall into two categories;

• At least $$5$$ individuals are expected in each category.

❗❗ The data must be counts (no percentage, proportions, or measurements).

Note that, these conditions are indeed satisfied for the ice cream example. The variable is categorical, the surveyed individuals were picked at random, and each individual could only choose $$1$$ flavor.

Finally, the assumption was that all flavors were equally popular, meaning that the expected value for each category was $$120/3=40$$, which is greater than $$5$$.

## Inference using the Chi-Square Test

To use a chi-square test, you first need to set a significance level, $$\alpha$$. The most commonly used level is $$5\%$$ ($$\alpha=0.05$$).

The next thing is to know the degrees of freedom, this value is given by the number of independent categories that the categorical variable has. So, for the Chi-square test of Goodness-of-Fits, the number of degrees of freedom equals the number of categories $$-1$$.

The next step is to find the Chi-square value or $$p-$$value. Either of these values can be used to reach a conclusion. A piece of the chi-square table is presented below.

 Degrees of freedom Significance level $$\alpha$$ $$0.9$$ $$0.1$$ $$0.05$$ $$0.025$$ $$1$$ $$0.02$$ $$2.71$$ $$3.84$$ $$5.02$$ $$2$$ $$0.21$$ $$4.61$$ $$5.99$$ $$7.38$$ $$3$$ $$0.58$$ $$6.25$$ $$7.82$$ $$9.35$$ $$4$$ $$1.06$$ $$7.78$$ $$9.49$$ $$11.14$$

For instance, if you are conducting a test with a significance level of $$5\%$$ and $$3$$ degrees of freedom. Using the table, the interpretation is as follows: The corresponding Chi-square value (the value where $$\alpha=0.05$$ and $$3$$ meet) is $$7.82$$. If your test statistic is lower than this Chi-square value, then you accept the null hypothesis; otherwise, you rejected it.

Another interpretation is using the $$p-$$value. If your test statistic is $$6.5$$, that means your $$p-$$value is between $$0.05$$ and $$0.1$$. Since your $$p-$$value is greater than the significance level, the null hypothesis is accepted.

## Example using Chi-Square Test for Categorical Data

Let's perform the Chi-square test of Goodness-of-Fit to determine the distribution of the ice cream example.

Using the data obtained in your experiment, with a significance level of $$5\%$$ ($$\alpha=0.05$$), is there sufficient evidence to conclude that the chocolate flavor is more popular than the other flavors in Neapolitan ice cream?

Solution:

Putting the expected values and observed values in a table, you'll have

 Flavor Chocolate Vanilla Strawberry Expected $$40$$ $$40$$ $$40$$ Observed $$50$$ $$35$$ $$35$$

Calculating the test statistic:

\begin{align} \chi^2 &= \sum_{i=1}^3\frac{(O_i-E_i)^2}{E_i} \\ &= \frac{(50-40)^2}{40}+\frac{(35-40)^2}{40}+\frac{(35-40)^2}{40} \\ &= \frac{(10)^2}{40}+\frac{(5)^2}{40}+\frac{(5)^2}{40} \\ &=\frac{150}{40} \\ &=3.74. \end{align}

The significance level is $$\alpha= 0.05$$ and the degrees of freedom are $$3-1=2$$. Using a Chi-square table, the corresponding Chi-square value is $$5.99$$. Because the test statistic is $$3.74$$, smaller than the Chi-square value, the null hypothesis is accepted.

Another way of looking at it is that the $$p-$$value obtained lies between $$0.1$$ and $$0.9$$, so is greater than $$0.05$$. Therefore, there's not enough evidence to reject $$H_0$$ and all flavors are equally popular.

## Inference for Distributions of Categorical Data - Key takeaways

• The Chi-square test statistic is calculated by $\chi^2=\sum_{i=1}^n\frac{(O_i-E_i)^2}{E_i},$ where $$O_i$$ is the $$i^{th}$$ observed value and $$E_i$$ is the $$i^{th}$$ expected value.
• Make sure you meet all $$4$$ conditions before using a chi-square test.
• There are $$3$$ types of Chi-square tests: for Goodness-of-Fit, for Homogeneity and for Independence.

You can verify if the distribution of the categorical data follows a proposed model. You can also verify if 2 or more category data are related, dependent or independent.

Depending on whether you want to find a relation or differences between two variables or more, you can use the Chi-Square test for Goodness of Fit, the Chi-Square test for Homogeneity or the Chi-Square test for Independence.

In a frequency table you find all the categorical variables of interest and how many times a specific value was obtained.

A pie chart is a circle divided into slices, where each slice represents a category. The area of each slice is proportional to the number of cases in that category.

With bar charts, you can compare the observed values with the expected values in a simpler way, plotting them side by side.

## Final Inference for Distributions of Categorical Data Quiz

Question

What are the parameters that dictate the shape of a chi-square distribution?

The only parameter is the Degrees of Freedom, $$k$$.

Show question

Question

What is the range of a $$\chi^{2}_{k}$$ distribution?

The range is $$0$$ to $$\infty$$.

Show question

Question

What is the standard deviation of a $$\chi^{2}_{k}$$ distribution?

$$\sqrt{2k}$$.

Show question

Question

A chi-square distribution with $$4$$ degrees of freedom has a $$95\%$$ critical value of $$9.49$$.

True.

Show question

Question

A chi-square distribution with $$18$$ degrees of freedom has a $$10\%$$ critical value of $$25.99$$.

False.

Show question

Question

What is the mode of a $$\chi^{2}_{k}$$ distribution?

$$k - 2$$ if $$k \geq 2$$.

Show question

Question

What is the variance of a $$\chi^{2}_{k}$$ distribution?

$$\sigma^{2} = 2k$$.

Show question

Question

A chi-square distribution is a _____  distribution that becomes increasingly ____ as its degrees of freedom, $$k$$, increases.

non-symmetric, symmetrical.

Show question

Question

You have a chi-square distribution with a standard deviation of $$4$$. How many degrees of freedom does the distribution have?

1. Start with the formula for standard deviation: $\sigma = \sqrt{2k},$ where $k = \text{degrees of freedom}.$
2. Rearrange the formula to solve for $$k$$: $\dfrac{\sigma^{2}}{2} = k.$
3. Plug in the value for standard deviation: $k = \dfrac{(4)^{2}}{2}.$
4. Solve for $$k$$: $k=8.$

Show question

Question

Let $$Z_{i}$$ represent a standard normal random variable. What distribution does $$\sum^{15}_{i = 1} Z^{2}_{i}$$ follow?

$$\chi^{2}_{15}$$.

Show question

Question

What is the mean of a $$\chi^{2}_{9}$$ distribution?

$$\mu = k = 9$$.

Show question

Question

What is the mean of a $$\chi^{2}_{k}$$ distribution?

$$\mu = k$$.

Show question

Question

What distribution does

$Q = \chi^{2}_{6} + \chi^{2}_{11}$

follow?

$$Q = \chi^{2}_{17}$$.

Show question

Question

You want to know if people of different hair colors prefer different cuisines. Which chi-square test should you perform?

Chi-Square Test for Homogeneity

Show question

Question

You only have one categorical variable. What kind of chi-square test can you perform?

Chi-Square Test for Goodness of Fit

Show question

Question

Your friend seems to roll a lot of sixes...what kind of chi-square test can tell you if they are cheating?

Chi-Square Test for Goodness of Fit

Show question

Question

You work for a car dealership. You want to know if race plays a role in what model car people buy. You can get the data from your company's records, but what kind of chi-square test should you perform to answer your question?

Chi-Square Test for Independence

Show question

Question

Your contingency table has 5 columns and 5 rows. You are performing a chi-square test for homogeneity. What are the degrees of freedom?

16

Show question

Question

When considering whether gender and eye color are independent, you measure gender as either Male, Female, or Non-Binary. You also keep counts of 6 eye colors. How many degrees of freedom will your chi-square test have?

10

Show question

Question

A roulette wheel has 38 possible outcomes. You've been watching a roulette wheel for a few hours, gathering data to determine if the wheel is fair. How many degrees of freedom will your chi-square test have?

37

Show question

Question

Can you use percentages for a chi-square test? If not, what kind of data can you use?

No. A chi-square test requires count data.

Show question

Question

What is the Chi-square test for goodness of fit?

The Chi-square test for goodness of fit is a statistical test that can be used to confirm or deny a hypothesis about the distribution of a categorical data set.

Show question

Question

What is a categorical data set?

Categorical data is data that is divided into discrete, unordered groups.

Show question

Question

Give an example of a categorical data set.

Anything that fits the definition

'Categorical data is data that is divided into discrete, unordered groups.'

Such as types of candy in a bag, the number of people with various eye colors, etc.

Show question

Question

How many conditions need to be met by a data set for a Chi-square test for goodness of fit to be used?

Four.

Show question

Question

What are the conditions that need to be met by a data set for a Chi-square test for goodness of fit to be used?

• The sampling method is simple random sampling.

• The variable under study is categorical.

• The expected value of observations for each category must be at least five.

• Each outcome in the variable under study must be independent.

Show question

Question

What makes a sampling method random?

The constituents of the sample must be chosen totally at random.

Show question

Question

Someone flips a coin a hundred times and records the result of each flip, heads or tails.

Is each outcome in this dataset independent?

Yes, the outcome of any given flip does not impact the probability of that or another outcome arising again.

Show question

Question

What is the formula for calculating the Chi-square test statistic?

$\chi^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2}{E_i}$

Show question

Question

What is the significance level of a Chi-square test for goodness of fit?

The significance level sets the strength of the evidence you require to be able to consider the null hypothesis proven.

Show question

Question

What is a null hypothesis?

A null hypothesis is a hypothesis that states that any statistical difference between populations is down to random chance.

Show question

Question

What is the alternative hypothesis in a Chi-square test for goodness of fit?

The alternative hypothesis is what will be true if your null hypothesis proves wrong.

Show question

Question

How can you find the number of degrees of freedom of a categorical variable?

$df =\text { number of groups} - 1$

Show question

Question

What values do you need to find the Chi-square value for a Chi-square test for goodness of fit?

Degrees of freedom, significance level, and a Chi-square table.

Show question

Question

What makes a group of a categorical variable independent?

The likelihood of a given observation belonging to that group is not affected by the number of observations belonging to any other group.

Show question

Question

Why is a chi-square for goodness of fit test always right-tailed

The Chi-square goodness of fit test is right-tailed because the numerator of the Chi-square test statistic is always positive.

Show question

Question

The chi-square distribution is the basis for three chi-square hypothesis tests:

the chi-square test for goodness of fit.

Show question

Question

What is a Chi-square test for homogeneity used for?

A Chi-square test for homogeneity is a Chi-square test that is applied to a single categorical variable from two or more different populations to determine whether they have the same distribution.

Show question

Question

A Chi-square test for homogeneity has the same basic assumptions as any other Pearson Chi-square test:

The variables must be categorical.

Show question

Question

The approach to use a Chi-square test for homogeneity has six basic steps:

1. State the hypotheses.

2. Calculate the expected frequencies.

3. Calculate the Chi-square test statistic.

4. Find the critical Chi-square value.

5. Compare the Chi-square test statistic to the Chi-square critical value.

6. Decide whether to reject the null hypothesis.

Show question

Question

What is the null hypothesis of a Chi-square test for homogeneity?

The null hypothesis is that the two variables are from the same distribution.

\begin{align} H_{0}: p_{1,1} &= p_{2,1} \text{ AND } \\ p_{1,2} &= p_{2,2} \text{ AND } \ldots \text{ AND } \\ p_{1,n} &= p_{2,n} \end{align}.

Show question

Question

What is the alternative hypothesis of a Chi-square test for homogeneity?

The alternative hypothesis is that the two variables are not from the same distribution, i.e., at least one of the null hypotheses is false.

\begin{align} H_{a}: p_{1,1} &\neq p_{2,1} \text{ OR } \\ p_{1,2} &\neq p_{2,2} \text{ OR } \ldots \text{ OR } \\ p_{1,n} &\neq p_{2,n} \end{align}.

Show question

Question

As with any statistical test, your analysis plan when doing a Chi-square test for homogeneity describes how you will use the sample data to either accept or reject the null hypothesis. Your plan should specify the following:

Significance level.

Show question

Question

When doing a Chi-square test for homogeneity, you use the sample data from your contingency table to find:

The degrees of freedom.

Show question

Question

There are __ variables in a Chi-square test for homogeneity. Therefore, you need the totals to add up in __ dimensions.

2.

Show question

Question

The degrees of freedom of a Chi-square test for homogeneity is calculated by:

$k = (r - 1)(c - 1)$

where,

• $$k$$ is the degrees of freedom,

• $$r$$ is the number of populations where this is also the number of rows in a contingency table, and

• $$c$$ is the number of levels of the categorical variable where this is also the number of columns in a contingency table.

Show question

Question

You must calculate the expected frequencies of a Chi-square test for homogeneity individually for each population at each level of the categorical variable, as given by the formula:

$E_{r,c} = \frac{n_{r} \cdot n_{c}}{n}$

where,

• $$E_{r,c}$$ is the expected frequency for population $$r$$ at level $$c$$ of the categorical variable,

• $$r$$ is the number of populations $$\Rightarrow$$ this is also the number of rows in a contingency table,

• $$c$$ is the number of levels of the categorical variable $$\Rightarrow$$ this is also the number of columns in a contingency table,

• $$n_{r}$$ is the number of observations from population $$r$$,

• $$n_{c}$$ is the number of observations from level $$c$$ of the categorical variable, and

• $$n$$ is the total sample size.

Show question

Question

The test statistic of a Chi-square test for homogeneity is given by the formula:

The test statistic of a Chi-square test for homogeneity is given by the formula:

$\chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}}$

where,

• $$O_{r,c}$$ is the observed frequency for population $$r$$ at level $$c$$, and

• $$E_{r,c}$$ is the expected frequency for population $$r$$ at level $$c$$.

Show question

Question

The $$p$$-value of a Chi-square test for homogeneity is the probability that the test statistic, with $$k$$ degrees of freedom, is __ than its calculated value.

More extreme.

Show question

Question

True or False? To interpret the results of a Chi-square test for homogeneity, you compare the $$p$$-value to the significance level and reject the null hypothesis if the $$p$$-value is less than the significance level.

True.

Show question

More about Inference for Distributions of Categorical Data 60%

of the users don't pass the Inference for Distributions of Categorical Data quiz! Will you pass the quiz?

Start Quiz

### No need to cheat if you have everything you need to succeed! Packed into one app! ## Study Plan

Be perfectly prepared on time with an individual plan. ## Quizzes

Test your knowledge with gamified quizzes. ## Flashcards

Create and find flashcards in record time. ## Notes

Create beautiful notes faster than ever before. ## Study Sets

Have all your study materials in one place. ## Documents

Upload unlimited documents and save them online. ## Study Analytics

Identify your study strength and weaknesses. ## Weekly Goals

Set individual study goals and earn points reaching them. ## Smart Reminders

Stop procrastinating with our study reminders. ## Rewards

Earn points, unlock badges and level up while studying. ## Magic Marker

Create flashcards in notes completely automatically. ## Smart Formatting

Create the most beautiful study materials using our templates. 