Determine whether two characteristics are independent
When we looked at categorical data in the previous chapter, it was related to a single variable, or characteristic of interest, such as favorite movie or car color. To illustrate the data, we made a frequency table and used it to create a pie chart or bar chart. But what if we want to illustrate the relationship between two categorical variables? To do this, we can use a contingency table.
Subsection4.1.1Contingency Tables
A contingency table summarizes all the possible combinations for two categorical variables. Each value in the table represents the number of times a particular combination of outcomes occurs. For example, suppose we randomly select 250 households from the greater Portland area and ask whether they have a cat and whether they have a dog. In this case, “have a cat” and “have a dog” are the two variables, and each variable has two categories: Yes and No. To create the contingency table, we make columns for the categories of one variable, and rows for the categories of the other variable. We also add a row and column for the subtotals of each category. Each cell of the resulting table contains the number of outcomes having the characteristics of the intersecting row and column categories. For our dog and cat example, the table would look like this:
Dog
No Dog
Total
Cat
Yes Cat and Yes Dog
Yes Cat and No Dog
Yes Cat Total
No Cat
No Cat and Yes Dog
No Cat and No Dog
No Cat Total
Total
Yes Dog Total
No Dog Total
Grand total
Suppose that of the 250 households surveyed, 180 said they have a cat, 95 said they have a dog, and 52 said they have both a cat and a dog. We can use this information to fill in the cells of the table.
Dog
No Dog
Total
Cat
52
180
No Cat
Total
95
250
The first cell we can fill in is the grand total, which is the total number of subjects in the study. In this case, there are 250 households participating in the survey. The next two cells we can fill in are the total number of households that have a cat, 180, and the total number of households that have a dog, 95. The final cell we can fill in from the given information is the intersection of the having a dog column and a having a cat row, which is 52 households.
Since each row and column must sum to their totals, we can use subtraction to find the missing numbers as shown below.
Dog
No Dog
Total
Cat
52
\(180-52=128\)
180
No Cat
\(95-52=43\)
\(155-128=27\) or \(70-43=27\)
\(250-180=70\)
Total
95
\(250-95=155\)
250
Now that we have our contingency table completed, notice that the numbers in the central four cells add to the grand total as shown in the table on the left. The total row and the total column also add to the grand total as shown in the right table.
Dog
No Dog
Total
Cat
52
128
180
No Cat
43
27
70
Total
95
155
250
Dog
No Dog
Total
Cat
52
128
180
No Cat
43
27
70
Total
95
155
250
Subsection4.1.2Contingency Tables and Venn Diagrams
If the subtractions we just did seem familiar, they should! This is very similar to what we did for reporting data with a Venn diagram. The Venn diagram for this data is shown below. We also subtracted the intersection from the total of the cat and dog owners to find numbers in the crescent regions.
Notice that the numbers in the four regions of the Venn diagram are the same as the four cells in the center of the contingency table and add to the grand total.
Subsection4.1.3“And” Statements
Now we can use the contingency table or the Venn diagram to determine the percentage of households that meet certain conditions. For instance, what percent of those surveyed own a cat and do not own a dog? In the Venn diagram, this is 128 households in the cat only region.
In the contingency table we see the 128 households at the intersection of the row of households who own a cat and the column of households who do not own a dog. As a percentage, the total number of households surveyed, is \(\frac{128}{250}=0.512\) or 51.2% that have a cat and no dog.
Dog
No Dog
Total
Cat
52
128
180
No Cat
43
27
70
Total
95
155
250
Subsection4.1.4“Or” Statements
How about the percentage of households surveyed that have a cat or a dog? We know from Venn diagrams that the inclusive or includes the number of households who own a cat only, a dog only, and both a cat and a dog, or \(128+52+43=223\) households. As a percentage of the total surveyed, we get \(\frac{223}{250}=0.892\) or 89.2% of households in the sample have a dog or a cat (or both).
We can get the same answer from the contingency table. by adding the cells for households who have a cat and not a dog, a dog and not a cat, and the households that have both a cat and a dog. This also gives us 223 households.
There is another way to calculate an or statements from a contingency table. We could add the row and column totals for having a cat and having a dog, but then we have counted the 52 households in the intersection twice. We can subtract that number to get \(180+95-52=223\) households with a dog or a cat, which we know is 89.2% of those surveyed.
Dog
No Dog
Total
Cat
52
128
180
No Cat
43
27
70
Total
95
155
250
Subsection4.1.5Conditional Statements
Another question we can answer using a contingency table is what percentage of dog owning households also own a cat? In this case the group that we are interested in isn’t every household surveyed (the grand total), but just those households that own a dog.
Dog
No Dog
Total
Cat
52
128
180
No Cat
43
27
70
Total
95
155
250
We call this a conditional statement because we are only considering the households with a certain condition. If we focus on the column representing the households that own a dog, we see that there is a total of 95 households with a dog, and that 52 of those 95 households also have a cat. Therefore, \(\frac{52}{95} \approx 0.547\) or approximately 54.7% of the households with a dog also have a cat. Another way to phrase this conditional statement is, “What percent of households have a cat given they have a dog.” You will see the word given quite a bit in this chapter and that makes the denominator change. It is also possible to find this conditional percentage using the Venn diagram by taking the number in the intersection and dividing it by the total in the whole dog circle.
Subsection4.1.6Contingency Tables with More Than Two Categories
When there are only two categories for each variable, like yes/no questions, Venn diagrams and contingency tables provide basically the same information and can be used interchangeably. A Venn diagram works well for yes/no variables since a subject is either inside the circle (has the characteristic) or outside the circle (does not have the characteristic). If we have more than two possibilities for any of the variables, though, we cannot use a Venn diagram. We can use a contingency table, though. Here is an example where one variable has four categories and the other has three categories.
Example4.1.2.
910 randomly sampled registered voters from Tampa, FL were asked if they thought workers who have illegally entered the US should (i) be allowed to keep their jobs and apply for US citizenship, (ii) be allowed to keep their jobs as temporary guest workers but not be allowed to apply for US citizenship, or (iii) lose their jobs and have to leave the country. Not sure was also an option (iv). The results of the survey by political ideology are shown below 1
. Use the contingency table to answer the questions.
Conservative
Moderate
Liberal
Total
(i) Apply for citizenship
57
120
101
278
(ii) Guest worker
121
113
28
262
(iii) Leave the country
179
126
45
350
(iv) Not sure
15
4
1
20
Total
372
363
175
910
What percent of the sampled Tampa, FL voters identified themselves as conservatives?
What percent of the sampled voters are in favor of the citizenship option?
What percent of the sampled voters identify themselves as conservatives and are in favor of the citizenship option?
What percent of the sampled voters identify themselves as liberal or are in favor of the leaving the country option?
What percent of the sampled voters who identify as conservatives are also in favor of the citizenship option? What percent of moderate and liberal voters share this view?
Solution.
To answer this question, we find the conservative column and look to the bottom cell for the total number of conservative voters and divide that by the total number of voters surveyed. This gives us \(\frac{372}{910}\approx 0.409\) or approximately 41% of the Tampa, FL voters who identify as conservative.
For this question we find the apply for citizenship row, look across to find the total, and divide this by the total number of voters surveyed. We get \(\frac{278}{910} \approx 0.305\) or approximately 31% of these voters are in favor of the citizenship option.
For this question we are looking for the cell that is the intersection of those who identify as conservative and those who are in favor of the citizen option. This cell has 57 voters, so we divide that by the total number of voters. This gives us or \(\frac{57}{910} \approx 0.063\) or approximately 6.3% of these voters identify as conservatives and are in favor of the citizenship option.
The or in this question is inclusive, so we need to determine the number of voters who identify as liberal, who are in favor of the leaving the country option, or both.
Conservative
Moderate
Liberal
Total
(i) Apply for citizenship
57
120
101
278
(ii) Guest worker
121
113
28
262
(iii) Leave the country
179
126
45
350
(iv) Not sure
15
4
1
20
Total
372
363
175
910
In terms of the individual cells, the number of voters who have the specified characteristics is the sum \(179+126+101+28+45+1=480\text{,}\) which we can divide by the total number of voters surveyed to get the percent. So, we have \(\frac{480}{910} \approx 0.527\) or approximately 53% of the voters identify as liberal or are in favor of the leave the country option.
Another way to calculate this is to add the total number who identify as liberal (175 voters) and the total number who are in favor of the leave the country option (350 voters), then subtract the double counted cell (45 voters) who are liberal and in favor of the leave the country option: \(175+350-45=480\)
As we saw before, these are conditional statements. For the first part of this question, we want to focus just on those voters who identify as conservatives, and from among that group determine the percent in favor of the citizenship option. We calculate that \(\frac{57}{372} \approx 0.153\) or approximately 15% of conservative voters are in favor of the citizenship option.
For the second part, we want to focus on just those voters who identify as moderate, and from among that group determine the percent in favor of the citizenship option. Then we have \(\frac{120}{363} \approx 0.33\) or approximately 33% of moderate voters are in favor of the citizen option.
Finally, we want to focus on just those voters who identify as liberal, and from among that group determine the percent in favor of the citizenship option. We calculate \(\frac{101}{175} \approx 0.58\) or approximately 58% of liberal voters are in favor of the citizenship option. Looking at these three percentages, it is clear that support of the citizenship option depends on political ideology. If support of the citizenship option were the same across political ideologies, then we would say that favoring the citizenship option and political ideology were independent of each other.
Subsection4.1.7Empirical Probability
If our sample is representative of the population, then we can also interpret a percentage we calculate from a contingency table as a probability, or the likelihood that something will happen. Since a contingency table is constructed from data collected through sampling or an experiment, we call it an empirical or experimental probability. This is different from a theoretical probability which we will look at in the next section.
Subsection4.1.8Finding Empirical Probabilities with a Contingency Table
Suppose that 60% of students in our class have a summer birthday (June, July, or August). Now suppose everyone’s name and birth month are written on slips of paper and thrown into a bag. If we pull a slip of paper out of the bag at random, what is the probability that the selected student has a summer birthday? If you think there should be a 60% chance, you are right! The relative frequency of the characteristic of interest will be equal to its empirical probability. To write this as a probability statement, it would look like
Probability is a function named P, and the function is applied to what follows in the parentheses. Let’s look at another example where we write probability statements and find empirical probabilities.
Example4.1.3.
A survey of licensed drivers asked whether they had received a speeding ticket in the last year and whether their car is red. The results of the survey are shown in the contingency table to the right.
Speeding Ticket
No Speeding Ticket
Total
Red Car
15
135
150
Not Red Car
45
470
515
Total
60
605
665
Find the probability that a randomly selected survey participant:
has a red car.
has had a speeding ticket in the last year.
has a red car and has not had a speeding ticket in the last year.
has a red car or has had a speeding ticket in the last year.
has had a speeding ticket in the last year given they have a red car.
who has received a speeding ticket in the last year also has a red car.
What do the answers to b and e suggest about the relationship between owning a red car and getting a speeding ticket?
Solution.
Speeding Ticket
No Speeding Ticket
Total
Red Car
15
135
150
Not Red Car
45
470
515
Total
60
605
665
To find \(P(\text{red car})\text{,}\) we divide the number of participants who own a red car by the total number of people surveyed:\(P(\text{red car})=\frac{150}{665} \approx 0.226\) or 22.6%.
Speeding Ticket
No Speeding Ticket
Total
Red Car
15
135
150
Not Red Car
45
470
515
Total
60
605
665
\(P(\text{speeding ticket})\text{,}\) we divide the number of participants who got a speeding ticket in the last year by the total number of people surveyed: \(P(\text{speeding ticket})=\frac{60}{665} \approx 0.09\) or 9%.
Speeding Ticket
No Speeding Ticket
Total
Red Car
15
135
150
Not Red Car
45
470
515
Total
60
605
665
To find \(P(\text{red and no ticket})\) , we find the intersection of the red car category and the no ticket category and divide by the total number of participants: \(P(\text{red and no ticket}) =\frac{135}{665} \approx 0.203\) or 20.3%
Speeding Ticket
No Speeding Ticket
Total
Red Car
15
135
150
Not Red Car
45
470
515
Total
60
605
665
To find \(P(\text{red or ticket})\text{,}\) we need to add those who drive a red car and did not have a speeding ticket (just red), those who had a speeding ticket and do not drive a red car (just ticket) and those who drive a red car and had a speeding ticket (both), and divide by the total number of participants:
\begin{gather*}
P(\text{red and no ticket})=\frac{135+45+15}{665}=\frac{195}{665} \approx 0.293\text{ or } 29.3\%
\end{gather*}
Recall from our earlier discussion that we could also calculate the or probability as:
\begin{align*}
P(\text{red and no ticket})\amp= P(\text{red})+ P(\text{speeding ticket}) - P(\text{red and speeding ticket})\\
\amp=\frac{150}{665}+\frac{60}{665}-\frac{15}{665}\\
\amp=\frac{195}{665}
\end{align*}
which gives us the same answer as counting the individual cells.
Speeding Ticket
No Speeding Ticket
Total
Red Car
15
135
150
Not Red Car
45
470
515
Total
60
605
665
The probability \(P(\text{speeding ticket given red car})\) is a conditional probability as we have seen before since it is conditional on the given characteristic occurring. In this problem, the given characteristic is owning a red car, so we isolate our attention to just the row of 150 red car owners and see how many have had a speeding ticket in the last year. Looking at the table, we see that there were 15 red car owners who had a speeding ticket in the last year, so we calculate:
\begin{gather*}
P(\text{speeding ticket given red car})=\frac{15}{150} = 0.10\text{ or } 10\%
\end{gather*}
Speeding Ticket
No Speeding Ticket
Total
Red Car
15
135
150
Not Red Car
45
470
515
Total
60
605
665
This question is also asking for a conditional probability, \(P(\text{red car given speeding ticket})\text{,}\) but it is phrased more like we would say it. In this case the given characteristic is that the person has received a speeding ticket, so we will isolate our attention to just the speeding ticket column. Among the 60 people who had a speeding ticket in the last year, we see that 15 also drove a red car. Now we can calculate the probability:
\begin{gather*}
P(\text{red car given speeding ticket})=\frac{15}{60} = 0.25\text{ or }25\%
\end{gather*}
Notice that compared with part e, when we change the conditional characteristic, we change the denominator of the fraction.
In part b, we determined that there was a 9% chance of randomly selecting a participant who had received a speeding ticket in the last year. However, in part e we found that there was a 25% chance of receiving a ticket in the last year if the person had a red car. This seems to suggest that there is a higher likelihood of getting a speeding ticket if you own a red car. This means that getting a speeding ticket is dependent on whether the person drives a red car, since that increases the probability of getting a ticket. We cannot say, however, whether driving a red car makes you speed or whether people who tend to drive faster buy red cars.
Subsection4.1.9Conditional Probabilities
We have mentioned conditional probabilities, which we find by isolating our attention to the given row or column. Here is another example of finding conditional probabilities.
Example4.1.4.
A home pregnancy test was given to a sample of 93 cisgender women, and their pregnancy was then verified by a blood test. The contingency table below shows the home pregnancy test and whether or not they were actually pregnant as determined by the blood test. Find the probability that a randomly selected woman in the sample
was not pregnant given the home test was positive.
had a positive home pregnancy test given they were not pregnant.
Positive Test
Negative Test
Total
Pregnant
70
4
74
Not Pregnant
5
14
19
Total
75
18
93
Solution.
Here are the solutions:
Since we are given the home test result was positive, we are limited to the 75 women in the positive test column, of which 5 were not pregnant. This gives:
Positive Test
Negative Test
Total
Pregnant
70
4
74
Not Pregnant
5
14
19
Total
75
18
93
\begin{gather*}
P(\text{not pregnant given positve test})=\frac{5}{75} \approx 0.067\text{ or } 6.7\%
\end{gather*}
Since we are given the woman is not pregnant, we are limited to the 19 women in the not pregnant row, of which 5 had a positive test. This gives:
Positive Test
Negative Test
Total
Pregnant
70
4
74
Not Pregnant
5
14
19
Total
75
18
93
\begin{gather*}
P(\text{positive test given not pregnant})=\frac{5}{19} \approx 0.263\text{ or } 26.3\%
\end{gather*}
This result is referred to as a false positive: A positive test result when the woman is not actually pregnant.
In this section we have learned about empirical probability. In the next section we will discuss another kind of probability that you may be familiar with – theoretical probability.
Exercises4.1.10Exercises
1.
A recent survey asked a random sample of PCC students if they are currently experiencing food insecurity and if they are currently experiencing housing insecurity. Fill in the missing entries of the contingency table below.
Food Insecure
Not Food Insecure
Total
Housing Insecure
60
Not Housing Insecure
460
760
Total
680
2.
A recent survey asked a random sample of PCC students if they have purchased food from the cafeteria in the last week, and if they purchased their textbooks through the bookstore. Fill in the missing entries of the contingency table below.
Bookstore
No Bookstore
Total
Cafeteria
375
No Cafeteria
135
Total
630
850
3.
A recent survey asked PCC students if they regularly eat breakfast and if they regularly floss their teeth Use the completed Venn Diagram to fill in the corresponding contingency table.
Breakfast
No Breakfast
Total
Floss
No Floss
Total
4.
A recent survey asked PCC students if they used an Apple phone, and if the regularly used a Chromebook outside of school. Use the completed Venn Diagram to fill in the corresponding contingency table.
Chromebook
No Chromebook
Total
Apple
No Apple
Total
5.
Use the following information to complete the contingency table:
\(\displaystyle \text{P(A and B)} = 10/75\)
\(\displaystyle \text{P(A)} = 40/75 \)
\(\displaystyle \text{P(not B)} = 45/75\)
A
Not A
Total
B
Not B
Total
6.
Use the following information to complete the contingency table:
\(\displaystyle \text{P(A given B)} = 30/80\)
\(\displaystyle \text{P(Not A and Not B)} = 10/120\)
A
Not A
Total
B
Not B
Total
7.
A professor gave a test to students in a morning class and the same test to the afternoon class. The grades are summarized below.
A
B
C
Total
Morning Class
8
18
13
39
Afternoon Class
10
4
12
26
Total
18
22
25
65
If one student was chosen at random:
What is the probability they were in the morning class?
What is the probability they earned a C?
What is the probability that they earned an A and they were in the afternoon class?
What is the probability that they earned an A given they were in the morning class?
What is the probability that they were in the morning class or they earned a B?
8.
A professor surveyed students in her morning and afternoon Math 105 class, and asked what their class standing was. The class standings are summarized below:
Freshman
Sophomore
Junior
Senor
Total
Morning Class
12
5
7
8
32
Afternoon Class
5
13
8
2
28
Total
17
18
15
10
60
If one student was chosen at random:
What is the probability they were in the morning class?
What is the probability they were a Freshman?
What is the probability that they were a Senior and they were in the afternoon class?
What is the probability that they were a Sophomore given they were in the morning class?
What is the probability that they were in the morning class or they were a Junior?
9.
The contingency table below shows the number of credit cards owned by a group of individuals below the age of 35 and above the age of 35.
Zero
One
Two or more
Total
Between the ages of 18-35
9
5
19
33
Over age 35
18
10
20
48
Total
27
15
39
81
If one person was chosen at random:
What is the probability they had no credit cards?
What is the probability they had one credit card?
What is the probability they had no credit cards and is over 35?
What is the probability they are between the ages of 18 and 35, or have zero credit cards?
What is the probability they had no credit cards given that they are between the ages of 18 and 35?
What is the probability they have no credit cards given that they are over age 35?
Does it appear that having no credit cards depends on age? Or are they independent? Use probability to support your claim.
10.
The following contingency table provides data from a sample of 6,224 individuals who were exposed to smallpox in Boston. 2
Data taken from Mostly Harmless Probability & Statistics by Rachel Webb
Inoculated
Not Inoculated
Total
Lived
238
5136
5374
Died
6
844
850
Total
244
5980
6224
What is the probability that a person was inoculated?
What is the probability that a person lived?
What is the probability that a person died or was inoculated?
What is the probability that a person died given they were inoculated?
What is the probability that a person died given they were not inoculated?
Does it appear that survival depended on if a person were inoculated? Or are they independent? Use probability to support your claim.
11.
The contingency table below shows the survival data for the passengers of the Titanic.
First
Second
Third
Crew
Total
Survive
203
118
178
212
711
Not Survive
122
167
528
673
1490
Total
325
285
706
885
2201
What is the probability that a passenger did not survive?
What is the probability that a passenger was crew?
What is the probability that a passenger was first class and did not survive?
What is the probability that a passenger did not survive or was crew?
What is the probability that a passenger survived given they were first class?
What is the probability that a passenger survived given they were second class?
What is the probability that a passenger survived given they were third class?
Does it appear that survival depended on the passenger’s class? Or are they independent? Use probability to support your claim.
12.
The following table shows the utility patents granted for a specific year.
Corporation
Government
Individual
Total
United States
45%
2%
8%
55%
Foreign
41%
1%
3%
45%
Total
86%
11%
3%
100%
What is the probability that a patent is foreign and from the government?
What is the probability that a patent is from the U.S. and from a corporation?
What is the probability that a patent is foreign or from the government?
What is the probability that a patent is from the U.S. given it is from an individual?
What is the probability that a patent is foreign given it is from the government?
13.
There is a 15% chance that a shopper entering a computer store will purchase a computer, a 25% chance they will purchase a game/software, and there is a 10% chance they will purchase both a computer and a game/software.
Create a contingency table for the information.
Game/Software
No Game/Software
Total
Computer
No Computer
Total
What is the probability that a shopper will not purchase a computer and will not purchase a game/software?
What is the probability that a shopper will purchase a computer or purchase a game/software?
What is the probability that a shopper will purchase a game/software given they have purchased a computer?
What is the probability that a shopper will purchase a game/software given they did not purchase a computer?
Does it appear that purchasing a game/software depends on whether the shopper purchased a computer? Or are they independent? Use probability to support your claim.
14.
A fitness center coach kept track over the last year of whether members stretched before they exercised, and whether or not they sustained an injury. Among the 400 members, 322 stretched before they exercised, 327 did not sustain an injury, and 270 both stretched and did not sustain an injury.
Create a contingency table for the information.
Injury
No Injury
Total
Stretched
Not Stretched
Total
What is the probability that a member sustained an injury?
What is the probability that a member sustained an injury and did not stretch?
What is the probability that a member stretched or did not sustain an injury?
What is the probability that a member sustained an injury given they stretched?
What is the probability that a member sustained an injury given they did not stretch?
Does it appear that sustaining an injury depends on whether the member stretches before exercising? Or are they independent? Use probability to support your claim.
15.
Among the 95 books on a bookshelf, 72 are fiction, 28 are hardcover, and 87 are fiction or hardcover.
Create a contingency table for the information.
Hardcover
Paperback
Total
Fiction
Nonfiction
Total
What is the probability that a book is non-fiction and paperback?
What is the probability that a book is fiction given it is hardcover?
16.
After finishing the course, among the 32 students in a Math 105 class, 25 could successfully construct a contingency table, 27 passed the class, and 29 could successfully construct a contingency table or passed the class.
Create a contingency table for the information.
Contingency Table
No Contingency Table
Total
Pass
No Pass
Total
What is the probability that a student passed and could not successfully construct a contingency table?
What is the probability that a student passed given they could not successfully construct a contingency table?