In the above section, we described methods for understanding the variation in a single categorical variable. In many situations, more than one categorical variable is recorded from each person or thing. In the used car ad described in the previous section, I recorded two categorical variables from each car -- its year of manufacturer and the name of the manufacturer. We refer to such data as bivariate data . It simply means that two variables are observed from each car.
When two categorical variables are recorded, we are often interested in understanding the relationship between the variables. In our car example, we may be interested in the relationship between year and manufacturer name. How many of the vintage cars in the ad are foreign? If I'm interested in purchasing a Chevrolet, are there many Chevrolets that are nearly new? Could I make a general statement like "most of the old cars are foreign and the new cars are typically American"?
Constructing a contingency table
We can learn about the relationship between two categorical variables by means of a contingency table. To illustrate the construction of a contingency table, let us consider a new example. In professional basketball, points are scored in different ways. A team scores points by making baskets during the flow of the game. Two points are scored by baskets made inside of the 3-point line and three points are scored by baskets made outside of the 3-point line. Teams also score points by making foul shots (or free throws) from the free throw line. Suppose that we are interesting in learning about the relationship between a player's ability to make two-point shots with his ability to make free throws. If a player makes a high percentage of two-point shots, is it reasonable to believe that he will also be successful in shots from the free throw line? Likewise, if a player doesn't shoot well in two-point shots, will he also be relatively weak from the free throw line?
To investigate the relationship between two-point shooting and foul shots, we need some data. Consider the shooting statistics of players in the National Basketball Association for the 1994-95 year. To learn about a player's free throw shooting ability, he should have enough opportunities. (It is certainly hard to learn about someone's shooting ability if he has only taken 10 foul shots during the season.) So we limit our study to the 236 players who had at least 80 free throw attempts. All of these players had at least 80 two-point shot attempts. For each player, we classify his shooting ability as "poor", "average" or "good". We will say that a player is a "poor" two-point shooter if his shooting percentage is 44% or smaller, is a "average" two-point shooter if his percentage is between 45% and 49%, and "good" if his shooting percentage is 50% or higher. Likewise, we classify his free throw shooting ability into three groups. He is a "poor" free throw shooter if his percentage of successful shots is 69% or smaller, "average" if he makes between 70 - 79% of his free throws, and "good" if he makes at least 80% of his free throws.
For each basketball player, two categorical variables are recorded -- his ability to shoot two-point shots (poor, average, or good) and his ability to shoot free throws (poor, average or good). For space limitations, I can't list the data for the entire collection of 236 players, but part of the data (including some famous players) is listed below:
|Name||Two-point Shooting Ability||Free Throw Shooting Ability||Danny Ainge||average||good||Charles Barkley||average||average||Shawn Bradley||average||poor||Clyde Drexler||average||good||Patrick Ewing||good||average||Grant Hill||average||average||Michael Jordan||poor||good||Scott Kerr||good||average|
This data can be organized by means of a two-way contingency table. We construct a table with three rows and three columns where a row corresponds to a category of two-point shooting ability and a column corresponds to the level of free throw shooting ability. We tally the data by placing a mark in the table for each observation corresponding to the values of the two categorical values. For example, note the Danny Ainge is an average two-point shooter and good in shooting free throws. We place a mark in the square in the table corresponding to the "average" row and the "good" column. If we tally the nine observations in the table above, we obtain the following results.
|Free Throw Shooting Ability||Two-Point Shooting Ability||poor||average||good||poor|||||average|||||||||||||good|||||
We use the computer to tally the observations for all 236 basketball players. If these tallies are converted into counts, we obtain the following two-way contingency table.
|Free Throw Shooting Ability||Two-Point Shooting Ability||poor||average||good||poor||16||38||18||average||25||40||41||good||30||22||6|
The numbers in the table give the counts of players having each possible combination of two-way shooting ability and free throw shooting ability. For example, we see that 38 players are poor two-point shooters and average free throw shooters and 6 players are poor in both characteristics.
In the previous section, we talked about obtaining a count table for a single categorical variable to understand the proportions of individuals in the different categories. One can obtain the counts for each variable in a two-way table by the computation of marginal totals. For each row, we add the counts in all of the columns; we place the resulting sum in a column named "TOTAL". In a similar fashion, for each column (including the new total column), we add the counts of all the rows and place the result in a new "TOTAL" column. We get the following table -- we'll call it a two-way table with marginal totals added.
|Free Throw Shooting Ability||Two-Point Shooting Ability||poor||average||good||TOTAL||poor||16||38||18||72||average||25||40||41||106||good||30||22||6||58||TOTAL||71||100||65||236|
The marginal row and column totals are helpful in understanding the distribution of each categorical variable. What is the proportion of poor free throw shooters in the NBA? We see that, of the 236 players, 71 are poor free throw shooters (shoot under 70%) for a proportion of 71/236 = 30%. Is it common for a NBA player to have a two-point shooting percentage of 50% or higher? This refers to the "good" category; we note that 58 are good two-point shooters for a proportion of 58/236 = 25%. Since 25% is a relatively small percentage, we would say that shooting at least 50% is not very common.
Conditional row proportions
Although the two-way table can be used to learn about the marginal totals of each variable, it is most useful in learning about the relationship or the association between the two variables. In this example, we are interested in the relationship between a player's shooting ability in two-point situations and his shooting ability from the free throw line. Is it true that an excellent shooter from the field will also be excellent from the free throw line?
To understand the association in a two-way contingency table, we compute conditional proportions in the table. Suppose that we wish to compare the free throw shooting ability of the poor, average, and good two-point shooters. For the group of poor two-point shooters, represented by the first row of the table, we compute the proportion of poor, average, and good free throw shooters. There are 72 poor shooters in the first row; in this group the proportion of poor free throw shooters is 16/72 = .22 or 22%. The proportion of average free throw shooters in this group is 38/72 = .53 or 53% and the proportion of poor shooters from the foul line is 18/72 = .25 or 25%. We call these numbers conditional proportions since they are computed conditional on the fact that only poor two-point shooters are considered.
Similarly, we can compute the conditional proportions for the second and third rows of the table. For the "average" field shooters, we divide the counts 25, 40 and 41 by the total in the second row 106 to get the proportions of poor, average, and good free throw shooters in this average group. Finally, we compute the proportions of the different categories of free throw shooting among the "good" field shooters in the third row. When we are finished, we obtain the following row percent form of the two-way table. The additional "TOTAL" column is displayed to emphasize the fact that we are computing row percentages and so the percentages for each row should sum to 100. (Note that the numbers in the second row don't quite add up to 100. This is due to rounding errors -- each percentage is rounded to the nearest whole number.)
|Free Throw Shooting Ability||Two-Point Shooting Ability||poor||average||good||TOTAL||poor||22||53||25||100||average||23||37||39||100||good||52||38||10||100|
We learn about the relationship between the two types of shooting ability by inspection of the row percentages. Of the poor field shooters, represented by the first row in the table, roughly half are average free shooters and the remaining are equally split between the poor and good categories. The average field shooters tend to be average or good free throw shooters (76% in these two categories). Looking at the third row, we see that over half (52%) of the good field shooters are poor free throw shooters; only 10% of this group are good in shooting free throws.
Side-by-side bar graphs
One can see the differences between the row percentages by the use of a number of bar graphs. The figure below shows the differences in the foul shooting ability of the poor, average, and good two-point shooters by the use of three side-by-side bar charts. The left-most chart of three bars represents the foul shooting performance of the weakest shooters from the field. The vertical scale of this graph is the percentage which is the proportion of each category multiplied by 100. The middle three bars represents the foul shooting ability of the average shooters from two-point land, and the right-most bars shows the foul shooting ability of the best two-point shooters. Note that the distribution of three categories of free throw shooters is very different for the three types of two-point shooters.
The above discussion and graph focused on the row percentages in the table. One can also learn about the connection between the two variables by the computation of column percentages. We divide the players into poor, average and good free throw shooters, which correspond to the three columns of the table. For each column, we compute the percentage of poor, average and good two-point shooters. For example, there are 71 poor shooters from the foul line. Of these players, 16/71 = 22% are poor field shooters, 25/71 = 35% are average field shooters and 30/71 = 42% are good field shooters. We repeat this procedure for the second and third columns of the table. The result of this computation is the table in the following column percent form. These, like the row percentages, are conditional proportions. We are now conditioning on the columns of the table. The extra "TOTAL" row in the table is the sum of the percentages in each column. Since percentages are computed separately for each column, the sum of the percentages for each column (across rows) should be equal to 100.
|Free Throw Shooting Ability||Two-Point Shooting Ability||poor||average||good||poor||23||38||28||average||35||40||63||good||42||22||9||TOTAL||100||100||100|
From this column percentage table, we can compare the field shooting ability of the poor, average, and good free throw shooters. The poor free throw shooters are generally average or good in shooting from the field (the total percent in the average and good categories is 78%). The average shooters from the foul line are generally poor or average from the field (total percent of 78%), and the good shooters from the charity stripe are primarily average shooters from two-point land.
Summarizing relationship between two variables.
So what do we conclude? Are good field shooters (from two-point land) also good free throw shooters? Actually, the opposite appears to be true. From the table of row percentages, we see that the good field shooters are generally poor free throw shooters and the average field shooters are the best shooters from the free throw line. Why is this true? This association can be partly explained in terms of the different positions of the players. The players who play center or power forward in the NBA attempt most of their two-point shots close to the basket. It is easier to make close shots (most dunk shots are successful) and so most of these players are good two-point shooters. However, these tall players have a much harder time making shots when they are further away from the basket. In particular, they typically are poor shooters from the free throw line. The players that are traditionally thought to be good shooters are the shooting guards in the NBA. These players have a good shooting touch, but their two-point shooting percentages are relatively low since they take most of their shots far from the basket. These same players will be good shooters from the foul line since they have a good shooting touch.