My notes for this page:

Fairness for yellow gummy bears: Hypothesis testing.

29 Fairness for yellow gummy bears: Hypothesis testing.

Slide 0

Hello and welcome to this new episode. Today we will talk about hypotheses and how you can test them. And this works easily and conclusively with gummy bears – no matter the color.

Let’s begin.

Slide 1

Descriptive and inferential statistics: What are they? We’ll look at an example, specifically, the PISA study. PISA – the Program for International Student Assessment – evaluates 15-year-old students’ scholastic proficiency at three-year intervals in many countries of the world. It is governed by the OECD, the Organisation for Economic Co-operation and Development, which presently has 38 member countries.

In 2018, a total of 5,451 randomly selected adolescents in Germany participated in the study. They scored 500 performance points on average in mathematics (SD = 95); the average of the OECD was a little lower, namely only 489 points (SD = 91). So, is the math proficiency of adolescents in Germany better than the average proficiency of their contemporaries in OECD countries? And is this seemingly rather small difference of 11 points actually meaningful in some way?

Statistics deals with questions like these and sensible processes have been developed to answer these questions.

Slide 2

An important goal is to check the reliability with which a particular claim – the hypothesis – can be derived from the observed data. Take these claims, for instance:

  • “The math proficiency of adolescents in Germany is better than the average proficiency of their contemporaries in OECD countries.”
  • “In comparison with the OECD countries, there are few high-performing students and many low-performing students in Germany.”

These claims can be examined based on the results of the 2018 PISA study.

It should be emphasized up front that we rarely come to a clear determination of “it’s true” or “it’s not true.” Rather, the point is to determine whether we can make a statement with sufficient probability. Most things in everyday life require decisions that are associated with some uncertainty. That’s exactly what we’re addressing today.

Slide 3

Incidentally, in the case of the PISA study, you can look up the answers to the questions in various publications. We’ll come back to this again at the very end of this episode.

But of course we want to understand how the experts come up with their assessments. How comprehensible and how reliable are they? In PISA, data is gathered on adolescents’ proficiency and ultimately this data is used to evaluate this random experiment.

Here’s what it boils down to:

A random experiment is conducted multiple times. The experimenter observes the results and would like to infer the probability distribution underlying the random experiment based on these results.

How is this possible? And – very important to remember – what errors are being risked?

Slide 4

The comprehensive data of a international performance assessment are less suitable for understanding the general approach.

Therefore, let’s start with a simple, fictitious Bernoulli trial.

The main person is Jala. She loves yellow gummy bears, but she thinks that the bags contain far fewer yellow ones than all other colors. She empties a bag with 30 gummy bears and finds four yellow gummy bears.

“Too few,” she thinks. “There are six colors, and four are fewer than one-sixth of 30.” Jala views her suspicion as confirmed.

“Such nonsense,” says her friend Aya. “You have to check at least five bags. Until you know that they contain fewer than 20 yellow gummy bears combined, you can’t be reasonably sure that the mix is generally unfair.”

Do the arguments sound convincing? We’ll look more closely at this.

Slide 5

You recall that this situation is also about binomial distribution. Here is the familiar formula once again.

For a Bernoulli sequence of length n and the probability of success p, the probability of exactly k successes (given 0 = k = n) is:

 

First, let’s consider the situation from Jala’s point of view. In that case, we’re interested in the fact that at most four gummy bears are yellow in a bag with 30 pieces, thus in P(X<5).

Here you see the elements for k = 0, 1, 2, 3, and 4 and the probability p = 1/6 ˜ 0.16 and thus 1-p = 5/6 ˜ 0.84. As a reminder: p = 1/6 because the gummy bears come in six colors.

In the last column, you can read the various results, which add up to approximately 0.463.

So, the probability is about 46 percent that there are four or fewer yellow gummy bears in a bag with 30 pieces.

That is a relatively high probability of nearly 1/2. Apparently, Jala should contemplate whether her statement is tenable. The probability of fewer than five gummy bears of her favorite color in a bag and thus the probability of an error on her part is far too high.

Slide 6

In my opinion, it makes sense to calculate “by hand” now and then to better understand the correlations.

Of course, we also could have had the numbers calculated for us, for instance, using a statistics program or suitable online applications. You arrive at the same result, which you see at the lower right: 4.63(-1) is nothing other than 0.463. For now we will disregard the long string of decimals places after that.

Slide 7

Now let’s take Aya’s point of view, who would like to test 5 • 30 = 150 gummy bears and has set a limit of 5 • 4 = 20 pieces. This is of course random. However, Aya thinks that a number under 20 is low enough to show that there are fewer yellow gummy bears than others.

The elements are thus k between 0 and 19 and n = 150. Naturally, p and 1-p remain unchanged at 0.16 and 0.84.

This time we’ll go straight to the online calculator and receive a rounded answer of P(X = 19) = 0.1579.

While this value is clearly less than 0.463, it is still quite high. In approximately one-sixth of the sets of five bags, supposedly there are relatively few yellow gummy bears. 

It there a sufficient basis to lodge a complaint with the manufacturer?

Obviously, it’s a matter of opinion. However, the statistics experts have agreed to reject this value. There is simply still too much randomness. The limit is drawn at 0.05 or preferably at 0.01 and we speak of a 5 percent level or a 1 percent level.

Expressed casually: A letter of complaint from Jala based on her data or on Aya’s suggestion will likely not succeed. A probability of error of 1/6 or even 1/2 is deemed unacceptable.

Slide 8

One more thing:  If you really want to accept an error in only 1 percent of the cases, this works only for 13 or fewer gummy bears. Calculate this for yourself using a suitable application.

Slide 9

Let’s look at another example, but let’s stay in the world of gummy bears. This time it’s about a different property, specifically their weight.

Gummy bears mostly come from a factory and are produced by machine. One very delicious brand comes 20 gummy bears in a 50 gram bag, so each weighs about 2.5 g. Generally, the pieces should not deviate more than 0.2 g upwards or downwards. At the most, deviations of 3 percent are tolerated. The machine’s reliability is checked using these values. If the result is poor, the machine must be serviced.

Let’s assume that the check is performed manually and a sample of 100 gummy bears is drawn. The machine is viewed as needing servicing if at least four gummy bears weigh less than 2.3 g or more than 2.7 g. Is that a reasonable limit?

Slide 10

Here is the situation:

We are dealing with a Bernoulli trial because there are only two possibilities. A gummy bear lies within the set weight standard or not.

A sample of 100 gummy bears is drawn.

We now assume that the sample contains at least four gummy bears whose weight lies outside of the standard.

Does the machine need to be serviced? Or could it still work properly?

How can we check this?

Slide 11

And of course, The decision must always be made with a little uncertainty. The question is, how great is the risk of accepting the high costs for servicing without good reason?

Thus, how high is the probability that we will find more than three gummy bears with the wrong weight in a random sample of 100 gummy bears? “Wrong” means the weight deviates too far upwards or downwards from the set standard.

We’re looking for P( X = 4 ) for 100 trials and a tolerance of 3 percent.

We use the calculator and determine that the sought-after probability is about 35 percent.

That is high, and it appears to be quite risky to shut down the machine and have it serviced for this small error.

Slide 12

Asked another way: 

How many gummy bears with a deviating weight do we have to tolerate if the risk of unnecessarily shutting down the machine should be maximum 5 percent?

The calculation shows that up to six gummy bears with the wrong weight may occur.

Or, if you draw 1,000 gummy bears at one time, then 40 gummy bears is actually a good limit for making a wrong decision in only 5 percent of the cases. Try it out yourself. 

Slide 13

We once worked with real gummy bears. Do you remember?

At that time, it was suggested that there were far fewer red and green gummy bears than yellow, white, and orange gummy bears.

We count what is pictured here: Of 102 gummy bears, 47 are either dark red or light red or green and 55 are yellow, white, or orange. Until now, we intuitively felt that this was approximately equal. Now we can back this up statistically and reliably check it, and that’s exactly what we want to do now.

Slide 14

The hypothesis H0 is thus: There are the same number of gummy bears in the colors RED and GREEN as in the colors WHITE, YELLOW, and ORANGE.

To test the hypothesis, we use P(X < 48).

Slide 15

And again we’ll let the calculator do the work. On the left side, you see that we arrive at a probability of just under 25 percent that of 102 gummy bears, fewer than 48 are red or green. The hypothesis is thus not tenable.

But let’s play with the numbers again. On the right side, you see the calculation with k = 42, and only with this number does the risk drop below 5 percent that the hypothesis will be wrongly rejected.

Slide 16

We can also perform the calculation for each color individually. Then the Bernoulli trial is simply one particular color against all the others. To the right of the table you see the corresponding probabilities; in none of the cases can we assume that the distribution is unfair.

Slide 17

Today’s episode was about the following:

The goal of hypothesis testing is to make a substantiated, statistically reliable statement about whether a hypothesis will be rejected or not with respect to the general population. 

We also call the supposition – the initial state – the null hypothesis and usually write it as H0. The alternative hypothesis – thus simply the opposite statement to the null hypothesis – is usually designated H1.

Slide 18

Do you still remember the two questions at the beginning in the context of the PISA study?

We asked whether the following claims can be proven.

Number 1: “The math proficiency of adolescents in Germany is better than the average proficiency of their contemporaries in OECD countries.”

Number 2: “In comparison with the OECD countries, there are few high-performing students and many low-performing students in Germany.”

Slide 19

Here are answers that were received in 2018 based on data from PISA.

Compared to the OECD countries, with 500 points Germany lies significantly above the OECD average of 489 points. With a very large sample, even smaller differences can become statistically significant.

The top group in the OECD comprises Japan (527), Korea (526), Estonia (523), and the Netherlands (519).

In Germany, 21.1 percent of 15-year-olds demonstrate a proficiency level of level 1 or lower. The value is not significantly below the OECD average of 24.0 percent.

In Germany, 13.3 percent of 15-year-olds demonstrate a proficiency level of level 5 or 6. This value is not significantly above the OECD average of 10.9 percent.

And here, of course, the size of the sample plays a role.

Slide 20

That’s all for today. It’s nice that you were here. Until next time.

Tip: Log in and save your completion progress

When you log in, your completion progress is automatically saved and later you can continue the training where you stopped. You also have access to the note function.

More information on the advantages