My notes for this page:

Basics of statistics and probability: How can you meaningfully represent the results?

Slide 0

I hope you're having a wonderful day. A warm welcome to all. Today we’re going to talk about data again and specifically about how you can represent the data so that they are as meaningful as possible to the readers. We’ll also talk about how you should not do it. We will see that representations cannot only be awkward but can manipulate quite deliberately.

Slide 1

Here is the core of the matter: We would like to show the results of statistical observations so that the representation gets to the heart of the statement made by the data. Of course, we want to do that as well as possible and especially minimize the risk of misinterpretations. In addition, the data should be prepared appropriately for the audience and understood as easily as possible.

Slide 2

Clearly, we first need data. If the data are simply available only in a table or something similar, we call this a raw data table. You see two examples here. One is the election of a class representative that we worked with in an earlier episode. The raw data table in this case is the tally sheet with the results. You also see here the likewise familiar results of 100 rolls with an ordinary die. This table with the unprocessed results is the raw data table.

Slide 3

Let’s look at another example, namely the most spoken languages in the world. In this table, you see the estimated number of native and second-language speakers in millions for six languages. Okay, only the first five are really the most spoken languages in the world. But I thought that the numbers for German might be interesting even though this language comes in only 12th place in this list with 132 million speakers.

In first and second place you see English and Mandarin, each with more than one billion speakers. Hindi comes in third with 637 million speakers and Spanish comes in fourth with 538 million. French follows in fifth place with 280 million active speakers.

By the way, a second language is somewhat different from a foreign language. Here we talk about second-language speakers if a language is essential in daily use, thus perhaps an official language in a country like English is in India. If we sort the data by native speakers, then Mandarin comes in first place, followed by Spanish in second and English in third.

Slide 4

A bar chart is often the easiest way to represent data. In this example, we enter the individual languages along the x-axis and choose labeling with numbers on the y-axis. Here the labeling goes from 0 to 1.4 billion because we need to have enough space at the top.

Each bar immediately allows the reader to roughly recognize the numbers of speakers and especially the relative placement of the results.

Slide 5

By the way, it is always important to consider which representation is truly suitable. For example, a pie chart is less useful in this case.

It does show the proportions among the languages quite well. However, it suggests that it includes everything. That’s not true in this case because there are many more languages in the world and far more people who do not speak precisely these represented languages. In addition, the chart gives the impression that every person is taken into account exactly once here. This is certainly also not the case because, for example, many people in India can communicate equally well in Hindi and English.

Slide 6

However, caution is also advised with the simple bar chart. Do you notice anything? Somehow the proportions seem to be shifted. Can you tell why this is?

Slide 7

Quite simply, the y-axis doesn’t start at 0 in the chart on the left like it does in the chart on the right that you already know from slide 4. As a result, the proportions appear much smaller, especially for German but also for French. Be mindful that this sort of representation frequently appears in publications and can be completely misleading when data are interpreted.

Slide 8

Here’s another bar chart, only in this case, the languages with smaller numbers of speakers seem to have caught up considerably. They owe this to the logarithmic scaling of the y-axis.

The distance between 0 and 1,000 is triple the distance between 0 and 10, and the distance between 0 and 100 is double the distance between 0 and 10.

The principle is easier to understand when it is expressed using powers: The distance between 0 and 1000 = 103 is triple the distance between 0 and 10 = 101 and the distance between 0 and 100 = 102 is twice the distance between 0 and 10 = 101.

In the linear representation, which we know not only from slide 4, but actually from our early schooldays, the distance between 0 and 100 is naturally ten times greater than the distance between 0 and 10. And we’re very accustomed to this perspective.

We therefore don’t always recognize logarithmic representations at first glance. For this reason, it is important to make sure that data that were scaled logarithmically are interpreted correctly.

Slide 9

A die is rolled. The chart shows the trend of the relative frequency of rolling a 6 in the course of 1,250 rolls. Do you remember? We used the random number generator of a computer in this case. Here, the raw data with a sequence of 1,250 numbers between 1 and 6 doesn’t lend itself well to a written representation. It starts with 6, 4, 5 ,4 ,2, 1, 4 and ends with 2, 4, 1, 3, 2, 4, 4. We’ll spare ourselves the rest. Important: Of course, there must be concrete data in this case too.

Slide 10

The last example for today deals with rice consumption in selected regions, specifically in Europe, North America, Africa, Latin America, and Asia including the Pacific region. The per capita consumption per year differs significantly in the various regions. While consumption is rather low in Europe and North America at 4.6 kilograms and 12.5 kilograms respectively, it is considerably higher in Africa at 25.1 kilograms and Latin America at 29.3 kilograms. The front-runner is Asia-Pacific with consumption of nearly 85 kilograms per capita.

Let us process the data in a chart.

Slide 11

It’s very clear that a bar chart is useful here. It provides a good overview of where things currently stand.

In the press you sometimes see nice pictures that are meant to visualize the data and correlations using suitable proportions. We see an example on the next slide.

Slide 12

How do you like the representation? What does it suggest?

Now, what was done here is not all that uncommon. For instance, if we compare Europe and North America, these regions differ roughly by a factor of 3, meaning that in North America, about three times the amount of rice is consumed. You see this tripling both on the x-axis and on the y-axis, and therefore the pictures differ by a factor of 3 • 3 = 9, thus suggesting significantly greater differences. The factor of 2 between North American and Africa becomes a perceived factor of 2 • 2 = 4. Even the not-so-large difference between consumption in Africa and Latin America is clearly visible.

And Asia? It doesn’t even fit on the page.

Slide 13

In a comparison between Latin America and Asia, the triple amount turns into 3 • 3 = 9 times.

So pay close attention when you see representations of data. And in case of doubt, ask for the raw data list.

Slide 14

That’s all for today. Thank you for being here, and I look forward to seeing you in the next episode.

Tip: Log in and save your completion progress

When you log in, your completion progress is automatically saved and later you can continue the training where you stopped. You also have access to the note function.

More information on the advantages