Blog
2023-02-03
Imagine a set of 142 points on a two-dimensional graph.
The mean of the \(x\)-values of the points is 54.26.
The mean of the \(y\)-values of the points is 47.83.
The standard deviation of the \(x\)-values is 16.76.
The standard deviation of the \(y\)-values is 26.93.
What are you imagining that the data looks like?
Whatever you're thinking of, it's probably not this:
This is the datasaurus, a dataset that was created by Alberto Cairo in
2016 to remind people to look beyond the summary statistics when analysing a dataset.
Anscombe's quartet
In 1972, four datasets with a similar aim were publised. Graphs in statistical analysis by Francis J Anscombe [1] contained four datasets that have become known as Anscombe's quartet: they all have the same
mean \(x\)-value, mean \(y\)-value, standard deviation of \(x\)-values, standard deviation of \(y\)-values, linear regression line, as well multiple other values
related to correlation and variance. But if you plot them, the four datasets look very different:

Plots of the four datasets that make up Anscombe's quartet. For each set of data:
the mean of the \(x\)-values is 9; the mean of the \(y\)-values is 7.5;
the standard deviation of the \(x\)-values is 3.32; the standard deviation of the \(y\)-values is 2.03;
the correlation coefficient between \(x\) and \(y\) is 0.816;
the linear regression line is \(y=3+0.5x\);
and coefficient of determination of linear regression is 0.667.
Anscombe noted that there were prevalent attitudes that:
- "Numerical calculations are exact, but graphs are rough."
- "For any particular kind of statistical data, there is just one set of calculations constituting a correct statistical analysis."
- "Performing intricate calculations is virtuous, actually looking at the data is cheating."
The four datasets were designed to counter these by showing that data exhibiting the same statistics can actually be very very different.
The datasaurus dozen
Anscombe's datasets indicate their point well, but the arrangement of the points is very regular and looks a little artificial when compared with real data sets.
In 2017, twelve sets of more realistic-looking data were published (in Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing by Justin Matejka and George Fitzmaurice [2]).
These datasets—known as the datasaurus dozen—all had the same
mean \(x\)-value, mean \(y\)-value,
standard deviation of \(x\)-values, standard deviation of \(y\)-values, and corellation coefficient (to two decimal places) as the datasaurus.

The twelve datasets that make up the datasaurus dozen. For each set of data (to two decimal places):
the mean of the \(x\)-values is 54.26; the mean of the \(y\)-values is 47.83;
the standard deviation of the \(x\)-values is 16.76; the standard deviation of the \(y\)-values is 26.93;
the correlation coefficient between \(x\) and \(y\) is -0.06.
Creating datasets like this is not trivial: if you have a set of values for the statistical properties of a dataset, it is difficult to create a dataset with those properties—especially
one that looks like a certain shape or pattern.
But if you already have one dataset with the desired properties, you can make other datasets with the same properties by very slightly moving every point in a random direction then
checking that the properties are the same—if you do this a few times, you'll eventually get a second dataset with the right properties.
The datasets in the datasaurus dozen were generated using this method: repeatedly adjusting all the points ever so slightly, checking if the properties were the same, then
keeping the updated data if it's closer to a target shape.
The databet
Using the same method, I generated the databet: a collection of datasets that look like the letters of the alphabet. I started with this set
of 100 points resembling a star:
After a long time repeatedly moving points by a very small amount, my computer eventually generated these 26 datasets, all of which have the same means,
standard deviations, and correlation coefficient:

The databet. For each set of data (to two decimal places):
the mean of the \(x\)-values is 0.50; the mean of the \(y\)-values is 0.52;
the standard deviation of the \(x\)-values is 0.17; the standard deviation of the \(y\)-values is 0.18;
the correlation coefficient between \(x\) and \(y\) is 0.16.
Words
Now that we have the alphabet, we can write words using the databet. You can enter a word or phrase here to do this:
Given two data sets with the same number of points, we can make a new larger dataset by including all the points in both the smaller sets.
It is possible to write the mean and standard deviation of the larger dataset in terms of the means and standard deviations of the smaller sets: in each case,
the statistic of the larger set depends only on the statistics of the smaller sets and not on the actual data.
Applying this to the databet, we see that the datasets that spell words of a fixed length will all have the same mean and standard deviation.
(The same is not true, sadly, for the correlation coefficient.) For example, the datasets shown in the following plot both have the same means and standard deviations:

Datasets that spell "TRUE☆" and "FALSE". For both sets of (to two decimal places):
the mean of the \(x\)-values is 2.50; the mean of the \(y\)-values is 0.52;
the standard deviation of the \(x\)-values is 1.42; the standard deviation of the \(y\)-values is 0.18.
Hopefully by now you agree with me that Anscombe was right: it's very important to plot data as well as looking at the summary statistics.
If you want to play with the databet yourself, all the letters are available on GitHub in JSON format.
The GitHub repo also includes fonts that you can download and install so you can use Databet Sans in
your next important document.
References
[1] Graphs in statistical analysis by Francis J Anscombe. American Statistician, 1973.
[2] Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing by Justin Matejka and George Fitzmaurice. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 2017.
(Click on one of these icons to react to this blog post)
You might also enjoy...
Comments
Comments in green were written by me. Comments in blue were not written by me.
⭐ top comment (2023-02-03) ⭐
Very cool! Thanks for sharing ????Jessica






Add a Comment
2023-01-08
Welcome to 2023 everyone! Now that the Advent calendar has disappeared, it's time to reveal the answers and announce the winners.
But first, some good news: with your help, the drones were all destroyed in time for Santa to deliver presents and Christmas was saved!
Now that the competition is over, the questions and all the answers can be found here.
Before announcing the winners, I'm going to go through some of my favourite puzzles from the calendar and a couple of other interesting bits and pieces.
Highlights
My first highlight is the puzzle from 1 December. I like this puzzle, because the lines of symmetry of a rectangle that you might expect—although it's not too hard to
see what the lines of symmetry are, so this makes a nice gentle first puzzle.
1 December
One of the vertices of a rectangle is at the point \((9, 0)\). The \(x\)-axis and \(y\)-axis are both lines of symmetry of the rectangle.
What is the area of the rectangle?
My next hightlight is the puzzle from 11 December. I always enjoy a surprise appearance of the Fibonacci sequence.
11 December
There are five 3-digit numbers whose digits are all either 1 or 2 and who do not contain
two 2s in a row: 111, 112, 121, 211, and 212.
How many 14-digit numbers are there whose digits are all either 1 or 2 and who do not contain
two 2s in a row?
My next highlight is the puzzle from 13 December. I love a good crossnumber, and had a lot of fun making this small one up. (If you enjoyed this one, you should check out the
crossnumbers I write for Chalkdust.)
13 December
Today's number is given in this crossnumber. The across clues are given as normal, but the down clues are given in a random order: you must work out
which clue goes with each down entry and solve the crossnumber to find today's number.
No number in the completed grid starts with 0.
|
|
|
My final highlight is the puzzle from 24 December. You could solve this by doing a lot of expanding, but there's a neat shortcut that makes it almost trivial to solve.
24 December
The expression \((3x-1)^2\) can be expanded to give \(9x^2-6x+1\). The
sum of the coefficients in this expansion is \(9-6+1=4\).
What is the sum of the coefficients in the expansion of \((3x-1)^7\)?
Hardest and easiest puzzles
Once you've entered 24 answers, the calendar checks these and tells you how many are correct. I logged the answers that were sent
for checking and have looked at these to see which puzzles were the most and least commonly incorrect. The bar chart below shows the total number
of incorrect attempts at each question.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 |
Day |
You can see that the most difficult puzzles were those on
11,
18 and
19 December;
and the easiest puzzle was on
8 December.
The winners
And finally (and maybe most importantly), on to the winners: 192 people managed to destroy all three drones. That's more people than last year:
2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 |
Year |
From the correct answers, the following 10 winners were selected:
- Claire Metcalfe
- Shivanshi
- Gary M
- Katharine Velleman
- James Dolengewicz
- Cathy Hooper
- Alan Buck
- Yurie Ito
- Kai
- Nicholas Jackson
Congratulations! Your prizes will be on their way shortly.
The prizes this year include 2022 Advent calendar T-shirts. If you didn't win one, but would like one of these, I've made them available to buy at merch.mscroggs.co.uk alongside the T-shirts from previous years.
Additionally, well done to
Aaron Johnson, Aaron Stiff, Aidan Dodgson, Alejandro Villarreal, Alek2ander, Alex Bolton, Alex Davis, Alex Hartz, Andrew Brady, Andrew Brodie, Andrew Ennaco, Andrew Roy, Andrew Turner, Artie Smith, Ashton Lewis, Austin Antoniou, Becky Russell, Ben Baker, Ben Boxall, Ben Reiniger, Ben Tozer, Ben Weiss, Beth Jensen, Blake, Brennan Dolson, Brian Carnes, Brian Wellington, Carl Westerlund, Carmen, Charleston W, Chris Eagle, Chris Hellings, Colin Beveridge, Colin Brockley, Connie, Corbin Groothuis, CreativeCrocheter, Dan Colestock, Dan DiMillo, Dan May, Dan Swenson, Dan Whitman, Daniel Cuneo, David and Ivy Walbert, David Ault, David Berardo, David Fox, David Kendel, David Mitchell, Deborah Tayler, Deborah Tayler, Derek Perrin, Dominik Niemand, Don Anderson, Dr Lizzie, Duncan Schaafsma, Dylan Richard, Eleanor, Elizabeth Blackwell, Elizabeth Madisetti, Emilie Heidenreich, Emily Troyer, Emma, Eoin Davey, Eric Kolbusz, Eric Scotti, Erik Eklund, Fionn Woodcock, Frances, Frank Kasell, Fred Verheul, Freddie Mao, Félix Breton, Gabriella Pinter, Gary M. Gerken, Gerry, Gert-Jan, Greg W., Gregory Loges, Greta, Han Whiteoak, Hannah Charman, Heerpal Sahota, Helen F, Herschel, Iris, Jack, Jacob, Jacob Loader, James Chapman, James Cunnane, Jarvis9, Jean-Noël Monette, Jean-Sébastien Turcotte, Jen Sparks, Jessica Marsh, Jim Ashworth, Jon Palin, Jonathan Chaffer, Jonathan Thiele, Jorge del Castillo Tierz, Joseph Gage, Joseph Wagner, Joshua Park, Karen Climis, Kevin Docherty, Kirsty Fish, Kristen Koenigs, Kyle Allen, Lazar Ilic, Lewis Dyer, Lise Andreasen, Louis, LycanFayn, Lyra, Magnus Eklund, Marco van der Park, Mark Stambaugh, Martin Harris, Martin Holtham, Mathryn, Matt Thomson, Mels, Merrilyn, Michael DeLyser, Mihai Zsisku, Mike L, Mike R, Millie, Mr J Winfield, Nadine Chaurand, Nancy Walker, Naomi Bowler, Naomi C, Nick Keith, Niji Ranger, Pamela Docherty, Patrick, Philip Corradi, Priyesh, Pup, Qaysed, Qaysed, Rashi, Ray Arndorfer, Reid, Reuben, Riccardo Lani, Rob Dixon, Robert Brady, Roger Lipsett, Roni Malek, Rosie Paterson, Russ Collins, Ruth Franklin, Sage Robinson, Sam Drei, Sarah Brook, Scott, Sean Henderson, Seth Cohen, shadorfff, Simon English, Stephen Cappella, Stephen Jasina, Sumaya Felic, Tamara Brenner, Tarim, The Connors of York, The Steelblade, Tom Fryers, Tony Mann, tripleboleo, Tyler St Clare, UsrBinPRL, Valentin VĂLCIU, Vinayak, Vinny R, vortex, Yasha, Yuliya N., and Zoran Morrissey-Ralevic.
who all also completed the Advent calendar but were too unlucky to win prizes this time or chose to not enter the prize draw.
See you all next December, when the Advent calendar will return.
Edit: Removed myself (and a second copy of myself) from the list of solvers.
(Click on one of these icons to react to this blog post)
You might also enjoy...
Comments
Comments in green were written by me. Comments in blue were not written by me.
You didn't mention the rate limiting you put in for the bots!
Sorry about that
Sorry about that
(anonymous)
I fought very hard to solve the middle "here are the 6 answers, construct the 6 small problems", but I just couldn't. Hints about that one and the genre in general would be great.
Lise Andreasen
@Valentin V?LCIU: Oops, forgot to remove my testing that everything works from the list of people! (Removing it now)
Matthew
Add a Comment
2022-12-29
This is the 100th blog post on this website!
But if I hadn't pointed this out,
you might not have noticed: the URL of the page is mscroggs.co.uk/blog/99 and not mscroggs.co.uk/blog/100.
This is a great example of an off-by-one error.
Off-by-one errors are one of the most common errors made when programming and elsewhere, and this is an excellent opportunity to blog about them.
Fence posts and fence panels
Imagine you want to make a straight fence that is 50m long. Each fencing panel is 2m long.
How many fence posts will you need?
Have a quick think about this before reading on.
If you're currently thinking about the number 25, then you've just made an off-by-one error.
The easiest way to see why is to think about some shorter fences.
If you want to make a fence that's 2m long, then you'll need just one fence panel. But one fence
post will not be enough: you'll need a second post to put at the other end of the fence panel.
If you want to make a 4m long fence, you'll need a post before the first panel, a post between
the two panels, and a post after the second panel: that's three posts in total.
In general, you'll always need one more fence post than panel, as you need a fence post
at the start of each panel and an extra post at the end of the final panel.
(Unless, of course, you're building a fence that is a closed loop.)
This fence post/fence panel issue appears surprisingly often, and can make counting things
quite difficult. For example, the first blog post
on this website was posted in 2012: ten years ago. But if you count the number of years listed in the
archive there are 11 years. If you release an issue of a magazine once a year, then issue 11 (not issue 10) will
be the issue released 10 years after you start not issue 10. If, like Chalkdust,
you release issues two times a year, issue 21 (not issue 20) will be the 10 year issue.
Half-open intervals
An interval is called closed if it includes its starting and ending point, and open if it
doesn't include them. A half-open interval includes one end point and not the other.
Using half-open intervals makes counting things less difficult: including one endpoint but not the other is a bit like ignoring
the final (or first) fence post so that there are the same number of post and panels.
In Python, the range function includes the first number but not the last
(this is the sensible choice as including the final number and not the first would be very confusing).
range(5, 8) includes the numbers 5, 6, and 7 (but not 8).
By excluding the final number, the number of numbers in a range
will be equal to the difference between the two input numbers.
Excluding the final item so that the number of items in a range is equal to the difference between the start and end is a great way to
reduce opportunities for off-by-one errors, and isn't too hard to get used to.
Why start at 0?
We've seen a couple of causes of off-by-one errors, but we've not yet seen why this page's URL
contains 99 rather than 100. This is because the numbering of blog posts started at zero.
But why is it a sensible choice to start at 0?
Using a half-open range, the first \(n\) numbers starting at 1 would be range(1, n + 1); the first \(n\) numbers starting at 0 on the other hand
would be range(0, n). The second option is neater, as you don't have to add one to the final number; the first option opens up more opportunities for
off-by-one errors.
This is one of the reasons why Python and many other programming languages start their numbering from 0.
Why doesn't everyone start at 0?
Starting at 0 and using half-open intervals to represent ranges of integers seem like good ways to help people avoid making off-by-one errors, but this choice is not perfect.
If you want to write a range of numbers from 1 to 8 inclusive using this convention, you would have to write range(1, 9):
forgetting to add one to the final number in this situation is another source of off-by-one errors.
It's also more natural to many people to start counting from 1, so some programming languages choose different conventions. The following table sums up the different possible
conventions, which desirable properties they have, and which languages use them.
Convenction | Languages using this convention | Length of range is difference between endpoints | range(START, n) contains \(n\) numbers | range(START, n) contains START | range(START, n) contains \(n\) |
START=0, range includes first endpoint only | Python, Javascript, PHP, Rust, C, C++ | ✓ | ✓ | ✓ | ✗ |
START=0, range includes last endpoint only | ✓ | ✓ | ✗ | ✓ | |
START=0, range includes both endpoints | ✗ | ✗ | ✓ | ✓ | |
START=0, range includes neither endpoint | ✗ | ✗ | ✗ | ✗ | |
START=1, range includes first endpoint only | ✓ | ✗ | ✓ | ✗ | |
START=1, range includes last endpoint only | ✓ | ✗ | ✗ | ✓ | |
START=1, range includes both endpoints | Matlab, Julia, Fortran | ✗ | ✓ | ✓ | ✓ |
START=1, range includes neither endpoint | ✗ | ✗ | ✗ | ✗ |
(I don't know of any languages that use any of the other conventions, but if you have please let me know in the comments below and I'll add them.)
None of the conventions manages to remove all the possible sources of confusion, so it looks like off-by-one errors are here to stay.
(Click on one of these icons to react to this blog post)
You might also enjoy...
Comments
Comments in green were written by me. Comments in blue were not written by me.
Hi!!!
Love your blog posts!
They make me get out of bed in the morning.
Just wanted to show my appreciation.
Cheers.
Love your blog posts!
They make me get out of bed in the morning.
Just wanted to show my appreciation.
Cheers.
Anonymous#3728
Add a Comment
2022-12-04
In November, I spent some time (with help from TD) designing this year's Chalkdust puzzle Christmas card.
The card looks boring at first glance, but contains 11 puzzles. By colouring in the answers to the puzzles on the front of the card in black (each answer appears twice), then colouring remaining squares
containing 0s red, and regions containing a star brown,
you will reveal a Christmas themed picture.
If you want to try the card yourself, you can download this printable A4 pdf. Alternatively, you can find the puzzles below and type the answers in the boxes. The answers will automatically be found and coloured in black, and appropriate squares and regions will be coloured red and brown...
The puzzles | ||
1. | What is the only prime number that is both two more than a prime number and two less than a prime number? | Answer |
2. | Holly adds up the first 7 odd numbers. What total does she get? | Answer |
3. | Holly next adds up the first \(n\) odd numbers to get a total of 1089. What is \(n\)? | Answer |
4. | Ivy starts with 0 then adds or subtracts some multiples of 4 or 7. What is the smallest positive integer that she could have ended with? | Answer |
5. | Ivy again starts with 0, but this time she adds or subtracts some multiples of 240 or 400. What is the smallest positive integer that she could have ended with? | Answer |
6. | How many 4-digit integers are there whose digits are all non-zero and whose digits add up to 7? | Answer |
7. | How many positive integers are there whose digits are all non-zero and whose digits add up to 7? | Answer |
8. | Eve wrote down a four-digit number. Eve then removed one of the digits of her number to make a three-digit number. The sum of her two numbers is 3119. What was her four-digit number? | Answer |
9. | Eve wrote down a five-digit number. Eve then removed one of the digits of her number to make a four-digit number. The sum of her two numbers is 96158. What is the largest number that her five-digit number could have been? | Answer |
10. | Noel drew 12 points on the circumference of a circle, then drew a straight line connecting every pair of points. How many lines did he draw? | Answer |
11. | Noel drew some points on the circumference of a circle, then drew a straight line connecting every pair of points. He drew 2926 lines. How many points did he draw? | Answer |
(Click on one of these icons to react to this blog post)
You might also enjoy...
Comments
Comments in green were written by me. Comments in blue were not written by me.
Great fun thanks. At first they seem impossible but then a way through appears! How do I get the answers / check if I’m right?
Graeme Johnston
Add a Comment
2022-11-25
This year, the front page of mscroggs.co.uk will once again feature an Advent calendar, just like
in each of the last seven years.
Behind each door, there will be a puzzle with a three digit solution. The solution to each day's puzzle forms part of a logic puzzle:
It's nearly Christmas and something terrible has happened: an evil Christmas-hater has set three drones loose above Santa's stables. As long as the drones are flying around, Santa is unable to
take off to deliver presents to children all over the world.
You need to help Santa by destroying the drones so that he can deliver presents before Christmas is ruined for everyone.
Each of the three drones was programmed with four integers between 1 and 20 (inclusive): the first two of these are the drone's starting position; the last two give the drone's daily speed.
The drones have divided the sky above Santa's stables into a 20 by 20 grid. On 1 December, the drones will be at their starting position.
Each day, every drone will add the first number in their daily speed to their horizontal position, and the second
number to their vertical position. If the drone's position in either direction becomes greater than 20, the drone will subtract 20 from their position in that direction.
Midnight in Santa's special Advent timezone is at 5am GMT, and so the day will change and the drones will all move at 5am GMT.
For example, if a drone's starting position was (1, 12) and its movement was (5, 7), then:
- on day 1, it would be at (1, 12);
- on day 2, it would be at (6, 19);
- on day 3, it would be at (11, 6);
- on day 4, it would be at (16, 13);
- on day 5, it would be at (1, 20);
- on day 6, it would be at (6, 7);
- and so on.
You need to calculate each drone's starting position and daily speed, then work out where the drone currently is so you can shoot it down.
Behind each day (except Christmas Day), there is a puzzle with a three-digit answer. Each of these answers forms part of a piece of information about the locations of the drones.
You must use these clues to work out each drone's starting position and daily speed, then work out where the drone currently is so you can shoot it down.
You can use this page to fire up to 5 missiles into the sky each day.
Ten randomly selected people who solve all the puzzles, destroy all three drones, and fill in the entry form behind the door on the 25th will win prizes!
The prizes will include an mscroggs.co.uk Advent 2022 T-shirt. If you'd like one of the T-shirts from a previous Advent, they are available to order at merch.mscroggs.co.uk.
The winners will be randomly chosen from all those who submit the entry form before the end of 2022. Each day's puzzle (and the entry form on Christmas Day) will be available from 5:00am GMT. But as the winners will be selected randomly,
there's no need to get up at 5am on Christmas Day to enter!
As you solve the puzzles, your answers will be stored. To share your stored answers between multiple devices, enter your email address below the calendar and you will be emailed a magic link to visit on your other devices.
To win a prize, you must submit your entry before the end of 2022. Only one entry will be accepted per person. If you have any questions, ask them in the comments below,
on Twitter,
or on Mastodon.
So once December is here, get solving! Good luck and have a very merry Christmas!
(Click on one of these icons to react to this blog post)
You might also enjoy...
Comments
Comments in green were written by me. Comments in blue were not written by me.
It's becoming a Christmas tradition to do your advent calendar with my partner. Loved being able to narrow down our guesses each day to pinpoint the drone this time around. Thanks for running this!
Liz
Another year of great puzzles, Matt! I really appreciate it and look forward to working these every year.
Dan Whitman
Really enjoyable this year. I "give" this advent calendar to my Year 12 and 13 Further Maths classes every year, and this has engaged more of them than in previous years. They particularly liked the shooting down of drones and the opportunity for intelligent "guess work" or in the case of some writing a computer programme which would calculate the probability distribution for each drone's position based on current information. Thank you
TAS
Thanks so much for making this, Matthew! It was a joy to solve, I found myself looking forward to every morning.
Tyler St Clare
Add a Comment