I knew what I wanted: a way to define a “weight” column that would tell me how many times a particular pair of values appeared in my spreadsheet. I spent some time (a lot of it) searching the web using various sentences about frequency and counting. I read a lot of documentation for =FREQUENCY (which isn’t quite what I need) and =COUNTIF (which works for one column, but not for two). At some point I realized I was looking for =SUMPRODUCT. So here’s a secret: I don’t understand how or why SUMPRODUCT works. I kind of do know that you can use it for a test like this, but I was really struggling to wrap my head around why it is the right tool.
I get it now, though. What clicked for me was xldynamic’s explanation of the arrays that a SUMPRODUCT formula creates when you include comparisons.
And I want to know how many rows have the exact same values in columns D and E:
=SUMPRODUCT(($D$2:$D$224=D2)*($E$2:$E$224=E2))
Worth knowing: a boolean result can be represented as TRUE/FALSE, YES/NO or 1/0. I don’t know why that is, by which I mean, it makes sense to me intuitively but I can’t actually explain it to you. So perhaps we’ll explore that in the Ladygrammar Book Computing Club. For now, just trust me. We’re going to represent TRUE as 1 and FALSE as 0.
So. If I were dealing with a table like this:
| Value A | Value B |
| 0 | 25 |
| 0 | 25 |
| 0 | 35 |
| 0 | 37 |
| 0 | 42 |
| 0 | 0 |
| 1 | 11 |
| 1.5 | 3 |
| 1.5 | 9 |
| 1.5 | 9 |
| 1.5 | 25 |
| 2 | 2 |
| 2 | 7 |
I’d use a =SUMPRODUCT formula to create two arrays about each row, and add up the product of each array. So the first instance of the formula would first compare my first Col A value (“0″) to every other value in Col A and create an array of the results (which are TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE). Then it would compare my first Col B value (“25″) to every other value in Column B and create an array of those results. Then it iterate through those two arrays and create a third array which contains the product of each pair of values. TRUE * TRUE is 1, FALSE * TRUE is 0, FALSE * FALSE is 0.
| TRUE | TRUE | 1 * 1 = 1 |
| TRUE | TRUE | 1 * 1 = 1 |
| TRUE | FALSE | 1 * 0 = 0 |
| TRUE | FALSE | 1 * 0 = 0 |
| TRUE | FALSE | 1 * 0 = 0 |
| TRUE | FALSE | 1 * 0 = 0 |
| FALSE | FALSE | 0 * 0 = 0 |
| FALSE | FALSE | 0 * 0 = 0 |
| FALSE | FALSE | 0 * 0 = 0 |
| FALSE | FALSE | 0 * 0 = 0 |
| FALSE | TRUE | 0 * 1 = 0 |
| FALSE | FALSE | 0 * 0 = 0 |
| FALSE | FALSE | 0 * 0 = 0 |
Then it adds up the SUM of all those PRODUCTS. Get it? SUMPRODUCT.
1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 = 2
There are exactly two rows where Col A is 0 and Col B is 25. I repeat that formula for each row, et voila:
| Value A | Value B | =SUMPRODUCT |
| 0 | 25 | 2 |
| 0 | 25 | 2 |
| 0 | 35 | 1 |
| 0 | 37 | 1 |
| 0 | 42 | 1 |
| 0 | 0 | 1 |
| 1 | 11 | 1 |
| 1.5 | 3 | 1 |
| 1.5 | 9 | 2 |
| 1.5 | 9 | 2 |
| 1.5 | 25 | 1 |
| 2 | 2 | 1 |
| 2 | 7 | 1 |
Then, hypothetically, I could use that third column to calculate the radius of points on a scatterplot. Also handy? Libre Office Calc’s save as HTML option.
PS. If you’re clever, you’ll notice that the scatterplot doesn’t really show any kind of correlation. That’s okay — my project was just to figure out how to make it in D3.