I find Binary Classification problems everywhere in the world, and I'm surprised whenever others don't. So I thought I would write an explainer in the most [[Horsies and Doggies thinking|intuitive]] terms possible. I'll do my best to start with the intuition and introduce technical terms only as appropriate.
### Hot Dog or Not Hot Dog
You probably know this banger from Jian Yang:

In real life, these algorithms assign a hotdog-probability $0 \leq p \leq 1$ for each image. If the algorithm works well, this probability is **calibrated**: that is to say, if you look at all the pictures that the algorithm says "60% hot dog" for, 60% of them are actually hot dogs.
OK, but how do we get to "hot dog" vs "not hot dog"? Something like:
```
if p > 0.5 { return "hot dog" }
else { return "not hot dog" }
```
The cutoff point 0.5 here is called the **threshold**.
Now let's try to visualize this. All the way on the left-end on the x axis, are pictures that are definitely not hot dogs. All the way on the right, are pictures that are definitely hot dogs. And on y axis we have the scores. Here's how it might look like:
![[Binary Classification 1.png]]
Now let's insert the 0.5 cutoff and see what happens:
![[Binary Classification 2.png]]
Let's look at what this means. Take a point on the graph that has p=0.25. If we had 100 such images, then 25 of them would be hot dogs, 75 of them would not be. We'd predict "not hot dog" on all of them (since p<0.5), so we would "miss" 25 hot dogs. Now let's look at the point of p=0.9. If we had 100 such images, 90 would be hot dogs, an 10 of them wouldn't be. We'd predict "hot dog" on all (since p>0.5), and therefore mislabel 10 non-hot-dogs as hot-dogs.
What happens if we choose a 0.8 cutoff? A lot more false negatives (things we classified as not hot dogs but are in fact hot dogs), and a lot less false positives (things we classified as hot dogs but are in fact not hot dogs).
![[Binary Classification 3.png]]
What about a 0.2 cutoff? We'll have a lot more false positives, and a lot less false negatives.
![[Binary Classification 4.png]]
Which one is right? Well, it depends on which is worse: saying that a hot dog is not a hot dog, or vice versa.
### Binary Classification In Real Life
Let's start with the classic one, criminal trials. A criminal trial is a binary classification, guilty or not guilty. The jury doesn't *know* the truth, and though they're not weirdos, in their head they have a probability (or level of belief, or amount of doubt). The statute is very clear, that they must believe it beyond reasonable doubt, say 95%. So this distribution is going to look like this:
![[Binary Classification 5.png]]
This is a statement from our moral system that says: we prefer ten guilty people to walk the streets instead of one innocent person in jail!
Now let's look at what happens at the TSA security screening, for "weapon" or "not weapon". In this case, the incentives are reversed: it's a pretty low cost to send someone to a secondary screening, but having a weapon pass through the screening could be fatal. So we will set the threshold quite low.
### False Positives and False Negatives are always a tradeoff
When anyone designs a binary classification system then, they always balance the two. This balance is captured in metrics like sensitivity/specificity (used in medical tests), precision/recall (used in ML community), confusion matrix, and F-score.
But you hopefully intuitively understand this: if you want to keep innocent people out of jail, then some criminals will walk free. If you don't want to drink spoilt milk, sometimes you'll throw out milk that was actually fine.
### Balancing False Positives and False Negatives
We've established that the cost of false positive is not the same as false negative. What we're always balancing out is called **Cost Ratio**, defined as:
$R = \frac{cost(False Positive)}{cost(False Negative)}$
If R is very high, then we'll have very high threshold, b/c FP's are so costly. If R is very low, then we'll have a very low threshold, because FN's are so costly. So usually, when people talk about questions like:
* How strict should we be at the TSA?
* How aggressive we should be in sentencing?
* Should CPS take away this child of borderline functional family?
* Should this Waymo stop or continue driving?
* How strict should we be when prescribing Adderall?
* We have some amount of alerts on our system. Is it an outage?
* We're Netflix, and we're suspecting this login is password-sharing. Do we block it?
What we're really asking is: "what's the right $R$ for this dilemma". On the last one for instance, you're balancing between pissing off a legitimate user and lost revenue. Traditionally, Netflix would prefer a very high threshold for blocking the password-sharing (in growth mode, a few dollars lost in revenue is not that big of a deal compared to customer goodwill). Later, when Netflix started to "crack down on password sharing", they adjusted their threshold down.
### Base rates
Another thing to take into account is base rates or data distribution. Let's think about password sharing again. Most sessions are *not* fraudulent, so the graph actually looks very lopsided, with prob_fraudulent needing to output something close to 0. This is called data imbalance in Machine Learning and there's different techniques to solve for it.
### A lot of predictive decisions are binary classification
Now consider this: anytime you're making a binary decision with an undecided outcome, it's actually a binary classification problem!
* In poker, when you're deciding whether to call or fold with a mediocre (but OK) hand, you're essentially trying to predict whether the opponent is bluffing or not. False negative is when the opponent steals the pot and you fold, and false positive is when you pay a value bet and the opponent wins the showdown.
* At work, if you're trying to decide whether to hire a person, you're predicting if they'll be a good fit. False positive is you make the wrong hire, and false negative is you miss out on a good hire.
* In life, when you're deciding whether to go to an event of not, you're predicting whether you'll have fun there. False positive is you go and it's boring, false negative is you miss out on fun.
The problem here is that when you miss out on a good hire, you don't know that. When the opponent mucks the hand after you fold, you don't know if they had the winning hand or not. Or statistically speaking, you don't actually have the counterfactual.
### When there are no false positives
Especially in predictive decisions, what does it mean if you have **no** false positives? Well, it means that you have *a lot* of false negatives. And in a lot of cases, false negatives are invisible (because you're taking action only on the positives), so you won't even know it affirmatively!
Examples
* If you're winning every showdown you're calling in poker, you're probably losing some pots to bluffing
* If you never made a bad hire, you probably lost on some good hires (this might be OK by the way)
* Elon Musk says that in the process of deletion, if you aren't adding back some of what you've deleted, you haven't deleted enough
* If you're never failing, you are not challenging yourself enough
* If you've never been to a boring event, you probably missed out on some great ones
#halfbaked
#published 2025-02-22
#essay_potential