## October 24, 2008

Didn't we already have this exact post?

You had "Optimization", and you may have read a draft elsewhere with some of this material in it. I'm posting it now by way of finally finishing it after years, and by way of taking a vacation (feeling a bit burned-out lately).

On a whimsical note, it is reminiscent of the unpredictability of the Infinite Improbability Drive :-)

"If we walk without rhythm, we won't attract the worm."

Set up a pattern-recognition system directed at your own actions, and when you fall into a predictable rut, do something differently.

Could you give a more precise definition of "calibrated"? Your example of 1/37 for each of 37 different possibilities, justified by saying that indeed one of the 37 will happen, seems facile. Do you mean that the "correct" distribution, relative to your guess, has low relative entropy?

I think what he means by "calibrated" is something like it not being possible for someone else to systematically improve the probabilities you give for the possible answers to a question just from knowing what values you've assigned (and your biases), without looking at what the question is.

I suppose the improvement would indeed be measured in terms of relative entropy of the "correct" guess with respect to the guess given.

anon: The short definition of "well-calibrated" is that if things you assign X% probability to actually do happen X% of the time, you are well-calibrated. If you go around saying that you are 90% certain of your answers on an exam, and you get a 90, you are well-calibrated. If you say that you are 90% certain and you get a 50, you are not well-calibrated.

For the long version, see A Technical Explanation of Technical Explanation". (Highly recommended!)

Anon: well-calibrated means roughly that in the class of all events you think have probability p to being true, the proportion of them that turn out to be true is p.

More formally, suppose you have a probability distribution over something you are going to observe. If the log probability of the event which actually occurs is equal to the entropy of your distribution, you are well calibrated. If it is above you are over confident, if it is below you are under confident. By this measure, assigning every possibility equal probability will always be calibrated.

This is related to relative entropy.

As with jsalvati's comment, upon reading this post I was convinced that either I was hallucinating, that almost the entirety of this post had appeared elsewhere before, or that I've read so much of Eliezer's writing that I'm anticipating his arguments in detail before he makes them.

Fortunately, it turns out that the second (and least disturbing) option was correct, in that a substantial amount of this post already existed at: [link deleted]

Not that I'm complaining, but it was seriously bothering me that I knew I'd heard this stuff before, but Google said it hadn't previously been posted on Overcoming Bias, and I was starting to doubt my sanity...

This notion of calibratedness seems to have bad properties to me. Consider a situation where I'm trying to guess a distribution for the outcomes of a coin flip with a coin which, my information tells me, lands "heads" 99% of the time. Then a guess of 50% and 50% is "calibrated" because of the 50% predictions I make, exactly half come out right. But a guess 49.9% heads and 50.1% tails is horribly calibrated; the "49.9%" predictions come out 99% correct, and the "50.1%" predictions come out 1% correct. So the concept, as defined like this, seems hypersensitive, and therefore not very useful. I think a proper definition must necessarily be in terms of relative entropy, or perhaps considering Bayesian posteriors from subsets of your information, but I still don't see how it should work. Sorry if someone already gave a robust definition that I missed.

Nick: If you don't mean *expected* log probability, then I don't know what you're talking about. And if you do, it seems to me that you're saying that well-calibratedness means that relative entropy of the "correct" distribution relative to yours is equal to your entropy. But then the uniform prior doesn't seem well-calibrated; again, consider a coin that lands "heads" 99% of the time. Then your entropy is 1, while the relative entropy of the "correct" distribution is (-log(99%)-log(1%))/2, which is >2.

Nick: Sorry, I got it backwards. What you seem to be saying is that well-calibratedness means that relative entropy of your distribution relative to the "correct" one is equal to your entropy. This does hold for the uniform guess. But once again, considering a situation where your information tells you the coin will land "heads" with 99% probability, it would seem that the only well-calibrated guesses are 99%-1% and 50%-50%. I don't yet have an intuition for why both of these guesses are strictly "better" in any way than an 80%-20% guess, but I'll think about it. It definitely avoids the sensitivity that seemed to come out of the "rough" definition, where 50% is great but 49.9% is horrible.

Please don't post a link to the draft version of this. This just means that lots of Overcoming Bias readers will be bored for the next couple of weeks, because they'll already have read the old version of something.

Anon: no, I mean the log probability. In your example, the calibratedness will generally be high: - \log 0.499 - H(p) ~= 0.00289 each time you see tails, and - log 0.501 - H(p) ~= - 0.00289 each time you come up tails. It's continuous.

Let's be specific. We have H(p) = - \sum_x p(x) \log p(x), where p is some probability distribution over a finite set. If we observe x0, the say the predictor's calibration is

C(x0)
= \sum_x p(x) \log p(x) - \log p(x0)
= - \log p(x0) - H(p)

so the expected calibration is 0 by the definition of H(p). The calibration is continuous in p. If \log p(x0) is higher then the expected value of \log p(x) then we are underconfident and C(x0) < 0; if \log p(x0) is lower than expected we are overconfident, and C>0.

With q = p(x) d(x,x0) the non-normalised probability distribution that assigns value only x0, we have

C = D(p||q)

so this is a relative entropy of sorts.

If the possibility of "creative surprises" depends on ignorance about the logical consequences of a game move, it seems that this would be best modeled as an asymmetric information problem.

It's interesting to note that the usual "Dutch-book" arguments for subjective probability break down under asymmetric information - the only way to avoid losing money to a possibly better-informed opponent is refusing to enter some bets, i.e. adopting an upper-lower probability interval.

Of course, such upper-lower spreads show up routinely in illiquid financial markets; I wonder whether any connections have been made between rational pricing conditions in such markets and imprecise probability theories like Dempster-Shafer, etc.

I think there's a sign error in my post -- C(x0) = \log p(x0) + H(p) it should be.

Nick Hay - IIRC the minus-log probability of an outcome is usually called "surprisal" or "self-information". The Shannon entropy of a random variable is just the expected value of its surprisal.

guest: right, so with those definitions you are overconfident if you are suprised more than you expected, underconfident if you are suprised less, calibration being how close your suprisal is to your expectation of it.

If you use upper-lower then you will forego combinations of bets that are sure gains. Dutch Book is inescapable, and only your presumption that the environment will contain agents who try to offer you losing bet combinations but no agents who are nice enough to take them, would lead you think that upper-lower "solves" the problem. No real agent can afford the computing power to be an ideal Bayesian, but the laws are ineluctable: whoever is not Bayesian - regardless of the reasons, regardless of whether it is unavoidable - is losing something.

whoever is not Bayesian - regardless of the reasons, regardless of whether it is unavoidable - is losing something.

And whoever is bayesian is also losing something, a quantity of easily acquirable energy. It is a loss one way or the other.

Nick: It seems like a bad idea to me to call a prediction underconfident or overconfident depending on the particular outcome. Shouldn't it depend rather on the "correct" distribution of outcomes, i.e. the Bayesian posterior taking all your information into account? I mean, with your definition, if we do the coin flip again, with 99% heads and 1% tails, and our prediction is 99% heads and 1% tails, then if it comes up heads we're slightly underconfident, and if it comes up tails we're strongly overconfident. Hence there's no such thing as an actually well-calibrated prediction for this (?). If we take into account the existence of a correct Bayesian posterior then it's clear that "expected calibration" is not at all 0. For instance if p is the "correct" probability of heads and q is your prediction then the "expected calibration" would seem to be -p*log(q)-(1-p)*log(1-q)+q*log(q)+(1-q)*log(1-q). And, for instance, if you know for a fact that a certain experiment can go one of 3 ways, and over a long period of time the proportion has been 60%-30%-10%, then not only 33.3%-33.3%-33.3%, but also 45%-45%-10% and 57%-19%-24% have "expected calibration" ~0 by this definition.

You consider the Randomized Yudkowsky-Kasparov system, but this is only one point on a continuum. There are stupider and smarter systems, generated by weaker and stronger chess players, respectively.

One interesting system is Randomized Kasparov-Kasparov, where we show arbitrary chess positions to Kasparov and ask him to assign probability distributions to what he'd do. Chess is a game of perfect information, so arbitrary chess positions can be shown without their history (of course, we have to count physically invisible state as known; a position consists of both the arrangement of the pieces and factoids like "the king has not moved, but the queen's rook has moved, so queenside castling is forbidden but kingside castling is allowed"). I assert without proof that RKK is an extremely smart system, and will turn both RYK and you yourself into chunky salsa.

Here's the question: is Kasparov smarter than RKK? We can imagine that first we develop RKK by properly incentivizing Kasparov (e.g. by telling him that we're going to use it to play against every human on the planet, and for every time that RKK comes out victorious, we'll donate a dollar to his favorite charity, or if RKK loses even once, we'll drown a kitten, or whatever), and then (after we've shown Kasparov the million zillion chess positions and recorded his assigned probability distributions) we turn the tables and properly incentivize him to beat RKK (charity donation, kitten drowning, whatever). Can Kasparov beat RKK? Does it matter if he knows he's playing against RKK?

Then the metaquestion: what if Kasparov was a weaker player? (For values of Kasparov that equal Yudkowsky...) What if he was a stronger player? One limiting case is easy. If Kasparov was very very smart, he could solve chess, and so RKK could solve chess, and they would be equally strong. (That is, if the solution says that White wins, then whoever plays White wins, and so forth.) There's a probability distribution that solves chess, after all (with 100% for the right move in any given situation and 0% for the others).

Will RYK be as good a player as Kasparov? Of course not. Sometimes the RYK system will randomly make dreadful moves which the real-life Kasparov would never make - start the game with P-KN4. I assign such moves a low probability, but sometimes the computer makes them anyway, by sheer random chance.

Don't think this is right. RYK would not make dreadful moves that you would not think Kasparov would make under any circumstances, unless you were assigning odds too high to those moves. So unless you think there's a chance that Kasparov would make that move more often than you would, you would assign a log-odd of -200 dB or something, effective 0 chance.

To clarify: what I'm specifically objecting to, I think, is "by sheer random chance". If it's the chance that's causing those bad moves, then you're not well-calibrated. If it's your uncertainty about Kasparov that's causing the moves, then my objection doesn't apply.

I swear to god I've read these Kasparov posts before...

I think the weakness of R[xx] systems is that they're only working one move at a time. Any position suggests a number of strategies to a good player. Being stuck with a randomized system that tends towards plausible moves means that a move that builds in one direction could be weakened in the next round by a move that's tending towards another good strategy.

Could R[xx] be improved (in theory--I think the calculations would be unmanageable) by considering more than one move at a time? Is there some number of moves where you don't exactly have an R[xx] anymore because the program is more like simulating the way the real chess player thinks?

Nancy, the underlying algorithms of most modern chess engines don't have any sort of persistent strategy from move to move. In fact, most don't even keep any state between moves. (That's an advanced feature.) And yet chess engines are able to compete with top chess players. While they look more than one move ahead, so does RYK. I don't think your objection actually pans out in practice.

re: calibration, it seems like what we want to do is ask ourselves what happens if an agent is asked lots of different probability questions, consider the the true probability as a function of probability stated by the agent, use some prior distribution (describing *our* uncertainty) on all such functions that the agent could have, update this prior using a finite set of answers we have seen the agent give and their correctness, and end up with a posterior distribution on functions (agent's probability -> true probability) from which we can get estimates of how over/underconfident the agent is at each probability level, and use those to determine what the agent "really means" when it says 90%; if the agent is overconfident at all probabilities then it's "overconfident" period, if it's underconfident at all probabilities then it's "underconfident" period, if it's over at some and under at some then I guess it's just "misconfident"? (an agent could be usually overconfident in an environment that usually asked it difficult questions and usually underconfident in an environment that usually asked it easy questions, or vice versa) If we keep asking an agent that doesn't learn the same question like in anon's comment then that seems like a degenerate case. On first think it doesn't seem like an agent's calibration function necessarily depends on what questions you ask it; Solomonoff induction is well-calibrated in the long run in all environments (right?), and you could imagine an agent that was like SI but with all probability outputs twice as close to 1, say. Hope this makes any sense.

pdf23ds, thanks. I wasn't sure whether the whole strategic situation could be implied in each position. Have people been able to learn anything about chess strategy by studying how programs play? Do program vs. program games make sense to people?

pdf23ds, most chess engines store analyzed subtrees into a transposition table. This is not the same as having a persistent strategy, but it does mean that much evaluation work can be saved by keeping a persistent state between moves.

This is similar to how iterative deepening works: in theory you could obtain the same result by performing a depth-first search directly, but iterative deepening provides a hint at the "best" moves which makes search more efficient.

I wasn't under the impression that transposition tables were kept between moves, but now that I think about it, they probably are. On the other hand, I'd wager that the reuse isn't very high, depending on the caching criteria. Maybe around 30%.

Have people been able to learn anything about chess strategy by studying how programs play?

I'm don't know if they've learned anything. Computer style is heavy on tactics. I'm not sure how their strategy is characterized. But it's notable that most new young grandmasters nowadays rely heavily on chess engines to help them train.

Do program vs. program games make sense to people?

Yes. Only chess engine fans are really too interested in them, though.

It'd be interesting to know if perfect checkers games (checkers has been solved) were amenable to human analysis.

A bit more on chess engines: many engines have a number of parameters you can set that determines their playing style, and some might say their "strategy". Things like aggressiveness, willingness to sacrifice, valuation of pieces, willingness to trade, etc. What many of these parameters actually do is change the static board evaluation function. A chess engine looks ahead many moves, and due to time constraints, picks a board position after X moves and runs the static evaluation function, which is a very naive estimate of the quality of the engine's position. A human's static evaluation function is much higher quality, but humans only look ahead a few moves at most. (Well, grandmasters look ahead a dozen or two moves at most but they're still evaluating many, many fewer positions (their pruning factor is higher).)

When I play against a superior player, I can't predict exactly where my opponent will move against me. [...] But I can predict the consequence of the unknown move, which is a win for the other player; and the more the player's actual action surprises me, the more confident I become of this final outcome."

Interestingly, playing an opponent which selects completely randomly from the range of possible moves gives a similar result: you never know what your opponent will do, but you can predict with a fair amount of confidence that you will win. And the more their actions surprise you (because you didn't spend any time thinking about such a remarkably dumb move as that), the more confident you become in your prediction of the result.

bah. moving too fast, forget to close an em tag. in the preceding comment, only the first paragraph should be em'd as a quote.

Will RYK be as good a player as Kasparov? Of course not. Sometimes the RYK system will randomly make dreadful moves which the real-life Kasparov would never make - start the game with P-KN4. I assign such moves a low probability, but sometimes the computer makes them anyway, by sheer random chance.

If you believe that Kasparov would never make that move, you should assign it a probability of 0.

As for calibration, RYK is not at all calibrated to Kasparov's actual behavior, making invocation of his name, and any surprise as to how badly RYK plays, absurd. Your blather about "the creative unpredictability of intelligence" is absurd; Kasparov could be a completely deterministic engine, always making the same moves in the same positions, and nothing would change -- he would still beat your ass every time and RYK would still suck.

The comments to this entry are closed.

## May 2009

Sun Mon Tue Wed Thu Fri Sat
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31