« Above-Average AI Scientists | Main | Dangerous Species Warnings »

September 29, 2008


Hrm, I'd have to say go with whichever is simpler (choose your favorite reasonable method of measuring the complexity of a hypothesis) for the usual reasons. (less bits to describe it means less stuff that has to be "just so", etc etc... Of course, modify this a bit if one of the hypothesies has a significantly different prior than the other due to previously learned info, but...) But yeah, the less complex one that works is more likely to be closer to the underlying dynamic.

If you're handed the two hypothesies as black boxes, so that you can't actually see inside them and work out which is more complex, then go with the first one. The first one, since it's more likely to be less complex (since maximum only the first ten data points could have been in some way explicitly hard coded into it. It successfully really predicted the next ten. The second one could, potentially, have in some way all twenty data points hard coded into it, and thus be more complex and thus effectively less likely to actually have anything resembling the underlying dynamic encoded into it)

Is it cheating to say that it depends hugely on the content of the theories, and their prior probabilities?

The theories screen off the theorists, so if we knew the theories then we could (given enough cleverness) decide based on the theories themselves what our belief should be.

But before we even look at the theories, you ask me which theory I expect to be correct. I expect the one which was written earlier to be correct. This is not because it matters which theory came first, irrespective of their content; it is because I have different beliefs about what each of the two theories might look like.

The first theorist had less data to work with, and so had less data available to insert into the theory as parameters. This is evidence that the first theory will be smaller than the second theory. I assign greater prior probabilities to small theories than to large theories, so I think the first theory is more likely to be correct than the second one.

I rather like the 3rd answer on his blog (Doug D's). A slight elaboration on that -- one virtue of a scientific theory is its generality, and prediction is a better way of determining generality than explanation -- demanding predictive power from a theory excludes ad hoc theories of the sort Doug D mentioned, that do nothing more than re-state the data. This reasoning, note, does not require any math. :-)

(Noting that the math-ey version of that reason has just been stated by Peter and Psy-kosh.)

The first guy has demonstrated prediction, the second only hindsight. We assume the first theory is right - but of course, we do the next experiment, and then we'll know.

Assuming both of them can produce values (are formulated in such a way that they can produce a new value with just the past values + the environment)

The second theory has the risk of being more descriptive than predictive. It has more potential of being fit to the input data, including all its noise, and to be a (maybe complex) enumeration of its values.

The first one has at least proven it could be used to predict, while the second one can only produce a new value.

I would thus give more credit to the first theory. At least it has won against ten coin flips without omniscience.

Which do we believe?

What exactly is meant here by 'believe'?
I can imagine various interpretations.

a. Which do we believe to be 'a true capturing of an underlying reality'?
b. Which do we believe to be 'useful'?
c. Which do we prefer, which seems more plausible?

a. Neither. Real scientists don't believe in theories, they just test them. Engineers believe in theories :-)

b. Utility depends on what you're trying to do. If you're an economist, then a beautifully complicated post-hoc explanation of 20 experiments may get your next grant more easily than a simple theory that you can't get published.

c. Who developed the theories? Which theory is simpler? (Ptolemy, Copernicus?) Which theory fits in best with other well-supported pre-existing theories? (Creationism, Evolution vs. theories about disease behaviour). Did any unusual data appear in the last 10 experiments that 'fitted' the original theory but hinted towards an even better theory? What is meant by 'consistent' (how well did it fit within error bands, how accurate is it)? Perhaps theory 1 came from Newton, and theory 2 was thought up by Einstein. How similar were the second sets of experiments to the original set?

How easy/difficult were the predictions? In other words, how well did they steer us through 'theory-space'? If theory 1 predicts the sun would come up each day, it's hardly as powerful as theory 2 which suggests the earth rotates around the sun.

What do we mean when we use the word 'constructs'? Perhaps the second theorist blinded himself to half of the results, constructed a theory, then tested it, placing himself in the same position as the original theorist but with the advantage of having tested his theory before proclaiming it to the world? Perhaps the constructor repeated this many times using different subsets of the data to build a predictor and test it; and chose the theorem which was most consistently suggested by the data and verified by subsequent testing.

Perhaps he found that no matter how he sliced and diced and blinded himself to parts of the data, his hand unerringly fell on the same 'piece of paper in the box' (to use the metaphor from the other site).

Another issue is 'how important is the theory'? For certain important theories (development of cancer, space travel, building new types of nuclear reactors etc.), neither 10 nor 20 large experiments might be sufficient for society to confer 'belief' in an engineering sense.

Other social issues may exist. Galileo 'believed' bravely, but perhaps foolishly, depending on how he valued his freedom.

d. Setting aside these other issues, and in the absence of any other information: As a scientist, my attitude would be to believe neither, and test both. As an engineer, my attitude would be to 'prefer' the first theory (if forced to 'believe' only one), and ask a scientist to check out the other one.

Both theories fit 20 data points. That some of those are predictions is irrelevant, except for the inferences about theory simplicity that result. Since likelihoods are the same, those priors are also the posteriors.

My state of belief is then represented by a certain probability that each theory is true. If forced to pick one out of the two, I would examine the penalties and payoffs of being correct and wrong, ala Pascal's wager.

We do ten experiments. A scientist observes the results, constructs a theory consistent with them

Huh? How did the scientist know what to observe without already having a theory? Theories arise as explanations for problems, explanations which yield predictions. When the first ten experiments were conducted, our scientist would therefore be testing predictions arising from an explanation to a problem. He wouldn't just be conducting any old set of experiments.

Similarly the second scientist's theory would be a different explanation of the problem situation, one yielding a different prediction. Before the decisive test, the theory that emerges as the best explanation under the glare of critical scrutiny would be the preferred explanation. Without knowing the problem situation and the explanations that have been advanced it cannot be determined which is to be preferred.

We do ten experiments. A scientist observes the results, constructs a theory consistent with them

Huh? How did the scientist know what to observe without already having a theory? Theories arise as explanations for problems, explanations which yield predictions. When the first ten experiments were conducted, our scientist would therefore be testing predictions arising from an explanation to a problem. He wouldn't just be conducting any old set of experiments.

Similarly the second scientist's theory would be a different explanation of the problem situation, one yielding a different prediction. Before the decisive test, the theory that emerges as the best explanation under the glare of critical scrutiny would be the preferred explanation. Without knowing the problem situation and the explanations that have been advanced it cannot be determined which is to be preferred.

One theory has a track record of prediction, and what is being asked for is a prediction, so at first glance I would choose that one. But the explanation based-one is built on more data.

But it is neither prediction nor explanation that makes things happen in the real world, but causality. So I would look in to the two theories and pick the one that looks to have identified a real cause instead of simply identifying a statistical pattern in the data.

Whichever is simpler - assuming we don't know anything about the scientists' abilities or track record.

Having two different scientists seems to pointlessly confound the example with extraneous variables.

I don't think the second theory is any less "predictive" than the first. It could have been proposed at the same time or before the first, but it wasn't. Why should the predictive ability of a theory vary depending on the point in time in which it was created? David Friedman seems to prefer the first because it demonstrates more ability on the part of the scientist who created it (i.e., he got it after only 10 tries).

Unless we are given any more information on the problem, I think I agree with David.

These theories are evidence about true distribution of data, so I construct a new theory based on them. I then could predict the next data point using my new theory, and if I have to play this game go back and choose one of the original theories that gives the same prediction, based only on prediction about this particular next data point, independently on whether selected theory as a whole is deemed better.

Having more data is strictly better. But I could expect that there is a good chance that a particular scientist will make an error (worse than me now, judging his result, since he himself could think about all of this and, say, construct a theory from first 11 data points and verify the absence of this systematic error using the rest, or use a reliable methodology). Success of the first theory gives evidence for it, which depending on my priors can significantly overweight expected improvement from more data points coming through imperfect procedure of converting into a theory.

Here's my answer, prior to reading any of the comments here, or on Friedman's blog, or Friedman's own commentary immediately following his statement of the puzzle. So, it may have already been given and/or shot down.

We should believe the first theory. My argument is this. I'll call the first theory T1 and the second theory T2. I'll also assume that both theories made their predictions with certainty. That is, T1 and T2 gave 100% probability to all the predictions that the story attributed to them.

First, it should be noted that the two theories *should* have given the same prediction for the next experiment (experiment 21). This is because T1 *should* have been the best theory that (would have) predicted the first batch. And since T1 also correctly predicted the second batch, it should have been the best theory that would do that, too. (Here, "best" is according to whatever objective metric evaluates theories with respect to a given body of evidence.)

But we are told that T2 makes exactly the same predictions for the first two batches. So it also should have been the best such theory. It should be noted that T2 has no more information with which to improve itself. T1, for all intents and purposes, also knew the outcomes of the second batch of experiments, since it predicted them with 100% certainty. Therefore, the theories *should* have been the best possible given the first two batches. In particular, they should have been equally good.

But if "being the best, given the first two batches" doesn't determine a prediction for experiment 21, then neither of these "best" theories should be predicting the outcome of experiment 21 with certainty. Therefore, since it is given that they *are* making such predictions, they *should* be making the same one.

It follows that at least one of the theories is not the best, given the evidence that it had. That is, at least one of them was constructed using flawed methods. T2 is more likely to be flawed than is T1, because T2 only had to post-dict the second batch. This is trivial to formalize using Bayes's theorem. Roughly speaking, it would have been harder for T1 to been constructed in a flawed way and still have gotten its predictions for the second batch right.

Therefore, T1 is more likely to be right than is T2 about the outcome of experiment 21.

(And, of course, first theory could be improved using the next 10 data points by Bayes' rule, which will give a candidate for being the second theory. This new theory can even disagree with the first on which value of particular data point is most likely.)

Knowing how theories and experiements were chosen would make this more sensible problem. Having that information would affect our expectations about theories - as others have noted there are a lot of theories one could form in ad hoc manner, but question is which of them was selected.

First theory has been selected with first ten experiements and it seems to have survived second set of experiements. If experiements were independent from first set of experiements and from each other this is quite unlikely so this is strong evidence that first theory is the connection between experiements.

Given reasonable way of choosing theories I would rate both theories as likely, but given finite resources and fallible theorists I would prefer first theory as we have evidence that it was chosen sensibly and that the problem is explainable with theory of its calibre, but only to extent how far I doubt rationality of theorist making second theory.

Gah, others got there first.

I would go with the first one in general. The first one has proved itself on some test data, while all the second one has done is to fit a model on given data. There is always the risk that the second theory has overfitted a model with no worthwhile generalization accuracy. Even if the second theory is simpler than the first the fact that the first theory has been proved right on unseen data makes it a slam dunk winner. Of course further experiments may cause us to update our beliefs, particularly if theory 2 is proving just as accurate.

There are an infinite number of models that can predict 10 variables, or 20 for that matter. The only probable way for scientist A to predict a model out of the infinite possible ones is to bring prior knowledge to the table about the nature of that model and the data. This is also true for the second scientist, but only slightly less so.

Therefore, scientist A has demonstrated a higher probability of having valuable prior knowledge.

I don't *think* there is much more to this than that. If the two scientists have equal knowledge there is no reason the second model need be more complicated than the first since the first fully described the extra revealed data in the second.

If it was the same scientist with both sets of data then you would pick the second model.

Tyrrell's argument seems to me to hit the nail on the head. (Although I would have liked to see that formalization -- it seems to me that while T1 will be preferred, the preference may be extremely slight, depending. No, I'm too lazy to do it myself :-))

Formalizing Vijay's answer here:

The short answer is that you should put more of your probability mass on T1's prediction because experts vary, and an expert's past performance is at least somewhat predictive of his future performance.

We need to assume that all else is symmetrical: you had equal priors over the results of the next experiment before you heard the scientists' theories; the scientists were of equal apparent caliber; P( the first twenty experimental results | T1 ) = P( the first twenty experimental results | T2); neither theorist influenced the process by which the next experiment was chosen; etc.

Suppose we have a bag of experts, each of which contains a function for generating theories from data. We draw a first expert from our bag at random and show him data points 1-10; expert 1 generates theory T1. We draw a second expert from our bag at random and show him data points 1-20: expert 2 generates theory T2.

Given the manner in which real human experts vary (some know more than others about a given domain; some aim for accuracy where others aim to support their own political factions; etc.), it is reasonable to suppose that some experts have priors that are well aligned with the problem at hand (or behave as if they do) while others have priors that are poorly aligned. Expert 1 distinguished himself by accurately predicting the results of experiments 11-20 from the results of experiments 1-10; many predictive processes would not have done so well. Expert 2 has only shown an ability to find some theory that is consistent with the results of experiments 1-20; many predictive processes put a non-zero prior on some such theory that would not have given the results of experiments 11-20 "most expected" status based only on the results from experiments 1-10. We should therefore expect better future performance from Expert 1, all else equal.

The problem at hand is complicated slightly in that we are judging, not experts, but theories, and the two experts generated their theories at different times from different amounts of information. If Expert 1 would have assigned a probability < 1 to results 11-20 (despite producing a theory that predicted those results), Expert 2 is working from more information than Expert 1, which gives Expert 2 at least a slight advantage. Still, given the details of human variability and the fact that Expert 1 did predict results 11-20, I would expect the former consideration to outweigh the latter.

Scientist 2's theory is more susceptible to over-fitting of the data; we have no reason to believe it's particularly generalizable. His theory could, in essence, simply be restating the known results and then giving a more or less random prediction for the next one. Let's make it 100,000 trials rather than 20 (and say that Scientist A has based his yet-to-be-falsified theory off the first 50,000 trials), and stipulate that Scientist 2 is a neural network -- then the answer seems clear.

I wrote in my last comment that "T2 is more likely to be flawed than is T1, because T2 only had to post-dict the second batch. This is trivial to formalize using Bayes's theorem. Roughly speaking, it would have been harder for T1 to been constructed in a flawed way and still have gotten its predictions for the second batch right."

Benja Fallenstein asked for a formalization of this claim. So here goes :).

Define a *method* to be a map that takes in a batch of evidence and returns a theory. We have two assumptions

ASSUMPTION 1: The theory produced by giving an input batch to a method will at least predict that input. That is, no matter how flawed a method of theory-construction is, it won't contradict the evidence fed into it. More precisely,

p( M(B) predicts B ) = 1.

(A real account of hypothesis testing would need to be much more careful about what constitutes a "contradiction". For example, it would need to deal with the fact that inputs aren't absolutely reliable in the real world. But I think we can ignore these complications in this problem.)

ASSUMPTION 2: If a method M is known to be flawed, then its theories are less likely to make correct predictions of future observations. More precisely, if B2 is not contained in B1, then

p( M(B1) predicts B2 | M flawed ) < P( M(B1) predicts B2 ).

(Outside of toy problems like this one, we would need to stipulate that B2 is not a logical consequence of B1, and so forth.)

Now, let B1 and B2 be two disjoint and nonempty sets of input data. In the problem, B1 is the set of results of the first ten experiments, and B2 is the set of results of the next ten experiments.

My claim amounted to the following. Let

P1 := p( M is flawed | M(B1) predicts B2 ),

P2 := p( M is flawed | M(B1 union B2) predicts B2 ).

Then P1 < P2

To prove this, note that, by Bayes's theorem, the second quantity P2 is given by

P2 = p( M(B1 union B2) predicts B2 | M is flawed ) * p(M is flawed) / p( M(B1 union B2) predicts B2 ).

Since p(X) = 1 implies p(X|Y) = 1 when Y is nonempty, Assumption 1 tells us that this reduces to

P2 = p(M is flawed).

On the other hand, the first quantity P1 is

P1 = p( M(B1) predicts B2 | M is flawed ) * p( M is flawed) / p( M(B1) predicts B2 ).

By Assumption 2, this becomes

P1 < p( M is flawed ).

Hence, P1 < P2, as claimed.

Throughout these replies there is a belief that theory 1 is 'correct through skill'. With that in mind it is hard to come to any other conclusion than 'scientist 1 is better'.

Without knowing more about the experiments, we can't determine if theory 1's 10 good predictions were simply 'good luck' or accident.

If your theory is that the next 10 humans you meet will have the same number of arms as they have legs, for example...

There's also potential for survivorship bias here. If the first scientist's results had been 5 correct, 5 wrong, we wouldn't be having this discussion about the quality of their theory-making skills. Without knowing if we are 'picking a lottery winner for this comparison' we can't tell if those ten results are chance or are meaningful predictions.

I'd use the only tool we have to sort theories: Occam's razor.
1. Weed out all the theories that do not match the experiment — keep both in that case.
2. Sort them by how simple they are.

This is what many do by assuming the second is “over-fitted”; I believe a good scientist would search the literature before stating a theory, and know about the first one; as he would also appreciate elegance, I'd expect him to come up with a simpler theory — but, as you pointed out, some time in a economics lab could easily prove me wrong, although I'm assuming the daunting complexity corresponds to plumbing against experiment disproving a previous theory, not the case that we consider here.

In one word: the second (longer references).

The barrel and box analogy hides that simplicity argument, by making all theories a ‘paper’. A stern wag of the finger to anyone who used statistical references, because there aren't enough data to do that.

Peter, your point that we have different beliefs about the theories prior to looking at them is helpful. AFAICT theories don't screen off theorists, though. My belief that the college baseball team will score at least one point in every game ("theory A"), including the next one ("experiment 21"), may reasonably be increased by a local baseball expert telling me so and by evidence about his expertise. This holds even if I independently know something about baseball.

As to the effect of "number of parameters" on the theories' probabilities, would you bet equally on the two theories if you were told that they contained an identical number of parameters? I wouldn't, given the asymmetric information contained in the two experts vouching for the theories.

Tim, I agree that if you remove the distinct scientists and have the hypotheses produced instead by a single process (drawn from the same bucket), you should prefer whichever prediction has the highest prior probability. Do you mean that the prior probability is equal to the prediction's simplicity or just that simplicity is a good rule of thumb in assigning prior probabilities? If we have some domain knowledge I don't see why simplicity should correspond exactly to our priors; even Solomonoff Inducers move away from their initial notion of simplicity with increased data. (You've studied that math and I haven't; is there a non-trivial updated-from-data notion of "simplicity" that has identical ordinal structure to an updated Solomonoff Inducer's prior?)

Tyrrell, I like your solution a lot. A disagreement anyhow: as you say, if experts 1 and 2 are good probability theorists, T1 will contain the most likely predictions given the experimental results according to Expert 1 and T2 likewise according to Expert 2. Still, if the experts have different starting knowledge and at least one cannot see the other's predictions, I don't see anything that surprising in their "highest probability predictions given the data" calculations disagreeing with one another. This part isn't in disagreement with you, but it also relevant that if the space of outcomes is small or if experiments 1-20 are part of some local regime that experiment 21 is not(e.g. physics at macroscopic scales, or housing prices before the bubble broke), it may not be surprising to see two theories that agree on a large body of data and diverge elsewhere. Theories that agree in one regime and disagree in others seem relatively common.

Alex, Bertil, and others, I may be wrong, but I think we should taboo "overfitting" and "ad hoc" for this problem and substitute mechanistic, probability-theory-based explanations for where phenomena like "overfitting" come from.

Tyrrell, right, thanks. :) Your formalization makes clear that P1/P2 = p(M(B1) predicts B2 | M flawed) / p(M(B1) predicts B2), which is a stronger result than I thought. Argh, I wish I were able to see this sort of thing immediately.

One small nitpick: It could be more explicit that in Assumption 2, B1 and B2 range over actual observation, whereas in Assumption 1, B ranges over all possible observations. :)

Anna, right, I think we need some sort of "other things being equal" proviso to Tyrrell's solution. If experiments 11..20 were chosen by scientist 1, experiment 21 is chosen by scientist 2, and experiments 1..10 were chosen by a third party, and scientist 2 knows scientist 1's theory, for example, we could speculate that scientist 2 has found a strange edge case in 1's formalization that 1 did not expect. I think I was implicitly taking the question to refer to a case where all 21 experiments are of the same sort and chosen independently -- say, lowest temperatures at the magnetic north pole in consecutive years, that sort of thing.

"One small nitpick: It could be more explicit that in Assumption 2, B1 and B2 range over actual observation, whereas in Assumption 1, B ranges over all possible observations. :)"

Actually, I implicitly was thinking of the "B" variables as ranging over actual observations (past, present, and future) in both assumptions. But you're right: I definitely should have made that explicit.

We know that the first researcher is able to successfully predict the results of experiment. We don't know that about the second researcher. Therefore I would bet on the first researcher prediction (but only assuming other things being equal).

Then we'll do the experiment and know for sure.

Benja --

I disagree with Tyrrell (see below), but I can give a version of Tyrrell's "trivial" formalization:

We want to show that:

Averaging over all theories T,
P(T makes correct predictions | T passes 10 tests) >
P(T makes correct predictions)

By Bayes' rule,

P(T makes correct predictions | T passes 10 tests) =
P(T makes correct predictions)
* P(T passes 10 tests | T makes correct predictions)
/ P(T passes 10 tests)

So our conclusion is equivalent to:

Averaging over all theories T,
P(T passes 10 tests | T makes correct predictions)
/ P(T passes 10 tests)
> 1

which is equivalent to

Averaging over all theories T,
P(T passes 10 tests | T makes correct predictions) > P(T passes 10 tests)

which has to be true for any plausible definition of "makes correct predictions". The effect is only small if nearly all theories can pass the 10 tests.

I disagree with Tyrrell's conclusion. I think his fallacy is to work with the undefined concept of "the best theory", and to assume that:

* If a theory consistent with past observations makes incorrect predictions then there was something wrong with the process by which that theory was formed. (Not true; making predictions is inherently an unreliable process.)

* Therefore we can assume that that process produces bad theories with a fixed frequency. (Not meaningful; the observations made so far are a varying input to the process of forming theories.)

In the math above, the fallacy shows up because the set of theories that are consistent with the first 10 observations is different from the set of theories that are consistent with the first 20 observations, so the initial statement isn't really what we wanted to show. (If that fallacy is a problem with my understanding of Tyrrell's post, he should have done the "trivial" formalization himself.)

There are lots of ways to apply Bayes' Rule, and this wasn't the first one I tried, so I also disagree with Tyrrell's claim that this is trivial.

Hi, Anna. I definitely agree with you that two equally-good theories could agree on the results of experiments 1--20 and then disagree about the results of experiment 21. But I don't think that they could both be *best-possible* theories, at least not if you fix a "good" criterion for evaluating theories with respect to given data.

What I was thinking when I claimed that in my original comment was the following:

Suppose that T1 says "result 21 will be X" and theory T2 says "result 21 will be Y".

Then I claim that there is another theory T3, which correctly predicts results 1--20, and which also predicts "result 21 will be Z", where Z is a less-precise description that is satisfied by both X and Y. (E.g., maybe T1 says "the ball will be red", T2 says "the ball will be blue", and T3 says "the ball will be visible".)

So T3 has had the same successful predictions as T1 and T2, but it requires less information to specify (in the Kolmogorov-complexity sense), because it makes a less precise prediction about result 21.

I think that's right, anyway. There's definitely still some hand-waving here. I haven't proved that a theory's being vaguer about result 21 implies that it requires less information to specify. I think it should be true, but I lack the formal information theory to prove it.

But suppose that this can be formalized. Then there is a theory T3 that requires less information to specify than do T1 and T2, and which has performed as well as T1 and T2 on all observations so far. A "good" criterion should judge T3 to be a better theory in this case, so T1 and T2 weren't best-possible.

Among the many excellent, and some inspiring, contributions to OvercomingBias, this simple post, together with its comments, is by far the most impactful for me. It's scary in almost the same way as the way the general public approaches selection of their elected representatives and leaders.

Tyrrell, um. If "the ball will be visible" is a better theory, then "we will observe some experimental result" would be an even better theory?

Solomonoff induction, the induction method based on Kolmogorov complexity, requires the theory (program) to output the precise experimental results of all experiments so far, and in the future. So your T3 would not be a single program; rather, it would be a set of programs, each encoding specifically one experimental outcome consistent with "the ball is visible." (Which gets rid of the problem that "we will observe some experimental result" is the best possible theory :))

Here is my answer without looking at the comments or indeed even at the post linked to. I'm working solely from Eliezer's post.

Both theories are supported equally well by the results of the experiments, so the experiments have no bearing on which theory we should prefer. (We can see this by switching theory A with theory B: the experimental results will not change.) Applying bayescraft, then, we should prefer whichever theory was a priori more plausible. If we could actually look at the contents of the theory we could make a judgement straight from that, but since we can't we're forced to infer it from the behavior of scientist A and scientist B.

Scientist A only needed ten experimental predictions of theory A borne out before he was willing to propose theory A, whereas scientist B needed twenty predictions of theory B borne out before he was willing to propose theory B. In absence of other information (perhaps scientist B is very shy, or had been sick while the first nineteen experiments were being performed), this suggests that theory B is much less a priori plausible than theory A. Therefore, we should put much more weight on the prediction of theory A than that of theory B.

If I'm lucky this post is both right and novel. Here's hoping!

I've seen too many cases of overfitting data to trust the second theory. Trust the validated one more.

The question would be more interesting if we said that the original theory accounted for only some of the new data.

If you know a lot about the space of possible theories and "possible" experimental outcomes, you could try to compute which theory to trust, using (surprise) Bayes' law. If it were the case that the first theory applied to only 9 of the 10 new cases, you might find parameters such that you should trust the new theory more.

In the given case, I don't think there is any way to deduce that you should trust the 2nd theory more, unless you have some a priori measure of a theory's likelihood, such as its complexity.

Benja, I have never studied Solomonoff induction formally. God help me, but I've only read about it on the Internet. It definitely was what I was thinking of as a candidate for evaluating theories given evidence. But since I don't *really* know it in a rigorous way, it might not be suitable for what I wanted in that hand-wavy part of my argument.

However, I don't think I made quite so bad a mistake as highly-ranking the "we will observe some experimental result" theory. At least I didn't make that mistake in my own mind ;). What I actually wrote was certainly vague enough to invite that interpretation. But what I was thinking was more along these lines:

[looks up color spectrum on Wikipedia and juggles numbers to make things work out]

The visible wavelengths are 380 nm -- 750 nm. Within that range, blue is 450 nm -- 495 nm, and red is 620 nm -- 750 nm.

Let f(x) be the decimal expansion of (x - 380nm)/370nm. This moves the visible spectrum into the range [0,1].

I was imagining that T3 ("the ball is visible") was predicting

"The only digit to the left of the decimal point in f(color of ball in nm) is a 0 (without a negative sign)."

while T1 ("the ball is red") predicts

"The only digit to the left of the decimal point in f(color of ball in nm) is a 0 (without a negative sign), and the digit immediately to the right is a 7."

and T2 ("the ball is blue") predicts

"The only digit to the left of the decimal point in f(color of ball in nm) is a 0 (without a negative sign), and the digit immediately to the right is a 2."

So I was really thinking of all the theories T1, T2, and T3 as giving precise predictions. It's just that T3 opted not to make a prediction about something that T2 and T3 did predict on.

However, I definitely take the point that Solomonoff induction might still not be suitable for my purposes. I was supposing that T3 would be a "better" theory by some criterion like Solomonoff induction. (I'm assuming, BTW, that T3 did predict everything that T1 and T2 predicted for the first 20 results. It's only for the 21st result that T3 didn't give an answer as detailed as those of T1 and T2. ) But from reading your comment, I guess maybe Solomonoff induction wouldn't even compare T3 to T1 and T2, since T3 doesn't purport to answer all of the same questions.

If so, I think that just means the Solomonoff isn't quite general enough. There should be a way to compare two theories even if one of them answers questions that the other doesn't address. In particular, in the case under consideration, T1 and T2 are given to be "equally good" (in some unspecified sense), but they both purport to answer the same question in a different way. To my mind, that *should* mean that each of them isn't *really* justified in choosing its answer over the other. But T3, in a sense, acknowledges that there is no reason to favor one answer over the other. There should be some rigorous sense in which this makes T3 a better theory.

Tim Freeman, I hope to reply to your points soon, but I think I'm at my "recent comments" limit already, so I'll try to get to it tomorrow.

Upon first reading, I honestly thought this post was either a joke or a semantic trick (e.g., assuming the scientists were themselves perfect Bayesians which would require some "There are blue-eyed people" reasoning).

Because theories that can make accurate forecasts are a small fraction of theories that can make accurate hindcasts, the Bayesian weight has to be on the first guy.

In my mind, I see this visually as the first guy projecting a surface that contains the first 10 observations into the future and it intersecting with the actual future. The second guy just wrapped a surface around his present (which contains the first guy's future). Who says he projected it in the right direction?

But then I'm not as smart as Eliezer and could have missed something.

Both theories are equally good. Both are correct. There is no way to choose one, except to make another experiment and see which theory - if any (still might be both well or both broken) - will prevail.

- Thomas

That the first theory is right seems obvious and not the least bit counterintuitive. Therefore, based on what I know about the psychology of this blog, I predict that it is false and the second one is true.

We have two theories that explain the all the available data - and this is Overcoming Bias - so how come only a tiny number of people have mentioned the possibility of using Occam's razor? Surely that must be part of any sensible response.

I don't think you've given enough information to make a reasonable choice.
If the results of all 20 experiments are consistent with both theories but the second theory would not have been made without the data from the second set of experiments, then it stands to reason that the second theory makes more precise predictions.

If the theories are equally complex and the second makes more precise predictions, then it appears to be a better theory. If the second theory contains a bunch of ad hoc parameters to improve the fit, then it's likely a worse theory.

But of course the original question does not say that the second theory makes more precise predictions, nor that it would not have been made without the second set of experiments.

Hi Tyrrell,

Let T1_21 and T2_21 be the two theories' predictions for the twenty-first experiment.

As you note, if all else is equal, our prior beliefs about P(T1_21) and P(T2_21) -- the odds we would've accepted on bets before hearing T1s and T2's predictions -- are relevant to the probability we should assign after hearing T1's and T2's predictions. It takes more evidence to justify a high-precision or otherwise low-prior-probability prediction. (Of course, by the same token, high precision and otherwise low-prior predictions are often more useful.)

The precision (or more exactly, the prior probability) of the predictions T1 and T2 assign to the first twenty experimental results are also relevant. The precision of these tested predictions, however, pulls in the opposite direction: if theory T1 made extremely precise, low-prior-probability predictions and got them right , this should more strongly increase our prior probability that T1's set of predictions is entirely true. You can formalize this with Bayes' theorem. [However, the obvious formalization only shows how probability of the conjunction of all of T1's predictions increases; you need a model of how T1 and T2 were generated to know how indicative each theory's track record is of its future predictive accuracy, or how much your beliefs about P(T1_21) specifically should increase. If you replace the scientists with random coin-flip machines, and your prior probability for each event is (1/2), T1's past success shouldn't increase your P(T1_21) belief at all.]

As to whether there is a single "best" metric for evaluating theories, you are right that for any one expert, with one set of starting (prior) beliefs about the world and one set of data with which to update those beliefs, there will be exactly one best (Bayes'-score-maximizing) probability to assign to events T1_21 and T2_21. However, if the two experts are working from non-identical background information (e.g., if one has background knowledge the other lacks), there is no reason to suppose the two experts' probabilities will match even if both are perfect Bayesians. If you want to stick with the Solomonoff formalism, we can make the same point there: a given Solomonoff inducer will indeed have exactly one best (probabilistic) prediction for the next experiment. However, two different Solomonoff inducers, working from two different UTM's and associated priors (or updating to two different sets of observations) may disagree. There is no known way to construct a perfectly canonical notion of "simplicity", "prior probability" or "best" in your sense.

If you want to respond but are afraid of the "recent comments" limit, perhaps email me? We're both friends of Jennifer Mueller's (I think. I'm assuming you're the Tyrrell McAllister she knows?), so between that and our Overcoming Bias intersection I've been meaning to try talking to you sometime. annasalamon at gmail dot com.

Also, have you read A Technical Explanation ? It's brilliant on many of these points.

We believe the first(T1).

Why: Correctly predicted outcomes updates it's probability of being correct(Bayes).

The additional information available to the second theory is redundant since it was correctly predicted by T1.

A few thoughts.

I would like the one that:

0) Doesn't violate any useful rules of thumb, e.g. conservation of energy, allowing transmitting information faster than the speed of light in a vacuum.
1) Gives more precise predictions. To be consistent with a theory isn't hard if the theory gives a large range of uncertainty. E.g. if one theory is
2) Doesn't have any infinities in its range

If all these are equal, I would prefer them equally. Otherwise I would have to think that something was special about the time they were suggested, and be money pumped.

For example: Assume that I was asked this question many times, but my memory wiped in between times. If I preferred the predicting theory, they could alternate which scientist discovered the theory first, and charge me a small amount of money to get the first guys theory, but get the explanatory one for free. So I would be forever switching between theories, purely on their temporalness. Which seems a little weird.

As a machine-learning problem, it would be straightforward: The second learning algorithm (scientist) did it wrong. He's supposed to train on half the data and test on the other half. Instead he trained on all of it and skipped validation. We'd also be able to measure how relatively complex the theories were, but the problem statement doesn't give us that information.

As a human learning problem, it's foggier. The second guy could still have honestly validated his theory against the data, or not. And it's not straightforward to show that one human-readable theory is more complex than another.

But with the information we're given, we don't know anything about that. So ISTM the problem statement has abstracted away those elements, leaving us with learning algorithms done right and done wrong.

We should take into account the costs to a scientist of being wrong. Assume that the first scientist would pay a high price if the second ten data points didn't support his theory. In this case he would only propose the theory if he was confident it was correct. This confidence might come from his intuitive understanding of the theory and so wouldn't be captured by us if we just observed the 20 data points.

In contrast, if there will be no more data the second scientist knows his theory will never be proved wrong.

Sorry, I misread the question. Ignore my last answer.

So reviewing the other comments now I see that I am essentially in agreement with M@ (on David's blog) who posted prior to Eli. Therefore, Eli disagrees with that. Count me curious.

The comments to this entry are closed.

Less Wrong (sister site)

May 2009

Sun Mon Tue Wed Thu Fri Sat
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30