« Debiasing as Non-Self-Destruction | Main | Modular Argument »

April 08, 2007

Comments

Well, it is not every day that I can cite something that occurred at a conference that both Robin Hanson and I attended. But, we were at a conference honoring the work of David Grether, a giant of the field of Bayesian decision theory and econometrics, which was held on the George Mason campus on Friday, 4/6.

Anyway, a theme of several papers was that people are slow to update their priors in reality in many situations, although details are important. It is not clear what the source of this "inertia" is.

Priors don't update. That's why they're called "priors".

Marginal posterior probabilities update; this is learning. Inductive priors over sequences don't update; they are what does the updating, they define your capability to learn. Even if you are a self-modifying AI and can rewrite your own source code, from a Bayesian perspective this is simply folded into an inductive prior over sequences of observations. I previously tried to write a post on this topic, but it got way too long and is now in my backlog of essays to finish someday.

This is exactly what I was trying to get at by distinguishing between the statement, "The marginal probability of drawing a red ball on the third round is 50%", which is true in all three scenarios above; versus the prior distributions over sequences of observations, which are different.

The inductive prior defines your responses to sequences of observations. This does not change over time; it is outside time. Learning how to learn is simply folded into the joint probability distribution.

(Apologies in advance to the sort-of-off-topic nature of this comment. As you'll see shortly, I had little choice.)

I was wondering, is there an avenue for us non-contributor readers to raise questions we think would be interesting to discuss? As far as I know, there are no public overcoming bias forums or mailing lists where everybody can post. One could ask questions in the comment sections in this blog, but that would be hijacking the commentaries to subjects other than what was actually said in the post - and I believe I've already seen at least one admonishment for a commenter to stick to the topic. Is it best to just post a question in the comments anyway, and trust for one of the regular contributors to make a real post about it if it's deemed interesting enough?

(As for the specific question I had in mind - I was wondering how careful one should be to avoid generalization from fictional evidence [described as a fallacy here, but I'd interprete it as a bias as well - which raises another potentially interesting question, how much overlap is there between fallacies and bias?]. When writing about artificial intelligence, for instance, would it be acceptable to mention Metamorphosis of Prime Intellect as a fictional example of an AI whose "morality programming" breaks down when conditions shift to ones its designer had not thought about? Or would it be better to avoid fictional examples entirely and stick purely to the facts?)

Excellent suggestion, Kaj. I'm checking with Robin and Nick about putting up a post whose comments could be used for topic suggestions. (No further discussion in this thread though, please.)

In practice you don't usually know exactly how the balls got into the urn. In that case you have a set of models for what might have happened, with a prior probability distribution over them. As you observe the sequences, you update the probabilities for these models. How does that fit into this inductive bias framework?

If you start out with a maximum-entropy prior, then you never learn anything, ever, no matter how much evidence you observe. You do not even learn anything wrong - you always remain as ignorant as you began.

Can you clarify what you mean here? Are you referring specifically to the monkey example or making a more general point?

Finney, if you consider probability distributions over sequences, then - for example - a mixture of 33% first distribution, 33% second distribution, and 33% third distribution, produces a new and coherent probability distribution over sequences. This would create an inductive prior that could learn any of the three sequences, given only slightly more evidence to determine which one was most likely.

Annan, I'm making a more general point. (Obviously not so general as to encompass 'maximum-entropy methods' of machine learning, which find the distribution that maximizes entropy subject to constraints; they are not literally maximum entropy.) Think of physical matter in a state of very high thermodynamic entropy, such as a black hole or radiation bath. A heat bath doesn't learn from observation, right? There's not enough order present to carry out operations of observing, or learning. Only highly ordered matter, like brains, can extract information from the environment. A probability distribution in a state of maximum entropy likewise lacks structure and does not update in any systematic direction. The marginal posteriors will resemble the marginal priors. It can't learn from experience; it doesn't do induction.

Eliezer,

Yes, thank you for correcting my sloppy wording.

So, it is the marginal posterior probabilities that exhibit inertia, or slow updating through learning, not the eternally unvarying "priors."

Why do you refer to the difference between a prior and the uniform prior as a bias, rather than the difference from the optimal prior? This doesn't agree with how you previously defined a bias.

Simon, I don't understand your question. The optimal prior is the one that assigns probability 1 to the exact sequence that will be observed. Also, cognitive biases are not like inductive biases, despite the names, that's kinda the point.

Well then, what's the point of discussing it on the blog, if the similarity is only due to the names?

As for the optimal prior, if the universe is non-deterministic, or if there are "many worlds", or multiple universes in general, or other ways in which a given observer can have multiple different futures, then the optimal prior is a distribution over all those futures.

I shouldn't have included non-deterministic, since that only leads to one actual outcome.

Simon, the point of discussing it on the blog is to help people who were confused by the similarity of names (not a hypothetical scenario, it did happen). And yes, if you are in a many-worlds situation of any type then the optimal prior is a distribution, albeit one that you will never realistically be able to compute.

OK, that clears it up then.

The point about the optimal prior was that, to the extent that a prior can be considered biased (in the sense I understood the word "bias", not inductive bias), the optimal prior is the unbiased prior it should be compared to. I didn't mean to imply that finding the optimal prior is realistic.

Eliezer,

So, an "optimal prior" is either a subjectively guessed probability or, more optimally, probability distribution that coincides with an objective probability or probability distribution. That is it would equal the posterior distribution one would arrive at after the asymptotic working out of Bayes' Theorem, assuming the conditions for Bayes' Theorem hold.

But, what if those conditions do not hold? Will the "optimal prior" be equal to the "objective truth" or to the distribution that one arrives at after the infinite working out of the posterior adjustment learning process, even assuming that we do not have the sort of inertial slow learning that seems to exist in much of reality?

To give an example of such a non-convergence, consider the sort of example posed by Diaconis and Freeman, with an infinite dimensional space and a disconnected basis, one can end up in a cycle rather than on the mean.

Barkley, priors aren't meant to be detailed objective models of the world - that's why they're called "priors". :)

A good prior learns from evidence, and the more probability mass it concentrates into sequences of the sort that are actually likely to occur, the faster it will learn. In a certain sense, the "optimal prior" is the one that learns so fast that it doesn't need any evidence at all - but that's not really what a "prior" is for. Even with an excellent prior, nearly all of the information will come from the environment.

Sense data is light, the prior is a camera. Most of the information is in the light, but you need a camera to develop it; a rock won't do. A good camera needs less light to develop an accurate picture, but the detailed picture is still carried by the light's message, not factory-preprinted inside the camera.

As for the Diaconis and Freedman paper, I haven't read it, but kindly remember that I am an infinite set atheist. In any case it is easy for poor priors to not learn, or anti-learn. Every prior that assigns more mass than maxent to "plausible" sequences, does so by draining mass from "implausible" sequences. If reality falls into one of the "implausible" sequences, we will do worse than maximum entropy, anti-learn from experience, and not pass on our genes to a whole lot of offspring.

Eliezer,

Ah, so you are a constructivist, perhaps even an intuitionist? Even so, the point of such theorems is that they can happen in a long transient within finite constraints, with the biggie here being the non-connectedness of the support. One can get stuck in a cycle going nowhere for a long time, just as in such phenomena as transient chaos. With a suitably large, but finite, dimensionality and a disconnected support, one can wander in a wilderness with not much serious convergence for a very long time.

I find the idea of a "prior learning" to be a bit weird. It is an agent who learns, although the prior the agent walks in with will certainly play a role in the ability of the agent to learn. But the problem of inertia that I raised has more to do with the nature of agents than with their priors.

Getting to the raison d'etre of this blog, the question here is does bias arise from the nature of the prior an agent brings to a decision or analytical process, or is it something about the open-mindedness or willing to adjust posteriors in the face of evidence that is more important? Presumably both are playing at least some role.

Why would anyone use a prior so strong that when presented with data, they would be unable to learn from it. In that case, if your prior is that strong, did you really have any intention of attempting to learn from new data?

Barkley,

I think that the concept of a prior deserves more attention as the strength of your current beliefs in the face of new evidence.

Presumably, if you have a subjective prior, you brought some "prior" experience or knowledge to the problem.... so philosphically, where does the original prior come from, and if it comes from your experience, is it really a prior, or have you actually reasoned your way to a posterior without even realizing it? Perhaps more time should actually be spent justifying your prior if you are going to bring a subjective prior to the problem. If you have good reasons and a lot of quality evidence, then the prior should receive a lot of weight.... deciding how much weight and how strongly you believe in your prior is a tough question.

I think that any time you create a prior without objective evidence able to support it, you have the potential to bias your results. But then again, if you truly believe in your subjective prior, do you really care about the potential to "bias" your results?

Thanks for this magnificent post. My only concern is that the point seems slightly overstated when you write: "All learning is induction, and all induction takes place through inductive bias." I wish this had been phrased slightly differently. The definition of learning seems a bit narrow. Is there no such thing as deductive learning? But even considering only the realm of inductive learning (based on observation), let's assume I see a swan for the first time, and the swan is white. Wouldn't it be correct to say that I've learned that at least one swan is white? (This may be slow learning, given the context, but wouldn't it still be learning?) And isn't the "inductive bias" in this case so minimal that it's not really properly called "bias" at all, since the assumption cannot be false?

Why is inductive bias called "bias"?

Because it represents a divergence from an imagined mind of pure emptiness that can learn equally well in any environment.

It is a bias because it is a prior assumption rather than something that is learned in the course of training. Mitchell's Machine Learning has a very clear explanation of inductive bias and why it is necessary for learning to occur. There are some examples of inductive bias at Wikipedia: http://en.wikipedia.org/wiki/Inductive_bias

The comments to this entry are closed.

Less Wrong (sister site)

May 2009

Sun Mon Tue Wed Thu Fri Sat
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31