The Art and Science of Cause and Effect (Part 3)




Things brings us to part-3 of the lecture, where I will demonstrate how the ideas presented thus far can be used to solve new problems of practical importance.


Consider the century old debate concerning the effect of smoking on lung cancer. In 1964, the Surgeon General issued a report linking cigarette smoking to death, cancer and most particularly, lung cancer. The report was based on non-experimental studies, in which a strong correlation was found between smoking and lung cancer, and the claim was that the correlation found is causal, namely: If we ban smoking, the rate of cancer cases will be roughly the same as the one we find today among non-smokers in the population.

These studies came under severe attacks from the tobacco industry, backed by some very prominent statisticians, among them Sir Ronald Fisher. The claim was that the observed correlations can also be explained by a model in which there is no causal connection between smoking and lung cancer. Instead, an unobserved genotype might exist which simultaneously

causes cancer and produces an inborn craving for nicotine. Formally, this claim would be written in our notation as: P(cancer | do(smoke)) = P(cancer) stating that making the population smoke or stop smoking would have no effect on the rate of cancer cases. Controlled experiment could decide between the two models, but these are impossible, and now also illegal to conduct.

This is all history. Now we enter a hypothetical era where representatives of both sides decide to meet and iron out their differences. The tobacco industry concedes that there might be some weak causal link between smoking and cancer and representatives of the health group concede that there might be some weak links to genetic factors. Accordingly, they draw this combined model, and the question boils down to assessing, from the data, the strengths of the various links. They submit the query to a statistician and the answer comes back immediately: IMPOSSIBLE. Meaning: there is no way to estimate the strength from the data, because any data whatsoever can perfectly fit either one of these two extreme models. So they give up, and decide to continue the political battle as usual.

Before parting, a suggestion comes up: perhaps we can resolve our differences, if we measure some auxiliary factors, For example, since the causal link model is based on the understanding that smoking affects lung cancer through the accumulation of tar deposits in the lungs, perhaps we can measure the amount of tar deposits in the lungs of sampled individuals, and this might provide the necessary information for quantifying the links? Both sides agree that this is a reasonable suggestion, so they submit a new query to the statistician: Can we find the effect of smoking on cancer assuming that an intermediate measurement of tar deposits is available??? The statistician comes back with good news: IT IS COMPUTABLE and, moreover, the solution is given in close mathematical form. HOW?


The statistician receives the problem, and treats it as a problem in High School ALGEBRA: We need to compute P(cancer) under hypothetical action, from non-experimental data, namely, from expressions involving NO ACTIONS. Or: we need to eliminate the "do" symbol from the initial expression. The elimination proceeds like ordinary solution of algebraic equation - in each stage, a new rule is applied, licensed by some subgraph of the diagram, until eventually leading to a formula involving only WHITE SYMBOLS, meaning expression computable from non-experimental data.

You are probably wondering whether this derivation solves the smoking-cancer debate. The answer is NO. Even if we could get the data on tar deposits, the model above is

quite simplistic, as it is based on certain assumptions which both parties might not agree to. For instance, that there is no direct link between smoking and lung cancer, immediated by tar deposits. The model would need to be refined then, and we might end up with a graph containing 20 variables or more. There is no need to panic when someone tells us: "you did not take this or that factor into account". On the contrary, the graph welcomes such new ideas, because it is so easy to add factors and measurements into the model. Simple tests are now available that permit an investigator to merely glance at the graph and decide if we can compute the effect of one variable on another.

Our next example illustrates how a long-standing problem is solved by purely graphical means - proven by the new algebra. The problem is called THE ADJUSTMENT PROBLEM or "the covariate selection problem" and represents the practical side of Simpson's paradox.


Simpson's paradox, first noticed by Karl Pearson in 1899, concerns the disturbing observation that every statistical relationship between two variables may be REVERSED by including additional factors in the analysis. For example, you might run a study and find that students who smoke get higher grades, however, if you adjust for AGE, the opposite is true in every AGE GROUP, namely, smoking predicts lower grades. If you further adjust for PARENT INCOME, you find that smoking predicts higher grades again, in every AGE-INCOME group, and so on.

Equally disturbing is the fact that no one has been able to tell us which factors SHOULD be included in the analysis. Such factors can now be identified by simple graphical means.

The classical case demonstrating Simpson's paradox took place in 1975, when UC Berkeley was investigated for sex bias in graduate admission. In this study, overall data showed a higher rate of admission among male applicants, but, broken down by departments, data showed a slight bias in favor of admitting female applicants. The explanation is simple: female applicants tended to apply to more competitive departments than males, and in these departments, the rate of admission was low for both males and females.


To illustrate this point, imagine a fishing boat with two different nets, a large mesh and a small net. A school of fish swim towards the boat and seek to pass it. The female fish try for the small-mesh challenge, while the male fish try for the easy route. The males go through and only females are caught. Judging by the final catch, preference toward female is clearly evident. However, if analyzed separately, each individual net would surely trap males more easily than females.

Another example involves a controversy called "reverse regression", which occupied the social science literature in the 1970's. Should we, in salary discrimination cases, compare salaries of equally qualified men and women, or, instead, compare qualifications of equally paid men and women? Remarkably, the two choices led to

opposite conclusions. It turned out that men earned a higher salary than equally qualified women, and SIMULTANEOUSLY, men were more qualified than equally paid women. The moral is that all conclusions are extremely sensitive to which variables we choose to hold constant when we are comparing, and that is why the adjustment problem is so critical in the analysis of observational studies.


Consider an observational study where we wish to find the effect of X on Y, for example, treatment on response. We can think of many factors that are relevant to the problem; some are affected by the treatment, some are affecting the treatment and some are affecting both treatment and response. Some of these factors may be unmeasurable, such as genetic trait or life style, others are measurable, such as gender, age, and salary level. Our problem is to select a subset of these factors for measurement and adjustment, namely, that if we compare subjects under the same value of those measurements and average, we get the right result.


Let us follow together the steps that would be required to test if two candidate measurements, Z1 and Z2, would be sufficient. The steps are rather simple, and can be performed manually, even on large graphs. However, to give you the feel of their mechanizability, I will go through them rather quickly. Here we go.


At the end of these manipulations, we end up with the answer to our question: "IF X is disconnected from Y, then Z1 and Z2 are appropriate measurements."


I now wish to summarize briefly the central message of this lecture. It is true that testing for cause and effect is difficult. Discovering causes of effects is even more difficult. But causality is not MYSTICAL OR METAPHYSICAL. It can be understood in terms of simple processes, and it can be expressed in a friendly mathematical language, ready for computer analysis.


What I have presented to you today is a sort of pocket calculator, an ABACUS, to help us investigate certain problems of cause and effect with mathematical precision. This does not solve all the problems of causality, but the power of SYMBOLS and mathematics should not be underestimated.


Many scientific discoveries have been delayed over the centuries for the lack of a mathematical language that can amplify ideas and let scientists communicate results. And I am convinced that many discoveries have been delayed in our century for lack of a mathematical language that can handle causation. For example, I am sure that Karl Pearson could have thought up the idea of RANDOMIZED EXPERIMENT in 1901, if he had allowed causal diagrams into his mathematics.

But the really challenging problems are still ahead: We still do not have a causal understanding of POVERTY and CANCER and INTOLERANCE, and only the accumulation of data and the insight of great minds will eventually lead to such understanding. The data is all over the place, the insight is yours, and now an abacus is at your disposal too. I hope the combination amplifies each of these components. Thank you.

Remarks: technical details can be found in

  • J. Pearl, "The new challenge: From a century of statistics to an age of causation." Presented at the IASC Second World Congress, Pasadena, CA, February 1997.
  • J. Pearl, "Causal diagrams for experimental research," (with discussion), Biometrika, 82(4), 669-710, December 1995,
  • J. Pearl, "Structural and probabilistic causality,'' In D.R. Shanks, K.J. Holyoak, and D.L. Medin (Eds.), The Psychology of Learning and Motivation, Vol. 34 Academic Press, San Diego, CA, 393-435, 1996.

    Hard copies of these and other related publications can be obtained from

    Prof. Judea Pearl
    UCLA Computer Science Department
    4532 Boelter Hall
    Los Angeles, CA 90095-1596

    or download a postscript file from http://singapore.cs.ucla.edu/frl_papers.html.