EB Notes

Automate Scientific Discovery

“From Causal Inference and Data Fusion to an Automated Scientist”, Elias Bareinboim, 1541-1672/16/$33.00 (c) 2016 IEEE. IEEE INTELLIGENT SYSTEMS. Published by the IEEE Computer Society.

Graphical models are the bee’s knees:

The advent of graphical methods of causal and counterfactual inference has made it possible to tackle some of the most challenging problems in scientific methodology.

To take advantage of big data, we need to use two more things on top of graphical models:

the ability to distinguish causal from associational relationships, and the ability to integrate data from multiple, heterogeneous sources.

Apparently, this is called the “data fusion” problem. Once we use his theoretical framework, we can let AIs generalize their causal models like human scientists.

Note how the overall theme echoes that of the 1987 book “Scientific Discovery” by the AI researchers Langley, Simon, et al. We all want to automate scientific discovery.

Causal Inference and the Data-Fusion Problem

Hypothesis: Confounding bias -> P(y|x) != P(y|do(x)), because something could cause both x and y (including the cases where one is the ancestor of the other).

Admissible sets: Z is not a descendant of X means that you can’t accidentally condition on a collider and get a spurious correlation. Z blocks all backdoor paths from X to Y means that it will prevent Y from having an effect as an ancestor of X or from something confounder causing both X and Y. That way, you block the other two causes of correlations. The only correlation left between X and Y is causative correlation. If I’m right, you’ve converted a question of intervention into a question of correlation simply by averting the other causes of correlation.

Corollary: Admissible set -> remove confounding bias relative to the effect of X on Y.

Front-door criterion:

   u
  / \
 /   \
x->z->y

Note that x d-separates u and z because y is a collider, and u and z d-separate x and y. So, P(u|z,x) = P(u|x) and P(y|x,z,u) = P(y|z,u). Substitute them into the formula for P(y|do(x))

Ah! Front-door criterion = back-door criterion applied twice!

For the causal effect of x on z, the collider y blocks all back-door paths from x to z and makes the admissible set {}. So, P(z|do(x)) = P(z|x).

For the causal effect of z on y, x blocks all back-door paths from z to y. So, P(y|do(z)) = Sum_x’ P(y|x’,z)P(x’)

So, overall effect of x on y is Sum_z (P(z|x) (Sum_x' P(y|x',z)P(x'))).

Back-Door Paths

Hypothesis: Back-door path => confounder. No back-door path => no confounder.

Papers I need to Understand

Pearl conference presentation on Data Fusion problem

Causal inference and the data-fusion problem

Causality the book

Recovering from Selection Bias in Causal and Statistical Inference

Recovering Causal Effects From Selection Bias (?)

My Core Ideas

Locality of causality

Expected value of information

Entropy

EB CS590 Past Project Ideas and Papers

My Own Research Interests

Scientific discovery.

Programming.

How humans learn. The causal models that are inherent in the textbooks we study and in the categories we have of the world around us.

Hypothesis: I’m interested in how to scale models from the micro-level to the macro-level. How do we form higher-level concepts?

For example, climate science (as in the thesis). Or going from our understanding of individual functions in a program to the overall properties of the program. Ditto for a car.

Hypothesis: I believe that we shouldn’t limit ourselves to the variables given in the data (as you keep mentioning, there could be millions of such low-level variables). How do we get to the high-level variables we care about? How would an automated program identify high-level “variables”?

Feature extraction: Instead of just feature selection, I want category creation. Build high level abstractions out of low level variables. I believe that will reduce our uncertainty about the joint distribution and make it cheaper to calculate the queries we are interested in. We are interested in high level variables.

Hypothesis: I think that is related to the decisions that we face - what interventions are we uncertain about? Also depends on the granularity of our tools. (example: macro-economics works with things like interest rates and tax rates, not individual firms.)

Inference / Data-fusion

Causal inference and the data-fusion problem.

Bounding

[T]he assessment of treatment efficacy in the face of imperfect compliance.

Methodologically, the message of this chapter has been to demonstrate that, even in cases where causal quantities are not identifiable, reasonable assumptions about the structure of causal relationships in the domain can be harnessed to yield useful quantitative information about the strengths of those relationships. Once such assumptions are articulated in graphical form and re-encoded in terms of canonical partitions, they can be submitted to algebraic methods that yield informative bounds on the quantities of interest. The canonical partition further allows us to supplement structural assumptions with prior beliefs about the population under study and invite Gibbs sampling techniques to facilitate Bayesian estimation of the target quantities.

– Chapter 8, Causality

Instrumental Variables / Linear Models

In this paper, we extend graph-based identification methods by allowing background knowledge in the form of non-zero parameter values. Such informa- tion could be obtained, for example, from a previ- ously conducted randomized experiment, from sub- stantive understanding of the domain, or even an identification technique. To incorporate such in- formation systematically, we propose the addition of auxiliary variables to the model, which are con- structed so that certain paths will be conveniently cancelled. This cancellation allows the auxiliary variables to help conventional methods of iden- tification (e.g., single-door criterion, instrumental variables, half-trek criterion), as well as model test- ing (e.g., d-separation, over-identification). More- over, by iteratively alternating steps of identifica- tion and adding auxiliary variables, we can improve the power of existing identification methods via a bootstrapping approach that does not require ex- ternal knowledge. We operationalize this method for simple instrumental sets (a generalization of in- strumental variables) and show that the resulting method is able to identify at least as many models as the most general identification method for linear systems known to date. We further discuss the ap- plication of auxiliary variables to the tasks of model testing and z-identification.

– Incorporating Knowledge into Structural Equation Models using Auxiliary Variables

Adjustment / Algorithms

We address the problem of nding a minimal separator in a di- rected acyclic graph (DAG), namely, nding a set Z of nodes that d-separates a given pair of nodes, such that no proper subset of Z d-separates that pair. We analyze several versions of this problem and o er polynomial algorithms for each. These include: nding a minimal separator from a restricted set of nodes, nding a minimum- cost separator, and testing whether a given separator is minimal. We con rm the intuition that any separator which cannot be reduced by a single node must be minimal.

– Finding Minimal D-separators

Covariate adjustment is a widely used approach to estimate total causal effects from observational data. Several graphical criteria have been de- veloped in recent years to identify valid covari- ates for adjustment from graphical causal mod- els. These criteria can handle multiple causes, latent confounding, or partial knowledge of the causal structure; however, their diversity is con- fusing and some of them are only sufficient, but not necessary. In this paper, we present a cri- terion that is necessary and sufficient for four different classes of graphical causal models: di- rected acyclic graphs (DAGs), maximum ances- tral graphs (MAGs), completed partially directed acyclic graphs (CPDAGs), and partial ancestral graphs (PAGs). Our criterion subsumes the ex- isting ones and in this way unifies adjustment set construction for a large set of graph classes.

– A Complete Generalized Adjustment Criterion

Structural Learning

Randomized controlled experiments are often described as the most reliable tool available to scien- tists for discovering causal relationships among quantities of interest. However, it is often unclear how many and which different experiments are needed to identify the full (possibly cyclic) causal structure among some given (possibly causally insufficient) set of variables. Recent results in the causal discovery literature have explored various identifiability criteria that depend on the assump- tions one is able to make about the underlying causal process, but these criteria are not directly constructive for selecting the optimal set of experiments. Fortunately, many of the needed construc- tions already exist in the combinatorics literature, albeit under terminology which is unfamiliar to most of the causal discovery community. In this paper we translate the theoretical results and apply them to the concrete problem of experiment selection. For a variety of settings we give explicit constructions of the optimal set of experiments and adapt some of the general combinatorics results to answer questions relating to the problem of experiment selection.

– Experiment Selection for Causal Discovery

We show that if any number of variables are allowed to be simultaneously and indepen- dently randomized in any one experiment, log 2 (N ) + 1 experiments are sufficient and in the worst case necessary to determine the causal relations among N >= 2 variables when no latent variables, no sample selection bias and no feedback cycles are present. For all K, 0 < K < 2 1 N we provide an upper bound on the number experiments required to determine causal structure when each ex- periment simultaneously randomizes K vari- ables. For large N , these bounds are signifi- cantly lower than the N - 1 bound required when each experiment randomizes at most one variable. For k max < N 2 , we show that N -1)+ 2k N log 2 (k max ) experiments are ( k max max sufficient and in the worst case necessary. We offer a conjecture as to the minimal number of experiments that are in the worst case suf- ficient to identify all causal relations among N observed variables that are a subset of the vertices of a DAG.

– On the Number of Experiments Sufficient and in the Worst Case Necessary to Identify All Causal Relations Among N Variables

Reinforcement Learning / Machine Learning

Reinforcement learning (RL) agents have been de- ployed in complex environments where interactions are costly, and learning is usually slow. One promi- nent task in these settings is to reuse interactions performed by other agents to accelerate the learn- ing process. Causal inference provides a family of methods to infer the effects of actions from a combination of data and qualitative assumptions about the underlying environment. Despite its suc- cess of transferring invariant knowledge across do- mains in the empirical sciences, causal inference has not been fully realized in the context of transfer learning in interactive domains. In this paper, we use causal inference as a basis to support a prin- cipled and more robust transfer of knowledge in RL settings. In particular, we tackle the problem of transferring knowledge across bandit agents in settings where causal effects cannot be identified by do-calculus [Pearl, 2000] and standard learning techniques. Our new identification strategy com- bines two steps - first, deriving bounds over the arms distribution based on structural knowledge; second, incorporating these bounds in a dynamic allocation procedure so as to guide the search to- wards more promising actions. We formally prove that our strategy dominates previously known algo- rithms and achieves orders of magnitude faster con- vergence rates than these algorithms. Finally, we perform simulations and empirically demonstrate that our strategy is consistently more efficient than the current (non-causal) state-of-the-art methods.

– Transfer Learning in Multi-Armed Bandits

Transfer Learning / Machine Learning

We provide a formal definition of the notion of “transportability,” or “external validity,” which we view as a license to transfer causal information learned in exper- imental studies to a different environment, in which only observational studies can be conducted. We introduce a formal representation called “selection diagrams” for expressing knowledge about differences and commonalities between populations of in- terest and, using this representation, we derive procedures for deciding whether causal effects in the target environment can be inferred from experimental findings in a dif- ferent environment. When the answer is affirmative, the procedures identify the set of experimental and observational studies that need be conducted to license the transport. We further demonstrate how transportability analysis can guide the transfer of knowl- edge among non-experimental studies to minimize re-measurement cost and improve prediction power. We further provide a causally principled definition of “surrogate endpoint” and show that the theory of transportability can assist the identification of valid surrogates in a complex network of cause-effect relationships.

– Transportability Across Studies

Fairness-Discrimination / Machine Learning

Abstract Anti-discrimination is an increasingly important task in data science. In this paper, we investigate the problem of discover- ing both direct and indirect discrimination from the historical data, and removing the discriminatory effects before the data is used for predictive analysis (e.g., building classifiers). We make use of the causal network to capture the causal structure of the data. Then we model direct and indirect discrimination as the path-specific effects, which explicitly distinguish the two types of discrimination as the causal effects transmitted along different paths in the network. Based on that, we pro- pose an effective algorithm for discovering direct and indirect discrimination, as well as an algorithm for precisely removing both types of discrimination while retaining good data utility. Different from previous works, our approaches can ensure that the predictive models built from the modified data will not incur discrimination in decision making. Experiments us- ing real datasets show the effectiveness of our approaches.

Discrimination refers to unjustified distinctions in decisions against individuals based on their membership in a certain group. Federal Laws and regulations (e.g., the Equal Credit Opportunity Act of 1974) have been established to prohibit discrimination on several grounds, such as gender, age, sex- ual orientation, race, religion or belief, and disability or ill- ness, which are referred to as the protected attributes. Nowa- days various predictive models have been built around the collection and use of historical data to make important de- cisions like employment, credit and insurance. If the histor- ical data contains discrimination, the predictive models are likely to learn the discriminatory relationship present in the data and apply it when making new decisions. Therefore, it is imperative to ensure that the data goes into the predictive models and the decisions made with its assistance are not subject to discrimination.

– A Causal Framework for Discovering and Removing Direct and Indirect Discrimination

Interference

Abstract. The term “interference” has been used to describe any set- ting in which one subject’s exposure may affect another subject’s out- come. We use causal diagrams to distinguish among three causal mech- anisms that give rise to interference. The first causal mechanism by which interference can operate is a direct causal effect of one individ- ual’s treatment on another individual’s outcome; we call this direct interference. Interference by contagion is present when one individ- ual’s outcome may affect the outcomes of other individuals with whom he comes into contact. Then giving treatment to the first individual could have an indirect effect on others through the treated individ- ual’s outcome. The third pathway by which interference may operate is allocational interference. Treatment in this case allocates individ- uals to groups; through interactions within a group, individuals may affect one another’s outcomes in any number of ways. In many settings, more than one type of interference will be present simultaneously. The causal effects of interest differ according to which types of interference are present, as do the conditions under which causal effects are iden- tifiable. Using causal diagrams for interference, we describe these dif- ferences, give criteria for the identification of important causal effects, and discuss applications to infectious diseases.

– Causal Diagrams for Interference

Effect Restoration

This paper highlights several areas where graphical techniques can be harnessed to address the problem of measurement errors in causal inference. In particular, it discusses the control of unmeasured confounders in parametric and nonparametric models and the computational prob- lem of obtaining bias-free effect estimates in such models. We derive new conditions under which causal effects can be restored by observing proxy variables of unmeasured confounders with/without external studies.

– Measurement bias and effect restoration in causal inference

Mediation Analysis

Recent advances in causal inference have given rise to a general and easy-to-use for- mula for assessing the extent to which the effect of one variable on another is mediated by a third. This Mediation Formula is applicable to nonlinear models with both dis- crete and continuous variables, and permits the evaluation of path-specific effects with minimal assumptions regarding the data-generating process. We demonstrate the use of the Mediation Formula in simple examples and illustrate why parametric methods of analysis yield distorted results, even when parameters are known precisely. We stress the importance of distinguishing between the necessary and sufficient interpretations of “mediated-effect” and show how to estimate the two components in nonlinear systems with continuous and categorical variables.

– The Causal Mediation Formula

Explanation / Actual Causation

This chapter offers a formal explication of the notion of “actual cause,” an event recognized as responsible for the production of a given outcome in a specific scenario, as in: “Socrates drinking hemlock was the actual cause of Socrates death.” Human intuition is extremely keen in detecting and ascertaining this type of causation and hence is considered the key to constructing explanations (Section 7.2.3) and the ultimate criterion (known as “cause in fact”) for determining legal responsibility.

Statements of the type “a car accident was the cause of Joe’s death,” made relative to a specific scenario, are classified as “singular,” “single-event,” or “token-level” causal statements. Statements of the type “car accidents cause deaths,” when made relative to a type of events or a class of individuals, are classified as “generic” or “type-level” causal claims (see Section 7.5.4). We will call the cause in a single-event statement an actual cause and the one in a type-level statement a general cause.

– Chapter 10, Causality

Constraints / Theory

(Paraphrasing) Help you identify functional constraints and thus help you test causal models as well as infer them from data.

Cyclic Models

We show that the d-separation criterion constitutes a valid test for conditional independence that are induced by feedback systems involving discrete variables

– Identifying Independencies in Causal Graphs with Feedback

Databases

Provenance is often used to validate data, by verifying its origin and explaining its derivation. When searching for “causes” of tuples in the query results or in general observations, the analysis of lineage becomes an essential tool for providing such justifications. However, lineage can quickly grow very large, limiting its immediate use for providing intuitive explanations to the user. The formal notion of causality is a more refined concept that identifies causes for observations based on user-defined criteria, and that assigns to them gradual degrees of responsibility based on their respective contributions. In this paper, we initiate a discussion on causality in databases, give some simple definitions, and motivate this formalism through a number of example applications.

– Causality in Databases

Theses

Automated Macro-scale Causal Hypothesis Formation Based on Micro-scale Observation (2016), K. Chalupka [link]

Not Interested

Causal Macrovariables

These abstractions are particularly useful when one can establish causal relations amongst macrovariables that hold independent of the micro-variable instantiations of the macrostates.

Abstract Interface to the Possible Instances

Each macrovariable state must have a consistent, well-defined causal effect. This effect can be probabilistic and highly variable, but must not depend on the microvariable instantiation of the macrovariable. For just like the specifics of gas molecule momenta do not change the effects of temperature, as long as their mean is equal.

Hypothesis: The effect of a macrovariable should depend on the interface provided by the microvariables not on the actual configurations.

Total cholesterol fails because low LDL + high HDL and high LDL + low HDL have different effects even though they present the same interface (sum).

Hypothesis: “equivalence relation” = interface satisfied by the microvariables. If two configurations satisfy the same interface (say mean of their velocities), then we consider them equivalent.

Chapter 2

The visual cause is distinguished from other macro-variables in that it contains all the causal information about the target behavior that is available in the image.

While we are interested in identifying the visual causes of a target behavior, the functional relation between the image pixels and the visual cause should not itself be interpreted as causal. Pixels do not cause the features of an image, they constitute them, just as the atoms of a table constitute the table (and its features).

Hypothesis: A category is just a label you give to a set of causal branches that have the same output (for your purposes).

The probability distribution over the visual cause is induced by the probability distribution over the pixels in the image and the functional mapping from the image to the visual cause.

Definition 1 (Observational Partition, Observational Class). The observational partition Pi o (T, I) of the set of images I w.r.t. behavior T is the partition induced by the equivalence relation ~ such that i ~ j if and only if P (T | I = i) = P (T | I = j). We will denote it as Pi o when the context is clear. A cell of an observational partition is called an observational class.

Thus, knowing the observational class of an image allows us to predict the value of T . However, the predictive probability assigned to an image does not tell us the causal effect of the image on T . For example, a barometer is widely taken to be an excellent predictor of the weather. But changing the barometer needle does not cause an improvement of the weather. It is not a (visual or otherwise) cause of the weather. In contrast, seeing a particular barometer reading may well be a visual cause of whether we pack an umbrella.

Our notion of a visual cause depends on the ability to manipulate the image.


The underlying idea is that images are considered causally equivalent with respect to T if they have the same causal effect on T .

The Causal Coarsening Theorem

The main theorem of this book relates the causal and observational partitions for a given I and T . It turns out that under appropriate, intuitive assumptions, the causal partition is a coarsening of the observational partition. That is, the causal partition aligns with the observational partition, but the observational partition may subdivide some of the causal classes.

Two points are worth noting here: First, the CCT is interesting inasmuch as the visual causes of a behavior do not contain all the information in the image that predict the behavior. Such information, though not itself a cause of the behavior, can be informative about the state of other non-visual causes of the target behavior. Second, the CCT allows us to take any classification problem in which the data is divided into observational classes, and assume that the causal labels do not change within each observational class.

The previous chapter develops a method to discover from micro-variable data the macro-variable cause of a pre-defined macro-variable “target behavior”. In this chapter, we do not assume that the macro-level effect is already specified. Instead, in a generalization of the CFL framework, we simultaneously recover the macro- level cause C and macro-level effect E from micro-variable data.

Causal Consistency of SEMs

Key variable: Models of the same system at different levels of detail, their consistency

Different models <- large number of variables with irrelevant or unobserved variables marginalized out; micro-level and macro-level models where the macrovariables are aggregates of the microvariables; dynamic time series model vs stationary behaviour models

A: blood cholesterol but not stress

B: 100 billion neurons vs average neuronal activity in functional brain regions

C: time-evolving chemical reaction - just final ratio of reactants and products

Definition: Consistency - Given the same interventions, both models should agree in their predictions.

Structure of the paper: Section 2 - SEMs; Section 3 - SEMs for causal modeling; Section 4 - exact transformation between two SEMs;

Hypothesis: Exact transformation between two SEMs - when can two models be thought of as causal descriptions of the same system?

Hypothesis: Novel idea of the paper - explicitly make use of a “natural ordering on the set of interventions”.

Hypothesis: Questions answered - When can we model only a subsystem of a more complex system? When does a micro-level system admit a causal description in terms of macro-level features? How do cyclic SEMs arise?

Observation: Historically, TC (total cholesterol level) was thought to be important in determining risk of HD (heart disease). Experiments - diets to raise or lower TC. Some found that higher TC lowered HD, others found the opposite.

Hypothesis: Seemingly conflicting observations <- perform an “invalid” transformation of the true underlying model.

Hypothesis: (Mine) “Transformation” = abstraction of the system. (Or abstraction of the set of interventions you care about.)

Observation: Actual hypothesis - LDL and HDL both increase TC, but have different effects on HD.

Hypothesis: Can’t “transform” (diet -> LDL and HDL -> HD) into (diet -> TC -> HD).


SEM => distributions over interventions

Observation: “The partial ordering of X corresponds to the ability to compose physical implementations of interventions.”

Hypothesis: He’s describing how we compose categories as per the granularity of our tools.

Expected Interventions

Hypothesis: Categories formed <- variables on which interventions are possible.

Hypothesis: Whether Mx and My are exact transformations depends on the set of interventions you care about.

Otherwise, you wouldn’t be justify losing the detailed information behind variables.

Hypothesis: Ignore variables <- you never expect to intervene on them.

Hypothesis: Simpler model <- you don’t expect to intervene on certain micro-variables.

Hypothesis: Static model <- you don’t expect to change things over time.

Corollary: That’s why we care about hotspots in software engineering - the places where we expect our code to change. That’s the set of interventions for which we want to design our system.

Question: What about just the observations? How coarse can your macrovariables be if all you do is observe?

Transportability and Causal Consistency

Hypothesis: You can transport an intervention across populations iff the target model is an exact transformation of the original model for that intervention.

So, you may not be able to transport certain interventions.

(Not at all sure about this.)

Observation: The current transportability formula uses a mixture of data from the source and target population.

Hypothesis: Maybe you could use a mixture of data using the exact transformation solution by using the commutativity diagram and working in the original model or target model as desired.

Hypothesis: Maybe it will help you consider multiple heterogeneous conditions at the same time.

Transportability across studies: A formal approach (2011)

Motivating Examples

Hypothesis: “Selection diagram” :: differences and commonalities between populations of interest

Hypothesis: Transportability problem :: causal effect in different environment -> Maybe causal effect in target environment

Hypothesis: Question - When can we assume that age-specific effects are invariant across cities (but not the age distribution itself)?

Observation: This seems to be transportable.

Now, toggle variables to see when the effect stops being transportable.

Problems:

Question: In the hypertension example, is the change in P(Z) due to differences in P(X) or due to differences in the way Z is affected by X? (RCT would distinguish between the two.)

Hypothesis: Transportable effect = if the mechanism is the same but it’s just the observational probabilities that are different.

i.e., P(z|x) vs P(z|do(x)).

Corollary: You can’t use statistics because they don’t distinguish between the two. Only interventions can distinguish between observational probabilities and mechanisms.

Corollary: Transportability is a causal notion, not a statistical notion.

Hypothesis: Transportability <- same mechanisms, but potentially different probabilities.

Hypothesis: Population differences = difference in probabilities, but not mechanisms.

Hypothesis: Key variable - What causes the difference between P(z) and P*(z)?

Observation: Language proficiency - distribution in age could be the same in LA and NY, but language proficiency could be caused differently by age in the two cities (maybe LA teaches Spanish in school). Since age distribution is the same, causal effects are just the same in both cities.

Observation: Language proficiency - if proficiency is caused the same way in both cities (P(z|age) = P*(z|age)), then the difference in P(z) was caused by difference in P(age) and so the causal effect won’t be the z-specific effect.

Observation: We assumed that P(y|do(x), z) = P*(y|do(x), z)! We’re assuming that the mechanism is the same.

Hypothesis: Transportability <- causal mechanisms between X and Y are the same, auxiliary mechanisms may be different, probabilities may be different.

Observation: They’re considering cases where some auxiliary variable Z is different in two populations.

Question: What if Z is the same? Do you assume that the causal effect is transportable?

Hypothesis: Assumption - the causal graph is the same in both populations.

Corollary: Difference in some auxiliary variable <- difference in mechanism or difference in probabilities.

Corollary: Same mechanism <- auxiliary variable has same probability in both populations, same causal graph, stability assumption

Hypothesis: Transportability <- same causal graph, causal mechanisms between X and Y are the same, auxiliary mechanisms may be different, probabilities may be different.

Basically, do they differ in f or U?

Observation: X -> Z -> Y; the mechanism from X -> Z could be different in the two populations, but you can adjust for that with P*(Z|X). The mechanism in P(y|do(x),z) is assumed to be the same.

Hypothesis: Transportability <- same causal graph, effects of mechanisms that are different can be identified or experimentally measured, some probabilities may be different.

Question: Why did they assume that P(y|do(x), z) is the same as P*(y|do(x), z) in the Markov chain example above?

Transportability Assumptions

Hypothesis: Use do-calculus on causal effects <- common causal graph.

Hypothesis: Assume P(A) = P*(A) <- empirical information.

For example, in X -> Z -> Y, P(y|do(x), z) = P(y|z) in the causal graph; so it holds for both P and P*. Then, if we are given that P(y|z) = P*(y|z), we can remove the star.

Observation: Assume that S _|_ Y | do(x) => P*(y|do(x)) = P(y|do(x)).

Basically, assume that the mechanisms with the selector are the same in both populations.

Hypothesis: Variables without the selector have the same mechanism in both populations.

Surrogate Endpoints from a Causality Viewpoint

Hypothesis: Problem statement - relationship between two treatments considered (say drug A and drug B), where you have data on Z and Y for the drug A treatment, but only data on Z for the drug B treatment.

Hypothesis: If drug A = drug B, then knowing about Z alone in the second case should be enough to infer things about Y.

Question: What if the population changes the mechanism for Z?

Question: Is it the same drug in both the studies or different drugs? What do you mean by “if the two treatments are identical”?

Shared by most workers in the field, this description lacks a key ingredient: the relation- ship between the two treatments considered, one prevailing when data is available on both the surrogate (Z) and the endpoint (Y ), and the second, when data on the surrogate alone are available. Clearly, if the two treatments are identical, the problem is trivially solved, for then, strong correlation between Z and Y in the first study should suffice; all accurate predictions that Z provides about Y in the first study, both across and within treatment arms, will remain valid in the follow-up study. We must conclude therefore, that the collec- tive intuition against the sufficiency of correlation 9 stems from a tacit understanding that the two studies are conducted under two different conditions, and that “strong correlation,” should mean not merely accuracy of prediction but also robustness to the new condition. It follows that any formal definition of surrogacy must specify how the new conditions differ from those prevailing in the original study and incorporate this specification in the body of the definition. We will now propose such specification.

Specifically, for Z to be a surrogate for Y , Z must be a good predictor of P (y|do(x)) and also remain a good predictor under new settings in which Z is directly controlled, as in Fig. 2(c). The reason we should concern ourselves with such settings is that, once Z is proclaimed a “surrogate endpoints” it invites efforts (e.g., by drug manufacturers) to find direct means of controlling Z.

Hypothesis: That’s the key problem. Can Z remember a good predictor of the effect on Y even when people try to game it?

Observation: Positive exemplar - a perfect mediator works. (example: X,S -> Z -> Y)

Observation: Negative exemplar - descendant of a mediator doesn’t work.

TODO: Question: What does S mean in 7(b)? Maybe S is the drug. [No. I think S is the selector for the new mechanism that decides Z.]

In 7(a): P(y|do(x), z) = P(y|z).

P(y|do(x), z, s) = P(y|z).

This means that whatever predictions (about the true endpoint Y ) we can make from observing Z in the initial study, those same predictions will remain valid in the followup study, irrespective of the new condition created by S.

This is no longer the case in Fig. 7(b). Here the true endpoint Y can be highly correlated with the putative surrogate Z, for both treatment and control, but in the follow up study this correlation may be severely altered, even destroyed.

In 7(b): P(y|do(x), z) = P(y|x, z)

P(y|do(x), z, s) = P(y|x, z, s)

A sharper difference emerges between the two models when we consider the role of Z in evaluating the efficacy of new treatments, say S.

I’m confused. If we are now considering a new treatment S, what were we considering earlier?

Hypothesis: Condition - P* terms must not involve Y (since Y is not measured in the follow-up studies).

In Fig. 7(b), on the other hand, the effect of S on Y cannot be thus decomposed into P-terms involving Y and P*-terms not involving Y, which means that we cannot forego measurement of Y under the new environment, thus loosing surrogacy.

Hypothesis: “Surrogate endpoint” problem = decompose effect of S on Y into P-terms involving Y and P*-terms not involving Y (so that we can forego measurement of Y under the new environment).

Hypothesis: “Measurements of P (y|do(x), z) and P (z|do(x), s) [should be] sufficient for assessing P (y|do(x), z, s) or P (y|do(x), s)”.

This is a direct consequence of the non-robustness of the Z-specific effect P (y|do(x), z, s) to new treatments represented by S indeed, if Z is merely a symptom of U , its correlation with Y may be deceptive; a new drug (S) may cure Z without having any effect on Y and without altering the effect of X on Y .

TODO: Inadequacy of “principal surrogacy” - false positive and false negative.

What we may lose in going from perfect to imperfect mediation is the ability to control the effect of X on Y through interventions on Z, but we do not lose the ability to properly assess the effectiveness of such interventions by measuring their effect on Z. The first views surrogates as targets for control, the second as sources of information. It is the latter quality that has been the traditional motivation behind the quest for surrogates, and we will adhere to this tradition in this paper.

Question: Examples of good and bad surrogate endpoints?

Data Fusion: Type Signatures

Hypothesis: Type of dataset = (population (?), experimental vs observation, sampling, measured)

Hypothesis: Data fusion problem :: heterogeneous datasets -> Maybe target dataset of some type.

Observation: Causal inference :: dataset of type (population a, observational, sampling s, measured v) -> dataset of type (population a, experimental, sampling s, measured v).

For example, P(y|do(x)) is of the type (population a, experimental, sampling s (whatever it may be), {x,y}).

Hypothesis: It is a dataset. You could use it to infer P(y,z|do(x)) (depending on the graph).

Observation: Z-identifiability :: dataset of type (population a, experimental, sampling s, measured z) -> dataset of type (population a, experimental, sampling s, measured x)

Observation: Transportability :: dataset of type (population a, experimental regime e, sampling s, measured v) -> dataset of type (population b, experimental regime e, sampling s, measured v).

Observation: Sampling selection bias :: dataset of type (population a, experimental regime e, sampling s, measured v) -> dataset of type (population a, experimental regime e, {}, measured v).

Hypothesis: Type of dataset = (population (?), experimental vs observation, sampling, measured, causal graph, mechanisms).

Question: What about selector diagrams? How do you handle the fact that you have multiple models (but with the same graph)?

Bayesian Inference vs Data Fusion

Question: What about general inference?

Hypothesis: Probabilistic inference :: dataset of type (population a, observational, sampling s, measured v) -> dataset of type (population a, observational, sampling s, measured v + y).

Observation: Bayesian inference :: prior distribution over hypotheses, evidence -> posterior distribution over hypotheses.

Question: What is the difference between Bayesian inference and data fusion?

Potential Tests

Hypothesis: Assumptions <- areas where they jumped from P to P*.

Question: From where do you get feedback for your transportability formula?

Test: Can you handle the age and skill examples using your idea of two partially-equivalent models?

Surrogate Endpoints Book (2005)

Introduction

“True” endpoint can be costly, difficult, or time-consuming to measure.

Question: Difference between surrogate markers and surrogate endpoints?

If you can measure stress through urine samples, why not just use that?

Hypothesis: Could take too long (like in cancer survival).

Hypothesis: Surrogate endpoint is for a clinical study.

Hypothesis: You can also use surrogate endpoints in clinical trials to screen out unpromising treatments.

Hypothesis: Shorter trials also limit non-compliance and missing data.

Hypothesis: Surrogate endpoints can help detect rare or adverse late effects of treatment, which a clinical trial might not do.

Why not the true endpoints? In cancer, survival “may lack sensitivity to true therapeutic advances, it may be confounded with competing risks and second-line treatments, and it is observed late”.

Motivating Example

The FDA approached two drugs - encainide and flecainide - believing that they would reduce the death rate due to cardiac-complication-related death. However, a clinical trial showed that the death rate was more than twice that of placebo. (Another drug - moricizine - also increased risk.)

Successful surrogate endpoints - HIV - CD4 blood count, HAART, and viral load.

Evaluation of Surrogate Endpoints

evaluation :: true endpoints, surrogate endpoints -> Bool

Accelerated approval - for diseases where no effective therapies exist

example: cancer - “If the achievement of a complete remission has indeed a major impact on prognosis in hematological ma- lignancies (Armitage 1993, The International Non-Hodgkin’s Lymphoma Prognostic Factors Project 1993, Kantarjian et al. 1995), the relationship between tumor response and survival duration is far less clear in solid tu- mors, even though the shrinkage of metastatic measurable masses has long been the cornerstone of the development of cytotoxic therapies”

Idea: “proportional explained” - proportion of treatment effect “mediated” by the surrogate.

Hypothesis: Use surrogate endpoint along with eventual measurement of true endpoint.

Hypothesis: Look at joint distribution of surrogate endpoint and clinical endpoint.

Types of surrogate endpoints: binary, categorical, continuous, censored continuous, longitudinal, multiple longitudinal.

Surrogate Markers: Statistics

Support for surrogate endpoints: biological plausibility, success in clinical trials (how well it can predict), risk-benefit and public health considerations. (Has to have all three.)

In epidemiological studies, a useful surrogate marker is a causal factor for the disease of interest, not merely a correlated factor. As Fleming (1996) stated, “a correlate does not a surrogate make.”

Definition: Sensitivity (aka recall) = T good S good / (T good S bad + T good S good)

i.e., true positive / actual positives.

Definition: Specificity = true negative / actual negatives

aka negative recall. Of all the cases where treatment would have failed, how many did you catch as true negative?

Hypothesis: For the surrogate to be “useful”, both sensitivity and specificity have to be close to 1 (area under curve should be 1).

Definition: Relative risk = (a/(a+b)) / (c/(c+d))

(a/(a+b)) = precision, i.e., how many of your positive labels were right? Another way: % of correct positive labels.

(c/(c+d)) = _, i.e., how many of your negative labels were wrong? Another way: % of wrong negative labels.

If % of correct positive labels increases, then relative risk will increase. If % of wrong negative labels decreases, then relative risk will increase.

Hypothesis: We want relative risk to be high.

So, we want 1/RR to be low.

Definition: Attributable proportion AP = SE / (1 - 1/RR). (???)

Hypothesis: For the surrogate marker to be “successful”, AP should be close to 1.

Observation: Surrogate validation: “Schatzkin, Freedman, and colleagues proposed strategies for determining whether a biomarker is a valid surrogate for a disease of interest, for instance whether human papillomavirus infection is a valid surrogate for cervival dysplasia.”

Biomarkers

Types of biomarkers: valid surrogates (blood pressure); candidate surrogates thought to reflect the pathologic process (brain appearance in Alzheimer’s disease); reflect drug action but of uncertain relation to clinical outcome; still more remote from clinical benefit endpoint.

Under intervention:

  1. Unreliable interaction between biomarker and the treatment intervention - Biomarker is of no value as a surrogate endpoint - Prostate-specific antigen (PSA) is a useful biomarker for prostate cancer detection but unreliable as an indicator of treatment response

  2. The full effect of the intervention is observed through the biomarker assessment - Biomarker is an ideal sur- rogate endpoint - None known at present

  3. Intervention affects the endpoint and the bio- marker independently; only a proportion of the treatment effect is cap- tured by the surrogate endpoint - Biomarker has value as a surrogate endpoint but explains only a part of the treatment effect - Most established surro- gate endpoints (e.g., de- velopment of opportunis- tic infections with HIV anti-viral and mortality)

  4. Intervention affects favorably on the biomarker but unfavorably on the well-state and disease - Biomarker is of little practical use as a surrogate endpoint but may have utility in exploratory studies - Suppression of ventricular ectopy as a biomarker of fatal arrhythmia following myocardial infarctions (CAST trial)

Global intervention assessment = testing biomarkers for efficacy and toxicity and thus narrowing down the surrogate markers, and then evaluating patient benefit for them.

Surrogate Endpoints

Ideas

Hypothesis: The surrogate end-points don’t have to be mediators. They could be effects of the target variable.

For example, the final answer to a math problem (say, 74.5) is strong evidence that you’ve actually understood the relevant technique.

(But a post-outcome variable will take too long for a target variable like death. Defeats the whole point of the surrogate endpoint.)

http://blogs.sciencemag.org/pipeline/archives/2017/10/13/a-painful-unacceptable-lack-of-DATA

https://www.healthline.com/health-news/why-dont-more-new-cancer-drugs-help-patients-live-longer#5 - could have data sets

http://markets.businessinsider.com/news/stocks/Intercept-Announces-Positive-Results-from-Phase-2-AESOP-Trial-Evaluating-OCA-for-the-Treatment-of-Patients-with-Primary-Sclerosing-Cholangitis-at-The-Liver-Meeting-2017-1005365523

How the Massachusetts government is arguing for a closed formulary based on surrogate endpoints: https://www.huffingtonpost.com/entry/a-bold-step-to-control-prescription-drug-prices_us_59f1e5d0e4b09812b938c71e

https://dial.uclouvain.be/pr/boreal/en/object/boreal%3A38354

Created: June 22, 2017
Last modified: November 3, 2017
Status: in-progress notes
Tags: notes, eb

comments powered by Disqus