Data: big, small, and meta

When I read this New York Times piece back in August, I was in the midst of preparation and training for data collection at rural health facilities in Zambia. The Times piece profiles a group called Global Pulse that is doing good work on the 'big data' side of global health:

The efforts by Global Pulse and a growing collection of scientists at universities, companies and nonprofit groups have been given the label “Big Data for development.” It is a field of great opportunity and challenge. The goal, the scientists involved agree, is to bring real-time monitoring and prediction to development and aid programs. Projects and policies, they say, can move faster, adapt to changing circumstances and be more effective, helping to lift more communities out of poverty and even save lives.

Since I was gearing up for 'field work' (more on that here; I'll get to it soon), I was struck at the time by the very different challenges one faces at the other end of the spectrum. Call it small data? And I connected the Global Pulse profile with this, by Wayan Vota, from just a few days before:

The Sneakernet Reality of Big Data in Africa

When I hear people talking about “big data” in the developing world, I always picture the school administrator I met in Tanzania and the reality of sneakernet data transmissions processes.

The school level administrator has more data than he knows what to do with. Years and years of student grades recorded in notebooks – the hand-written on paper kind of notebooks. Each teacher records her student attendance and grades in one notebook, which the principal then records in his notebook. At the local district level, each principal’s notebook is recorded into a master dataset for that area, which is then aggregated at the regional, state, and national level in even more hand-written journals... Finally, it reaches the Minister of Education as a printed-out computer-generated report, complied by ministerial staff from those journals that finally make it to the ministry, and are not destroyed by water, rot, insects, or just plain misplacement or loss. Note that no where along the way is this data digitized and even at the ministerial level, the data isn’t necessarily deeply analyzed or shared widely....

And to be realistic, until countries invest in this basic, unsexy, and often ignored level of infrastructure, we’ll never have “big data” nor Open Data in Tanzania or anywhere else. (Read the rest here.)

Right on. And sure enough two weeks later I found myself elbow-deep in data that looked like this -- "Sneakernet" in action:

In many countries a quite a lot of data -- of varying quality -- exists, but it's often formatted like the above. Optimistically, it may get used for local decisions, and eventually for high-level policy decisions when it's months or years out of date. There's a lot of hard, good work being done to improve these systems (more often by residents of low-income countries, sometimes by foreigners), but still far too little. This data is certainly primary, in the sense that was collected on individuals, or by facilities, or about communities, but there are huge problems with quality, and with the sneakernet by which it gets back to policymakers, researchers, and (sometimes) citizens.

For the sake of quick reference, I keep a folder on my computer that has -- for each of the countries I work in -- most of the major recent ultimate sources of nationally-representative health data. All too often the only high-quality ultimate source is the most recent Demographic and Health Survey, surely one of the greatest public goods provided by the US government's aid agency. (I think I'm paraphrasing Angus Deaton here, but can't recall the source.) When I spent a summer doing epidemiology research with the New York City Department of Health and Mental Hygiene, I was struck by just how many rich data sources there were to draw on, at least compared to low-income countries. Very often there just isn't much primary data on which to build.

On the other end of the spectrum is what you might call the metadata of global health. When I think about the work the folks I know in global health -- classmates, professors, acquaintances, and occasionally thought not often me -- do day to day, much of it is generating metadata. This is research or analysis derived from the primary data, and thus relying on its quality. It's usually smart, almost always well-intentioned, and often well-packaged, but this towering edifice of effort is erected over a foundation of primary data; the metadata sometimes gives the appearance of being primary, when you dig down the sources often point back to those one or three ultimate data sources.

That's not to say that generating this metadata is bad: for instance, modeling impacts of policy decisions given the best available data is still the best way to sift through competing health policy priorities if you want to have the greatest impact. Or a more cynical take: the technocratic nature of global health decision-making requires that we either have this data or, in its absence, impute it. But regardless of the value of certain targeted bits of the metadata, there's the question of the overall balance of investment in primary vs. secondary-to-meta data, and my view -- somewhat ironically derived entirely from anecdotes -- is that we should be investing a lot more in the former.

One way to frame this trade-off is to ask, when considering a research project or academic institute or whatnot, whether the money spent on that project might result in more value for money if it was spent instead training data collectors and statistics offices, or supporting primary data collection (e.g., funding household surveys) in low-income countries. I think in many cases the answer will be clear, perhaps to everyone except those directly generating the metadata.

That does not mean that none of this metadata is worthwhile. On the contrary, some of it is absolutely essential. But a lot isn't, and there are opportunity costs to any investment, a choice between investing in data collection and statistics systems in low-income countries, vs. research projects where most of the money will ultimately stay in high-income countries, and the causal pathway to impact is much less direct.  

Looping back to the original link, one way to think of the 'big data' efforts like Global Pulse is that they're not metadata at all, but an attempt to find new sources of primary data. Because there are so few good sources of data that get funded, or that filter through the sneakernet, the hope is that mobile phone usage and search terms and whatnot can be mined to give us entirely new primary data, on which to build new pyramids of metadata, and with which to make policy decisions, skipping the sneakernet altogether. That would be pretty cool if it works out.

Slow down there

Max Fisher has a piece in the Washington Post presenting "The amazing, surprising, Africa-driven demographic future of the Earth, in 9 charts". While he notes that the numbers are "just projections and could change significantly under unforeseen circumstances" the graphs don't give any sense of the huge uncertainty involved in projecting trends out 90 years in the future. Here's the first graph:

 

The population growth in Africa here is a result of much higher fertility rates, and a projected slower decline in those rates.

But those projected rates have huge margins of error. Here's the total fertility rate, or "the average number of children that would be born to a woman over her lifetime"  for Nigeria, with confidence intervals that give you a sense of just how little we know about the future:

That's a lot of uncertainty! (Image from here, which I found thanks to a commenter on the WaPo piece.)

It's also worth noting that if you had made similar projections 87 years ago, in 1926, it would have been hard to anticipate World War II, hormonal birth control, and AIDS, amongst other things.

(Not) knowing it all along

David McKenzie is one of the guys behind the World Bank's excellent and incredibly wonky Development Impact blog. He came to Princeton to present on a new paper with Gustavo Henrique de Andrade and Miriam Bruhn, "A Helping Hand or the Long Arm of the Law? Experimental evidence on what governments can do to formalize firms" (PDF). The subject matter -- trying to get small, informal companies to register with the government -- is outside my area of expertise. But I thought there were a couple methodologically interesting bits: First, there's an interesting ethical dimension, as one of their several interventions tested was increasing the likelihood that a firm would be visited by a government inspector (i.e., that the law would be enforced). From page 10:

In particular, if a firm owner were interviewed about their formality status, it may not be considered ethical to then use this information to potentially assign an inspector to visit them. Even if it were considered ethical (since the government has a right to ask firm owners about their formality status, and also a right to conduct inspections), we were still concerned that individuals who were interviewed in a baseline survey and then received an inspection may be unwilling to respond to a follow-up. Therefore a listing stage was done which did not involve talking to the firm owner.

In other words, all their baseline data was collected without actually talking to the firms they were studying -- check out the paper for more on how they did that.

Second, they did something that could (and maybe should) be incorporated into many evaluations with relative ease. Because findings often seem obvious after we hear them, McKenzie et al. asked the government staff whose program they were evaluating to estimate what the impact would be before the results were in. Here's that section (emphasis added):

A standard question with impact evaluations is whether they deliver new knowledge or merely formally confirm the beliefs that policymakers already have (Groh et al, 2012). In order to measure whether the results differ from what was anticipated, in January 2012 (before any results were known) we elicited the expectations of the Descomplicar [government policy] team as to what they thought the impacts of the different treatments would be. Their team expected that 4 percent of the control group would register for SIMPLES [the formalization program] between the baseline and follow-up surveys. We see from Table 7 that this is an overestimate...

They then expected the communication only group to double this rate, so that 8 percent would register, that the free cost treatment would lead to 15 percent registering, and that the inspector treatment would lead to 25 percent registering.... The zero or negative impacts of the communication and free cost treatments therefore are a surprise. The overall impact of the inspector treatment is much lower than expected, but is in line with the IV estimates, suggesting the Descomplicar team have a reasonable sense of what to expect when an inspection actually occurs, but may have overestimated the amount of new inspections that would take place. Their expectation of a lack of impact for the indirect inspector treatment was also accurate.

This establishes exactly what in the results was a surprise and what wasn't. It might also make sense for researchers to ask both the policymakers they're working with and some group of researchers who study the same subject to give such responses; it would certainly help make a case for the value of (some) studies.

This beautiful graphic is not really that useful

This beautiful infographic from the excellent blog Information is Beautiful has been making the rounds. You can see a bigger version here, and it's worth poking around for a bit. The creators take all deaths from the 20th century (drawing from several sources) and represent their relative contribution with circles:

I appreciate their footnote that says the graphic has "some inevitable double-counting, broad estimation and ball-park figures." That's certainly true, but the inevitably approximate nature of these numbers isn't my beef.

The problem is that I don't think raw numbers of deaths tell us very much, and can actually be quite misleading. Someone who saw only this infographic might well end up less well-informed than if they didn't see it. Looking at the red circles you get the impression that non-communicable and infectious diseases were roughly equivalent in importance in the 20th century, followed by "humanity" (war, murder, etc) and cancer.

The root problem is that mortality is inevitable for everyone, everywhere. This graphic lumps together pneumonia deaths at age 1 with car accidents at age 20, and cancer deaths at 50 with heart disease deaths at 80. We typically don't  (and I would argue should't) assign the same weight to a death in childhood or the prime of life with one that comes at the end of a long, satisfying life.  The end result is that this graphic greatly overemphasizes the importance of non-communicable diseases in the 20th century -- that's the impression most laypeople will walk away with.

A more useful graphic might use the same circles to show the years of life lost (or something like DALYs or QALYs) because those get a bit closer at what we care about. No single number is actually  all that great, so we can get a better understanding if we look at several different outcomes (which is one problem with any visualization). But I think raw mortality numbers are particularly misleading.

To be fair, this graphic was commissioned by Wellcome as "artwork" for a London exhibition, so maybe it should be judged by a different standard...

First responses to DEVTA roll in

In my last post I highlighted the findings from the DEVTA trial of deworming in Vitamin A in India, noting that the Vitamin A results would be more controversial. I said I expected commentaries over the coming months, but we didn't have to wait that long after all. First is a BBC Health Check program features a discussion of DEVTA with Richard Peto, one of the study's authors. It's for a general audience so it doesn't get very technical, and because of that it really grated when they described this as a "clinical trial," as that has certain connotations of rigor that aren't reflected in the design of the study. If DEVTA is a clinical trial, then so was

Peto also says there were two reasons for the massive delay in publishing the trial, 1) time to check things and "get it straight," and 2) that they were " afraid of putting up a trial with a false negative." [An aside for those interested in publication bias issues: can you imagine an author with strong positive findings ever saying the same thing about avoiding false positives?!]

Peto ends by sounding fairly neutral re: Vitamin A (portraying himself in a middle position between advocates in favor and skeptics opposed) but acknowledges that with their meta-analysis results Vitamin A is still "cost-effective by many criteria."

Second is a commentary in The Lancet by Al Sommers, Keith West, and Reynaldo Martorell. A little history: Sommers ran the first big Vitamin A trials in Sumtra (published in 1986) and is the former dean of the Johns Hopkins School of Public Health.  (Sommers' long-term friendship with Michael Bloomberg, who went to Hopkins as an undergrad, is also one reason the latter is so big on public health.) For more background, here's a recent JHU story on Sommers' receiving a $1 million research prize in part for his work on Vitamin A.

Part of their commentary is excerpted below, with my highlights in bold:

But this was neither a rigorously conducted nor acceptably executed efficacy trial: children were not enumerated, consented, formally enrolled, or carefully followed up for vital events, which is the reason there is no CONSORT diagram. Coverage was ascertained from logbooks of overworked government community workers (anganwadi workers), and verified by a small number of supervisors who periodically visited randomly selected anganwadi workers to question and examine children who these workers gathered for them. Both anganwadi worker self-reports, and the validation procedures, are fraught with potential bias that would inflate the actual coverage.

To achieve 96% coverage in Uttar Pradesh in children found in the anganwadi workers' registries would have been an astonishing feat; covering 72% of children not found in the anganwadi workers' registries seems even more improbable. In 2005—06, shortly after DEVTA ended, only 6·1% of children aged 6—59 months in Uttar Pradesh were reported to have received a vitamin A supplement in the previous 6 months according to results from the National Family Health Survey, a national household survey representative at national and state level.... Thus, it is hard to understand how DEVTA ramped up coverage to extremely high levels (and if it did, why so little of this effort was sustained). DEVTA provided the anganwadi workers with less than half a day's training and minimal if any incentive.

They also note that the study funding was minimalist compared to more rigorous studies, which may be an indication of quality. And as an indication that there will almost certainly be alternative meta-analyses that weight the different studies differently:

We are also concerned that Awasthi and colleagues included the results from this study, which is really a programme evaluation, in a meta-analysis in which all of the positive studies were rigorously designed and conducted efficacy trials and thus represented a much higher level of evidence. Compounding the problem, Awasthi and colleagues used a fixed-effects analytical model, which dramatically overweights the results of their negative findings from a single population setting. The size of a study says nothing about the quality of its data or the generalisability of its findings.

I'm sure there will be more commentaries to follow. In my previous post I noted that I'm still trying to wrap my head around the findings, and I think that's still right. If I had time I'd dig into this a bit more, especially the relationship with the Indian National Family Health Survey. But for now I think it's safe to say that two parsimonious explanations for how to reconcile DEVTA with the prior research are emerging:

1. DEVTA wasn't all that rigorous and thus never achieved the high population coverage levels necessary to have a strong mortality impact; the mortality impact was attenuated by poor coverage, resulting in the lack of a statistically significant effect in line with prior results. Thus is shouldn't move our priors all that much. (Sommers et al. seem to be arguing for this.) Or,

2. There's some underlying change in the populations between the older studies and these newer studies that causes the effect of Vitamin A to decline -- this could be nutrition, vaccination status, shifting causes of mortality, etc. If you believe this, then you might discount studies because they're older.

(h/t to @karengrepin for the Lancet commentary.)

A massive trial, a huge publication delay, and enormous questions

It's been called the "largest clinical* trial ever": DEVTA (Deworming and Enhanced ViTamin A supplementation), a study of Vitamin A supplementation and deworming in over 2 million children in India, just published its results. "DEVTA" may mean "deity" or "divine being" in Hindi but some global health experts and advocates will probably think these results come straight from the devil. Why? Because they call into question -- or at least attenuate -- our estimates of the effectiveness of some of the easiest, best "bang for the buck" interventions out there. Data collection was completed in 2006, but the results were just published in The Lancet. Why the massive delay? According to the accompany discussion paper, it sounds like the delay was rooted in very strong resistance to the results after preliminary outcomes were presented at a conference in 2007. If it weren't for the repeated and very public shaming by the authors of recent Cochrane Collaboration reviews, we might not have the results even today. (Bravo again, Cochrane.)

So, about DEVTA. In short, this was a randomized 2x2 factorial trial, like so:

The results were published as two separate papers, one on Vitamin A and one on deworming, with an additional commentary piece:

The controversy is going to be more about what this trial didn't find, rather than what they did: the confidence interval on the Vitamin A study's mortality estimate (mortality ratio 0.96, 95% confidence interval of 0.89 to 1.03) is consistent with a mortality reduction as large as 11%, or as much as a 3% increase. The consensus from previous Vitamin A studies was mortality reductions of 20-30%, so this is a big surprise. Here's the abstract to that paper:

Background

In north India, vitamin A deficiency (retinol <0·70 μmol/L) is common in pre-school children and 2–3% die at ages 1·0–6·0 years. We aimed to assess whether periodic vitamin A supplementation could reduce this mortality.

Methods

Participants in this cluster-randomised trial were pre-school children in the defined catchment areas of 8338 state-staffed village child-care centres (under-5 population 1 million) in 72 administrative blocks. Groups of four neighbouring blocks (clusters) were cluster-randomly allocated in Oxford, UK, between 6-monthly vitamin A (retinol capsule of 200 000 IU retinyl acetate in oil, to be cut and dripped into the child’s mouth every 6 months), albendazole (400 mg tablet every 6 months), both, or neither (open control). Analyses of retinol effects are by block (36 vs36 clusters).

The study spanned 5 calendar years, with 11 6-monthly mass-treatment days for all children then aged 6–72 months.  Annually, one centre per block was randomly selected and visited by a study team 1–5 months after any trial vitamin A to sample blood (for retinol assay, technically reliable only after mid-study), examine eyes, and interview caregivers. Separately, all 8338 centres were visited every 6 months to monitor pre-school deaths (100 000 visits, 25 000 deaths at ages 1·0–6·0 years [the primary outcome]). This trial is registered at ClinicalTrials.gov, NCT00222547.

Findings

Estimated compliance with 6-monthly retinol supplements was 86%. Among 2581 versus 2584 children surveyed during the second half of the study, mean plasma retinol was one-sixth higher (0·72 [SE 0·01] vs 0·62 [0·01] μmol/L, increase 0·10 [SE 0·01] μmol/L) and the prevalence of severe deficiency was halved (retinol <0·35 μmol/L 6% vs13%, decrease 7% [SE 1%]), as was that of Bitot’s spots (1·4% vs3·5%, decrease 2·1% [SE 0·7%]).

Comparing the 36 retinol-allocated versus 36 control blocks in analyses of the primary outcome, deaths per child-care centre at ages 1·0–6·0 years during the 5-year study were 3·01 retinol versus 3·15 control (absolute reduction 0·14 [SE 0·11], mortality ratio 0·96, 95% CI 0·89–1·03, p=0·22), suggesting absolute risks of death between ages 1·0 and 6·0 years of approximately 2·5% retinol versus 2·6% control. No specific cause of death was significantly affected.

Interpretation

DEVTA contradicts the expectation from other trials that vitamin A supplementation would reduce child mortality by 20–30%, but cannot rule out some more modest effect. Meta-analysis of DEVTA plus eight previous randomised trials of supplementation (in various different populations) yielded a weighted average mortality reduction of 11% (95% CI 5–16, p=0·00015), reliably contradicting the hypothesis of no effect.

Note that instead of just publishing these no-effect results and leaving the meta-analysis to a separate publication, the authors go ahead and do their own meta-analysis of DEVTA plus previous studies and report that -- much attenuated, but still positive -- effect in their conclusion. I think that's a fair approach, but also reveals that the study's authors very much believe there are large Vitamin A mortality effects despite the outcome of their own study!

[The only media coverage I've seen of these results so far comes from the Times of India, which includes quotes from the authors and Abhijit Banerjee.]

To be honest, I don't know what to make of the inconsistency between these findings and previous studies, and am writing this post in part to see what discussion it generates. I imagine there will be more commentaries on these findings over the coming months, with some decrying the results and methodologies and others seeing vindication in them. In my view the best possible outcome is an ongoing concern for issues of external validity in biomedical trials.

What do I mean? Epidemiologists tend to think that external validity is less of an issue in randomized trials of biomedical interventions -- as opposed to behavioral, social, or organizational trials -- but this isn't necessarily the case. Trials of vaccine efficacy have shown quite different efficacy for the same vaccine (see BCG and rotavirus) in different locations, possibly due to differing underlying nutritional status or disease burdens. Our ability to interpret discrepant findings can only be as sophisticated as the available data allows, or as sophisticated as allowed by our understanding of the biological and epidemiologic mechanisms that matter on the pathway from intervention to outcome. We can't go back in time and collect additional information (think nutrition, immune response, baseline mortality, and so forth) on studies far in the past, but we can keep such issues in mind when designing trials moving forward.

All that to say, these results are confusing, and I look forward to seeing the global health community sort through them. Also, while the outcomes here (health outcomes) are different from those in the Kremer deworming study (education outcomes), I've argued before that lack of effect or small effects on the health side should certainly influence our judgment of the potential education outcomes of deworming.

*I think given the design it's not that helpful to call this a 'clinical' trial at all - but that's another story.

Alwyn Young just broke your regression

Alwyn Young -- the same guy whose paper carefully accounting for growth in East Asian was popularized by Krugman and sparked an enormous debate -- has been circulating a paper on African growth rates. Here's the 2009 version (PDF) and October 2012 version. The abstract of the latter paper:

Measures of real consumption based upon the ownership of durable goods, the quality of housing, the health and mortality of children, the education of youth and the allocation of female time in the household indicate that sub-Saharan living standards have, for the past two decades, been growing about 3.4 to 3.7 percent per annum, i.e. three and a half to four times the rate indicated in international data sets. (emphasis added)

The Demographic and Health Surveys are large-scale nationally-representative surveys of health, family planning, and related modules that tend to ask the same questions across different countries and over large periods of time. They have major limitations, but in the absence of high-quality data from governments they're often the best source for national health data. The DHS doesn't collect much economic data, but they do ask about ownership of certain durable goods (like TVs, toilets, etc), and the answers to these questions are used to construct a wealth index that is very useful for studies of health equity -- something I'm taking advantage of in my current work. (As an aside, this excellent report from Measure DHS (PDF) describes the history of the wealth index.)

What Young has done is to take this durable asset data from many DHS surveys and try to estimate a measure of GDP growth from actually-measured data, rather than the (arguably) sketchier methods typically used to get national GDP numbers in many African countries. Not all countries are represented at any given point in time in the body of DHS data, which is why he ends up with a very-unbalanced panel data set for "Africa," rather than being able to measure growth rates in individual countries. All the data and code for the paper are available here.

Young's methods themselves are certain to spark ongoing debate (see commentary and links from Tyler Cowen and Chris Blattman), so this is far from settled -- and may well never be. The takeaway is probably not that Young's numbers are right so much as that there's a lot of data out there that we shouldn't trust very much, and that transparency about the sources and methodology behind data, official or not, is very helpful. I just wanted to raise one question: if Young's data is right, just how many published papers are wrong?

There is a huge literature on cross-country growth 's empirics. A Google Scholar search for "cross-country growth Africa" turns up 62,400 results. While not all of these papers are using African countries' GDPs as an outcome, a lot of them are. This literature has many failings which have been duly pointed out by Bill Easterly and many others, to the extent that an up-and-coming economist is likely to steer away from this sort of work for fear of being mocked. Relatedly, in Acemoglu and Robinson's recent and entertaining take-down of Jeff Sachs, one of their insults criticisms is that Sachs only knows something because he's been running "kitchen sink growth regressions."

Young's paper just adds more fuel to that fire. If African GDP growth has been 3 1/2 to 4 times greater than the official data says, then every single paper that uses the old GDP numbers is now even more suspect.

Why we should lie about the weather (and maybe more)

Nate Silver (who else?) has written a great piece on weather prediction -- "The Weatherman is Not a Moron" (NYT) -- that covers both the proliferation of data in weather forecasting, and why the quantity of data alone isn't enough. What intrigued me though was a section at the end about how to communicate the inevitable uncertainty in forecasts:

...Unfortunately, this cautious message can be undercut by private-sector forecasters. Catering to the demands of viewers can mean intentionally running the risk of making forecasts less accurate. For many years, the Weather Channel avoided forecasting an exact 50 percent chance of rain, which might seem wishy-washy to consumers. Instead, it rounded up to 60 or down to 40. In what may be the worst-kept secret in the business, numerous commercial weather forecasts are also biased toward forecasting more precipitation than will actually occur. (In the business, this is known as the wet bias.) For years, when the Weather Channel said there was a 20 percent chance of rain, it actually rained only about 5 percent of the time.

People don’t mind when a forecaster predicts rain and it turns out to be a nice day. But if it rains when it isn’t supposed to, they curse the weatherman for ruining their picnic. “If the forecast was objective, if it has zero bias in precipitation,” Bruce Rose, a former vice president for the Weather Channel, said, “we’d probably be in trouble.”

My thought when reading this was that there are actually two different reasons why you might want to systematically adjust reported percentages ((ie, fib a bit) when trying to communicate the likelihood of bad weather.

But first, an aside on what public health folks typically talk about when they talk about communicating uncertainty: I've heard a lot (in classes, in blogs, and in Bad Science, for example) about reporting absolute risks rather than relative risks, and about avoiding other ways of communicating risks that generally mislead. What people don't usually discuss is whether the point estimates themselves should ever be adjusted; rather, we concentrate on how to best communicate whatever the actual values are.

Now, back to weather. The first reason you might want to adjust the reported probability of rain is that people are rain averse: they care more strongly about getting rained on when it wasn't predicted than vice versa. It may be perfectly reasonable for people to feel this way, and so why not cater to their desires? This is the reason described in the excerpt from Silver's article above.

Another way to describe this bias is that most people would prefer to minimize Type II Error (false negatives) at the expense of having more Type I error (false positives), at least when it comes to rain. Obviously you could take this too far -- reporting rain every single day would completely eliminate Type II error, but it would also make forecasts worthless. Likewise, with big events like hurricanes the costs of Type I errors (wholesale evacuations, cancelled conventions, etc) become much greater, so this adjustment would be more problematic as the cost of false positives increases. But generally speaking, the so-called "wet bias" of adjusting all rain prediction probabilities upwards might be a good way to increase the general satisfaction of a rain-averse general public.

The second reason one might want to adjust the reported probability of rain -- or some other event -- is that people are generally bad at understanding probabilities. Luckily though, people tend to be bad about estimating probabilities in surprisingly systematic ways! Kahneman's excellent (if too long) book Thinking, Fast and Slow covers this at length. The best summary of these biases that I could find through a quick Google search was from Lee Merkhofer Consulting:

 Studies show that people make systematic errors when estimating how likely uncertain events are. As shown in [the graph below], likely outcomes (above 40%) are typically estimated to be less probable than they really are. And, outcomes that are quite unlikely are typically estimated to be more probable than they are. Furthermore, people often behave as if extremely unlikely, but still possible outcomes have no chance whatsoever of occurring.

The graph from that link is a helpful if somewhat stylized visualization of the same biases:

In other words, people think that likely events (in the 30-99% range) are less likely to occur than they are in reality, that unlike events (in the 1-30% range) are more likely to occur than they are in reality, and extremely unlikely events (very close to 0%) won't happen at all.

My recollection is that these biases can be a bit different depending on whether the predicted event is bad (getting hit by lightning) or good (winning the lottery), and that the familiarity of the event also plays a role. Regardless, with something like weather, where most events are within the realm of lived experience and most of the probabilities lie within a reasonable range, the average bias could probably be measured pretty reliably.

So what do we do with this knowledge? Think about it this way: we want to increase the accuracy of communication, but there are two different points in the communications process where you can measure accuracy. You can care about how accurately the information is communicated from the source, or how well the information is received. If you care about the latter, and you know that people have systematic and thus predictable biases in perceiving the probability that something will happen, why not adjust the numbers you communicate so that the message -- as received by the audience -- is accurate?

Now, some made up numbers: Let's say the real chance of rain is 60%, as predicted by the best computer models. You might adjust that up to 70% if that's the reported risk that makes people perceive a 60% objective probability (again, see the graph above). You might then adjust that percentage up to 80% to account for rain aversion/wet bias.

Here I think it's important to distinguish between technical and popular communication channels: if you're sharing raw data about the weather or talking to a group of meteorologists or epidemiologists then you might take one approach, whereas another approach makes sense for communicating with a lay public. For folks who just tune in to the evening news to get tomorrow's weather forecast, you want the message they receive to be as close to reality as possible. If you insist on reporting the 'real' numbers, you actually draw your audience further from understanding reality than if you fudged them a bit.

The major and obvious downside to this approach is that people know this is happening, it won't work, or they'll be mad that you lied -- even though you were only lying to better communicate the truth! One possible way of getting around this is to describe the numbers as something other than percentages; using some made-up index that sounds enough like it to convince the layperson, while also being open to detailed examination by those who are interested.

For instance, we all the heat index and wind chill aren't the same as temperature, but rather represent just how hot or cold the weather actually feels. Likewise, we could report some like "Rain Risk" or "Rain Risk Index" that accounts for known biases in risk perception and rain aversion. The weather man would report a Rain Risk of 80%, while the actual probability of rain is just 60%. This would give us more useful information for the recipients, while also maintaining technical honesty and some level of transparency.

I care a lot more about health than about the weather, but I think predicting rain is a useful device for talking about the same issues of probability perception in health for several reasons. First off, the probabilities in rain forecasting are much more within the realm of human experience than the rare probabilities that come up so often in epidemiology. Secondly, the ethical stakes feel a bit lower when writing about lying about the weather rather than, say, suggesting physicians should systematically mislead their patients, even if the crucial and ultimate aim of the adjustment is to better inform them.

I'm not saying we should walk back all the progress we've made in terms of letting patients and physicians make decisions together, rather than the latter withholding information and paternalistically making decisions for patients based on the physician's preferences rather than the patient's. (That would be silly in part because physicians share their patients' biases.) The idea here is to come up with better measures of uncertainty -- call it adjusted risk or risk indexes or weighted probabilities or whatever -- that help us bypass humans' systematic flaws in understanding uncertainty.

In short: maybe we should lie to better tell the truth. But be honest about it.

When randomization is strategic

Here's a quote from Tom Yates on his blog Sick Populations about a speech he heard by Rachel Glennerster of J-PAL:

Glennerster pointed out that the evaluation of PROGRESA, a conditional cash transfer programme in Mexico and perhaps the most famous example of randomised evaluation in social policy, was instigated by a Government who knew they were going to lose the next election. It was a way to safeguard their programme. They knew the next Government would find it hard to stop the trial once it was started and were confident the evaluation would show benefit, again making it hard for the next Government to drop the programme. Randomisation can be politically advantageous.

I think I read this about Progresa / Oportunidades before but had forgotten it, and thus it's worth re-sharing. The way in which Progresa was randomized (different areas were stepped into the program, so there was a cohort of folks who got it later than others, but all the high need areas got it within a few years) made this more politically feasible as well. I think this situation, in which a government institutes a study of a program to keep it alive through subsequent changes of government, will probably be a less common tactic than its opposite, in which a government designs an evaluation of a popular program that a) it thinks doesn't work, b) it wants to cut, and c) the public otherwise likes, just to prove that it should be cut -- but only time will tell.

A misuse of life expectancy

Jared Diamond is going back and forth with Acemoglu and Robinson over his review of their new book, Why Nations Fail. The exchange is interesting in and of itself, but I wanted to highlight one passage from Diamond's response:

The first point of their four-point letter is that tropical medicine and agricultural science aren’t major factors shaping national differences in prosperity. But the reasons why those are indeed major factors are obvious and well known. Tropical diseases cause a skilled worker, who completes professional training by age thirty, to look forward to, on the average, just ten years of economic productivity in Zambia before dying at an average life span of around forty, but to be economically productive for thirty-five years until retiring at age sixty-five in the US, Europe, and Japan (average life span around eighty). Even while they are still alive, workers in the tropics are often sick and unable to work. Women in the tropics face big obstacles in entering the workforce, because of having to care for their sick babies, or being pregnant with or nursing babies to replace previous babies likely to die or already dead. That’s why economists other than Acemoglu and Robinson do find a significant effect of geographic factors on prosperity today, after properly controlling for the effect of institutions.

I've added the bolding to highlight an interpretation of what life expectancy means that is wrong, but all too common.

It's analagous to something you may have heard about ancient Rome: since life expectancy was somewhere in the 30s, the Romans who lived to be 40 or 50 or 60 were incredibly rare and extraordinary. The problem is that life expectancy -- by which we typically mean life expectancy at birth -- is heavily skewed by infant mortality, or deaths under one year of age. Once you get to age five you're generally out of the woods -- compared to the super-high mortality rates common for infants (less than one year old) and children (less than five years old). While it's true that there were fewer old folks in ancient Roman society, or -- to use Diamond's example -- modern Zambian society, the difference isn't nearly as pronounced as you might think given the differences in life expectancy.

Does this matter? And if so, why? One area where it's clearly important is Diamond's usage in the passage above: examining the impact of changes in life expectancy on economic productivity. Despite the life expectancy at birth of 38 years, a Zambian male who reaches the age of thirty does not just have eight years of life expectancy left -- it's actually 23 years!

Here it's helpful to look at life tables, which show mortality and life expectancy at different intervals throughout the lifespan. This WHO paper by Alan Lopez et al. (PDF) examining mortality between 1990-9 in 191 countries provides some nice data: page 253 is a life table for Zambia in 1999. We see that males have a life expectancy at birth of just 38.01 years, versus 38.96 for females (this was one of the lowest in the world at that time). If you look at that single number you might conclude, like Diamond, that a 30-year old worker only has ~10 years of life left. But the life expectancy for those males remaining alive at age 30 (64.2% of the original birth cohort remains alive at this age) is actually 22.65 years. Similarly, the 18% of Zambians who reach age 65, retirement age in the US, can expect to live an additional 11.8 years, despite already having lived 27 years past the life expectancy at birth.

These numbers are still, of course, dreadful -- there's room for decreasing mortality at all stages of the lifespan. Diamond's correct in the sense that low life expectancy results in a much smaller economically active population. But he's incorrect when he estimates much more drastic reductions in the economically productive years that workers can expect once they reach their economically productive 20s, 30s, and 40s.

----

[Some notes: 1. The figures might be different if you limit it to "skilled workers" who aren't fully trained until age 30, as Diamond does; 2. I'm also assumed that Diamond is working from general life expectancy, which was similar to 40 years total, rather than a particular study that showed 10 years of life expectancy at age 30 for some subset of skilled workers, possibly due to high HIV prevalence -- that seems possible but unlikely; 3. In these Zambia estimates, about 10% of males die before reaching one year of age, or over 17% before reaching five years of age. By contrast, between the ages of 15-20 only 0.6% of surviving males die, and you don't see mortality rates higher than the under-5 ones until above age 85!; and 4. Zambia is an unusual case because much of the poor life expectancy there is due to very high HIV/AIDS prevalence and mortality -- which actually does affect adult mortality rates and not just infant and child mortality rates. Despite this caveat, it's still true that Diamond's interpretation is off. ]

Mimicking success

If you don't know what works, there can be an understandable temptation to try to create a picture that more closely resembles things that work. In some of his presentations on the dire state of student learning around the world, Lant Pritchett invokes the zoological concept of isomorphic mimicry: the adoption of the camouflage of organizational forms that are successful elsewhere to hide their actual dysfunction. (Think, for example, of a harmless snake that has the same size and coloring as a very venomous snake -- potential predators might not be able to tell the difference, and so they assume both have the same deadly qualities.) For our illustrative purposes here, this could mean in practice that some leaders believe that, since good schools in advanced countries have lots of computers, it will follow that, if computers are put into poor schools, they will look more like the good schools. The hope is that, in the process, the poor schools will somehow (magically?) become good, or at least better than they previously were. Such inclinations can nicely complement the "edifice complex" of certain political leaders who wish to leave a lasting, tangible, physical legacy of their benevolent rule. Where this once meant a gleaming monument soaring towards the heavens, in the 21st century this can mean rows of shiny new computers in shiny new computer classrooms.

That's from this EduTech post by Michael Trucano. It's about the recent evaluations showing no impact from the One Laptop per Child (OLPC) program, but I think the broader idea can be applied to health programs as well. For a moment let's apply it to interventions designed to prevent maternal mortality. Maternal mortality is notoriously hard to measure because it is -- in the statistical sense -- quite rare. While many 'rates' (which are often not actual rates, but that's another story) in public health are expressed with denominators of 1,000 (live births, for example), maternal mortality uses a denominator of 100,000 to make the numerators a similar order of magnitude.

That means that you can rarely measure maternal mortality directly -- even with huge sample sizes you get massive confidence intervals that make it difficult to say whether things are getting worse, staying the same, or improving. Instead we typically measure indirect things, like the coverage of interventions that have been shown (in more rigorous studies) to reduce maternal morbidity or mortality. And sometimes we measure health systems things that have been shown to affect coverage of interventions... and so forth. The worry is that at some point you're measuring the sort of things that can be improved -- at least superficially -- without having any real impact.

All that to say: 1) it's important to measure the right thing, 2) determining what that 'right thing' is will always be difficult, and 3) it's good to step back every now and then and think about whether the thing you're funding or promoting or evaluating is really the thing you care about or if you're just measuring "organizational forms" that camouflage the thing you care about.

(Recent blog coverage of the OLPC evaluations here and here.)

Stats lingo in econometrics and epidemiology

Last week I came across an article I wish I'd found a year or two ago: "Glossary for econometrics and epidemiology" (PDF from JSTOR, ungated version here) by Gunasekara, Carter, and Blakely. Statistics is to some extent a common language for the social sciences, but there are also big variations in language that can cause problems when students and scholars try to read literature from outside their fields. I first learned epidemiology and biostatistics at a school of public health, and now this year I'm taking econometrics from an economist, as well as other classes that draw heavily on the economics literature.

Friends in my economics-centered program have asked me "what's biostatistics?" Likewise, public health friends have asked "what's econometrics?" (or just commented that it's a silly name). In reality both fields use many of the same techniques with different language and emphases. The Gunasekara, Carter, and Blakely glossary linked above covers the following terms, amongst others:

  • confounding
  • endogeneity and endogenous variables
  • exogenous variables
  • simultaneity, social drift, social selection, and reverse causality
  • instrumental variables
  • intermediate or mediating variables
  • multicollinearity
  • omitted variable bias
  • unobserved heterogeneity

If you've only studied econometrics or biostatistics, chances are at least some of these terms will be new to you, even though most have roughly equivalent forms in the other field.

Outside of differing language, another difference is in the frequency with which techniques are used. For instance, instrumental variables seem (to me) to be under-used in public health / epidemiology applications. I took four terms of biostatistics at Johns Hopkins and don't recall instrumental variables being mentioned even once! On the other hand, economists just recently discovered randomized trials. (Now they're more widely used) .

But even within a given statistical technique there are important differences. You might think that all social scientists doing, say, multiple linear regression to analyze observational data or critiquing the results of randomized controlled trials would use the same language. In my experience they not only use different vocabulary for the same things, they also emphasize different things. About a third to half of my epidemiology coursework involved establishing causal models (often with directed acyclic graphs)  in order to understand which confounding variables to control for in a regression, whereas in econometrics we (very!) briefly discussed how to decide which covariates might cause omitted variable bias. These discussions were basically about the same thing, but they differed in terms of language and in terms of emphasis.

I think an understanding of how and why researchers from different fields talk about things differently helps you to understand the sociology and motivations of each field.  This is all related to what Marc Bellemare calls the ongoing "methodological convergence in the social sciences." As research becomes more interdisciplinary -- and as any applications of research are much more likely to require interdisciplinary knowledge -- understanding how researchers trained in different academic schools think and talk will become increasingly important.

Group vs. individual uses of data

Andrew Gelman notes that, on the subject of value-added assessments of teachers, "a skeptical consensus seems to have arisen..." How did we get here? Value-added assessments grew out of the push for more emphasis on measuring success through standardized tests in education -- simply looking at test scores isn't OK because some teachers are teaching in better schools or are teaching better-prepared students. The solution was to look at how teachers' students improve in comparison to other teachers' students. Wikipedia has a fairly good summary here.

Back in February New York City released (over the opposition of teachers' unions) the value-added scores of some 18,000 teachers. Here's coverage from the Times on the release and reactions.

Gary Rubinstein, an education blogger, has done some analysis of the data contained in the reports and published five posts so far: part 1, part 2, part 3, part 4, and part 5. He writes:

For sure the 'reformers' have won a battle and have unfairly humiliated thousands of teachers who got inaccurate poor ratings. But I am optimistic that this will be be looked at as one of the turning points in this fight. Up until now, independent researchers like me were unable to support all our claims about how crude a tool value-added metrics still are, though they have been around for nearly 20 years. But with the release of the data, I have been able to test many of my suspicions about value-added.

I suggest reading his analysis in full, or at least the first two parts.

For me one early take-away from this -- building off comments from Gelman and others -- is that an assessment might be a useful tool for improving education quality overall, while simultaneously being a very poor metric for individual performance. When you're looking at 18,000 teachers you might be able to learn what factors lead to test score improvement on average, and use that information to improve policies for teacher education, recruitment, training, and retention. But that doesn't mean one can necessarily use the same data to make high-stakes decisions about individual teachers.

On food deserts

Gina Kolata, writing for the New York Times, has sparked some debate with this article: "Studies Question the Pairing of Food Deserts and Obesity". In general I often wish that science reporting focused more on how the new studies fit in with the old, rather than just the (exciting) new ones. On first reading I noticed that one study is described as having explored the association of "the type of food within a mile and a half of their homes" with what people eat. This raised a little question mark in my mind, as I know that prior studies have often looked at distances much shorter than 1.5 miles, but it was mostly a vague hesitation. And if you didn't know that before reading the article, then you've missed a major difference between the old and new results (and one that could have been easily explained). Also, describing something as "an article of faith" when it's arguably something more like "the broad conclusion draw from most most prior research"... that certainly established an editorial tone from the beginning.

Intrigued, I sent the piece to a friend (and former public health classmate) who has work on food deserts, to get a more informed reaction. I'm sharing her thoughts here (with permission) because this is an area of research that I don't follow as closely, and her reactions helped me to situate this story in the broader literature:

1. This quote from the article is so good!

"It is always easy to advocate for more grocery stores,” said Kelly D. Brownell, director of Yale University’s Rudd Center for Food Policy and Obesity, who was not involved in the studies. “But if you are looking for what you hope will change obesity, healthy food access is probably just wishful thinking.”

The "unhealthy food environment" has a much bigger impact on diet than the "healthy food environment", but it's politically more viable to work from an advocacy standpoint than a regulatory standpoint. (On that point, you still have to worry about what food is available - you can't just take out small businesses in impoverished neighborhoods and not replace it with anything.)

2. The article is too eager to dismiss the health-food access relationship. There's good research out there, but there's constant difficulty with tightening methods/definitions and deciding what to control for. The thing that I think is really powerful about the "food desert" discourse is that it opens doors to talk about race, poverty, community, culture, and more. At the end of the day, grocery stores are good for low-income areas because they bring in money and raise property values. If the literature isn't perfect on health effects, I'm still willing to advocate for them.

3. I want to know more about the geography of the study that found that low-income areas had more grocery stores than high-income areas. Were they a mix of urban, peri-urban, and rural areas? Because that's a whole other bear. (Non-shocker shocker: rural areas have food deserts... rural poverty is still a problem!)

4. The article does a good job of pointing to how difficult it is to study this. Hopkins (and the Baltimore Food Czar) are doing some work with healthy food access scores for neighborhoods. This would take into account how many healthy food options there are (supermarkets, farmers' markets, arabers, tiendas) and how many unhealthy food options there are (fast food, carry out, corner stores).

5. The studies they cite are with kids, but the relationship between food insecurity (which is different, but related to food access) and obesity is only well-established among women. (This, itself, is not talked about enough.) The thinking is that kids are often "shielded" from the effects of food insecurity by their mothers, who eat a yo-yo diet depending on the amount of food in the house.

My friend also suggested the following articles for additional reading:

More on microfoundations

Last month I wrote a long-ish post describing the history of the "microfounded" approaches to macroeconomics. For a while I was updating that post with links to recent blog posts as the debate continued, but I stopped after the list grew too long. Now Simon Wren-Lewis has written two more posts that I think are worth highlighting because they come from someone who is generally supportive of the microfoundations approach (I've found his defense of the general approach quite helpful), but who still has some specific critiques. The end of his latest post puts these critiques in context:

One way of reading these two posts is a way of exploring Krugman’s Mistaking Beauty for Truth essay. I know the reactions of colleagues, and bloggers, to this piece have been quite extreme: some endorsing it totally, while others taking strong exception to its perceived targets. My own reaction is very similar to Karl Smith here. I regard what has happened as a result of the scramble for austerity in 2010 to be in part a failure of academic macroeconomics. It would be easy to suggest that this was only the result of unfortunate technical errors, or political interference, and that otherwise the way we do macro is basically fine. I think Krugman was right to suggest otherwise. Given the conservative tendency in any group, an essay that said maybe there might just be an underlying problem here would have been ignored. The discipline needed a wake-up call from someone with authority who knew what they were talking about. Identifying exactly what those problems are, and what to do about them, seems to me an important endeavour that has only just begun.

Here are his two posts:

  1. The street light problem: "I do think microfoundations methodology is progressive. The concern is that, as a project, it may tend to progress in directions of least resistance rather than in the areas that really matter – until perhaps a crisis occurs."
  2. Ideological bias: "In RBC [Real Business Cycle] models, all changes in unemployment are voluntary. If unemployment is rising, it is because more workers are choosing leisure rather than work. As a result, high unemployment in a recession is not a problem at all.... If anyone is reading this who is not familiar with macroeconomics, you might guess that this rather counterintuitive theory is some very marginal and long forgotten macroeconomic idea. You would be very wrong."

Up to speed: microfoundations

[Admin note: this is the first of a new series of "Up to speed" posts which will draw together information on a subject that's either new to me or has been getting a lot of play lately in the press or in some corner of the blogosphere. The idea here is that folks who are experts on this particular subject might not find anything new; I'm synthesizing things for those who want to get up to speed.]

Microfoundations (Wikipedia) are quite important in modern macroeconomics. Modern macroeconomics really started with Keynes. His landmark General Theory of Employment, Interest and Money (published in 1936) set the stage for pretty much everything that has come since. Basically everything that came before Keynes couldn't explain the Great Depression -- or worse yet how the world might get out of it -- and Keynes' theories (rightly or wrongly) became popular because they addressed that central failing.

One major criticism was that modern macroeconomic models like Keynes' were top-down, only looking at aggregate totals of measures like output and investment. That may not seem too bad, but when you tried to break things down to the underlying individual behaviors that would add up to those aggregates, wacky stuff happens. At that point microeconomic models were much better fleshed out, and the micro models all started with individual rational actors maximizing their utility, assumptions that macroeconomists just couldn't get from breaking down their aggregate models.

The most influential criticism came from Robert Lucas, in what became known as the Lucas Critique (here's a PDF of his 1976 paper). Lucas basically argued that aggregate models weren't that helpful because they were only looking at surface-level parameters without understanding the underlying mechanisms. If something -- like the policy environment -- changes drastically then the old relationships that were observed in the aggregate data may no longer apply. An example from Wikipedia:

One important application of the critique is its implication that the historical negative correlation between inflation and unemployment, known as the Phillips Curve, could break down if the monetary authorities attempted to exploit it. Permanently raising inflation in hopes that this would permanently lower unemployment would eventually cause firms' inflation forecasts to rise, altering their employment decisions.

Economists responded by developing "micro-founded" macroeconomic models, ones that built up from the sum of microeconomic models. The most commonly used of these models is called, awkwardly, dynamic stochastic general equilibirum (DGSE). Much of my study time this semester involves learning the math behind this. What's the next step forward from DGSE? Are these models better than the old Keynesian models? How do we even define "better"? These are all hot topics in macro at the moment. There's been a recent spat in the economics blogosphere that illustrates this -- what follows are a few highlights.

Back in 2009 Paul Krugman (NYT columnist, Nobel winner, and Woodrow Wilson School professor) wrote an article titled "How Did Economists Get It So Wrong?" that included this paragraph:

As I see it, the economics profession went astray because economists, as a group, mistook beauty, clad in impressive-looking mathematics, for truth. Until the Great Depression, most economists clung to a vision of capitalism as a perfect or nearly perfect system. That vision wasn’t sustainable in the face of mass unemployment, but as memories of the Depression faded, economists fell back in love with the old, idealized vision of an economy in which rational individuals interact in perfect markets, this time gussied up with fancy equations. The renewed romance with the idealized market was, to be sure, partly a response to shifting political winds, partly a response to financial incentives. But while sabbaticals at the Hoover Institution and job opportunities on Wall Street are nothing to sneeze at, the central cause of the profession’s failure was the desire for an all-encompassing, intellectually elegant approach that also gave economists a chance to show off their mathematical prowess.

Last month Stephen Williamson wrote this:

[Because of the financial crisis] There was now a convenient excuse to wage war, but in this case a war on mainstream macroeconomics. But how can this make any sense? The George W era produced a political epiphany for Krugman, but how did that ever translate into a war on macroeconomists? You're right, it does not make any sense. The tools of modern macroeconomics are no more the tools of right-wingers than of left-wingers. These are not Republican tools, Libertarian tools, Democratic tools, or whatever.

A bit of a sidetrack, but this prompted Noah Smith to write a long post (that is generally more technical than I want to get in to here) defending the idea that modern macro models (like DSGE) are in fact ideologically biased, even if that's not their intent. Near the end:

So what this illustrates is that it's really hard to make a DSGE model with even a few sort-of semi-realistic features. As a result, it's really hard to make a DSGE model in which government policy plays a useful role in stabilizing the business cycle. By contrast, it's pretty easy to make a DSGE model in which government plays no useful role, and can only mess things up. So what ends up happening? You guessed it: a macro literature where most papers have only a very limited role for government.

In other words, a macro literature whose policy advice is heavily tilted toward the political preferences of conservatives.

Back on the main track, Simon Wren-Lewis, writing at Mainly Macro, comes to Krugman's defense, sort of, by saying that its conceivable that an aggregate model might actually be more defensible than a micro-founded one in certain circumstances.

This view [Krugman's view that aggregate models may still be useful] appears controversial. If the accepted way of doing macroeconomics in academic journals is to almost always use a ‘fancier optimisation’ model, how can something more ad hoc be more useful? Coupled with remarks like ‘the economics profession went astray because economists, as a group, mistook beauty, clad in impressive-looking mathematics, for truth’ (from the 2009 piece) this has got a lot of others, like Stephen Williamson, upset. [skipping several paragraphs]

But suppose there is in fact more than one valid microfoundation for a particular aggregate model. In other words, there is not just one, but perhaps a variety of particular worlds which would lead to this set of aggregate macro relationships....Furthermore, suppose that more than one of these particular worlds was a reasonable representation of reality... It would seem to me that in this case the aggregate model derived from these different worlds has some utility beyond just one of these microfounded models. It is robust to alternative microfoundations.

Back on the main track, Krugman followed up with an argument for why its OK to use both aggregate and microfounded models.

And here's Noah Smith writing again, "Why bother with microfoundations?"

Using wrong descriptions of how people behave may or may not yield aggregate relationships that really do describe the economy. But the presence of the incorrect microfoundations will not give the aggregate results a leg up over models that simply started with the aggregates....

When I look at the macro models that have been constructed since Lucas first published his critique in the 1970s, I see a whole bunch of microfoundations that would be rejected by any sort of empirical or experimental evidence (on the RBC side as well as the Neo-Keynesian side). In other words, I see a bunch of crappy models of individual human behavior being tossed into macro models. This has basically convinced me that the "microfounded" DSGE models we now use are only occasionally superior to aggregate-only models. Macroeconomists seem to have basically nodded in the direction of the Lucas critique and in the direction of microeconomics as a whole, and then done one of two things: either A) gone right on using aggregate models, while writing down some "microfoundations" to please journal editors, or B) drawn policy recommendations directly from incorrect models of individual behavior.

The most recent is from Krugman, wherein he says (basically) that models that make both small and big predictions should be judged more on the big than the small.

This is just a sampling, and likely a biased one as there are many who dismiss the criticism of microfoundations out of hand and thus aren't writing detailed responses. Either way, the microfoundations models are dominant in the macro literature now, and the macro-for-policy-folks class I'm taking at the moment focuses on micro-founded models (because they're "how modern macro is done").

So what to conclude? My general impression is that microeconomics is more heavily 'evolved' than macroeconomics. (You could say that in macro the generation times are much longer, and the DNA replication bits are dodgier, so evolving from something clearly wrong towards something clearly better is taking longer.)

Around the same time that micro was getting problematized by Kahneman and others who questioned the rational utility-maximizing nature of humans, thus launching behavioral economics revolution -- which tries to complicate micro theory with a bit of reality -- the macroeconomists were just  getting around to incorporating the original microeconomic emphasis on rationality. Just how much micro will change in the next decades in response to the behavioral revolution is unclear, so expecting troglodytesque macro to have already figured this out is unrealistic.

A number of things are unclear to me: just how deep the dissatisfaction with the current models is, how broadly these critiques (vs. others from different directions) are endorsed, and what actually drives change in fields of inquiry. Looking back in another 30-40 years we might see this moment in time as a pivotal shift in the history of the development of macroeconomics -- or it may be a little hiccup that no one remembers at all. It's too soon to tell.

Updates: since writing this I've noticed several more additions to the discussion:

Coincidence or consequence?

Imagine there's a pandemic flu virus on the loose, and a vaccine has just been introduced. Then come reports of dozens of cases of Guillain-Barré syndrome (GBS), a rare type of paralysis. Did the new vaccine cause it? How would you even begin to know? One first step (though certainly not the only one) is to think about the background rate of disease:

Inappropriate assessment of vaccine safety data could severely undermine the eff ectiveness of mass campaigns against pandemic H1N1 2009 influenza. Guillain-Barré syndrome is a good example to consider. Since the 1976–77 swine influenza vaccination campaign was associated with an increased number of cases of Guillain-Barré syndrome, assessment of such cases after vaccination will be a high priority. Therefore, it is important to know the background rates of this syndrome and how this rate might vary with regard to population demographics. The background rate of the syndrome in the USA is about 1–2 cases per 1 million person-months of observation. During a pandemic H1N1 vaccine campaign in the USA, 100 million individuals could be vaccinated. For a 6-week follow-up period for each dose, this corresponds to 150 million person-months of observation time during which a predicted 200 or more new cases of Guillain-Barré syndrome would occur as background coincident cases. The reporting of even a fraction of such a large number of cases as adverse events after immunisation, with attendant media coverage, would probably give rise to intense public concern, even though the occurrence of such cases was completely predictable and would have happened in the absence of a mass campaign.

That's from a paper by Steven Black et al. in 2009, "Importance of background rates of disease in assessment of vaccine safety during mass immunisation with pandemic H1N1 infl uenza vaccines". They also calculate background rates for spontaneous abortion, preterm delivery, and spontaneous death among other things.

Platform evaluation

Cesar Victora,  Bob Black,  Ties Boerma, and Jennifer Bryce (three of the four are with the Hopkins Department of International Health and I took a course with Prof Bryce) wrote this article in The Lancet in January 2011: "Measuring impact in the Millennium Development Goal era and beyond: a new approach to large-scale effectiveness evaluations." The abstract:

Evaluation of large-scale programmes and initiatives aimed at improvement of health in countries of low and middle income needs a new approach. Traditional designs, which compare areas with and without a given programme, are no longer relevant at a time when many programmes are being scaled up in virtually every district in the world. We propose an evolution in evaluation design, a national platform approach that: uses the district as the unit of design and analysis; is based on continuous monitoring of different levels of indicators; gathers additional data before, during, and after the period to be assessed by multiple methods; uses several analytical techniques to deal with various data gaps and biases; and includes interim and summative evaluation analyses. This new approach will promote country ownership, transparency, and donor coordination while providing a rigorous comparison of the cost-effectiveness of different scale-up approaches.

Discarding efficacy?

Andrew Grove, former CEO of Intel, writes an editorial in Science:

We might conceptualize an “e-trial” system along similar lines. Drug safety would continue to be ensured by the U.S. Food and Drug Administration. While safety-focused Phase I trials would continue under their jurisdiction, establishing efficacy would no longer be under their purview. Once safety is proven, patients could access the medicine in question through qualified physicians. Patients' responses to a drug would be stored in a database, along with their medical histories. Patient identity would be protected by biometric identifiers, and the database would be open to qualified medical researchers as a “commons.” The response of any patient or group of patients to a drug or treatment would be tracked and compared to those of others in the database who were treated in a different manner or not at all.

Alex Tabarrok of Marginal Revolution (who is a big advocate for FDA reform, running this site) really likes the idea. I hate it. While the current system has some problems, Grove's system would be much, much worse than the current system. The biggest problem is that we would have no good data about whether a drug is truly efficacious, because all of the results in the database would be confounded by selection bias. Getting a large sample size and having subgroups tells you nothing about why someone got the treatment in the first place.

Would physicians pay attention to peer-reviewed articles and reviews identifying the best treatments for specific groups? Or would they just run their own analyses? I think there would be a lot of the latter, which is scary since many clinicians can’t even define selection bias or properly interpret statistical tests. The current system has limitations, but Grove's idea would move us even further from any sort of evidence-based medicine.

Other commenters at Marginal Revolution rightly note that it's difficult to separate safety from efficacy, because recommending a drug is always based on a balance of risks and benefits. Debilitating nausea or strong likelihood of heart attack would never be OK in a drug for mild headaches, but if it cures cancer the standards are (and should be) different.

Derek Lowe, a fellow Arkansan who writes the excellent chemistry blog In The Pipeline, has more extensive (and informed) thoughts here.

Update (1/5/2012): More criticism, summarized by Derek Lowe.

What does social science know?

Marc Bellemare wrote a post "For Fellow Teachers: Revised Primers on Linear Regression and Causality." Good stuff for students too -- not just teachers. The primers are PDFs on linear regression (6 pages) and causality (3 pages), and they're either 1) a concise summary if you're studying this stuff already, or 2) something you should really read if you don't have any background in quantitative methods. I also really enjoyed an essay by Jim Manzi that Marc links to, titled "What Social Science Does -- and Doesn't -- Know." Manzi reviews the history of experimentation in natural sciences, and then in social sciences. He discusses why it's more difficult to extrapolate from randomized trials in the social sciences due to greater 'causal density,' amongst other reasons. Manzi summarized a lot of research in criminology (a field I didn't even know used many field trials) and ends with some conclusions that seem sharp (emphasis added):

...After reviewing experiments not just in criminology but also in welfare-program design, education, and other fields, I propose that three lessons emerge consistently from them.

First, few programs can be shown to work in properly randomized and replicated trials. Despite complex and impressive-sounding empirical arguments by advocates and analysts, we should be very skeptical of claims for the effectiveness of new, counterintuitive programs and policies, and we should be reluctant to trump the trial-and-error process of social evolution in matters of economics or social policy.

Second, within this universe of programs that are far more likely to fail than succeed, programs that try to change people are even more likely to fail than those that try to change incentives. A litany of program ideas designed to push welfare recipients into the workforce failed when tested in those randomized experiments of the welfare-reform era; only adding mandatory work requirements succeeded in moving people from welfare to work in a humane fashion. And mandatory work-requirement programs that emphasize just getting a job are far more effective than those that emphasize skills-building. Similarly, the list of failed attempts to change people to make them less likely to commit crimes is almost endless—prisoner counseling, transitional aid to prisoners, intensive probation, juvenile boot camps—but the only program concept that tentatively demonstrated reductions in crime rates in replicated RFTs was nuisance abatement, which changes the environment in which criminals operate....

I'd note here that many researchers and policymakers who are interested in health-related behavior change have been moving away from simply providing information or attempting to persuade people to change their behavior, and moving towards changing the unhealthy environments in which we live. NYC Health Commissioner Thomas Farley spoke explicitly about this shift in emphasis when he addressed us summer interns back in June. That approach is a direct response to frustration with the small returns from many behavioral intervention approaches, and an acknowledgment that we humans are stubborn creatures whose behavior is shaped (more than we'd like to admit) by our environments.

Manzi concludes:

And third, there is no magic. Those rare programs that do work usually lead to improvements that are quite modest, compared with the size of the problems they are meant to address or the dreams of advocates.

Right, no pie in the sky. If programs or policies had huge effects they'd be much easier to measure, for one. Read it all.