Advocates and scientists

A new book by The Idealist: Jeffrey Sachs and the Quest to End Poverty. The blurbs on Amazon are fascinating because they indicate that either the reviewers didn't actually read the book (which wouldn't be all that surprising) or that Munk's book paints a nuanced enough picture that readers can come away with very different views on what it actually proves. Here are two examples:

Amartya Sen: “Nina Munk’s book is an excellent – and moving – tribute to the vision and commitment of Jeffrey Sachs, as well as an enlightening account of how much can be achieved by reasoned determination.”

Robert Calderisi: "A powerful exposé of hubris run amok, drawing on touching accounts of real-life heroes fighting poverty on the front line."

The publisher's description seems to encompass both of those points of view: "The Idealist is the profound and moving story of what happens when the abstract theories of a brilliant, driven man meet the reality of human life." That sounds like a good read to me -- I look forward to reading when it comes out in September.

Munk's previous reporting strikes a similar tone. For example, here's an excerpt of her 2007 Vanity Fair profile of Sachs:

Leaving the region of Dertu, sitting in the back of an ancient Land Rover, I'm reminded of a meeting I had with Simon Bland, head of Britain's Department for International Development in Kenya. Referring to the Millennium Villages Project, and to Sachs in particular, Bland laid it out for me in plain terms: "I want to say, 'What concept are you trying to prove?' Because I know that if you spend enough money on each person in a village you will change their lives. If you put in enough resources—enough foreigners, technical assistance, and money—lives change. We know that. I've been doing it for years. I've lived and worked on and managed [development] projects.

"The problem is," he added, "when you walk away, what happens?"

Someone -- I think it was Chris Blattman, but I can't find the specific post -- wondered a while back whether too much attention has been given to the Millennium Villages Project. After all, the line of thinking goes, the MVP's have really just gotten more press and aren't that different from the many other projects with even less rigorous evaluation designs. That's certainly true: when journalists and aid bloggers debate the MVPs, part of what they're debating is Sachs himself because he's such a polarizing personality. If you really care about aid policy, and the uses of evidence in that policy, then that can all feel like an unhelpful distraction. Most aid efforts don't get book-length profiles, and the interest in Sachs' personality and persona will probably drive the interest in Munk's book.

But I also think the MVP debates have been healthy and interesting -- and ultimately deserving of most of the heat generated -- because they're about a central tension within aid and development, as well as other fields where research intersects with activism. If you think we already generally know what to do, then it makes sense to push forward with it at all costs. The naysayers who doubt you are unhelpful skeptics who are on some level ethically culpable for blocking good work. If you think the evidence is not yet in, then it makes more sense to function more like a scientist, collecting the evidence needed to make good decisions in the longer term. The naysayers opposing the scientists are then utopian advocates who throw millions at unproven projects. I've seen a similar tension within the field of public health, between those who see themselves primarily as advocates and those who see themselves as scientists, and I'm sure it exists elsewhere as well.

That is, of course, a caricature -- few people fall completely on one side of the advocates vs. scientists divide. But I think the caricature is a useful one for framing arguments. The fundamental disagreement is usually not about whether evidence should be used to inform efforts to end poverty or improve health or advance any other goal. Instead, the disagreement is often over what the current state of knowledge is. And on that note, if you harbor any doubts on where Sachs has positioned himself on that spectrum here's the beginning of Munk's 2007 profile:

In the respected opinion of Jeffrey David Sachs.... the problem of extreme poverty can be solved. In fact, the problem can be solved "easily." "We have enough on the planet to make sure, easily, that people aren't dying of their poverty. That's the basic truth," he tells me firmly, without a doubt.

...To Sachs, the end of poverty justifies the means. By hook or by crook, relentlessly, he has done more than anyone else to move the issue of global poverty into the mainstream—to force the developed world to consider his utopian thesis: with enough focus, enough determination, and, especially, enough money, extreme poverty can finally be eradicated.

Once, when I asked what kept him going at this frenzied pace, he snapped back, "If you haven't noticed, people are dying. It's an emergency."

----

via Gabriel Demombynes.

If you're new to the Millennium Villages debate, here's some background reading: a recent piece in Foreign Policy by Paul Starobin, and some good posts by Chris Blattman (one, two, three), this gem from Owen Barder, and Michael Clemens.

On deworming

GiveWell's Alexander Berger just posted a more in-depth blog review of the (hugely impactful) Miguel and Kremer deworming study. Here's some background: the Cochrane reviewGivewell's first response to it, and IPA's very critical response. I've been meaning to blog on this since the new Cochrane review came out, but haven't had time to do the subject justice by really digging into all the papers. So I hope you'll forgive me for just sharing the comment I left at the latest GiveWell post, as it's basically what I was going to blog anyway:

Thanks for this interesting review — I especially appreciate that the authors [Miguel and Kremer] shared the material necessary for you [GiveWell] to examine their results in more depth, and that you talk through your thought process.

However, one thing you highlighted in your post on the new Cochrane review that isn’t mentioned here, and which I thought was much more important than the doubts about this Miguel and Kremer study, was that there have been so many other studies that did not find large effect on health outcomes! I’ve been meaning to write a long blog post about this when I really have time to dig into the references, but since I’m mid-thesis I’ll disclaim that this quick comment is based on recollection of the Cochrane review and your and IPA’s previous blog posts, so forgive me if I misremember something.

The Miguel and Kremer study gets a lot of attention in part because it had big effects, and in part because it measured outcomes that many (most?) other deworming studies hadn’t measured — but it’s not as if we believe these outcomes to be completely unrelated. This is a case where what we believe the underlying causal mechanism for the social effects to be is hugely important. For the epidemiologists reading, imagine this as a DAG (a directed acyclic graph) where the mechanism is “deworming -> better health -> better school attendance and cognitive function -> long-term social/economic outcomes.” That’s at least how I assume the mechanism is hypothesized.

So while the other studies don’t measure the social outcomes, it’s harder for me to imagine how deworming could have a very large effect on school and social/economic outcomes without first having an effect on (some) health outcomes — since the social outcomes are ‘downstream’ from the health ones. Maybe different people are assuming that something else is going on — that the health and social outcomes are somehow independent, or that you just can’t measure the health outcomes as easily as the social ones, which seems backwards to me. (To me this was the missing gap in the IPA blog response to GiveWell’s criticism as well.)

So continuing to give so much attention to this study, even if it’s critical, misses what I took to be the biggest takeaway from that review — there have been a bunch of studies that showed only small effects or none at all. They were looking at health outcomes, yes, but those aren’t unrelated to the long-term development, social, and economic effects. You [GiveWell] try to get at the external validity of this study by looking for different size effects in areas with different prevalence, which is good but limited. Ultimately, if you consider all of the studies that looked at various outcomes, I think the most plausible explanation for how you could get huge (social) effects in the Miguel Kremer study while seeing little to no (health) effects in the others is not that the other studies just didn’t measure the social effects, but that the Miguel Kremer study’s external validity is questionable because of its unique study population.

(Emphasis added throughout)

 

Aid, paternalism, and skepticism

Bill Easterly, the ex-blogger who just can't stop, writes about a conversation he had with GiveWell, a charity reviewer/giving guide that relies heavily on rigorous evidence to pick programs to invest in. I've been meaning to write about GiveWell's approach -- which I generally think is excellent. Easterly, of course, is an aid skeptic in general and a critic of planned, technocratic solutions in particular. Here's an excerpt from his notes on his conversation with GiveWell:

...a lot of things that people think will benefit poor people (such as improved cookstoves to reduce indoor smoke, deworming drugs, bed nets and water purification tablets) {are things} that poor people are unwilling to buy for even a few pennies. The philanthropy community’s answer to this is “we have to give them away for free because otherwise the take-up rates will drop.” The philosophy behind this is that poor people are irrational. That could be the right answer, but I think that we should do more research on the topic. Another explanation is that the people do know what they’re doing and that they rationally do not want what aid givers are offering. This is a message that people in the aid world are not getting.

Later, in the full transcript, he adds this:

We should try harder to figure out why people don’t buy health goods, instead of jumping to the conclusion that they are irrational.

Also:

It's easy to catch people doing irrational things. But it's remarkable how fast and unconsciously people get things right, solving really complex problems at lightning speed.

I'm with Easterly, up to a point: aid and development institutions need much better feedback loops, but are unlikely to develop them for reasons rooted in their nature and funding. The examples of bad aid he cites are often horrendous. But I think this critique is limited, especially on health, where the RCTs and all other sorts of evidence really do show that we can have massive impact -- reducing suffering and death on an epic scale -- with known interventions. [Also, a caution: the notes above are just notes and may have been worded differently if they were a polished, final product -- but I think they're still revealing.]

Elsewhere Easterly has been more positive about the likelihood of benefits from health aid/programs in particular, so I find it quite curious that his examples above of things that poor people don't always price rationally are all health-related. Instead, in the excerpts above he falls back on that great foundational argument of economists: if people are rational, why have all this top-down institutional interference? Well, I couldn't help contrasting that argument with this quote highlighted by another economist, Tyler Cowen, at Marginal Revolution:

Just half of those given a prescription to prevent heart disease actually adhere to refilling their medications, researchers find in the Journal of American Medicine. That lack of compliance, they estimate, results in 113,00 deaths annually.

Let that sink in for a moment. Residents of a wealthy country, the United States, do something very, very stupid. All of the RCTs show that taking these medicines will make them live longer, but people fail to overcome the barriers at hand to take something that is proven to make them live longer. As a consequence they die by the hundreds of thousands every single year. Humans may make remarkably fast unconscious decisions correctly in some spheres, sure, but it's hard to look at this result and see any way in which it makes much sense.

Now think about inserting Easterly's argument against paternalism (he doesn't specifically call it that here, but has done so elsewhere) in philanthropy here: if people in the US really want to live, why don't they take these medicines? Who are we to say they're irrational? That's one answer, but maybe we don't understand their preferences and should avoid top-down solutions until we have more research.

reductio ad absurdum? Maybe. On the one hand, we do need more research on many things, including medication up-take in high- and low-income countries. On the other hand, aid skepticism that goes far enough to be against proven health interventions just because people don't always value those interventions rationally seems to line up a good deal with the sort of anti-paternalism-above-all streak in conservatism that opposes government intervention in pretty much every area. Maybe it's a good policy to try out some nudge-y (libertarian paternalism, if you will) policies to encourage people to take their medicine, or require people to have health insurance they would not choose to buy on their own.

Do you want to live longer? I bet you do, and it's safe to assume that people in low-income countries do as well. Do you always do exactly what will help you do so? Of course not: observe the obesity pandemic. Do poor people really want to suffer from worms or have their children die from diarrhea? Again, of course not. While poor people in low-income countries aren't always willing to invest a lot of time or pay a lot of money for things that would clearly help them stay alive for longer, that shouldn't be surprising to us. Why? Because the exact same thing is true of rich people in wealthy countries.

People everywhere -- rich and poor -- make dumb decisions all the time, often because those decisions are easier in the moment due to our many irrational cognitive and behavioral tics. Those seemingly dumb decisions usually reveal the non-optimal decision-making environments in which we live, but you still think we could overcome those things to choose interventions that are very clearly beneficial. But we don't always. The result is that sometimes people in low-income countries might not pay out of pocket for deworming medicine or bednets, and sometimes people in high-income countries don't take their medicine -- these are different sides of the same coin.

Now, to a more general discussion of aid skepticism: I agree with Easterly (in the same post) that aid skeptics are a "feature of the system" that ultimately make it more robust. But it's an iterative process that is often frustrating in the moment for those who are implementing or advocating for specific programs (in my case, health) because we see the skeptics as going too far. I'm probably one of the more skeptical implementers out there -- I think the majority of aid programs probably do more harm than good, and chose to work in health in part because I think that is less true in this sector than in others. I like to think that I apply just the right dose of skepticism to aid skepticism itself, wringing out a bit of cynicism to leave the practical core.

I also think that there are clear wins, supported by the evidence, especially in health, and thus that Easterly goes too far here. Why does he? Because his aid skepticism isn't simply pragmatic, but also rooted in an ideological opposition to all top-down programs. That's a nice way to put it, one that I think he might even agree with. But ultimately that leads to a place where you end up lumping things together that are not the same, and I'll argue that that does some harm. Here are two examples of aid, both more or less from Easterly's post:

  • Giving away medicines or bednets free, because otherwise people don't choose to invest in them; and,
  • A World Bank project in Uganda that "ended up burning down farmers’ homes and crops and driving the farmers off the land."

These are a both, in one sense, paternalistic, top-down programs, because they are based on the assumption that sometimes people don't choose to do what is best for themselves. But are they the same otherwise? I'd argue no. One might argue that they come from the same place, and an institution that funds the first will inevitably mess up and do the latter -- but I don't buy that strong form of aid skepticism. And being able to lump the apparently good program and the obviously bad together is what makes Easterly's rhetorical stance powerful.

If you so desire, you could label these two approaches as weak coercion and strong coercion. They are both coercive in the sense that they reshape the situations in which people live to help achieve an outcome that someone -- a planner, if you will -- has decided is better. All philanthropy and much public policy is coercive in this sense, and those who are ideologically opposed to it have a hard time seeing the difference. But to many of us, it's really only the latter, obvious harm that we dislike, whereas free medicines don't seem all that bad. I think that's why aid skeptics like Easterly group these two together, because they know we'll be repulsed by the strong form. But when they argue that all these policies are ultimately the same because they ignore people's preferences (as demonstrated by their willingness to pay for health goods, for example), the argument doesn't sit right with a broader audience. And then ultimately it gets ignored, because these things only really look the same if you look at them through certain ideological lenses.

That's why I wish Easterly would take a more pragmatic approach to aid skepticism; such a form might harp on the truly coercive aspects without lumping them in with the mildly paternalistic. Condemning the truly bad things is very necessary, and folks "on the inside' of the aid-industrial complex aren't generally well-positioned to make those arguments publicly. However, I think people sometimes need a bit of the latter policies, the mildly paternalistic ones like giving away medicines and nudging people's behavior -- in high- and low-income countries alike. Why? Because we're generally the same everywhere, doing what's easiest in a given situation rather than what we might choose were the circumstances different. Having skeptics on the outside where they can rail against wrongs is incredibly important, but they must also be careful to yell at the right things lest they be ignored altogether by those who don't share their ideological priors.

Off by a factor of 100

GiveWell is an "independent, nonprofit charity evaluator" that finds "outstanding giving opportunities and publish[es] the full details of [their] analysis to help donors decide where to give." Their Giving 101 page is a good place to start regarding their methodology and conclusions. I want to highlight a recent blog post of theirs titled "Errors in DCP2 Cost Effectiveness Estimate for Deworming". DCP2 stands for "Disease Control Priorities in Developing Countries," a report funded by the Gates Foundation and produced for many partners including the World Bank.

The DCP2 blog post and its comments are wonky but worth reading in full because of their implications. It's a pretty strong argument for why calculations need to be as transparent as possible if we're going to make decisions based on them:

Over the past few months, GiveWell has undertaken an in-depth investigation of the cost-effectiveness of deworming, a treatment for parasitic worms that are very common in some parts of the developing world. While our investigation is ongoing, we now believe that one of the key cost-effectiveness estimates for deworming is flawed, and contains several errors that overstate the cost-effectiveness of deworming by a factor of about 100. This finding has implications not just for deworming, but for cost-effectiveness analysis in general: we are now rethinking how we use published cost-effectiveness estimates for which the full calculations and methods are not public...

Eventually, we were able to obtain the spreadsheet that was used to generate the $3.41/DALY [Disability-adjusted life year] estimate. That spreadsheet contains five separate errors that, when corrected, shift the estimated cost effectiveness of deworming from $3.41 to $326.43.

From later in the post:

Whether or not the long-term effects are taken into account, the corrected DCP2 estimate of STH treatment falls outside of the $100/DALY range that the World Bank initially labeled as highly cost-effective (see page 36 of the DCP2.) With the corrections, a variety of interventions, including vaccinations and insecticide-treated bednets, become substantially more cost-effective than deworming.

Best practices of ranking aid best practices

Aid Watch has a post up by Claudia Williamson (a post-doc at DRI) about the "Best and Worst of Official Aid 2011". As Claudia summarizes, their paper looks at "five dimensions of agency ‘best practices’: aid transparency, minimal overhead costs, aid specialization, delivery to more effective channels, and selectivity of recipient countries based on poverty and good government" and calculates an overall agency score. Williamson notes that the "scores only reflect the above practices; they are NOT a measure of whether the agency’s aid is effective at achieving good results." Very true -- but I think this can be easily overlooked. In their paper Easterly and Williamson say:

We acknowledge that there is no direct evidence that our indirect measures necessarily map into improved impact of aid on the intended beneficiaries. We will also point out specific occasions where the relationship between our measures and desirable outcomes could be non-monotonic or ambiguous.

But still, grouping these things together into a single index may obscure more than it enlightens. Transparency seems more of an unambiguous good, whereas overhead percentages are less so. Some other criticisms from the comments section that I'd like to highlight include one from someone named Bula:

DfID scores high and USAID scores low because they have fundamentally different missions. I doubt anyone at USAID or State would attempt to say with a straight face that AID is anything other than a public diplomacy tool. DfID as a stand alone ministry has made a serious effort in all of the areas you’ve measured because it’s mission aligns more closely to ‘doing development’ and less with ‘public diplomacy’. Seems to be common sense.

And a comment from Tom that starts with a quote from the Aid Watch post:

“These scores only reflect the above practices; they are NOT a measure of whether the agency’s aid is effective at achieving good results.”

Seriously? How can you possibly give an aid agency a grade based solely on criteria that have no necessary relationship with aid effectiveness? It is your HYPOTHESIS that transparency, overhead, etc, significantly affect the quality of aid, but without looking at actual effeciveness that hypothesis is completely unproven. An A or an F means absolutely nothing in this context. Without looking at what the agency does with the aid (i.e. is it effective), why should we care whether an aid agency has low or high overhead? To take another example, an aid agency could be the least transparent but achieve the best results; which matters more, your ideological view of how an agency “should” function, or that they achieve results? In my mind it’s the ends that matter, and we should then determine what the best means are to achieve that result. You approach it with an a priori belief that those factors are the most important, and therefore risk having ideology overrule effectiveness. Isn’t that criticism the foundation of this blog and Dr. Easterly’s work more generally?

Terence at Waylaid Dialectic has three specific criticisms worth reading and then ends with this:

I can see the appeal, and utility of such indices, and the longitudinal data in this one are interesting, but still think the limitations outweigh the merits, at least in the way they’re used here. It’s an interesting paper but ultimately more about heat than light."

I'm not convinced the limitations outweigh the merits, but there are certainly problems. One is that the results quickly get condensed to "Britain, Japan and Germany do pretty well and the U.S. doesn’t."

Another problem is that without having some measure of aid effectiveness, it seems that this combined metric may be misleading -- analogous to a process indicator in a program evaluation. In that analogy, Program A might procure twice as many bednets as Program B, but that doesn't mean it's necessarily better, and for that you'd need to look at the impact on health outcomes. Maybe more nets is better. Or maybe the program that procures fewer bednets distributes them more intelligently and has a stronger impact. In the absence of data on health outcomes, is the process indicator useful or misleading? Well, it depends. If there's a strong correlation (or even a good reason to believe) that the process and impact indicators go together, then it's probably better than nothing. But if some of the aid best practices lead to better aid effectiveness, and some don't, then it's at best not very useful, and at worst will prompt agencies to move in the wrong direction.

As Easterly and Williamson note in their paper, they're merely looking at whether aid agencies do what aid agencies say should be their best practices. However, without a better idea of the correlation between those aid practices and outcomes for the people who are supposed to benefit from the programs, it's really hard to say whether this metric is (using Terence's words) "more heat than light."

It's a Catch-22: without information on the correlation between best aid practices and real aid effectiveness it's hard to say whether the best aid practices "process indicator" is enlightening or obfuscating, but if we had that data on actual aid effectiveness we would be looking at that rather than best practices in the first place.

Modelling Stillbirth

William Easterly and Laura Freschi go after "Inception Statistics" in the latest post on AidWatch. They criticize -- in typically hyperbolic style, with bonus points for the pun in the title -- both the estimates of stillbirth and their coverage in the news media. I left a comment on their blog outlining my thoughts but thought I'd re-post them here with a little more explanation. Here's what I said:

Thanks for this post (it’s always helpful to look at quality of estimates critically) but I think the direction of your criticism needs to be clarified. Which of the following are you upset about (choose all that apply)?

a) the fact that the researchers used models at all? I don’t know the researchers personally, but I would imagine that they are concerned with data quality in general and would much preferred to have had reliable data from all the countries they work with. But in the absence of that data (and while working towards it) isn’t it helpful to have the best possible estimates on which to set global health policy, while acknowledging their limitations? Based on the available data, is there a better way to estimate these, or do you think we’d be better off without them (in which case stillbirth might be getting even less attention)? b) a misrepresentation of their data as something other than a model? If so, could you please specify where you think that mistake occurred — to me it seems like they present it in the literature as what it is and nothing more. c) the coverage of these data in the media? On that I basically agree. It’s helpful to have critical viewpoints on articles where there is legitimate disagreement.

I get the impression your main beef is with (c), in which case I agree that press reports should be more skeptical. But I think calling the data “made up” goes too far too. Yes, it’d be nice to have pristine data for everything, but in the meantime we should try for the best possible estimates because we need something on which to base policy decisions. Along those lines, I think this commentary by Neff Walker (full disclosure: my advisor) in the same issue is worthwhile. Walker asks these five questions – noting areas where the estimates need improvement: - “Do the estimates include time trends, and are they geographically specific?” (because these allow you to crosscheck numbers for credibility) - “Are modelled results compared with previous estimates and differences explained?” - “Is there a logical and causal relation between the predictor and outcome variables in the model?” - “Do the reported measures of uncertainty around modelled estimates show the amount and quality of available data?” - “How different are the settings from which the datasets used to develop the model were drawn from those to which the model is applied?” (here Walker says further work is needed)

I'll admit to being in over my head in evaluating these particular models. As Easterly and Freschi note, "the number of people who actually understand these statistical techniques well enough to judge whether a certain model has produced a good estimate or a bunch of garbage is very, very small." Very true. But in the absence of better data, we need models on which to base decisions -- if not we're basing our decisions on uninformed guesswork, rather than informed guesswork.

I think the criticism of media coverage is valid. Even if these models are the best ever they should still be reported as good estimates at best. But when Easterly calls the data "made up" I think the hyperbole is counterproductive. There's an incredibly wide spectrum of data quality, from completely pulled-out-of-the-navel to comprehensive data from a perfectly-functioning vital registration system. We should recognize that the data we work with aren't perfect. And there probably is a cut-off point at which estimates are based on so many models-within-models that they are hurtful rather than helpful in making informed decisions. But are these particular estimates at that point? I would need to see a much more robust criticism than AidWatch has provided so far to be convinced that these estimates aren't helpful in setting priorities.

"Small Changes, Big Results"

The Boston Review has a whole new set of articles on the movement of development economics towards randomized trials. The main article is Small Changes, Big Results: Behavioral Economics at Work in Poor Countries and the companion and criticism articles are here. They're all worth reading, of course. I found them through Chris Blattman's new post "Behavioral Economics and Randomized Trials: Trumpeted, Attacked, and Parried." I want to re-state a point I made in the comments there, because I think it's worth re-wording to get it right. It's this: I often see the new randomized trials in economics compared to clinical trials in the medical literature. There are many parallels to be sure, but the medical literature is huge, and there's really one subset of it that offers better parallels.

Within global health research there are a slew of large (and not so large), randomized (and other rigorous designs), controlled (placebo or not) trials that are done in "field" or "community" settings. The distinction is that clinical trials usually draw their study populations from a hospital or other clinical setting and their results are thus only generalizable to the broader population (external validity) to the extent that the clinical population is representative of the whole population; while community trials are designed to draw from everyone in a given community.

Because these trials draw their subjects from whole communities -- and they're often cluster-randomized so that whole villages or clinic catchment areas are the unit that's randomized, rather than individuals -- they are typically larger, more expensive, more complicated and pose distinctive analytical and ethical problems. There's also often room for nesting smaller studies within the big trials, because the big trials are already recruiting large numbers of people meeting certain criteria and there are always other questions that can be answered using a subset of that same population. [All this is fresh on my mind since I just finished a class called "Design and Conduct of Community Trials," which is taught by several Hopkins faculty who run very large field trials in Nepal, India, and Bangladesh.]

Blattman is right to argue for registration of experimental trials in economics research, as is done with medical studies. (For nerdy kicks, you can browse registered trials at ISRCTN.) But many of the problems he quotes Eran Bendavid describing in economics trials--"Our interventions and populations vary with every trial, often in obscure and undocumented ways"--can also be true of community trials in health.

Likewise, these trials -- which often take years and hundreds of thousands of dollars to run -- often yield a lot of knowledge about the process of how things are done. Essential elements include doing good preliminary studies (such as validating your instruments), having continuous qualitative feedback on how the study is going, and gathering extra data on "process" questions so you'll know why something worked or not, and not just whether it did (a lot of this is addressed in Blattman's "Impact Evaluation 2.0" talk). I think the best parallels for what that research should look like in practice will be found in the big community trials of health interventions in the developing world, rather than in clinical trials in US and European hospitals.

Evaluation in education (and elsewhere)

Jim Manzi has some fascinating thoughts on evaluating teachers at the American Scene. Some summary outtakes:

1. Remember that the real goal of an evaluation system is not evaluation. The goal of an employee evaluation system is to help the organization achieve an outcome....

2. You need a scorecard, not a score. There is almost never one number that can adequately summarize the performance of complex tasks like teaching that are executed as part of a collective enterprise....

3. All scorecards are temporary expedients. Beyond this, no list of metrics can usually adequately summarize performance, either....

4. Effective employee evaluation is not fully separable from effective management

When you zoom out to a certain point, all complex systems in need of reform start to look alike, because they all combine social, political, economic, and technical challenges, and the complexity, irrationality, and implacability of human behavior rears its ugly head at each step of the process. The debates about tactics and strategy and evaluation for reforming American education or US aid policy or improving health systems or fostering economic development start to blend together, so that Manzi's conclusions sound oddly familiar:

So where does this leave us? Without silver bullets.

Organizational reform is usually difficult because there is no one, simple root cause, other than at the level of gauzy abstraction. We are faced with a bowl of spaghetti of seemingly inextricably interlinked problems. Improving schools is difficult, long-term scut work. Market pressures are, in my view, essential. But, as I’ve tried to argue elsewhere at length, I doubt that simply “voucherizing” schools is a realistic strategy...

Read the rest of his conclusions here.

Microfinance Miscellany

I had a conversation yesterday with a PhD student friend (also in international health) about the evaluation of microcredit programs. I was trying to summarize -- off the top of my head, never a good idea! -- recent findings, and wasn't able to communicate much. But I did note that like many aid and development programs, you get a pretty rosy picture when you're using case studies or cherry-picked before-and-after evaluations without comparison groups. So I was trying to describe what it looks like to do rigorous impact evaluations that account for the selection biases you get if you're just comparing people who self-select for taking out loans versus controls. After that discussion, I was quite happy to come across this new resource on David Roodman's blog: yesterday DFID released a literature review of microfinance impacts in Africa.

On a related note, Innovations for Poverty Action hosted a conference on microfinance evaluation last October, and many of the presentations and papers presented are available here. The "What Are We Learning About Impacts?" sections includes presentations given by Abhijit Banerjee (PDF) and Dean Karlan (PDF) of Yale. Worth reading.

Randomizing in the USA, ctd

[Update: There's quite a bit of new material on this controversy if you're interested. Here's a PDF of Seth Diamond's testimony in support of (and extensive description of) the evaluation at a recent hearing, along with letters of support from a number of social scientists and public health researchers. Also, here's a separate article on the City Council hearing at which Diamond testified, and an NPR story that basically rehashes the Times one. Michael Gechter argues that the testing is wrong because there isn't doubt about whether the program works, but, as noted in the comments there, doesn't note that denial-of-service was already part of the program because it was underfunded.] A couple weeks ago I posted a link to this NYTimes article on a program of assistance for the homeless that's currently being evaluated by a randomized trial. The Poverty Action Lab blog had some discussion on the subject that you should check out too.

The short version is that New York City has a housing assistance program that is supposed to keep people from becoming homeless, but they never gave it a truly rigorous evaluation. It would have been better to evaluate it up front (before the full program was rolled out) but they didn't do that, and now they are.  The policy isn't proven to work, and they don't have resources to give it to everyone anyway, so instead of using a waiting list (arguably a fair system) they're randomizing people into receiving the assistance or not, and then tracking whether they end up homeless. If that makes you a little uncomfortable, that's probably a good thing -- it's a sticky issue, and one that might wrongly be easier to brush aside when working in a different culture. But I think on balance it's still a good idea to evaluate programs when we don't know if they actually do what they're supposed to do.

The thing I want to highlight for now is the impact that the tone and presentation of the article impacts your reactions to the issue being discussed. There's obviously an effect, but I thought this would be a good example because I noticed that the Times article contains both valid criticisms of the program and a good defense of why it makes sense to test it.

I reworked the article by rearranging the presentation of those sections. Mostly I just shifted paragraphs, but in a few cases I rearranged some clauses as well. I changed the headline, but otherwise I didn't change a single word, other than clarifying some names when they were introduced in a different order than in the original. And by leading with the rationale for the policy instead of with the emotional appeal against it, I think the article gives a much different impression. Let me know what you think:

City Department Innovates to Test Policy Solutions

By CARA BUCKLEY with some unauthorized edits by BRETT KELLER

It has long been the standard practice in medical testing: Give drug treatment to one group while another, the control group, goes without.

Now, New York City is applying the same methodology to assess one of its programs to prevent homelessness. Half of the test subjects — people who are behind on rent and in danger of being evicted — are being denied assistance from the program for two years, with researchers tracking them to see if they end up homeless.

New York City is among a number of governments, philanthropies and research groups turning to so-called randomized controlled trials to evaluate social welfare programs.

The federal Department of Housing and Urban Development recently started an 18-month study in 10 cities and counties to track up to 3,000 families who land in homeless shelters. Families will be randomly assigned to programs that put them in homes, give them housing subsidies or allow them to stay in shelters. The goal, a HUD spokesman, Brian Sullivan, said, is to find out which approach most effectively ushered people into permanent homes.

The New York study involves monitoring 400 households that sought Homebase help between June and August. Two hundred were given the program’s services, and 200 were not. Those denied help by Homebase were given the names of other agencies — among them H.R.A. Job CentersHousing Court Answers and Eviction Intervention Services — from which they could seek assistance.

The city’s Department of Homeless Services said the study was necessary to determine whether the $23 million program, called Homebase, helped the people for whom it was intended. Homebase, begun in 2004, offers job training, counseling services and emergency money to help people stay in their homes.

The department, added commissioner Seth Diamond, had to cut $20 million from its budget in November, and federal stimulus money for Homebase will end in July 2012.

Such trials, while not new, are becoming especially popular in developing countries. In India, for example, researchers using a controlled trial found that installing cameras in classrooms reduced teacher absenteeism at rural schools. Children given deworming treatment in Kenya ended up having better attendance at school and growing taller.

“It’s a very effective way to find out what works and what doesn’t,” said Esther Duflo, an economist at the Massachusetts Institute of Technology who has advanced the testing of social programs in the third world. “Everybody, every country, has a limited budget and wants to find out what programs are effective.”

The department is paying $577,000 for the study, which is being administered by the City University of New York along with the research firm Abt Associates, based in Cambridge, Mass. The firm’s institutional review board concluded that the study was ethical for several reasons, said Mary Maguire, a spokeswoman for Abt: because it was not an entitlement, meaning it was not available to everyone; because it could not serve all of the people who applied for it; and because the control group had access to other services.

The firm also believed, she said, that such tests offered the “most compelling evidence” about how well a program worked.

Dennis P. Culhane, a professor of social welfare policy at the University of Pennsylvania, said the New York test was particularly valuable because there was widespread doubt about whether eviction-prevention programs really worked.

Professor Culhane, who is working as a consultant on both the New York and HUD studies, added that people were routinely denied Homebase help anyway, and that the study was merely reorganizing who ended up in that pool. According to the city, 5,500 households receive full Homebase help each year, and an additional 1,500 are denied case management and rental assistance because money runs out.

But some public officials and legal aid groups have denounced the study as unethical and cruel, and have called on the city to stop the study and to grant help to all the test subjects who had been denied assistance.

“They should immediately stop this experiment,” said the Manhattan borough president, Scott M. Stringer. “The city shouldn’t be making guinea pigs out of its most vulnerable.”

But, as controversial as the experiment has become, Mr. Diamond said that just because 90 percent of the families helped by Homebase stayed out of shelters did not mean it was Homebase that kept families in their homes. People who sought out Homebase might be resourceful to begin with, he said, and adept at patching together various means of housing help.

Advocates for the homeless said they were puzzled about why the trial was necessary, since the city proclaimed the Homebase program as “highly successful” in the September 2010 Mayor’s Management Report, saying that over 90 percent of families that received help from Homebase did not end up in homeless shelters. One critic of the trial, Councilwoman Annabel Palma, is holding a General Welfare Committee hearing about the program on Thursday.

“I don’t think homeless people in our time, or in any time, should be treated like lab rats,” Ms. Palma said.

“This is about putting emotions aside,” [Mr. Diamond] said. “When you’re making decisions about millions of dollars and thousands of people’s lives, you have to do this on data, and that is what this is about.”

Still, legal aid lawyers in New York said that apart from their opposition to the study’s ethics, its timing was troubling because nowadays, there were fewer resources to go around.

Ian Davie, a lawyer with Legal Services NYC in the Bronx, said Homebase was often a family’s last resort before eviction. One of his clients, Angie Almodovar, 27, a single mother who is pregnant with her third child, ended up in the study group denied Homebase assistance. “I wanted to cry, honestly speaking,” Ms. Almodovar said. “Homebase at the time was my only hope.”

Ms. Almodovar said she was told when she sought help from Homebase that in order to apply, she had to enter a lottery that could result in her being denied assistance. She said she signed a letter indicating she understood. Five minutes after a caseworker typed her information into a computer, she learned she would not receive assistance from the program.

With Mr. Davie’s help, she cobbled together money from the Coalition for the Homeless and a public-assistance grant to stay in her apartment. But Mr. Davie wondered what would become of those less able to navigate the system. “She was the person who didn’t fall through the cracks,” Mr. Davie said of Ms. Almodovar. “It’s the people who don’t have assistance that are the ones we really worry about.”

Professor Culhane said, “There’s no doubt you can find poor people in need, but there’s no evidence that people who get this program’s help would end up homeless without it.”

Randomizing in the USA

The NYTimes posted this article about a randomized trial in New York City:

It has long been the standard practice in medical testing: Give drug treatment to one group while another, the control group, goes without.

Now, New York City is applying the same methodology to assess one of its programs to prevent homelessness. Half of the test subjects — people who are behind on rent and in danger of being evicted — are being denied assistance from the program for two years, with researchers tracking them to see if they end up homeless.

Dean Karlan at Innovations for Policy Action responds:

It always amazes me when people think resources are unlimited. Why is "scarce resource" such a hard concept to understand?

I think two of the most important points here are that a) there weren't enough resources for everyone to get the services anyway, so they're just changing the decision-making process for who gets the service from first-come-first-served (presumably) to randomized, and b) studies like this can be ethical when there is reasonable doubt about whether a program actually helps or not. If it were firmly established that the program is beneficial, then it's unethical to test it, which is why you can't keep testing a proven drug against placebo.

However, this is good food for thought for those who are interested in doing randomized trials of development initiatives in other countries. It shows the impact (and reactions) from individuals to being treated as "test subjects" here in the US -- and why should we expect people in other countries to feel differently? That said, a lot of randomized trials don't get this sort of pushback. I'm not familiar with this program beyond what I read in this article, but it's possible that more could have been done to communicate the purpose of the trial to the community, activists, and the media.

There are some interesting questions raised in the IPA blog comments as well.

Results-Based Aid

Nancy Birsdall writes "On Not Being Cavalier About Results" about a recent critique of the UK's DFID (Department for International Development):

The fear about an insistence on results arises from confusion about what “results” are. A legitimate typical concern is that aid bureaucracies pressed for “results” will resort, more than already is the case, to projects that provide inputs that seem add up to easily measured “wins” (bednets delivered, books distributed, paramedics trained, vehicles or computers purchased, roads built) while neglecting “system” issues and “institution building”. But bednets and books and vehicles and roads are not results in any meaningful sense, and the connection between these inputs and real outcomes (healthier babies, better educated children, higher farmer income) goes through systems and institutions and is often lost....

Let us define results as measured gains in what children have learned by the end of primary school, or measured reductions in infant mortality or deforestation, or measured increases in the hours of electricity available, or annual increases in revenue from taxes paid by rich households in poor countries – or a host of other indicators that ultimately add up to the transformation of societies and the end of their dependence on outside aid. For a country to get results might not require more money but a reconfiguration of local politics, the cleaning up of bureaucratic red tape, local leadership in setting priorities or simply more exposure to the force of local public opinion. Let aid be more closely tied to well-defined results that recipient countries are aiming for; let donors and recipients start measuring and reporting those results to their own citizens; let there be continuous evaluation and learning about the mechanics of how recipient countries and societies get those results (their institutional shifts, their system reforms, their shifting politics and priorities), built on the transparency that Secretary Mitchell is often emphasizing.

(Emphasis added)

I'd also like to note that Birdsall is the founding director of the Center for Global Development, a nonprofit in DC that does a lot of work related to evidence-based aid. I relied fairly heavily on their report on "Closing the Evaluation Gap" on a recent dual degree app. The full report is worth the read.