Group vs. individual uses of data

Andrew Gelman notes that, on the subject of value-added assessments of teachers, "a skeptical consensus seems to have arisen..." How did we get here? Value-added assessments grew out of the push for more emphasis on measuring success through standardized tests in education -- simply looking at test scores isn't OK because some teachers are teaching in better schools or are teaching better-prepared students. The solution was to look at how teachers' students improve in comparison to other teachers' students. Wikipedia has a fairly good summary here.

Back in February New York City released (over the opposition of teachers' unions) the value-added scores of some 18,000 teachers. Here's coverage from the Times on the release and reactions.

Gary Rubinstein, an education blogger, has done some analysis of the data contained in the reports and published five posts so far: part 1, part 2, part 3, part 4, and part 5. He writes:

For sure the 'reformers' have won a battle and have unfairly humiliated thousands of teachers who got inaccurate poor ratings. But I am optimistic that this will be be looked at as one of the turning points in this fight. Up until now, independent researchers like me were unable to support all our claims about how crude a tool value-added metrics still are, though they have been around for nearly 20 years. But with the release of the data, I have been able to test many of my suspicions about value-added.

I suggest reading his analysis in full, or at least the first two parts.

For me one early take-away from this -- building off comments from Gelman and others -- is that an assessment might be a useful tool for improving education quality overall, while simultaneously being a very poor metric for individual performance. When you're looking at 18,000 teachers you might be able to learn what factors lead to test score improvement on average, and use that information to improve policies for teacher education, recruitment, training, and retention. But that doesn't mean one can necessarily use the same data to make high-stakes decisions about individual teachers.

On food deserts

Gina Kolata, writing for the New York Times, has sparked some debate with this article: "Studies Question the Pairing of Food Deserts and Obesity". In general I often wish that science reporting focused more on how the new studies fit in with the old, rather than just the (exciting) new ones. On first reading I noticed that one study is described as having explored the association of "the type of food within a mile and a half of their homes" with what people eat. This raised a little question mark in my mind, as I know that prior studies have often looked at distances much shorter than 1.5 miles, but it was mostly a vague hesitation. And if you didn't know that before reading the article, then you've missed a major difference between the old and new results (and one that could have been easily explained). Also, describing something as "an article of faith" when it's arguably something more like "the broad conclusion draw from most most prior research"... that certainly established an editorial tone from the beginning.

Intrigued, I sent the piece to a friend (and former public health classmate) who has work on food deserts, to get a more informed reaction. I'm sharing her thoughts here (with permission) because this is an area of research that I don't follow as closely, and her reactions helped me to situate this story in the broader literature:

1. This quote from the article is so good!

"It is always easy to advocate for more grocery stores,” said Kelly D. Brownell, director of Yale University’s Rudd Center for Food Policy and Obesity, who was not involved in the studies. “But if you are looking for what you hope will change obesity, healthy food access is probably just wishful thinking.”

The "unhealthy food environment" has a much bigger impact on diet than the "healthy food environment", but it's politically more viable to work from an advocacy standpoint than a regulatory standpoint. (On that point, you still have to worry about what food is available - you can't just take out small businesses in impoverished neighborhoods and not replace it with anything.)

2. The article is too eager to dismiss the health-food access relationship. There's good research out there, but there's constant difficulty with tightening methods/definitions and deciding what to control for. The thing that I think is really powerful about the "food desert" discourse is that it opens doors to talk about race, poverty, community, culture, and more. At the end of the day, grocery stores are good for low-income areas because they bring in money and raise property values. If the literature isn't perfect on health effects, I'm still willing to advocate for them.

3. I want to know more about the geography of the study that found that low-income areas had more grocery stores than high-income areas. Were they a mix of urban, peri-urban, and rural areas? Because that's a whole other bear. (Non-shocker shocker: rural areas have food deserts... rural poverty is still a problem!)

4. The article does a good job of pointing to how difficult it is to study this. Hopkins (and the Baltimore Food Czar) are doing some work with healthy food access scores for neighborhoods. This would take into account how many healthy food options there are (supermarkets, farmers' markets, arabers, tiendas) and how many unhealthy food options there are (fast food, carry out, corner stores).

5. The studies they cite are with kids, but the relationship between food insecurity (which is different, but related to food access) and obesity is only well-established among women. (This, itself, is not talked about enough.) The thinking is that kids are often "shielded" from the effects of food insecurity by their mothers, who eat a yo-yo diet depending on the amount of food in the house.

My friend also suggested the following articles for additional reading:

More on microfoundations

Last month I wrote a long-ish post describing the history of the "microfounded" approaches to macroeconomics. For a while I was updating that post with links to recent blog posts as the debate continued, but I stopped after the list grew too long. Now Simon Wren-Lewis has written two more posts that I think are worth highlighting because they come from someone who is generally supportive of the microfoundations approach (I've found his defense of the general approach quite helpful), but who still has some specific critiques. The end of his latest post puts these critiques in context:

One way of reading these two posts is a way of exploring Krugman’s Mistaking Beauty for Truth essay. I know the reactions of colleagues, and bloggers, to this piece have been quite extreme: some endorsing it totally, while others taking strong exception to its perceived targets. My own reaction is very similar to Karl Smith here. I regard what has happened as a result of the scramble for austerity in 2010 to be in part a failure of academic macroeconomics. It would be easy to suggest that this was only the result of unfortunate technical errors, or political interference, and that otherwise the way we do macro is basically fine. I think Krugman was right to suggest otherwise. Given the conservative tendency in any group, an essay that said maybe there might just be an underlying problem here would have been ignored. The discipline needed a wake-up call from someone with authority who knew what they were talking about. Identifying exactly what those problems are, and what to do about them, seems to me an important endeavour that has only just begun.

Here are his two posts:

  1. The street light problem: "I do think microfoundations methodology is progressive. The concern is that, as a project, it may tend to progress in directions of least resistance rather than in the areas that really matter – until perhaps a crisis occurs."
  2. Ideological bias: "In RBC [Real Business Cycle] models, all changes in unemployment are voluntary. If unemployment is rising, it is because more workers are choosing leisure rather than work. As a result, high unemployment in a recession is not a problem at all.... If anyone is reading this who is not familiar with macroeconomics, you might guess that this rather counterintuitive theory is some very marginal and long forgotten macroeconomic idea. You would be very wrong."

Non-representative sample of Hunger Games responses

I had the idea for the Hunger Games survival analysis post Tuesday afternoon and published it about 24 hours later (and yes, in the meantime I did sleep, eat, and do a bit of real work as well). I thought it might hit a nerdy nerve by meshing pop culture and stats, and I was right. Three days later it's been read by over 12,000 people on my site alone, and the average time on page is long enough that I think folks are actually reading it and not just looking at the pretty pictures. It was picked up by Andrew Gelman and Jezebel (a Venn diagram with only this in the middle, I bet) and everyone from Stata to Discover Magazine shared it on Twitter.

All that to say, I think there's a market for explaining statistics and concepts from social science (I tried to work in some political science, economics, and psychology research) using pop culture tie-ins, so I may do some more of this.

For now I want to share some of the humorous reactions I've seen:

  • A classmates who is familiar with survival analysis but hasn't read the books saw the graphs and her immediate response was "Oh no, what happened on the first day? Those poor children!"
  • Richard Williams, commenting on the Stata listserv discussion: "If Stata can win over the Hunger Games crowd, SAS & SPSS are finished."
  • One of the comments on Metafilter kind of misses the point: "I love statistics, but come on: The major finding here is that Suzanne Collins did a good job creating a fictional dataset that shows some significant differences between groups. Yes, that's because statistics measures deviations from randomness, and Collins *made up the data* as part of her novel's plot." Shocking.
  • A friend to a friend of mine on Gchat: "he used Stata for a Good Thing. It was Interesting. That's im-[ahem]-Possible and he Did It.
  • Finally, the Teaching Assistant from last semester's generalized linear models class threatened to grade it as an assignment. Next time I'll use data where the assumption of the model (proportional hazards) aren't clearly violated...

An application of survival analysis to the Hunger Games (seriously)

I just finished what is quite possibly the nerdiest thing I've ever written:  "Hunger Games survival analysis." I manage to pull in articles from Matt Yglesias and Erik Kain and discuss tesserae inflation, Prospect Theory, demographics, research by Acemoglu and Robinson and by Michael Clemens, game theory, coordination failures, arguments for open data, and of course the namesake survival analysis. Complete with Kaplan-Meier survival estimator graphs and all:

I posted it as a page rather than a blog post to make some of the formatting easier, so please click through to read the real thing.

Up to speed: microfoundations

[Admin note: this is the first of a new series of "Up to speed" posts which will draw together information on a subject that's either new to me or has been getting a lot of play lately in the press or in some corner of the blogosphere. The idea here is that folks who are experts on this particular subject might not find anything new; I'm synthesizing things for those who want to get up to speed.]

Microfoundations (Wikipedia) are quite important in modern macroeconomics. Modern macroeconomics really started with Keynes. His landmark General Theory of Employment, Interest and Money (published in 1936) set the stage for pretty much everything that has come since. Basically everything that came before Keynes couldn't explain the Great Depression -- or worse yet how the world might get out of it -- and Keynes' theories (rightly or wrongly) became popular because they addressed that central failing.

One major criticism was that modern macroeconomic models like Keynes' were top-down, only looking at aggregate totals of measures like output and investment. That may not seem too bad, but when you tried to break things down to the underlying individual behaviors that would add up to those aggregates, wacky stuff happens. At that point microeconomic models were much better fleshed out, and the micro models all started with individual rational actors maximizing their utility, assumptions that macroeconomists just couldn't get from breaking down their aggregate models.

The most influential criticism came from Robert Lucas, in what became known as the Lucas Critique (here's a PDF of his 1976 paper). Lucas basically argued that aggregate models weren't that helpful because they were only looking at surface-level parameters without understanding the underlying mechanisms. If something -- like the policy environment -- changes drastically then the old relationships that were observed in the aggregate data may no longer apply. An example from Wikipedia:

One important application of the critique is its implication that the historical negative correlation between inflation and unemployment, known as the Phillips Curve, could break down if the monetary authorities attempted to exploit it. Permanently raising inflation in hopes that this would permanently lower unemployment would eventually cause firms' inflation forecasts to rise, altering their employment decisions.

Economists responded by developing "micro-founded" macroeconomic models, ones that built up from the sum of microeconomic models. The most commonly used of these models is called, awkwardly, dynamic stochastic general equilibirum (DGSE). Much of my study time this semester involves learning the math behind this. What's the next step forward from DGSE? Are these models better than the old Keynesian models? How do we even define "better"? These are all hot topics in macro at the moment. There's been a recent spat in the economics blogosphere that illustrates this -- what follows are a few highlights.

Back in 2009 Paul Krugman (NYT columnist, Nobel winner, and Woodrow Wilson School professor) wrote an article titled "How Did Economists Get It So Wrong?" that included this paragraph:

As I see it, the economics profession went astray because economists, as a group, mistook beauty, clad in impressive-looking mathematics, for truth. Until the Great Depression, most economists clung to a vision of capitalism as a perfect or nearly perfect system. That vision wasn’t sustainable in the face of mass unemployment, but as memories of the Depression faded, economists fell back in love with the old, idealized vision of an economy in which rational individuals interact in perfect markets, this time gussied up with fancy equations. The renewed romance with the idealized market was, to be sure, partly a response to shifting political winds, partly a response to financial incentives. But while sabbaticals at the Hoover Institution and job opportunities on Wall Street are nothing to sneeze at, the central cause of the profession’s failure was the desire for an all-encompassing, intellectually elegant approach that also gave economists a chance to show off their mathematical prowess.

Last month Stephen Williamson wrote this:

[Because of the financial crisis] There was now a convenient excuse to wage war, but in this case a war on mainstream macroeconomics. But how can this make any sense? The George W era produced a political epiphany for Krugman, but how did that ever translate into a war on macroeconomists? You're right, it does not make any sense. The tools of modern macroeconomics are no more the tools of right-wingers than of left-wingers. These are not Republican tools, Libertarian tools, Democratic tools, or whatever.

A bit of a sidetrack, but this prompted Noah Smith to write a long post (that is generally more technical than I want to get in to here) defending the idea that modern macro models (like DSGE) are in fact ideologically biased, even if that's not their intent. Near the end:

So what this illustrates is that it's really hard to make a DSGE model with even a few sort-of semi-realistic features. As a result, it's really hard to make a DSGE model in which government policy plays a useful role in stabilizing the business cycle. By contrast, it's pretty easy to make a DSGE model in which government plays no useful role, and can only mess things up. So what ends up happening? You guessed it: a macro literature where most papers have only a very limited role for government.

In other words, a macro literature whose policy advice is heavily tilted toward the political preferences of conservatives.

Back on the main track, Simon Wren-Lewis, writing at Mainly Macro, comes to Krugman's defense, sort of, by saying that its conceivable that an aggregate model might actually be more defensible than a micro-founded one in certain circumstances.

This view [Krugman's view that aggregate models may still be useful] appears controversial. If the accepted way of doing macroeconomics in academic journals is to almost always use a ‘fancier optimisation’ model, how can something more ad hoc be more useful? Coupled with remarks like ‘the economics profession went astray because economists, as a group, mistook beauty, clad in impressive-looking mathematics, for truth’ (from the 2009 piece) this has got a lot of others, like Stephen Williamson, upset. [skipping several paragraphs]

But suppose there is in fact more than one valid microfoundation for a particular aggregate model. In other words, there is not just one, but perhaps a variety of particular worlds which would lead to this set of aggregate macro relationships....Furthermore, suppose that more than one of these particular worlds was a reasonable representation of reality... It would seem to me that in this case the aggregate model derived from these different worlds has some utility beyond just one of these microfounded models. It is robust to alternative microfoundations.

Back on the main track, Krugman followed up with an argument for why its OK to use both aggregate and microfounded models.

And here's Noah Smith writing again, "Why bother with microfoundations?"

Using wrong descriptions of how people behave may or may not yield aggregate relationships that really do describe the economy. But the presence of the incorrect microfoundations will not give the aggregate results a leg up over models that simply started with the aggregates....

When I look at the macro models that have been constructed since Lucas first published his critique in the 1970s, I see a whole bunch of microfoundations that would be rejected by any sort of empirical or experimental evidence (on the RBC side as well as the Neo-Keynesian side). In other words, I see a bunch of crappy models of individual human behavior being tossed into macro models. This has basically convinced me that the "microfounded" DSGE models we now use are only occasionally superior to aggregate-only models. Macroeconomists seem to have basically nodded in the direction of the Lucas critique and in the direction of microeconomics as a whole, and then done one of two things: either A) gone right on using aggregate models, while writing down some "microfoundations" to please journal editors, or B) drawn policy recommendations directly from incorrect models of individual behavior.

The most recent is from Krugman, wherein he says (basically) that models that make both small and big predictions should be judged more on the big than the small.

This is just a sampling, and likely a biased one as there are many who dismiss the criticism of microfoundations out of hand and thus aren't writing detailed responses. Either way, the microfoundations models are dominant in the macro literature now, and the macro-for-policy-folks class I'm taking at the moment focuses on micro-founded models (because they're "how modern macro is done").

So what to conclude? My general impression is that microeconomics is more heavily 'evolved' than macroeconomics. (You could say that in macro the generation times are much longer, and the DNA replication bits are dodgier, so evolving from something clearly wrong towards something clearly better is taking longer.)

Around the same time that micro was getting problematized by Kahneman and others who questioned the rational utility-maximizing nature of humans, thus launching behavioral economics revolution -- which tries to complicate micro theory with a bit of reality -- the macroeconomists were just  getting around to incorporating the original microeconomic emphasis on rationality. Just how much micro will change in the next decades in response to the behavioral revolution is unclear, so expecting troglodytesque macro to have already figured this out is unrealistic.

A number of things are unclear to me: just how deep the dissatisfaction with the current models is, how broadly these critiques (vs. others from different directions) are endorsed, and what actually drives change in fields of inquiry. Looking back in another 30-40 years we might see this moment in time as a pivotal shift in the history of the development of macroeconomics -- or it may be a little hiccup that no one remembers at all. It's too soon to tell.

Updates: since writing this I've noticed several more additions to the discussion:

Coincidence or consequence?

Imagine there's a pandemic flu virus on the loose, and a vaccine has just been introduced. Then come reports of dozens of cases of Guillain-Barré syndrome (GBS), a rare type of paralysis. Did the new vaccine cause it? How would you even begin to know? One first step (though certainly not the only one) is to think about the background rate of disease:

Inappropriate assessment of vaccine safety data could severely undermine the eff ectiveness of mass campaigns against pandemic H1N1 2009 influenza. Guillain-Barré syndrome is a good example to consider. Since the 1976–77 swine influenza vaccination campaign was associated with an increased number of cases of Guillain-Barré syndrome, assessment of such cases after vaccination will be a high priority. Therefore, it is important to know the background rates of this syndrome and how this rate might vary with regard to population demographics. The background rate of the syndrome in the USA is about 1–2 cases per 1 million person-months of observation. During a pandemic H1N1 vaccine campaign in the USA, 100 million individuals could be vaccinated. For a 6-week follow-up period for each dose, this corresponds to 150 million person-months of observation time during which a predicted 200 or more new cases of Guillain-Barré syndrome would occur as background coincident cases. The reporting of even a fraction of such a large number of cases as adverse events after immunisation, with attendant media coverage, would probably give rise to intense public concern, even though the occurrence of such cases was completely predictable and would have happened in the absence of a mass campaign.

That's from a paper by Steven Black et al. in 2009, "Importance of background rates of disease in assessment of vaccine safety during mass immunisation with pandemic H1N1 infl uenza vaccines". They also calculate background rates for spontaneous abortion, preterm delivery, and spontaneous death among other things.

Generalized linear models resource

The lectures are over, the problem sets are submitted -- all that's left for the fall semester are finals in a couple weeks. One of the courses I'm taking is Germán Rodríguez's "Generalized Linear Statistical Models" and it occurred to me that I should highlight the course website for blog readers. Princeton does not have a school of public health (nor a medical school, business school, or law school, amongst other things) but it does have a program in demography and population research, and Professor Rodríguez teaches in that program.

The course website includes Stata logs, exams, datasets, and problem sets based on those data sets. The lectures have closely followed the lecture notes on the website, covering the following models: linear models (continuous data), logit models (binary data), Poisson models (count data), overdispersed count data, log-linear models (contingency tables), multinomial responses, survival analysis, and panel data, along with some appendices on likelihood and GLM theory. Enjoy.

Platform evaluation

Cesar Victora,  Bob Black,  Ties Boerma, and Jennifer Bryce (three of the four are with the Hopkins Department of International Health and I took a course with Prof Bryce) wrote this article in The Lancet in January 2011: "Measuring impact in the Millennium Development Goal era and beyond: a new approach to large-scale effectiveness evaluations." The abstract:

Evaluation of large-scale programmes and initiatives aimed at improvement of health in countries of low and middle income needs a new approach. Traditional designs, which compare areas with and without a given programme, are no longer relevant at a time when many programmes are being scaled up in virtually every district in the world. We propose an evolution in evaluation design, a national platform approach that: uses the district as the unit of design and analysis; is based on continuous monitoring of different levels of indicators; gathers additional data before, during, and after the period to be assessed by multiple methods; uses several analytical techniques to deal with various data gaps and biases; and includes interim and summative evaluation analyses. This new approach will promote country ownership, transparency, and donor coordination while providing a rigorous comparison of the cost-effectiveness of different scale-up approaches.

Pick your model

I enjoyed this piece by Dani Rodrik at Project Syndicate:

Indeed, though you may be excused for skepticism if you have not immersed yourself in years of advanced study in economics, coursework in a typical economics doctoral program produces a bewildering variety of policy prescriptions depending on the specific context. Some of the frameworks economists use to analyze the world favor free markets, while others don’t. In fact, much economic research is devoted to understanding how government intervention can improve economic performance. And non-economic motives and socially cooperative behavior are increasingly part of what economists study.

As the late great international economist Carlos Diaz-Alejandro once put it, “by now any bright graduate student, by choosing his assumptions….carefully, can produce a consistent model yielding just about any policy recommendation he favored at the start.” And that was in the 1970’s! An apprentice economist no longer needs to be particularly bright to produce unorthodox policy conclusions.

Nevertheless, economists get stuck with the charge of being narrowly ideological, because they are their own worst enemy when it comes to applying their theories to the real world. Instead of communicating the full panoply of perspectives that their discipline offers, they display excessive confidence in particular remedies – often those that best accord with their own personal ideologies.

Is it that bad? Well, statistician Kaiser Fung of the blog Numbers Rule Your World) says that it's actually much worse and that Rodrik doesn't go far enough as he compares Rodrik's point with a critique of economic modeling in Emanuel Derman's new book Models Behaving Badly (which I haven't read yet):

My own view, informed by years of building statistical models for businesses, is more sympathetic with Derman than Rodrik. There is no way that economic (by extension, social science) models can ever be similar to physics models. Derman draws the comparison in order to disparage economics models. I prefer to avoid the comparison entirely.

The insurmountable challenge of social science models, which constrains their effectiveness, is that the real drivers of human behavior are not measurable. What causes people to purchase goods, or vote for a particular candidate, or become obese, or trade stocks is some combination of desire, impulse, guilt, greed, gullibility, inattention, curiosity, etc. We can't measure any of those quantities accurately.

Discarding efficacy?

Andrew Grove, former CEO of Intel, writes an editorial in Science:

We might conceptualize an “e-trial” system along similar lines. Drug safety would continue to be ensured by the U.S. Food and Drug Administration. While safety-focused Phase I trials would continue under their jurisdiction, establishing efficacy would no longer be under their purview. Once safety is proven, patients could access the medicine in question through qualified physicians. Patients' responses to a drug would be stored in a database, along with their medical histories. Patient identity would be protected by biometric identifiers, and the database would be open to qualified medical researchers as a “commons.” The response of any patient or group of patients to a drug or treatment would be tracked and compared to those of others in the database who were treated in a different manner or not at all.

Alex Tabarrok of Marginal Revolution (who is a big advocate for FDA reform, running this site) really likes the idea. I hate it. While the current system has some problems, Grove's system would be much, much worse than the current system. The biggest problem is that we would have no good data about whether a drug is truly efficacious, because all of the results in the database would be confounded by selection bias. Getting a large sample size and having subgroups tells you nothing about why someone got the treatment in the first place.

Would physicians pay attention to peer-reviewed articles and reviews identifying the best treatments for specific groups? Or would they just run their own analyses? I think there would be a lot of the latter, which is scary since many clinicians can’t even define selection bias or properly interpret statistical tests. The current system has limitations, but Grove's idea would move us even further from any sort of evidence-based medicine.

Other commenters at Marginal Revolution rightly note that it's difficult to separate safety from efficacy, because recommending a drug is always based on a balance of risks and benefits. Debilitating nausea or strong likelihood of heart attack would never be OK in a drug for mild headaches, but if it cures cancer the standards are (and should be) different.

Derek Lowe, a fellow Arkansan who writes the excellent chemistry blog In The Pipeline, has more extensive (and informed) thoughts here.

Update (1/5/2012): More criticism, summarized by Derek Lowe.

What does social science know?

Marc Bellemare wrote a post "For Fellow Teachers: Revised Primers on Linear Regression and Causality." Good stuff for students too -- not just teachers. The primers are PDFs on linear regression (6 pages) and causality (3 pages), and they're either 1) a concise summary if you're studying this stuff already, or 2) something you should really read if you don't have any background in quantitative methods. I also really enjoyed an essay by Jim Manzi that Marc links to, titled "What Social Science Does -- and Doesn't -- Know." Manzi reviews the history of experimentation in natural sciences, and then in social sciences. He discusses why it's more difficult to extrapolate from randomized trials in the social sciences due to greater 'causal density,' amongst other reasons. Manzi summarized a lot of research in criminology (a field I didn't even know used many field trials) and ends with some conclusions that seem sharp (emphasis added):

...After reviewing experiments not just in criminology but also in welfare-program design, education, and other fields, I propose that three lessons emerge consistently from them.

First, few programs can be shown to work in properly randomized and replicated trials. Despite complex and impressive-sounding empirical arguments by advocates and analysts, we should be very skeptical of claims for the effectiveness of new, counterintuitive programs and policies, and we should be reluctant to trump the trial-and-error process of social evolution in matters of economics or social policy.

Second, within this universe of programs that are far more likely to fail than succeed, programs that try to change people are even more likely to fail than those that try to change incentives. A litany of program ideas designed to push welfare recipients into the workforce failed when tested in those randomized experiments of the welfare-reform era; only adding mandatory work requirements succeeded in moving people from welfare to work in a humane fashion. And mandatory work-requirement programs that emphasize just getting a job are far more effective than those that emphasize skills-building. Similarly, the list of failed attempts to change people to make them less likely to commit crimes is almost endless—prisoner counseling, transitional aid to prisoners, intensive probation, juvenile boot camps—but the only program concept that tentatively demonstrated reductions in crime rates in replicated RFTs was nuisance abatement, which changes the environment in which criminals operate....

I'd note here that many researchers and policymakers who are interested in health-related behavior change have been moving away from simply providing information or attempting to persuade people to change their behavior, and moving towards changing the unhealthy environments in which we live. NYC Health Commissioner Thomas Farley spoke explicitly about this shift in emphasis when he addressed us summer interns back in June. That approach is a direct response to frustration with the small returns from many behavioral intervention approaches, and an acknowledgment that we humans are stubborn creatures whose behavior is shaped (more than we'd like to admit) by our environments.

Manzi concludes:

And third, there is no magic. Those rare programs that do work usually lead to improvements that are quite modest, compared with the size of the problems they are meant to address or the dreams of advocates.

Right, no pie in the sky. If programs or policies had huge effects they'd be much easier to measure, for one. Read it all.

Miscellany: Epidemic City and life expectancy

In 8 days I'll be done with my first year of graduate studies and will have a chance to write a bit more. I've been keeping notes all year on things to write about when I have more time, so I should have no shortage of material! In the meantime, two links to share: 1) Just in time for my summer working with the New York City Department of Health comes Epidemic City: The Politics of Public Health in New York. The Amazon / publisher's blurb:

The first permanent Board of Health in the United States was created in response to a cholera outbreak in New York City in 1866. By the mid-twentieth century, thanks to landmark achievements in vaccinations, medical data collection, and community health, the NYC Department of Health had become the nation's gold standard for public health. However, as the city's population grew in number and diversity, new epidemics emerged, and the department struggled to balance its efforts between the treatment of diseases such as AIDS, multi-drug resistant tuberculosis, and West Nile Virus and the prevention of illness-causing factors like lead paint, heroin addiction, homelessness, smoking, and unhealthy foods. In Epidemic City, historian of public health James Colgrove chronicles the challenges faced by the health department in the four decades following New York City's mid-twentieth-century peak in public health provision.

This insightful volume draws on archival research and oral histories to examine how the provision of public health has adapted to the competing demands of diverse public needs, public perceptions, and political pressure.

Epidemic City delves beyond a simple narrative of the NYC Department of Health's decline and rebirth to analyze the perspectives and efforts of the people responsible for the city's public health from the 1960s to the present. The second half of the twentieth century brought new challenges, such as budget and staffing shortages, and new threats like bioterrorism. Faced with controversies such as needle exchange programs and AIDS reporting, the health department struggled to maintain a delicate balance between its primary focus on illness prevention and the need to ensure public and political support for its activities.

In the past decade, after the 9/11 attacks and bioterrorism scares partially diverted public health efforts from illness prevention to threat response, Mayor Michael Bloomberg and Department of Health Commissioner Thomas Frieden were still able to work together to pass New York's Clean Indoor Air Act restricting smoking and significant regulations on trans-fats used by restaurants. Because of Bloomberg's willingness to exert his political clout, both laws passed despite opposition from business owners fearing reduced revenues and activist groups who decried the laws' infringement upon personal freedoms. This legislation preventative in nature much like the 1960s lead paint laws and the department's original sanitary code reflects a return to the 19th century roots of public health, when public health measures were often overtly paternalistic. The assertive laws conceived by Frieden and executed by Bloomberg demonstrate how far the mandate of public health can extend when backed by committed government officials.

Epidemic City provides a compelling historical analysis of the individuals and groups tasked with negotiating the fine line between public health and political considerations during the latter half of the twentieth century. By examining the department's successes and failures during the ambitious social programs of the 1960s, the fiscal crisis of the 1970s, the struggles with poverty and homelessness in the 1980s and 1990s, and in the post-9/11 era, Epidemic City shows how the NYC Department of Health has defined the role and scope of public health services, not only in New York, but for the entire nation.

2) Aaron Carroll at the Incidental Economist writes about the subtleties of life expectancy. His main point is that infant mortality skews life expectancy figures so much that if you're talking about end-of-life expectations for adults who have already passed those (historically) most perilous times as a youngster, you really need to look at different data altogether.

The blue points on the graph below show life expectancy for all races in the US at birth, while the red line shows life expectancy amongst those who have reached the age of 65. Ie, if you're a 65-year-old who wants to know your chances of dying (on average!) in a certain period of time, it's best to consult a more complete life table rather than life expectancy at birth, because you've already dodged the bullet for 65 years.

(from the Incidental Economist)

Modelling Stillbirth

William Easterly and Laura Freschi go after "Inception Statistics" in the latest post on AidWatch. They criticize -- in typically hyperbolic style, with bonus points for the pun in the title -- both the estimates of stillbirth and their coverage in the news media. I left a comment on their blog outlining my thoughts but thought I'd re-post them here with a little more explanation. Here's what I said:

Thanks for this post (it’s always helpful to look at quality of estimates critically) but I think the direction of your criticism needs to be clarified. Which of the following are you upset about (choose all that apply)?

a) the fact that the researchers used models at all? I don’t know the researchers personally, but I would imagine that they are concerned with data quality in general and would much preferred to have had reliable data from all the countries they work with. But in the absence of that data (and while working towards it) isn’t it helpful to have the best possible estimates on which to set global health policy, while acknowledging their limitations? Based on the available data, is there a better way to estimate these, or do you think we’d be better off without them (in which case stillbirth might be getting even less attention)? b) a misrepresentation of their data as something other than a model? If so, could you please specify where you think that mistake occurred — to me it seems like they present it in the literature as what it is and nothing more. c) the coverage of these data in the media? On that I basically agree. It’s helpful to have critical viewpoints on articles where there is legitimate disagreement.

I get the impression your main beef is with (c), in which case I agree that press reports should be more skeptical. But I think calling the data “made up” goes too far too. Yes, it’d be nice to have pristine data for everything, but in the meantime we should try for the best possible estimates because we need something on which to base policy decisions. Along those lines, I think this commentary by Neff Walker (full disclosure: my advisor) in the same issue is worthwhile. Walker asks these five questions – noting areas where the estimates need improvement: - “Do the estimates include time trends, and are they geographically specific?” (because these allow you to crosscheck numbers for credibility) - “Are modelled results compared with previous estimates and differences explained?” - “Is there a logical and causal relation between the predictor and outcome variables in the model?” - “Do the reported measures of uncertainty around modelled estimates show the amount and quality of available data?” - “How different are the settings from which the datasets used to develop the model were drawn from those to which the model is applied?” (here Walker says further work is needed)

I'll admit to being in over my head in evaluating these particular models. As Easterly and Freschi note, "the number of people who actually understand these statistical techniques well enough to judge whether a certain model has produced a good estimate or a bunch of garbage is very, very small." Very true. But in the absence of better data, we need models on which to base decisions -- if not we're basing our decisions on uninformed guesswork, rather than informed guesswork.

I think the criticism of media coverage is valid. Even if these models are the best ever they should still be reported as good estimates at best. But when Easterly calls the data "made up" I think the hyperbole is counterproductive. There's an incredibly wide spectrum of data quality, from completely pulled-out-of-the-navel to comprehensive data from a perfectly-functioning vital registration system. We should recognize that the data we work with aren't perfect. And there probably is a cut-off point at which estimates are based on so many models-within-models that they are hurtful rather than helpful in making informed decisions. But are these particular estimates at that point? I would need to see a much more robust criticism than AidWatch has provided so far to be convinced that these estimates aren't helpful in setting priorities.

"Small Changes, Big Results"

The Boston Review has a whole new set of articles on the movement of development economics towards randomized trials. The main article is Small Changes, Big Results: Behavioral Economics at Work in Poor Countries and the companion and criticism articles are here. They're all worth reading, of course. I found them through Chris Blattman's new post "Behavioral Economics and Randomized Trials: Trumpeted, Attacked, and Parried." I want to re-state a point I made in the comments there, because I think it's worth re-wording to get it right. It's this: I often see the new randomized trials in economics compared to clinical trials in the medical literature. There are many parallels to be sure, but the medical literature is huge, and there's really one subset of it that offers better parallels.

Within global health research there are a slew of large (and not so large), randomized (and other rigorous designs), controlled (placebo or not) trials that are done in "field" or "community" settings. The distinction is that clinical trials usually draw their study populations from a hospital or other clinical setting and their results are thus only generalizable to the broader population (external validity) to the extent that the clinical population is representative of the whole population; while community trials are designed to draw from everyone in a given community.

Because these trials draw their subjects from whole communities -- and they're often cluster-randomized so that whole villages or clinic catchment areas are the unit that's randomized, rather than individuals -- they are typically larger, more expensive, more complicated and pose distinctive analytical and ethical problems. There's also often room for nesting smaller studies within the big trials, because the big trials are already recruiting large numbers of people meeting certain criteria and there are always other questions that can be answered using a subset of that same population. [All this is fresh on my mind since I just finished a class called "Design and Conduct of Community Trials," which is taught by several Hopkins faculty who run very large field trials in Nepal, India, and Bangladesh.]

Blattman is right to argue for registration of experimental trials in economics research, as is done with medical studies. (For nerdy kicks, you can browse registered trials at ISRCTN.) But many of the problems he quotes Eran Bendavid describing in economics trials--"Our interventions and populations vary with every trial, often in obscure and undocumented ways"--can also be true of community trials in health.

Likewise, these trials -- which often take years and hundreds of thousands of dollars to run -- often yield a lot of knowledge about the process of how things are done. Essential elements include doing good preliminary studies (such as validating your instruments), having continuous qualitative feedback on how the study is going, and gathering extra data on "process" questions so you'll know why something worked or not, and not just whether it did (a lot of this is addressed in Blattman's "Impact Evaluation 2.0" talk). I think the best parallels for what that research should look like in practice will be found in the big community trials of health interventions in the developing world, rather than in clinical trials in US and European hospitals.

Evaluation in education (and elsewhere)

Jim Manzi has some fascinating thoughts on evaluating teachers at the American Scene. Some summary outtakes:

1. Remember that the real goal of an evaluation system is not evaluation. The goal of an employee evaluation system is to help the organization achieve an outcome....

2. You need a scorecard, not a score. There is almost never one number that can adequately summarize the performance of complex tasks like teaching that are executed as part of a collective enterprise....

3. All scorecards are temporary expedients. Beyond this, no list of metrics can usually adequately summarize performance, either....

4. Effective employee evaluation is not fully separable from effective management

When you zoom out to a certain point, all complex systems in need of reform start to look alike, because they all combine social, political, economic, and technical challenges, and the complexity, irrationality, and implacability of human behavior rears its ugly head at each step of the process. The debates about tactics and strategy and evaluation for reforming American education or US aid policy or improving health systems or fostering economic development start to blend together, so that Manzi's conclusions sound oddly familiar:

So where does this leave us? Without silver bullets.

Organizational reform is usually difficult because there is no one, simple root cause, other than at the level of gauzy abstraction. We are faced with a bowl of spaghetti of seemingly inextricably interlinked problems. Improving schools is difficult, long-term scut work. Market pressures are, in my view, essential. But, as I’ve tried to argue elsewhere at length, I doubt that simply “voucherizing” schools is a realistic strategy...

Read the rest of his conclusions here.

Hangman for Stata

Yes, you can load a .do file and play Hangman in Stata. But only true stats nerds are allowed to play.

And on a related note, have you ever wondered how a game as morbid as hangman became so popular? Can you imagine if you visited another culture and they had a word game that everyone -- adults and children -- knew how to play, and it was based on the electric chair or decapitation, would you judge them? Wikipedia tells me its origins are obscure...

Academic vs. Applied... Everything

When I posted on Academic vs. Applied Epi I included the following chart:

Then I realized that this breakdown likely works pretty well for other fields too. I sent a link to an economist friend, who responded: "No doubt this is similar with econ. The theoreticians live in a world of (wrong) assumptions, while the practitioners are facing the tough policy challenges. And there are quite a few similarities with the below...such as urgency etc."

You can replace "physicians" with "economists" or many other professions and the chart holds up. Contrasting academic economics researchers with policymakers, the fields for Timeline, Data quality, Scientific values, Outputs, and Competencies needed all hold up pretty well.

Many positions that are basically epidemiological in nature are filled by physicians with clinical training but very little formal public health and epidemiology training, which is strongly paralleled in the policy realm. Some sort of graduate training is generally necessary for many jobs, so those aiming for the applied track tend to get multipurpose 'public policy' degrees often viewed as weak by the more purist academics, while those studying public policy deride the inapplicability of the theoretical work done by academics. And the orientation of many academic fields towards a set of skills primarily useful in pursuits that aren't highly valued by the more applied practitioners may go a long way in explaining animosity between the two camps.