The State Department’s employee evaluation process is worse than terrible. It is no better than a gamble.
Consider this career choice: you are a second-tour FS-04 consular officer serving your country in an American embassy abroad. You have tenure, which means that you’re more or less guaranteed employment with the State Department for the next 15 years. You are now being considered for promotion to FS-03. In a bid to speed up the process, the Management Bureau Office of Performance Evaluation approaches you with this offer: you may proceed with the promotion board selection process or, more simply, flip a coin. Heads, you will be promoted. Tails, you will not.
Even odds are not great but they’re better than the 2018 promotion statistics suggest you have, which is closer to one in three. Still, you’re told promotion is a merit-based process decided after rational, impartial evaluation of your accomplishments. You’re a hard worker and have made no mistakes. Your peers and supervisors like you and your Employee Evaluation Reports (EERs) have all been stellar. So you decide to go with the promotion panel.
Everybody in the department believes this, or at the very least, cynically behaves as if it were true. And it may well be true. But here’s the problem: social scientific research argues strongly that the State Department’s promotion process produces decisions little better than random chance. The department has the data available to determine whether panel decisions are indeed correlated with future performance rather than blind luck but it has not used that information to improve the process.
This should alarm everyone and not just those being considered for promotion. If the department cannot accurately select future high-performers, then we are rewarding people who do not deserve it and punishing those who don’t deserve that, either. And it means, when it comes to the future leadership of the premier executive department, we are doing no better than throwing dice and hoping for the best.
The EER has a long history. The Rogers Act of 1924 set up the modern Foreign Service, replacing prodigal heirs populating the diplomatic corps with a professional and meritocratic promotion system. The 1980 Foreign Service Act instituted new reforms, significantly eliminating the evaluation of a diplomat’s spouse, then almost entirely women, as part of the employee’s evaluation. The Foreign Service now uses a narrative system primarily devised in 2002, amended partially in 2015, resulting in the goals-oriented, short-narrative EER in use today.
Each year the Performance Evaluation office convenes around 20 boards made up of about six panelists, including an outside civilian. For six to eight weeks each panelist reviews 40 candidate folders per day, which include the past five years’ EERs. That accounts for some 30,000 EERs reviewed or more than 200,000 pages in total for each board, a daunting task by any measure.
The best thing to be said about this promotion process is that it is worse than a similar system the Israeli army abandoned in 1955. Daniel Kahneman, then a 22-year-old psychology graduate, was asked to assess how the nascent Israeli army selected its officers. The army used a process adapted from the British following World War II. It involved evaluating a group of officer candidates working together to physically bridge an obstacle. Here, Kahneman writes in his Nobel Prize biography,
[w]e were looking for manifestations of the candidates’ characters, and we saw plenty: true leaders, loyal followers, empty boasters, wimps – there were all kinds. Under the stress of the event, we felt, the soldiers’ true nature would reveal itself, and we would be able to tell who would be a good leader and who would not.
But the trouble was that, in fact, we could not tell. Every month or so we had a “statistics day,” during which we would get feedback from the officer-training school, indicating the accuracy of our ratings of candidates’ potential. The story was always the same: our ability to predict performance at the school was negligible. … I was so impressed by the complete lack of connection between the statistical information and the compelling experience of insight that I coined a term for it: “the illusion of validity.”
Kahneman, along with his partner Amos Tversky, documented another cognitive fallacy in this exercise: the human tendency to make extreme forecasts based on very little data. Overconfidence in predicting future performance has been documented in fields as diverse as football, economics, the weather, and the stock market. It should come as no surprise, then, that predicting future performance of diplomats would be shot through with overconfidence as well.
Closely related to the illusion of validity was another feature of our discussions about the candidates we observed: our willingness to make extreme predictions about their future performance on the basis of a small sample of behavior. … As I understood clearly only when I taught statistics some years later, the idea that predictions should be less extreme than the information on which they are based is deeply counterintuitive.
Kahneman subsequently drew up a more objective measure of performance using a numerical scale to rank candidates based on certain attributes correlated with future success in officer training. It was not perfect but it has remained, more or less unaltered, with the Israeli army ever since.
The similarities between the two evaluation systems are striking. Like the Israeli army of 1955, the State Department panels rank candidates into three tiers. Like the Israeli army, the Foreign Service Officers serving as judge and jury over lower-ranking strangers are not human resources professionals or executives with hiring and firing experience. They are beneficiaries of the same system that selected them. They have no reason to doubt the process that promoted them and in fact have quite a psychological incentive to perpetuate it. Kahneman and his disciples were skeptical of expert judgment. Kahneman cites as an influence the work of psychologist Paul Meehl, who famously determined the superiority of actuarial prediction to clinical judgment in medical prognosis. Kahneman eventually came to understand the value of training and experience to improve professional judgement. But the one-time panel members only receive minimal training for their task and rarely serve on a panel again.
The only fundamental difference between the two selection processes was the Israeli army actually observed the individuals being ranked. The department, using a record of paper, is basing its decision on hearsay. There is no interaction with the candidates themselves. The selection process is done by consensus, which means the entire panel must agree on every individual file. Indeed, officers I have talked to say there is rarely any dissent, that they all see the same things the same way, and that what they are seeing is blindingly obvious on its face. Kahneman saw the same thing, too. As he told the author Michael Lewis, “[t]he impression we had of each candidate’s character was as direct and compelling as the color of the sky.” But he could not correlate these judgments to their outcomes.
The Foreign Service promotion system is explicitly designed to predict future performance. It is projecting: EERs should tell us who will succeed at the next level and who will fail. The Procedural Precepts for the 2018 Foreign Service Selection Boards specifically state, “[p]romotion is recognition that an employee has demonstrated readiness to successfully perform at the next highest level.” It continues:
A recommendation for promotion is not a reward for prior service. Boards should recommend for immediate advancement only those employees whose records indicate superior long-range potential and a present ability to perform at a higher level.
But as Kahneman demonstrated and has been amply documented from economists like Richard Thaler and statisticians like Scott Armstrong, people are very poor predictors of future human behavior.
In addition to overconfidence, Armstrong outlines two ways people often get predictions wrong. First, he notes that agreement or consensus is not that same thing as accuracy. The board process requires each member to rank the top promotable candidates but then, strangely, come to a group agreement about the order of those candidates (as well as those mid-ranked and low-ranked). Second, Armstrong notes that increasing complexity makes prediction correspondingly more difficult. Variables and uncertainty, which perfectly describes the Foreign Service career assignment system, make predictions of future performance much less tenable.
All this perpetuates a system governed by subjectivity and intuition, both of which Kahneman and his partisans have exposed as no better than chance. There is very little objective measurement of performance or potential. Not even the straightforward U.S. military requirement to tick a box recommending promotion or not is available in the EER. State Department panel members, already unable to read the entirety of each individual EER because of sheer volume, must parse a hundred different ways raters recommend promotion, looking for evidence of faint praise or lack of enthusiasm.
Performance Evaluation considered these issues, including objectivity and bias, in an declassified 2013 cable. The cable argued that the Defense Department’s system of numerical ranking was more appropriate to the task of evaluating hundreds of thousands of members of the armed forces. But if it were simply a matter of scale, shouldn’t there be enough military officers reviewing their peers in similar ratios to Foreign Service boards? More importantly, it is difficult to argue that the Pentagon is concerned more with efficiency than efficacy when lives are literally at risk if the system fails.
In another unclassified cable in 2015 accompanying the modestly reformed EER, the department attempted to address rank-and-file requests for a more objective performance evaluation system. The department responded by saying it had conducted external consultations and concluded metric evaluation fed bias and grade inflation. But these are precisely the problems with the current system as evidenced by the department’s concern with unconscious bias on the one hand and inflation of performance evaluation on the other hand. A 2017 unclassified cable noted specifically how difficult board members found it to rank candidates who were all, to borrow a phrase, above average. There is no question about the problem of bias and grade inflation. The question to ask is which system – narrative/subjective or quantitative/objective – best reduces these biases and has the highest correlation with future performance of employees.
The consequence is a collective, department-wide effort to game the promotion system. Conventional wisdom among the department rank and file accepts that there is a proper way to write or couch the EER, that deviating from this unwritten rule spells career doom, and that there is absolutely no quarter to be allowed to document poor performance, individual struggle, or personal failing. The logical outcome of this conventional wisdom should be no surprise. It produces highly inflated and hyperbolic EERs crammed with superlative performance and exceptional personal achievement, not any rational or reasonable measure of performance or readiness for promotion. It is absurd, especially at the lower ranks, to expect that all officers will be equally excellent at all things, which is the picture the EERs in the aggregate indicate and about which promotion panels perennially complain.
Given this, the lack of quantitative measurement of performance is especially glaring. This may not seem important given the depth and rigor of a professional, narrative evaluation. But quantitative data is the standard measure of performance we get in high school, university, and graduate and professional schools. High school grades are more strongly correlated than the Scholastic Aptitude Test to university success. Performance in individual subjects can be aggregated in a grade point average, and that grade point average along with standardized test scores can be correlated with future performance as measured in future grades and earnings.
The 2013 cable notes eight systemic biases without addressing explicitly how the EER and promotion board system are designed to minimize bias. It is worth quoting in full:
Boards can help to cut through overt or subtle biases at the line manager level. In addition to invidious general biases (age, gender, race, ethnic origin, sexual orientation, religion), more nuanced types of bias are also workplace hazards. The [Foreign Service] Board system, by having an impartial objective group, mitigate possible bias. The Foreign Service grievance system adds additional safeguards. Bias is simply a personality-based tendency, either toward or against something.
In other words, the boards are impartial and objective because the bureau says they are impartial and objective.
The department has recently required unconscious bias training for selection panel members and has long screened EER drafts for inadmissible information. This is an important start. But the panels themselves are structured in a way to protect, not remove, bias. Panel members are not required to recuse themselves unless they are reviewing a family member. The panel decision itself cannot be appealed; employees can only grieve language they have already consented to send to the boards.
While the department expresses concern about bias and forbids most explicit (race, sexual orientation, age) and implicit (“staffing” to replace “manpower,” for example) bias, importantly quite a lot of information is still not scrubbed from standard EERs. Importantly, gender – a legally suspect class specifically mentioned in Equal Employment Opportunity laws and regulations – is permitted. Additionally, names are also allowed which, in a diverse country like the United States, one can use to make reasonable guesses about a candidate’s religion, national origin, ethnic group, mother tongue, and race.
The department explicitly dismissed calls for a more quantitative or objective evaluation system. The 2015 cable dismissing a proposal for quantitative performance evaluation specifically endorses a narrative approach and the 2017 cable reinforces it. Unfortunately, social science research has repeatedly identified the “narrative fallacy” as a flawed heuristic people apply to interpret the past and anticipate the future. The panels agree, in other words, to build a third cognitive fallacy right into the structure of the evaluation.
The lack of objective measurement lends itself to all sorts of abuses, including watered-down and “coded” language that experienced raters and reviewers use to communicate sub rosa to the panel, while lying to the candidate, that they are not promotable. I am stunned that the use of coded language in EERs is not only an acknowledged practice but in fact accepted as the price of promotion in the department. We would not tolerate secret communications from our children’s teachers or guidance counselors in letters of recommendation for jobs or university admission. If such a practice were discovered, it would garner the same kind of legal and press attention the university admission-bribing scandal recently did. Raters and reviewers who communicate in coded language are committing fraud. This subterfuge would not survive an Inspector General investigation, Congressional scrutiny, or class-action lawsuit.
The EER, as a narrative instrument, lacks objective utility at a minimum and otherwise eschews any quantitative measurement of performance. But this data is not entirely missing. The process has the statistical tools to correlate prior evaluations with future performance and promotion rates but has not used these tools. Promotion panels routinely rank EERs into three tiers and then rank order every EER in the top tier. This could very easily be correlated with prior and future evaluations and promotion rates, but these ranks are not permanently assigned to individual officers or their EER files. This is a mistake, the equivalent of grading classwork without issuing a transcript at the end of high school.
The panel’s purview is guided by two massive documents: the “Procedural Precepts” for Foreign Service Selection Boards (31 pages) and the “Decision Criteria for Tenure and Promotion in the Foreign Service,” confusingly also known as the Core Precepts (15 pages). The Procedural Precepts outline the duties and procedures the panel must follow while the Core Precepts outline the skills and objectives an officer must demonstrate. The Core Precepts refer to these as “guidelines by which Tenure and Selection Boards determine the tenure and promotability” of officers. But nowhere in either document are instructions that explain how the board should apply the Core Precepts. That’s the equivalent of applying the A-F grading system without explaining what distinguishes an A from an F.
There are six Core Precepts containing 31 subsections. This is simply too much for any one narrative evaluation – the line count is just above 100 – to consider in any depth. Even if several narratives over the last five years are evaluated, that leaves little more than 500 lines total. So employees, raters and reviewers must select their most important accomplishments and skimp on covering all facets of the Core Precepts.
Panelists are left to parsing nuance of language, which is simply more noise above the randomness of assignment, location, experience, work product, and outcome, not to mention unconscious bias and writing ability. How can a panel rationally evaluate a 25-year-old consular officer with no prior job experience working in a nonimmigrant visa mill in Mexico against a 50-year-old second career officer handling American Citizen Services cases following an earthquake in Japan? How can panels distinguish between competent performance and sheer circumstance?
Fortunately, the Core Precepts actually provide the solution to this otherwise subjective mind game. Instead of the laborious negotiation over the narrative evaluation, the employee, rater, reviewer, and a peer or local staff member would grade the employee against their peers in a given post on each of the Core Precepts subsections. The employee would have no control over the grading but would be allowed to see it. The individual grades and aggregate would be assigned to the employee’s file for future panels to view.
A numerical evaluation has more utility than its narrative counterpart. It is more objective because in this case the reviewer is rating the employee directly against their immediate peers whereas under the current system the panel measures performance against an unseen global population. The global view introduces much more noise, subjectivity, and randomness to the process. A graded system can be averaged over a year, a grade, or a precept; it can be correlated to future performance; it can track improvement; it eliminates sub rosa sabotage (using regression analysis, we could determine whether a particular low rank was within the standard deviation of the other reviewers or prior evaluations).
Any kind of human evaluation system is prone to bias, error, and luck. The best we can do is find the tools that limit our bias, reduce error, and account for luck. The current State Department promotion system makes no systematic attempt to do that. The result is the opposite of meritocratic: it is pure fortune.
The opinions and characterizations in this piece are those of the author and do not necessarily represent those of the U.S. government.