Going to War with WAR
James decides, after 20 years, to speak up about it


James has an article up today - in front of the paywall, I'm pretty sure - in which he uses Judge vs Altuve to "ask the questions of a child" about sabermetrics.  You know the truth is?  That I understand micro sabermetrics a lot better than I understand the macro questions.  I might be able to tell you whether a -1 inch horizontal movement is better than a -3 inch movement, but I probably can't tell you how many more/less games you'd win because Ryon Healy played 3B instead of 1B.  In other words, I don't consider any sabe observation too basic to discuss.

In this morning's piece, James reduces all of sabermetrics to 2 'moral' issues:

(1) No stat means anything outside of its connection to W's and L's, and

(2) You have to "normalize" everything in its context, like HR's in Fenway vs Safeco.

Do those two things and you've forwarded your understanding of baseball.


For example, on Twitter today he asks, addressing WPA and WAR, 


You would not use situational stats to measure FUTURE value. But suppose as hypothetical extreme a 21-year-old hit .700 with a 4.000 OPS in 100 Games at AAA. Assessing his FUTURE, that would make him the most prized property in baseball. But would you vote for him for MVP?


In his more developed BJOL piece he says, regarding Altuve and Judge,


Baseball-Reference WAR shows the little guy at 8.3, and the big guy at 8.1.   But in reality, they are nowhere near that close.   I am not saying that WAR is a bad statistic or a useless statistic, but it is not a perfect statistic, and in this particular case it is just dead wrong.   It is dead wrong because the creators of that statistic have severed the connection between performance statistics and wins, thus undermining their analysis.


James moves on to an interesting concept, the idea of "general" relationship between bases and wins and "normalized" relationships between runs and wins:


Look, there is a general relationship between runs and wins, a normal relationship, and there is a specific relationship, based on this specific player and this specific team.   If you evaluate Altuve and Judge by the general and normal relationship of runs to wins, then it appears that Judge is almost even with Altuve.  But if you evaluate them by the specific relationship of Altuve’s runs to the Astros wins and Judge’s runs to the Yankees wins, then Altuve moves up and Judge moves down, and a significant gap opens up between—large enough, in fact, that Judge drops out of the #2 spot, dropping behind Eric Hosmer of Kansas City.


He goes on to point out that the Yankee$ "pythag'ed" 102 wins but only won 91.  He gets passionate here:  "IT IS NOT RIGHT TO GIVE THE YANKEES PLAYERS CREDIT FOR WINNING 102 GAMES WHEN IN FACT THEY ONLY WON 91.  This is not a choice.  It is not an option.  It is an error."

This has always been the difference between James and, say, Dave Cameron.  "Pure" algebra guys think in terms of crediting players for theoretical skills they showed.  James (and I) think ALSO in terms of crediting players for what occurred on the field.  Part of the reason for this is --- > many "invisible" things happen between the theoretical and the real.


And so he continues,


When you express Judge’s RUNS. . .his run contributions. . . when you express his runs as a number of wins, you have to adjust for the fact that there are only 91 wins there, when there should be 102.  (The Astros should have won 101 games and did win 101 games, so that’s not an issue with Altuve.)  But back to the Yankees, one way to do that is to say that the Yankee win contributions, rather than being allowed to add up to 102, must add up to 91.  That’s a good way to do it, and, of course, if you do that, it reduces Judge’s win contribution by 11%    Using WAR, it reduces his win contribution by MORE THAN 11%, because the replacement level remains the same while his win contribution diminishes, so the wins ABOVE THE REPLACEMENT LEVEL are decreased by more like 16%.   Judge drops from 8.1 WAR to 6.8. 


James also considers it important that Judge hit poorly in high-leverage situations, while Altuve hit well.  Look, if you want to ignore that because you think it puts you in a better PREDICTIVE position, great.  Often - not always - it WILL make your predictions more accurate.  But let's not pretend the Astros didn't win the World Series this year, what?

But the fun starts here:


I have been silent on this issue for more than 20 years, and let me explain why.  In the 1990s I developed Win Shares, while younger analysts developed WAR.   At that time it was my policy not to argue with younger analysts.  I was much more well-known, at that time, than they were, and it’s a one-way street.   When you are at the top of a profession, you don’t speak ill of those who coming along behind you.   It’s petty, and it’s just not done.   Some of those people did take pot shots at me and some didn’t, but. . .well, it’s a one-way street.  I’ve got mine; I’m not pulling up the ladder behind me.  

But that was a long time ago.  We’re not there anymore. WAR is not an upstart statistic; it is the dominant statistic.   We can debate its merits on an equal footing.  

The logic for applying the normal and usual relationship is that deviations from the normal and usual relationship should be attributed to luck.  There is no such thing as an "ability" to hit better when the game is on the line, goes the argument; it is just luck.   It’s not a real ability.   

But. . . I have held my peace on this for 20-some years. . .that argument is just dead wrong.   There are five reasons why it is wrong.


Perhaps in Part II we'll look at those 5 things.  But for the moment:  let Dr. Detecto cast his vote, this day, with James:  that theoretical skills are not more important than outcomes on the field.



1.  We do not "know" that there is no such thing as clutch hitting.  We haven't yet measured it, true.  (False:  provide such a player, and they'll tell you it's a fluke - Dr D)  But I (James) was wrong to join this "consensus."  We can't prove that deviations in clutch hitting are 100% due to chance.  We simply assume they are, because we can't prove that they're not.  We should be agnostic about clutch skill.

2.  It doesn't matter whether it's luck or skill.

3.  There is "luck" everywhere, and sabes don't account for most of it.  A guy draws 90 walks instead of 60, because of the different umpires behind the plate.  Are you going to "normalize" out that luck? But what about BIG luck?  A player is in a car accident and it wrecks his season.  Are you going to normalize that luck?  How do you normalize Mitch Haniger's shoulder issues?

4.  The connection between W's and stats is the only reason you do stats analysis.  When you amputate that connection, you're doing 1962 beat writer analysis - saying that Johnny Bench isn't great because he doesn't hit .300.  Sabes will say a 700-run team in 1965 is a better offense than one that scores 700 runs in 1975.  Why?  Win impact and NO other reason.  But!  If a team wins 80 when they pythag'ed 90 ... we'll pretend they won 90.  It makes no sense.

5.  James uses the example of a 1.400 OPS kid in AAA who played only 80, 100 games.  He's the most valuable property in baseball - but where is he on a WAR chart?


At which point Dr. D might innocently bring up James Paxton's 4.6 WAR in 136 innings.



In this video, Jim Callis (BA's #1 guy) opines that the Vieira money doesn't matter much -- because Ohtani won't care about the money much.  Worth a listen.

Congrats to the Champs,

Dr D


Taro's picture

If in awarding an MVP we're judging past individual performance in the context of team success, I actually agree with James here.


...to stop assuming that there are invisible things. :)

What I mean is that, with all of the information we have at our disposal, there's no reason we can't get a lot closer to explaining what happened on the field that caused the Yankees to lose 11 more games than the numbers say they should have. We'll probably never find all of it...but we can micro-analyze each game of the 2017 season and figure out what determined the difference between winning and losing and also figure out which team should have won or lost based on bases gained/lost and look for patterns for each team and for all teams together.

I refuse to believe there aren't any.

Of course, until we do that, my stopgap answer is to rate MVP by what did happen (and rate historical player greatness similarly by what did happen) and to rate player future projections by what should have happened.

It seems like a reasonable compromise to me.


That we should give 2017 awards MOSTLY based on what they did done, and to make predictions MOSTLY on what they might have theoretically done with "average luck and circumstances."



1.  I agree that WAR is and always will be a soft science, as much witch-doctory as it is mathematical reality.  dWAR is especially iffy, but even oWAR assumes that every team replaces a player with the exact same player (or value) and that just isn't the case.  To use a football analogy, when Drew Bledsoe went down in 2001, the New England Patriots didn't replace him with a replacement QB, they rolled out Tom Brady.  Bledsoe actually would have rated a negative WAR, despite being a pretty good QB.  If I think for a bit, I'm pretty sure that i can come up with such a MLB account, as well. 

2.  All the way back in 1979, there was the discussion about whether the MVP Award winner should really be the most VALUABLE player or the BEST player that year.  Wilver Dornel Stargell (.281-.352-.552/2.5 WAR) tied Keith Hernandez (.344-.417-.513/7.6 WAR) for the MVP that year.  Of course, the We are Family Pirates won the World Series and Pops was the spiritual soul and guru of that team.  But he wasn't the player that Hernandez was, or that Dave Winfield was (8.3 WAR) or even that teammate Dave Parker (6.7 WAR) was.  But undoubtedly, he was the ringmaster of the WS champs.

BTW, give Parker the MVP that season, he was .310-.380-.526 with a RF GG for the WS winners, and he's likely in the Hall.  He won the MVP in '78, has 2 WS rings, 3 GG's, 2 batting titles, etc.  I'm sure he would have told you in October of '79 that Stargell was the MVP on the team, being gracious, but you wonder where he would be if the voters went with the best player on the WS champions.  BTW, he's on the ballot for the Dec. vote.

Anyway, I try not to get to worried about WAR, for whatever that is worth.


If James is talking about player value from a GM's perspective (as he seems to be), then that value needs to be completely divorced from how the team as a whole performs. What a GM is concerned about is answering the question "How many dollars is he worth?", and the answer to that has nothing to do with how many wins the team has, or whether it has more or less than the Pythag estimate. If the Yankees' bullpen cost the team a bunch of games, does that mean they should pay Judge less? Of course not! Judge should and will be paid based on his own performance. It's not as if other teams have suddenly devalued Yankee players by 10%.

As for his five points, they also have problems:

1. "We don't know if clutch hitting exists"- As Tangotiger has pointed out numerous times, we need to stop with the binary thinking whereby something either exists or doesn't exist. Everything effects how a player performs, from their diet to how long they grow their hair. Therefor, the question is how much does it matter. In the case of "clutchness", it is part of every player's skillset, but the evidence strongly suggests that it isn't as nearly as big a factor as the people who talk about it suggest. So while it should be included in players evaluations for completeness sake, it isn't going to make a huge difference when talking about a single season (over a long career it will add it up).

2. "It doesn't matter whether it's luck or skill"- Of course it does. In order to properly assess a player's value, you have to figure out how much a player's stats are reflective of his skill.

3. "There is 'luck' everywhere, and sabes don't account for most of it."- WAR is solely concerned with how well a player performed, it is not concerned with why a player performed as he did. So ideally in-game factors such as umpires, opposing players, weather, etc. should be included in WAR, but health shouldn't because that's simply not what WAR is trying to measure.

4. "The connection between W's and stats is the only reason you do stats analysis"- True, but James' method of adjusting a player's value based on a teams win total is extremely ham-fisted. We now have detailed play-by-play data which allows us to measure how much a player helped his team win each game. There is no reason to be looking at a team's overall record when we have the info to grade every single pitch.

5. "What about a player with monster numbers in AAA?"- If James' is concerned only with how many wins a player contributes to a major league team, then he must think a minor leaguer is worthless no matter what his stats are. His own Win Shares system would assign such a player zero value.


To a certain extent we have a philosophical topic here.  It is one thing to ask, "What did a player accomplish in 2017" and a different thing to ask "what do we think a player's 2017 component-skills record predicts he will do in 2018."  James is emphasizing the former; Fangraphs is emphasizing the latter -- to the point where 2018's projection becomes, de facto, their historical record of what he did in 2017.

Take xFIP, for example.  Fangraphs can and will argue Cy Young based on xFIP, whether or not a pitcher actually threw a bunch of poorly-located gopher balls that created a large ERA-xFIP.  They can, and will, assume that any variation between 11% and a pitcher's actual HR/F was nonsense and should be discarded.

James is merely arguing that if a pitcher gave up 7%, or 14%, homers per fly then that is the history of the matter.  I would add that there are times, a minority of times, in which that HR/F rate was skill-based (over the course of a season or two) and that GM's certainly will consider that.


Could go line-by-line on the rest of it but of course you are right, in terms of the point you are making CPB.  Thanks.


In 1972, Steve Carlton had an amazingly great season for an amazingly bad team.  His Phillies went 59-97, but Lefty was 27-10!  In 41 starts he threw 30 complete games, 8 shutout, 346 innings and had an ERA of 1.97! He  won 27 games despite getting more than 4 runs of support in only 9 starts.  In 4 of those games he got exactly 5 runs.  It was an amazing season.  Philadelphia went 22-4 over his last 26 starts, this from a team that won 38% of all its games and just 26% of the games that Carlton did not start.

Philadelphia was 29-12 in games he did start and 32-87 in the games.  That’s 70% vs 26%.  He is credited with a 12 WAR season, but something is amiss.

If Lefty was worth just 12 wins to Philly, then they should have been a WAR-hypothetical 17-24 team in his 41 starts, were those starts covered by a replacement level MLB pitcher.  17-24 is a 41% rate, but Philadelphia won just 26% of his non-starts that season.

Something doesn’t add up:  Reynolds, Chapman, Fryman and Twitchell were the 4 other most used starters that season.  They ran FIPS of 3.89, 3.91, 3.90 and 2.60, combining for 77 starts. They were somewhat worse than the league average that year, but they weren’t replacement level throwers.   Philly was 23-54 in their 77 starts, a 30% win rate.  Yet Carlton’s WAR assumes that Philadelphia would win 40% of Lefty’s 41 starts were he replaced by a AAAA arm.

Assuming that the Phillies win just 26% of those games (the actual rate) then they would have gone 11-30.  That is an 18 game difference from their actual record, yet Carlton gets credit for only 12 WAR.  He’s been ripped off.  

WAR simply demands a bit of eye of newt and toe of frog.  


Carlton winning almost half the games for a 100-loss team ... have always boggled at that.  Glad you brought that into it.  :- )

One of the things that 'doesn't add up' is that sabermetricians have decided to set "replacement level" at much more than 26%.  They proceed from the idea that the 1972 Phillies could have pulled a AAA starter up to win a lot more than 26% of his starts.  Here's an obvious case in which that decision doesn't pass the chuckle test.

I don't mind that that they pull a "replacement level" winning percentage out of their ears -- you do need to work with some constant or other -- but do mind that they proceed dogmatically from that point on.


Like James said, WAR is a great stat.  He feels Win Shares are better; I could see either case.  Where you and I part ways with Fangraphs, is when WAR becomes the end of the discussion, rather than the start of it.

I DO think WAR (or Win Shares) should indeed be the start of the discussion.  Baseball performance analysis is in great shape.  We can come up with a great number, that being WAR, to "put a player on a hand" and then go from there.  Problem is, a lot of people become hostile at the first suggestion of deviation from that starting number.


Leaving you and I, Keith, to ask the question:  is Carlton 1972 a 12-win starter?  If you add that guy, 346 innings and 1.97, to an 81-win team do they then win 93?


In '72, the closest NL team to .500 was the St. Louis Cardinals. They went 75-81 (everybody played only 155 or 156 games that season, but I can't remember why). At 6 games under, they were the closest in the league. 6 teams that season were well over .500 and 6 were well under.  

If you substitute Lefty's 41 starts for for the 33 that Reggie Cleveland threw, plus the 8 from Don Durham, maybe you get close to  your question.  St. Louis was 2-6 in Durhams 8 starts, 45 innings worth. They were 16-17 in Cleveland's 33 outings, 231 innings.  They were clearly better than replacement throwers as their FIPs were better than the league average, although B-R gives them a combined WAR of fractionally below 0.

So St. Louis was 18-23 in their 41 starts; they combined for 276 innings.  

For Lefty to have been worth just 12 games to St. Louis that year by replacing the 41 Cleveland/Durham starts, the Cards would have had to go 30-11 in his hypothetical replacement outings.  .Remember that Philly was 29-12 in his 41 starts there.  Philly was 11th in the league in runs scored in '72, St. Louis 8th.  St. Louis scored 67 more runs that year than Philadelphia.  Philadelphia score 0 or 1 runs in 8 of Carlton's 41 starts and just 2 runs in 9 more.  In 40% of his starts he had support of 2 runs or less. His average support was 3.83.  Throwing for the Cardinals, Bob Gibson got 3.93 runs of support per game, with 2 runs or less in 12 of his 34 starts, about 36% of the time. 

Looking at all that, and considering the extra quality innings that Carlton threw, I would say it is a safe bet that he would lift St. Louis to 87 wins, a 12 game improvement. It is likely a bit more.  If you figure the extra 70 innings, then you probably get even more than that.

Interestingly, Carlton had, of course, played for the Cards prior to '72.  He was shipped to Philadelphia becasue of a contract dispute, the Cards getting Rick Wise in return.  Wise was a 110 OPS+, 2.92 FIP, 16-16 pitcher for St. Louis in '72, throwing 269 innings in 35 starts.  The Cards were 18-17 in his outings, he averaged 3.71 runs of support.  B-R says he was a 5 WAR player.  This suggests that the Cards would have won 7 more games with Carlton than they did with Wise.  Considering the extra 80 innings and the 2.01 FIP that Carlton had, I suggest it would have been one or two more than that. The Cards would have had to go 25-10 in those 35 starts to pick up 7 games. That's a 71% rate, the Philles were almost exctly that in Carlton's 41 starts.  Considering the Cardinals were less than a .500 team, it is "unlikely" that whey would have won 4 of the 6 starts that Carlton would have got, above and beyond those of Wise.  So I think it is fair to say that Carlton was likely 9 games better than Wise that year.  Carlton doesn't get St. Louis to the playoffs that season (Pittsburgh's 96 wins takes the division) but it does get them to the 3rd best record in the league, up from 7th.

Weirdly, all that St. Louis/Philly comparison works pretty well.

And I will admit, they is some conjecture, wool of bat and tongue of dog in my analysis, too.  

Add comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd><p><br>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.


  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.