The World's #1 Starting Pitcher (and Football Team)
All we want is a Star for Xmas


Since the 1960's, the idea of the "Elo Rating System" has dominated tournament chess.  It has percolated from there into soccer, into college football, into video games, and so on.

It's a rating system that is both (1) mathematically as strong as a python, and (2) intuitively simple.  The average player has a rating around 1400 "class C", within a range that is about 800-2400.  If he sits down on Friday night across from a player with a rating of 1601 ("class B"), he's got about a 25% chance, 3 to 1.

That is also how many "rating points" the players have at stake.  If the class C player wins, his rating will go up +12, to 1412; if the class B player wins, his rating will go up +4, to 1604.  Thusly:

Winner Class C new rating Class B new rating
Class B 1396 (class "D") 1604
Class C 1412 1588 (class "C")


Each "class" interval is about 200 points and as it turns out, corresponds neatly with the idea of a "standard deviation."  And, interestingly, by the time you get to 2 SD's -- 400 rating points' difference -- victory for the lower player is virtually impossible.  

(Christopher Langan, with an IQ between 195 and 210, once said that if you get to 2 SD's difference in IQ, the people begin to have trouble exchanging information.  Although Langan himself enjoys being hard to understand...

Now imagine the communication between two beings with (say) 10 SDs' difference.  Or imagine chessgames between computers with 3300 ratings, vs. human chessmasters with ratings of 2300. We literally do not understand what computers are doing when winning certain endgames, like Rook and Bishop against Rook.  Nobody understands the computers' play, not with a year's study.  Which is a scary thought if you've ever seen the Matrix.)

This system was so simple and powerful that it exceeded Arpad Elo's wildest dreams.  Elo had said, "rating systems will be accepted according to the players' confidence in their intutitve accuracy."  How right he was.  When a 1500 player now sits down across from a 1700, he has a near-perfect idea of where that 1700 player is ... in tactical combinations, in positional play, in Rook endings, etc.  

"1700" describes a chessplayer better than his name does.


Seattle Seahawks

Nate Silver has an article out, musing about how sky-high the Seahawks' Elo rating is.  Exec Sum:

  • The Seahawks' 35-6 crushing of Arizona put them #1 again ...
  • (... similar to how the Spurs' crushing of the Heat would make them "scariest" team)
  • The Seahawks current 1755 rating -- different scale than chess -- has them #21 all time ...
  • Adjacent to the Montana 49'ers, Aikman Cowboys, and Brady Patriots
  • If the Seahawks win the Super Bowl, they'll be #3 in history, behind two Brady teams
  • The Seahawks' record against high-rated teams is unbelievable
  • Russell Wilson has never, in 52 games, been "out of a game" late in the 4th quarter

There y'go Rick.  That's the HBDI way of capturing the Seahawks' roll.  ;- )


The World's #1 Starting Pitcher

Bill James has his own "Elo system" for rating Opening Day aces.  It's actually an improvement in one way:  it subtly penalizes inactivity.  (In 1977, Fischer hadn't played in five years, but his rating was mathematically the same as in 1972.)

At the end of the 2014 season, here was the list of top Aces:

World's #1 Starting Pitcher
September 28, 2014

  • Clayton Kershaw - 626.01
  • Felix Hernandez - 578.42
  • Max Scherzer - 564.43
  • Chris Sale - 562.54
  • David Price - 560.85
  • Johnny Cueto - 550.86
  • Cole Hamels - 550.77
  • Jon Lester - 550.58
  • Adam Wainwright - 546.99
  • Madison Bumgarner - 545.31
These kind of rating systems are awesome when you use them to --- > put into perspective the career arcs of a Hisashi Iwakuma or a Corey Kluber.  It doesn't matter that Kluber had 7.2 WAR to Felix' 6.1 last year.  Kluber does not go in to 2015 with the same standing or "fear factor."  Hisashi Iwakuma is beautiful to watch, a sight to behold, when he's healthy.  It doesn't exactly make him Cole Hamels.
HBDI, baby.
After his amazing run through the playoffs, Madison Bumgarner skyrocketed to #2.  This sparked a series of howls that other pitchers -- Sale, for example, hadn't had the chance to improve their ratings, so it wasn't "fair" to include the postseason.
Ratings are not reward and punishment.  They are a description of reality.  Bumgarner didn't get his new 572 rating as an incentive to pitch well going forward; he got it as a natural consequence of what he did.  
Supposing Roger Federer's rating is 600 and Rafael Nadal's is 550 -- but Nadal goes out and wins a Grand Slam by 6-0, 6-0, 6-0 scores all the way through while Federer swims in the Bahamas.  Nadal is "scarier" after his rampage than he was before it, correct?  Ratings reflect that.
Madison Bumgarner is now "scarier" than he was before the playoffs.  His rating is, necessarily, higher after his exploits than it was before them.  That's the truth.  Get over it.
 There is a life lesson here.  No Central Agency of Fairness exists in the real world; somebody notify the NEA of this, please.  Mariners blogs aren't bigger or smaller based on what the NEA thinks will build our self-esteem.
HBDI practical application:  if the Mariners had some way to shed Taijuan Walker and nab Cole Hamels or David Price, I'm in.  I do not know if the radar gun agrees, and don't much care.
Felix Hernandez is a "supergrandmaster," a guy who (in these terms) a full standard deviation, maybe two SD's, ahead of some other Opening Day starters.  Here's SABRMatt's idea of "supermarginal" players.  Chessplayers understand the concept of a "supergrandmaster."  One of them is worth three lesser stars.  One Lionel Messi is worth any number of Danny Welbecks.
Kyle Seager's rating is Master Class and, though most people don't realize it, his rating is still rising nicely.
Justin Ruggiano's "rating" would be bottom of the barrel, whatever his platoon splits.  
Logan Morrison's would be rather low, but climbing in an eyebrow-raising way, like that of a talented 13-year-old chessplayer.  Brad Miller's would be similar.  
Youse guys demand a right fielder with a respectable "rating."  Y'understand that Dr. D is not philosophically opposed to this.  :- )
Stars & Scrubs fo'eva,





His thirst for battle, and desire to pulverize his opponents, fuels so much of what this team does. There is this "How dare you enter the field of battle with us. Who do you think you are?" Attitude. Remember the fit the media had when those Marines whizzed on dead Taliban? That is what Lynch's crotch grab was in football terms. The NFL fines him, and the football paparazzi allow them to collect by asking him questions he already made clear he won't answer. They aren't worthy of it, should respectfully step by to go talk to his teammates. But they don't, and so, when he does speak, it's to fellow warriors turned journalists who he respects and knows understand. When I watch this guy NOT take the easy step out of bounds in favor of staying in and hitting a few more football Orcs...well, let's just say I watched the Hobbit this morning and thought often of Marshawn and the men he fights with and for. May he always remain a Seahawk. We cannot spare this man. He fights.
THAT is what ELO is measuring. And that is what takes it off the charts. Yeah, ESPN and Virginia, there is a hangover effect.


...I have long considered developing an elo rating system that measures player performance in baseball by match up to match up results. Would there be interest here in seeing that? Abs following it?

misterjonez's picture

but I suspect it would be quite the undertaking. I remember seeing some strength-of-schedule stuff for hitters once, a few years ago, but it was pretty simplistic.


Basically, I would compute a seasonal elo and a career elo...the seasonal, everyone starts at 1000 and goes up or down as the plate appearances take place...the career elo is calibrated with very old play by play days from say...1946...and then calculated going forward for the rest of the play by play era.
But the math wouldn't be so hard.

bsr's picture

If I read you right, you are talking about a "matchup adjusted" rating that, e.g. for hitters factors in the quality of pitchers faced (and vice versa). I have been wanting to see something like that for a while now. Quantify who is getting it done vs the top opponents (ie, playoff conditions), vs feasting on scrubs. I think it would fill in some blind spots from the one size fits all WAR type stats.
So...yes please!


For me that would be one of the Prime Attractions of a SABR Shtick subdomain.  Those would be the first two things I thought of:  Inside Scoop and SABR Elo ratings.  
Not sure how you'd apply them to things other than Starting Pitchers and Teams, though.
Hey Benihana.  What's my next step on the premium space station?  :- )


Rather than by doing SP game results, I'd be doing it by PA by PA results. Take the average run value of an event, find the run value of each PA outcome. Calculate a single event W% for each outcome and call that the win score part of the elo.

bsr's picture

Would this be factoring in the quality of the opponent? I was envisioning something that basically replicates a team "strength of schedule" type formula but for each player at the PA/individual opponent level (aggregating PA by PA results). So a HR vs Felix is worth way more to a hitter's rating than a HR vs whoever. Not sure if that is what you were describing.


What elo ratings do is figure out how good each player is by the sequence of matches such that if two players are separated by a lot on their current elo rating when they face each other, the player who is much worse is rewarded a lot more for a "win" against the player that is much better than the much better player would be if he wins against the worse player. Let's say, for example that a 1200 elo batter is facing Felix, whose elo is, say, 2000. If the average run value of a PA in the AL is .35 and the batter homers (run value of 1.4), meaning the single event W% is around .900, the worse player is given .9 of a win in that matchup..and because Felix is s&p much better, making this result an upset win, the number of points that the batter takes from Felix is larger than if he had been having a pitcher with a similar elo to himself.


When last I considered live-tracking strength of schedule, I ran into the problem of needing daily play by play data as a season unfolded. I can, fairly readily, calculate elo ratings for players in all past completed seasons, but I have never been able to figure out how to get play by play data in a data frame for each day's games during a current season. So to track elo ratings live, I will need to find some way to do that


If a player is a standard deviation worse, he benefits from 3:1 odds, both in reality and in the Elo points they exchange.  Don't doubt that the math would be fairly cumbersome as it applied to individual pitcher-batter matchups, but now that you put it in these terms, I see what you're getting at with Elo ratings for batters and pitchers.  Good stuff.  
It never occurred to me to use the Elo system in "handicap" contests.  By definition, in chess that 3:1 odds situation applies to a simple, fair, 50-50 game situation.  Now you introduce the idea of a 70% game in favor of the pitcher -- similar to Pawn odds in chess -- I have no idea how you would adapt the system.
I'm curious.  Where did you become familiar with the Elo system?


I did have an unofficial elo rating when I was in club chess in high school (was around 1500 oe 1600).
The wash to Fox the problem in baseball of the matchup favoring the pitcher is to convert event run probabilities into winning percentage. The average at bat carries a standard run're familiar with the concept of offensive winning percentage (James invented that), I assume. I pripose to do that to the outcome of each PA. Once you're in W% scale, the matchup is fair again.


The one problem with my idea is that a negative event carries a negative run value. So the run values of all events will have to be shifted s&p that the worst events for the batter carry a zero W% So rather than using a basic RC estimate for each event, I'd be using a marginal RC over an out. Not a big deal to calculate that though.

Add comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd><p><br>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.


  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.