Monthly Archives: August 2013

My new favorite package: Shiny!

In the last post I said that I would be doing some data analysis on the triathlon results I scraped off the web, but I recently decided to delve into the brave new world of web apps! Shiny, designed by the creators of RStudio (my preferred R IDE) is a really easy and simple to use package for develop a web interface for your R scripts. I have some larger ideas regarding what I’d like to develop in the future, but for now I wanted to try out Shiny myself and see if I could develop a simple app.

Being a big fan of Ichiro Suzuki (who just surpassed 4,000 combined hits between the MLB and Japan!), I decided to make an app that allows a user to compare the cumulative hit trajectory of a player (past or present) against the hit trajectory for the 28 players that have more than 3,000 hits in the major leagues. Owing to the top-notch documentation of Shiny, I was able to put the app together in only a few hours.

This post is broken down into 3 parts: 1) getting the data together (which involves a little web scrapping, so see my previous post about if your interested in doing some yourself), 2) writing the Shiny ui.R file and 3) writing the Shiny server.R file. As usual, I’m not trying to recreate the wheel here, so I highly recommend reading the Shiny tutorial before you start since I won’t be covering too much of the basics of Shiny.  The tutorial was about all the introduction I needed to start building my app.

Part 1. Getting the data…

So I did a little googling, and found that there are 28 players who have more that 3,000 hits in the MLB.  Being lazy, I wrote a little script to scrape the names of these players off the Baseball-reference.com webpage that list the all-time hits leaders in descending order:

### Scrape 3000 hit club from www.baseball-reference.com
b = readLines("http://www.baseball-reference.com/leaders/H_career.shtml")
bdoc <- htmlParse(b, asText = T)
pathResult <- getNodeSet(bdoc, path = "//td/a")[1:28]
members <- unlist(lapply(pathResult, function(x) c(xmlValue(x))))
members <- gsub("[+]","", members)

### Get members first and last name to match with Master.csv playerID
memberFirst <- lapply(strsplit(members, split = "[[:space:]]"),
                      function(x) x[1])
memberLast <- lapply(strsplit(members, split = "[[:space:]]"),
                     function(x) x[2])

What I’ve done is download the HTML code and put it in a format that is easy for R to read.  Then, using Firebug for Firefox, I was able to locate the HTML/XML path to the names on the all time hits list.  Finally, I extracted the player names from the HTML code, cleaned it up, and saved it as a vector to be used later on (notice I only extracted the first 28 players on the list since these are the players with >=3,000 hits).

Next, I needed to find the actual hit data, by year so I can cumulatively sum it.  Of course I could have manually entered hit data into a spreadsheet, saved it as a .txt and read it back into R, but what fun is that??? So, I did some more googling and found this amazing baseball statistics website, created by Sean Lahman where you can download insane amounts of data as .csv files or even as database files.  Since I’m only interested in batting stats (hits), I only need to use the Batting.csv file (which contains the batting stats) and Master.csv file (which contains both the player names and playerIDs, which are needed to sort through the mountains of data):

setwd("C:/LocationOfYourDataHere")
master <- read.csv("Master.csv", header = TRUE, sep = ",",
                   quote = "\"", dec = ".", fill = TRUE,
                   comment.char = "")
batting <- read.csv("Batting.csv", header = TRUE, sep = ",",
                    quote = "\"", dec = ".", fill = TRUE, 
                    comment.char = "")

### extract playerIDs from Master.csv and 
### extract hits and other batting data from Batting.csv
memberId <- vector()
battingMember <- list()
hitsMember <- list()
for(i in 1:length(memberLast)){
  masterSub <- subset(master,
                      as.character(master$nameLast) == memberLast[[i]] &
                      as.character(master$nameFirst) == memberFirst[[i]]) 
  
  if(nrow(masterSub) > 1){ masterSub = masterSub[1, ] }
  
  memberId[i] <- as.character(masterSub$playerID)
  battingMember[[i]] <- batting[as.character(batting$playerID) == memberId[i], ]
  hitsMember[[i]] <- battingMember[[i]]$H
}

What I did above was to use the players first and last names to extract the playerID out of the Master.csv file, then use the playerID to extract out the hitting data from Batting.csv (I plan on cleaning this up some day, but for now,  I just wanted to get it to work).

Now that I had all the hit data for the 28 players, I can sum each season cumulatively so I can plot the data nicely:

### Calculate cumulative summation of hits for all members
mHitsCumSum <- lapply(hitsMember, function(x) cumsum(x))

Phew!  That’s actually the hard part. Now, its just doing a little copying a pasting and a some referencing of the tutorial and we have our first web app.

Part 2. Creating the Shiny ui.R file

Now for the fun part, designing your web app.  I wanted to keep my first app simple, so I decided to have a sidebar with only two way of sending data to R, and have one graph as the main output.  The first way a user can interact with the app is to highlight the hit trajectory of a particular member of the 3,000 hit club.  To do this, I made a drop down list containing the names of all 28 players.  When a name is selected, the hit trajectory is highlighted on the graph.  This was accomplished using the selectInput() function.

The other bit of interaction the user can do is plot the hit trajectory of ANY player, past or present, simply by entering in the name of the player and letting R look up the player in the master and batting data frames and plot the data on the graph.  This was accomplished using the textInput() function.

Finally, to display the graph, we tell shiny to plot the hit trajectories in the main panel (mainPanel()) by using the plotOutput() function.

library(shiny)

# Define UI for miles per gallon application
shinyUI(pageWithSidebar(
  
  # Application title
  headerPanel("The 3000 Hit Club"),
  
  # Sidebar with controls to select a member of the 3,000 hit club
  # and input a non-member and plot their hit trajectory
  sidebarPanel(
    
    ### Dropdown menu to select a member of 3,000 hit club to highlight on 
    ### plot
    selectInput("member", "Member of 3000 hit Club:",
                list( "Pete Rose" = "rosepe01",
                      "Ty Cobb" = "cobbty01",
                      "Hank Aaron" = "aaronha01",
                      "Stan Musial" = "musiast01",    
                      "Tris Speaker" = "speaktr01",  
                      "Cap Anson" = "ansonca01",
                      "Honus Wagner" = "wagneho01",   
                      "Carl Yastrzemski" = "yastrca01",
                      "Paul Molitor" = "molitpa01",     
                      "Eddie Collins" = "collied01",
                      "Derek Jeter" = "jeterde01",     
                      "Willie Mays" = "mayswi01",    
                      "Eddie Murray" = "murraed02",
                      "Nap Lajoie" = "lajoina01",      
                      "Cal Ripken" = "ripkeca01",     
                      "George Brett" = "brettge01",   
                      "Paul Waner" = "wanerpa01",      
                      "Robin Yount" = "yountro01",     
                      "Tony Gwynn" = "gwynnto01",   
                      "Dave Winfield" = "winfida01",  
                      "Craig Biggio" = "biggicr01",
                      "Rickey Henderson" = "henderi01",
                      "Rod Carew" = "carewro01",      
                      "Lou Brock" = "brocklo01",    
                      "Rafael Palmeiro" = "palmera01",
                      "Wade Boggs" = "boggswa01",
                      "Al Kaline" = "kalinal01",
                      "Roberto Clemente" = "clemero01")),

    
    # To text input to select non-3000 hit member to plot hit trajectory
    textInput("player", "Player Name:", value = ""),
    
    # Button to update plot output
    submitButton("Update View")
    
  ),
  
  # Show the output plot of the hit trajectory
  mainPanel(
    #tableOutput("view"),    
    
    plotOutput("cumsumPlot")
  )
))

Part 3.  server.R file

the server.R file is the file that does the heavy lifting behind the scenes. It contains the R scripts that takes user inputs (called input variables), does data manipulation, then spits out the results (called output variables) to ui.R, which then displays the outputs of server.R in your web browser.

If you look closely, you will see that I included the script from Part 1 at the beginning of the server.R file.  This code is outside the shinyServer() function, so is run ONCE when the app is loaded into the browser, and then all data frames, matrices, vectors, etc. can be used by shinyServer().

After the data is loaded into R, we run the shinyServer() function, which contains reactive functions.  Reactive functions are run any time a user changes one of the input variables.  You will see there is a reactive function called currentMemberHits(), which simply selects the desired 3,000 hit member for plotting, and there is another reactive function called currentPlayerHits() which get the non-members hit data from the master and batting data frames and calculates the cumulative hits trajectory.  Finally there is reactive function called renderPlot() which is run whenever currentMemberHits() or currentPlayerHits() changes.  renderPlot() just wraps normal R plotting functions and sends the plot back to ui.R to display in the web browser.

library(shiny)
library(XML)

### OVERHEAD
### Scrape 3000 hit club from www.baseball-reference.com
b = readLines("http://www.baseball-reference.com/leaders/H_career.shtml")
bdoc <- htmlParse(b, asText = T)
pathResult <- getNodeSet(bdoc, path = "//td/a")[1:28]
members <- unlist(lapply(pathResult, function(x) c(xmlValue(x))))
members <- gsub("[+]","", members)

### Get members first and last name to match with Master.csv playerID
memberFirst <- lapply(strsplit(members, split = "[[:space:]]"), function(x) x[1])
memberLast <- lapply(strsplit(members, split = "[[:space:]]"), function(x) x[2])

### Read in local files downloaded from...
setwd("C:/chitchat/data")
master <- read.csv("Master.csv", header = TRUE, sep = ",", quote = "\"",
                  dec = ".", fill = TRUE, comment.char = "")
batting <- read.csv("Batting.csv", header = TRUE, sep = ",", quote = "\"",
                   dec = ".", fill = TRUE, comment.char = "")

### extract playerIDs from Master.csv and 
### extract hits and other batting data from Batting.csv
memberId <- vector()
battingMember <- list()
hitsMember <- list()
for(i in 1:length(memberLast)){
  masterSub <- subset(master, as.character(master$nameLast) == memberLast[[i]] &
                        as.character(master$nameFirst) == memberFirst[[i]]) 
  
  if(nrow(masterSub) > 1){ masterSub = masterSub[1, ] }
  
  memberId[i] <- as.character(masterSub$playerID)
  battingMember[[i]] <- batting[as.character(batting$playerID) == memberId[i], ]
  hitsMember[[i]] <- battingMember[[i]]$H
}

### Calculate cumulative summation of hits for all members
mHitsCumSum <- lapply(hitsMember, function(x) cumsum(x))

### For plotting
maxYears <- max(unlist(lapply(hitsMember, function(x) length(x))))
maxHits <- max(unlist(lapply(mHitsCumSum, function(x) max(x))))

### Define server logic required to plot various players against 3000 hit club
shinyServer(function(input, output) {
  
  ### get hits for chosen 3000 club member
  currentMemberHits <- reactive({ 
    
    ### Calculate cumulative summation of hits
    cumsum(hitsMember[[match(as.character(input$member), memberId)]])
    
  })
  
  ### get hits for non-3000 club player
  currentPlayerHits <- reactive({ 
    
    playerFirst <- lapply(strsplit(input$player, split = "[[:space:]]"),
                          function(x) x[1])
    playerLast <- lapply(strsplit(input$player, split = "[[:space:]]"),
                         function(x) x[2])
    
    ### extract hits and other batting data from Batting table
    masterPlayer <- list()
    playerId <- vector()
    battingPlayer <- list()
    hitsPlayer <- list()
    for(i in 1:length(playerLast)){
      masterSub <- subset(master, 
                          as.character(master$nameLast) == playerLast &
                          as.character(master$nameFirst) == playerFirst) 
      
      if(nrow(masterSub) > 1){ masterSub = masterSub[1, ] }
      
      playerId <- as.character(masterSub$playerID)  
      battingPlayer <- batting[as.character(batting$playerID) == playerId, ]
      hitsPlayer <- battingPlayer$H
    }
    
    ### Calculate cumulative summation of hits for non-member
    cumsum(hitsPlayer)
    
  })
  
  ### Output table comparing currentMemberHits() and currentPlayerHits()
  ### NOT IMPLEMENTED!
  output$view <- renderTable({
    data.frame("X" = currentMemberHits())#, "Y" = currentPlayerHits())
  })
  
  ### Output xy-scatterplot
  output$cumsumPlot <- renderPlot({
    plot(seq(1, maxYears, 1), rep(0, maxYears), type = "n",
         lwd = 2, xlim = c(0, maxYears), ylim = c(0, maxHits),
         xlab = "Year", ylab = "Hits")
    segments(x0 = -100, x1 = 1000, y0 = 3000, y1 = 3000, lty = 2, lwd = 2,
             col = "black")
    for(i in 1:length(mHitsCumSum)){
      lines(seq(1, length(mHitsCumSum[[i]]), 1), mHitsCumSum[[i]], lwd = 2,
            col = "grey70")
    }
    lines(seq(1, length(currentMemberHits()), 1), currentMemberHits(), lwd = 2, 
        col = "magenta")
    lines(seq(1, length(currentPlayerHits()), 1), currentPlayerHits(), lwd = 2, 
          col = "blue")
  })
})

That’s it. Now, to run the app we need to put both the ui.R and server.R into a directory, then in a R session, we can run the follow code to run the app LOCALLY:

library(shiny)
runApp("C:/LocationOfYourShinyAppDirectory")
screenshot of my first web app!

screenshot of my first web app!

I was really impressed with how professional the app looks. This is hopefully only the beginning for a really promising package. I also plan on getting these scripts onto Github soon. When I do, I’ll add a link. Cheers!

UPDATE: I’ve added a new tab at the top of the page which includes the entire script for this app (3000 Hit Analyzer). It was a lot easier to add a new tab at the top of this website then create Github respository to share the scripts!

Advertisements