HydroApp Part 1: Using Shiny and googleVis to downloading and displaying US water quality data

Update: I’ve added a link to the HydroApp under the Web App tab at the top of the page!  It’s a work in progress, so you might (probably) get some weird behavior.  Somethings I’ve already noticed:

1) It can take some time to load the Google Map with sampling locations. I’m guessing it has to do with calling the Google Map server, but not totally sure.

2) The map doesn’t display all the sites, I think it just displays the first 400.  This is something that I’m working on.  Maybe someone out there has some suggestions???

3) I’m still adding functionality, so right now, all it does is display the sampling locations.  I hope to add data visualization and download capabilities soon…

Anyway, enjoy!

It’s been a busy fall season for me, so I haven’t updated the blog as much as I was hoping.  But, the good news is I’ve been using R quite a bit in my research, and have continued playing around with the Shiny package.  Recently, I’ve been working on an app that’s (arguably) more useful than my first Shiny app, and has some potential for people interested in water quality data (like myself!)  I’d like to extend a big thank you again to Rstudio for creating Shiny and specifically starting a free beta program for hosting Shiny apps on their servers.  I don’t know how long this beta program will last, but it has worked great so far. Keep up the good work.  Anyway, on to the app…

For those who have ever attempted to examine water quality data collected by the United State Geological Survey (USGS) or the Enivornmental Protection Agency (EPA), it’s not the most intuitive website to say the least.  Though the USGS and the Environmental Protection Agency has begun to address this (their Water Quality Portal is actually pretty slick), it’s still quite laborious to quickly and efficiently parse and display the data, particularly if you’re interested in looking at many sites.  Even with the improved website, you still need to download and analyze the data in separate steps, and just organizing all the data from one station can be quite time consuming in a spreadsheet program.

So, I decided to do something about it.  The USGS and EPA have implemented a nice web service for querying their data through the web (details can be found here).  I wanted to create a simple app where you entered the US state and a water quality parameter of interest.  You would then be presented with a map with the all the sampling locations within the state where the parameter of interest has been measured. Finally, you would be able to view a the data, and download the data as a .csv or .txt file for offline analysis. I haven’t had time to add the plotting features or download features, but I hope to have it up soon.

As in my previous post, I will briefly go through the scripts, describing some key points and the general layout. This is just part 1, so the app is still a work in progress.  Let me know what you think!

First, the entire HydroApp script can be seen here. And, if you’d like to try it out, I’ve hosted it on the Rstudio server here!

As you can see, the server side is comprised of 3 reactive functions (makeURL(), testURL(), and stateStationsInput()). Essentially, the inputs from the states and Parameter selection boxes in the ui.R script are combined with the base Water Quality Portal URL to download a .csv file that contains all the stations that have the desired water quality parameter. The makeURL() function handles the correct formatting of the URL so we can download the data:

  makeURL <- reactive({
    inStateID <- input$stateID
    inStateID.url <- URLencode(paste("US", inStateID, sep=":"),
                               reserved = T)
    inParmID <- input$parmID
    inParmID.url <- urlEncode(inParmID)

    url <- paste("http://www.waterqualitydata.us/Station/search?countrycode=US",
                 "&statecode=", inStateID.url,
                 "&characteristicName=", inParmID.url,
                 "&siteType=Stream&mimeType=csv&zip=no", sep="")
    print(url)
    return(url)
  })

You will notice that both the State ID (inStateID) and parameter ID (inParmID) need to be appropriately encode to work within a URL. To do this I used a function called urlEncode() to convert the criteria into a % plus a two-digit hexadecimal representation, which can be read by a web browser:

urlEncode <- function(URL) {
gsubfn(".", list("_" = "%5F", "." = "%2E","+" = "%2B","!" = "%21",
                 "*" = "%2A","'" = "%27"), c(URL))
}

Once the URL is properly encoded, I then test the URL to see if any sites match the criteria. This is done with testURL():

  testURL <- reactive({
    url <- makeURL()
    urlHeaders <- HEAD(url)$header
    warnings <- which(names(urlHeaders) %in% "warning")

    if(!length(warnings)){
      noData = FALSE
      summary <- cbind(paste("Total Site Count: ",
                             urlHeaders$'total-site-count'))
      print(summary)
      print(noData)
    }
    if(length(warnings) > 0){
      noData = TRUE
      summary <- paste("WARNING :", unlist(urlHeaders)[warnings])
      print(summary)
    }
    return(list(noData, summary))

  })

Using the httr library, I read in the header information for the website using HEAD(), and check to see if there are any warnings, specifically, if no stations match both of the input values (ie. State and water quality parameter). The header information also contains information on how many sites exist that match the criteria; somthing I print out later for the user to see.

Finally, assuming there are no warnings, I use the URL to download the requested data and from the downloaded .csv file I create a new data frame with two columns, stationLatLong and stationTip. The stationLatLong column is need by googleVis to correctly display the location of the data. The stationTip column is used for the descriptive text that pops up whenever you click on a station location.

  stateStationsInput <- reactive({
    if(testURL()[[1]] == F){
      url <- makeURL()
      stateStations = read.csv(url, colClasses = "character",
                               sep =",", na.strings = "", fill = F,
                               allowEscapes = F, header = T)

      stationLatLong <- paste(stateStations$LatitudeMeasure,
                              stateStations$LongitudeMeasure, sep=":")
      stationTip <- paste(stateStations$MonitoringLocationName,
                          stateStations$MonitoringLocationIdentifier,
                          sep="<BR>")

      data.frame(stationLatLong, stationTip)
    }
  })

Finally, and the one output (right now), I use the awesome gvisMap() function in the googleVis package to display a Google map with all the stations that match the desired criteria (I think only the first 400 stations are displayed… I’m working on figuring out a way to see more).

  output$stationMap <- renderGvis({
    if(testURL()[[1]] == F){
      mapData <- stateStationsInput()
      gvisMap(mapData, "stationLatLong", "stationTip",
             options=list(showTip=TRUE, showLine=FALSE,
             enableScrollWheel=TRUE, mapType='terrain',
             useMapTypeControl=TRUE))
    }

  })
Advertisements

My new favorite package: Shiny!

In the last post I said that I would be doing some data analysis on the triathlon results I scraped off the web, but I recently decided to delve into the brave new world of web apps! Shiny, designed by the creators of RStudio (my preferred R IDE) is a really easy and simple to use package for develop a web interface for your R scripts. I have some larger ideas regarding what I’d like to develop in the future, but for now I wanted to try out Shiny myself and see if I could develop a simple app.

Being a big fan of Ichiro Suzuki (who just surpassed 4,000 combined hits between the MLB and Japan!), I decided to make an app that allows a user to compare the cumulative hit trajectory of a player (past or present) against the hit trajectory for the 28 players that have more than 3,000 hits in the major leagues. Owing to the top-notch documentation of Shiny, I was able to put the app together in only a few hours.

This post is broken down into 3 parts: 1) getting the data together (which involves a little web scrapping, so see my previous post about if your interested in doing some yourself), 2) writing the Shiny ui.R file and 3) writing the Shiny server.R file. As usual, I’m not trying to recreate the wheel here, so I highly recommend reading the Shiny tutorial before you start since I won’t be covering too much of the basics of Shiny.  The tutorial was about all the introduction I needed to start building my app.

Part 1. Getting the data…

So I did a little googling, and found that there are 28 players who have more that 3,000 hits in the MLB.  Being lazy, I wrote a little script to scrape the names of these players off the Baseball-reference.com webpage that list the all-time hits leaders in descending order:

### Scrape 3000 hit club from www.baseball-reference.com
b = readLines("http://www.baseball-reference.com/leaders/H_career.shtml")
bdoc <- htmlParse(b, asText = T)
pathResult <- getNodeSet(bdoc, path = "//td/a")[1:28]
members <- unlist(lapply(pathResult, function(x) c(xmlValue(x))))
members <- gsub("[+]","", members)

### Get members first and last name to match with Master.csv playerID
memberFirst <- lapply(strsplit(members, split = "[[:space:]]"),
                      function(x) x[1])
memberLast <- lapply(strsplit(members, split = "[[:space:]]"),
                     function(x) x[2])

What I’ve done is download the HTML code and put it in a format that is easy for R to read.  Then, using Firebug for Firefox, I was able to locate the HTML/XML path to the names on the all time hits list.  Finally, I extracted the player names from the HTML code, cleaned it up, and saved it as a vector to be used later on (notice I only extracted the first 28 players on the list since these are the players with >=3,000 hits).

Next, I needed to find the actual hit data, by year so I can cumulatively sum it.  Of course I could have manually entered hit data into a spreadsheet, saved it as a .txt and read it back into R, but what fun is that??? So, I did some more googling and found this amazing baseball statistics website, created by Sean Lahman where you can download insane amounts of data as .csv files or even as database files.  Since I’m only interested in batting stats (hits), I only need to use the Batting.csv file (which contains the batting stats) and Master.csv file (which contains both the player names and playerIDs, which are needed to sort through the mountains of data):

setwd("C:/LocationOfYourDataHere")
master <- read.csv("Master.csv", header = TRUE, sep = ",",
                   quote = "\"", dec = ".", fill = TRUE,
                   comment.char = "")
batting <- read.csv("Batting.csv", header = TRUE, sep = ",",
                    quote = "\"", dec = ".", fill = TRUE, 
                    comment.char = "")

### extract playerIDs from Master.csv and 
### extract hits and other batting data from Batting.csv
memberId <- vector()
battingMember <- list()
hitsMember <- list()
for(i in 1:length(memberLast)){
  masterSub <- subset(master,
                      as.character(master$nameLast) == memberLast[[i]] &
                      as.character(master$nameFirst) == memberFirst[[i]]) 
  
  if(nrow(masterSub) > 1){ masterSub = masterSub[1, ] }
  
  memberId[i] <- as.character(masterSub$playerID)
  battingMember[[i]] <- batting[as.character(batting$playerID) == memberId[i], ]
  hitsMember[[i]] <- battingMember[[i]]$H
}

What I did above was to use the players first and last names to extract the playerID out of the Master.csv file, then use the playerID to extract out the hitting data from Batting.csv (I plan on cleaning this up some day, but for now,  I just wanted to get it to work).

Now that I had all the hit data for the 28 players, I can sum each season cumulatively so I can plot the data nicely:

### Calculate cumulative summation of hits for all members
mHitsCumSum <- lapply(hitsMember, function(x) cumsum(x))

Phew!  That’s actually the hard part. Now, its just doing a little copying a pasting and a some referencing of the tutorial and we have our first web app.

Part 2. Creating the Shiny ui.R file

Now for the fun part, designing your web app.  I wanted to keep my first app simple, so I decided to have a sidebar with only two way of sending data to R, and have one graph as the main output.  The first way a user can interact with the app is to highlight the hit trajectory of a particular member of the 3,000 hit club.  To do this, I made a drop down list containing the names of all 28 players.  When a name is selected, the hit trajectory is highlighted on the graph.  This was accomplished using the selectInput() function.

The other bit of interaction the user can do is plot the hit trajectory of ANY player, past or present, simply by entering in the name of the player and letting R look up the player in the master and batting data frames and plot the data on the graph.  This was accomplished using the textInput() function.

Finally, to display the graph, we tell shiny to plot the hit trajectories in the main panel (mainPanel()) by using the plotOutput() function.

library(shiny)

# Define UI for miles per gallon application
shinyUI(pageWithSidebar(
  
  # Application title
  headerPanel("The 3000 Hit Club"),
  
  # Sidebar with controls to select a member of the 3,000 hit club
  # and input a non-member and plot their hit trajectory
  sidebarPanel(
    
    ### Dropdown menu to select a member of 3,000 hit club to highlight on 
    ### plot
    selectInput("member", "Member of 3000 hit Club:",
                list( "Pete Rose" = "rosepe01",
                      "Ty Cobb" = "cobbty01",
                      "Hank Aaron" = "aaronha01",
                      "Stan Musial" = "musiast01",    
                      "Tris Speaker" = "speaktr01",  
                      "Cap Anson" = "ansonca01",
                      "Honus Wagner" = "wagneho01",   
                      "Carl Yastrzemski" = "yastrca01",
                      "Paul Molitor" = "molitpa01",     
                      "Eddie Collins" = "collied01",
                      "Derek Jeter" = "jeterde01",     
                      "Willie Mays" = "mayswi01",    
                      "Eddie Murray" = "murraed02",
                      "Nap Lajoie" = "lajoina01",      
                      "Cal Ripken" = "ripkeca01",     
                      "George Brett" = "brettge01",   
                      "Paul Waner" = "wanerpa01",      
                      "Robin Yount" = "yountro01",     
                      "Tony Gwynn" = "gwynnto01",   
                      "Dave Winfield" = "winfida01",  
                      "Craig Biggio" = "biggicr01",
                      "Rickey Henderson" = "henderi01",
                      "Rod Carew" = "carewro01",      
                      "Lou Brock" = "brocklo01",    
                      "Rafael Palmeiro" = "palmera01",
                      "Wade Boggs" = "boggswa01",
                      "Al Kaline" = "kalinal01",
                      "Roberto Clemente" = "clemero01")),

    
    # To text input to select non-3000 hit member to plot hit trajectory
    textInput("player", "Player Name:", value = ""),
    
    # Button to update plot output
    submitButton("Update View")
    
  ),
  
  # Show the output plot of the hit trajectory
  mainPanel(
    #tableOutput("view"),    
    
    plotOutput("cumsumPlot")
  )
))

Part 3.  server.R file

the server.R file is the file that does the heavy lifting behind the scenes. It contains the R scripts that takes user inputs (called input variables), does data manipulation, then spits out the results (called output variables) to ui.R, which then displays the outputs of server.R in your web browser.

If you look closely, you will see that I included the script from Part 1 at the beginning of the server.R file.  This code is outside the shinyServer() function, so is run ONCE when the app is loaded into the browser, and then all data frames, matrices, vectors, etc. can be used by shinyServer().

After the data is loaded into R, we run the shinyServer() function, which contains reactive functions.  Reactive functions are run any time a user changes one of the input variables.  You will see there is a reactive function called currentMemberHits(), which simply selects the desired 3,000 hit member for plotting, and there is another reactive function called currentPlayerHits() which get the non-members hit data from the master and batting data frames and calculates the cumulative hits trajectory.  Finally there is reactive function called renderPlot() which is run whenever currentMemberHits() or currentPlayerHits() changes.  renderPlot() just wraps normal R plotting functions and sends the plot back to ui.R to display in the web browser.

library(shiny)
library(XML)

### OVERHEAD
### Scrape 3000 hit club from www.baseball-reference.com
b = readLines("http://www.baseball-reference.com/leaders/H_career.shtml")
bdoc <- htmlParse(b, asText = T)
pathResult <- getNodeSet(bdoc, path = "//td/a")[1:28]
members <- unlist(lapply(pathResult, function(x) c(xmlValue(x))))
members <- gsub("[+]","", members)

### Get members first and last name to match with Master.csv playerID
memberFirst <- lapply(strsplit(members, split = "[[:space:]]"), function(x) x[1])
memberLast <- lapply(strsplit(members, split = "[[:space:]]"), function(x) x[2])

### Read in local files downloaded from...
setwd("C:/chitchat/data")
master <- read.csv("Master.csv", header = TRUE, sep = ",", quote = "\"",
                  dec = ".", fill = TRUE, comment.char = "")
batting <- read.csv("Batting.csv", header = TRUE, sep = ",", quote = "\"",
                   dec = ".", fill = TRUE, comment.char = "")

### extract playerIDs from Master.csv and 
### extract hits and other batting data from Batting.csv
memberId <- vector()
battingMember <- list()
hitsMember <- list()
for(i in 1:length(memberLast)){
  masterSub <- subset(master, as.character(master$nameLast) == memberLast[[i]] &
                        as.character(master$nameFirst) == memberFirst[[i]]) 
  
  if(nrow(masterSub) > 1){ masterSub = masterSub[1, ] }
  
  memberId[i] <- as.character(masterSub$playerID)
  battingMember[[i]] <- batting[as.character(batting$playerID) == memberId[i], ]
  hitsMember[[i]] <- battingMember[[i]]$H
}

### Calculate cumulative summation of hits for all members
mHitsCumSum <- lapply(hitsMember, function(x) cumsum(x))

### For plotting
maxYears <- max(unlist(lapply(hitsMember, function(x) length(x))))
maxHits <- max(unlist(lapply(mHitsCumSum, function(x) max(x))))

### Define server logic required to plot various players against 3000 hit club
shinyServer(function(input, output) {
  
  ### get hits for chosen 3000 club member
  currentMemberHits <- reactive({ 
    
    ### Calculate cumulative summation of hits
    cumsum(hitsMember[[match(as.character(input$member), memberId)]])
    
  })
  
  ### get hits for non-3000 club player
  currentPlayerHits <- reactive({ 
    
    playerFirst <- lapply(strsplit(input$player, split = "[[:space:]]"),
                          function(x) x[1])
    playerLast <- lapply(strsplit(input$player, split = "[[:space:]]"),
                         function(x) x[2])
    
    ### extract hits and other batting data from Batting table
    masterPlayer <- list()
    playerId <- vector()
    battingPlayer <- list()
    hitsPlayer <- list()
    for(i in 1:length(playerLast)){
      masterSub <- subset(master, 
                          as.character(master$nameLast) == playerLast &
                          as.character(master$nameFirst) == playerFirst) 
      
      if(nrow(masterSub) > 1){ masterSub = masterSub[1, ] }
      
      playerId <- as.character(masterSub$playerID)  
      battingPlayer <- batting[as.character(batting$playerID) == playerId, ]
      hitsPlayer <- battingPlayer$H
    }
    
    ### Calculate cumulative summation of hits for non-member
    cumsum(hitsPlayer)
    
  })
  
  ### Output table comparing currentMemberHits() and currentPlayerHits()
  ### NOT IMPLEMENTED!
  output$view <- renderTable({
    data.frame("X" = currentMemberHits())#, "Y" = currentPlayerHits())
  })
  
  ### Output xy-scatterplot
  output$cumsumPlot <- renderPlot({
    plot(seq(1, maxYears, 1), rep(0, maxYears), type = "n",
         lwd = 2, xlim = c(0, maxYears), ylim = c(0, maxHits),
         xlab = "Year", ylab = "Hits")
    segments(x0 = -100, x1 = 1000, y0 = 3000, y1 = 3000, lty = 2, lwd = 2,
             col = "black")
    for(i in 1:length(mHitsCumSum)){
      lines(seq(1, length(mHitsCumSum[[i]]), 1), mHitsCumSum[[i]], lwd = 2,
            col = "grey70")
    }
    lines(seq(1, length(currentMemberHits()), 1), currentMemberHits(), lwd = 2, 
        col = "magenta")
    lines(seq(1, length(currentPlayerHits()), 1), currentPlayerHits(), lwd = 2, 
          col = "blue")
  })
})

That’s it. Now, to run the app we need to put both the ui.R and server.R into a directory, then in a R session, we can run the follow code to run the app LOCALLY:

library(shiny)
runApp("C:/LocationOfYourShinyAppDirectory")
screenshot of my first web app!

screenshot of my first web app!

I was really impressed with how professional the app looks. This is hopefully only the beginning for a really promising package. I also plan on getting these scripts onto Github soon. When I do, I’ll add a link. Cheers!

UPDATE: I’ve added a new tab at the top of the page which includes the entire script for this app (3000 Hit Analyzer). It was a lot easier to add a new tab at the top of this website then create Github respository to share the scripts!

How to access data in html tables through R!

This post is going to be on web scraping, a technique that I’ve only used a few times, but which is can be very useful. It’s essentially the process of accessing the underlying HTML code or other “behind the scenes” data from a websites. From reading the wikipedia page on web scraping, you should be aware that not all webmasters will be happy about someone “scraping” their data (and in some cases it may be illegal), so proceed at your own risk!!!

First, before I begin, the data I’m going to be using are from a race management company located in the northeast called F.I.R.M. I recently complete a race organized by them (I’m part of the data I will be using for this demo!)The data is accessible by anyone with an internet connection (not just participants) so I felt it is OK to use race result data for this post.

Secondly, the method I will be discussing below is used to access data embedded in a HTML table from a particular website, and therefore it’s website-specific. If you want to repeat what I show you for another website, you will need to 1) be able to view their HTML code, 2) figure out where in the HTML code the data of interest is stored, and 3) have a basic understanding of HTML and HTML/XML structure so you can use the correct path (more on this later) to tell R to where the data is located. (1) isn’t too hard to do. I use Firefox browser, and they have a really robust and useful plug-in called Firebug that let’s you view the HTML code of any website you visit. (2) and (3) are more tricky, and will require some time studying HTML, but there are some good website out there to help you learn the basics of HTML and Xpath/XML path syntax.

So, first we need to know the URL of the website with the data we want to scrap. We can use the readLines() function in the XML library to read in the HTML code (which includes the data we are interested in). This creates a vector of the HTML code. Unfortunately, this isn’t very useful to us, since it will be VERY difficult to extract the data we are interested in. So, we create and XMLInternalDocument object using htmlParse() to more easily access the XML node that has our data:

library(XML)
b <- readLines("http://www.firm-racing.com/result_report.asp?RID=792&type=1")
bdoc <- htmlParse(b, asText = T)

Now that we have an XMLInternalDocument, we access the XML node that has our data using getNodeSet() and retrieve the raw data using xmlValue().

result.table <- getNodeSet(bdoc, path = "//table/td/div")
racer.rslt <- matrix(unlist(lapply(result.table, 
                     function(x) c(xmlValue(x)))),
                     ncol = 16, byrow = T)

You can see there is a arguement “path” in getNodeSet(), which is where R looks to get our desired data out of the XML document. Defining the correct path requires knowledge of XPath syntax that’s not going to be covered here, but using Firebug and some trial and error, I was able to narrow down the location of the data itself to “//table/td/div” fairly quickly. You can see that the getNodeSet() returns a list of XML nodes that R found when it followed the “path” we defined in the function call. We then use xmlValue() in a lapply() loop to extract the actual values in the Nodes.

racer.rslt is our desired matrix which contains the data we want (race time, age group, finishing order, etc.). Now we can convert the matrix to a data frame, add headers (obtained from the website) and start analyzing the data!

result.df <- as.data.frame(racer.rslt,stringsAsFactors = FALSE)
header <- c("bib", "category", "swim_cat", "swim_ov", "swim_time",
            "TT1", "bike_cat", "bike_ov", "bike_time", "TT2",
            "run_cat", "run_ov", "run_time", "overall_cat",
            "overall_ov", "overall_time")
names(result.df) <- header
final data frame of race results ready for analysis.

final data frame of race results ready for analysis.

For my next post I’m going to do some analyzing of the data. Stay tuned!

Add error bars to a plot in R

So, it’s been a while since I last posted to this blog, but I haven’t forgotten about it.  I’ve just been really busy (aren’t we all?).  I’ve had a few blog posts in mind for a while now,  so I decided to post one so I don’t get too backlogged.

A common feature of scientific figures is the addition of error bars to data points, typically representing 2-standard deviations (SD), 2-standard errors (SE), etc.  When you include them in a graph, they should be easy to read, but not overwhelm the graph.  Here I’ll show you how you can scale the size of the error bars (by size I referring to the cap width of the bars relative to the size of the axis…).I’ve implemented it as a function so that you can reuse it (and improve it!).  If anyone has any suggestions, let me know.  Here goes:

First we will start by creating the function. I’ve added a couple arguments to make the bars “pretty”, but it’s pretty easy to add additional arguments. Note the default values for arguments in the function definition (ie. xcolor = "black", mean that if no xcolor is specifically specified when the function is called, it will default to “black”). You can also see how I calculate the width of the error bar caps: I take the difference between the largest and smallest value on the relevant axis and divide by the “cap.scaling” argument. 50 seems to make a nice sized error bar cap so I set that as the default, but you can use whatever you think is best.
Lastly and most importantly, for Plot_ErrorBars() to work correctly, each argument that doesn’t have a default value MUST be called either by name or listed in the EXACT order as in the function definition. I think there are ways to get around this, but that is for another post 🙂

### Plotting function to plot y- or x-axis error bars
### Filename: Plot_ErrorBars.R
### Notes:
############################################################################

# INPUTS:
# x: x-axis data
# y: y-axis data
# x.err: error bars for x-axis data
# y.err: error bars for y-axis data
# xbar: Whether error bars for x-axis values should be plotted (Default is F)
# ybar: Whether error bars for y-axis values should be plotted (Default is F)
# cap.scaling: scaling factor for size of error bar caps. Default is 50.
# xcolor: line color. (Default is black)
# ycolor: line color. (Default is black)
# lwidth: line width (Default is 1)

# OUTPUTS:
# error bars for x-axis data, y-axis data, or both.

# FUNCTION:
Plot_ErrorBars<-function(x, y, x.err, y.err, xbar = F,
                         ybar = F, cap.scaling = 50, xcolor = "black",
                         ycolor = "black", lwidth = 1, ...){
  ycap.width <- (par("usr")[4] - par("usr")[3])/cap.scaling
  xcap.width <- (par("usr")[2] - par("usr")[1])/cap.scaling
  if(xbar == T){
    for(i in 1:length(x.err)){
      segments(x0 = x[i] + x.err[i], y0 = y[i],
               x1= x[i] - x.err[i],  y1 = y[i],
               lwd = lwidth, col = xcolor)
      segments(x0 = x[i] + x.err[i], y0 = y[i] + ycap.width,
               x1=x[i] + x.err[i],   y1 = y[i] - ycap.width,
               lwd = lwidth, col = xcolor)
      segments(x0 = x[i] - x.err[i], y0 = y[i] + ycap.width,
               x1=x[i] - x.err[i],   y1 = y[i] - ycap.width,
               lwd = lwidth, col = xcolor)
    }
  }
  if(ybar == T){
    for(i in 1:length(y.err)){
      segments(x0 = x[i], y0 = y[i] + y.err[i],
               x1= x[i],  y1 = y[i] - y.err[i],
               lwd = lwidth, col = ycolor)
      segments(x0 = x[i] + xcap.width, y0 = y[i] + y.err[i],
               x1=x[i] - xcap.width,   y1 = y[i] + y.err[i],
               lwd = lwidth, col = ycolor)
      segments(x0 = x[i] + xcap.width, y0 = y[i] - y.err[i],
               x1=x[i] - xcap.width,   y1 = y[i] - y.err[i],
               lwd = lwidth, col = ycolor)
    }
  }
}
# END OF FUNCTION

Now we will use Plot_ErrorBars() with some fake data. I’ll also demonstrate how changing the “cap.scaling” argument changes the cap widths. First I’ll just use the default arguments, and only plot error bars for the y values:

# Create fake data
x1 <- seq(1, 10, 1) + rnorm (length(seq(1, 10, 1)), mean =0, sd= 0.5)
y1 <- seq(1, 10, 1) + rnorm (length(seq(1, 10, 1)), mean =0, sd= 0.5)
x.se <- abs(rnorm(length(seq(1, 10, 1)), mean =0, sd= 0.75))
y.se <- abs(rnorm(length(seq(1, 10, 1)), mean =0, sd= 0.75))
# Plot
plot(x1, y1, pch = 20, cex = 1.5, xlim = c(0,12), ylim = c(0, 12))
Plot_ErrorBars(x1, y1, y.err=y.se, ybar = T)

Here are the results:
default_errorbars

Now I’m going to change some of the default parameters and plot both x and y error bars:

# Plot
plot(x1, y1, pch = 20, cex = 1.5, xlim = c(0,12), ylim = c(0, 12))
Plot_ErrorBars(x1, y1, x.err=x.se, y.err=y.se, xbar = T, ybar = T,
               xcolor = "red", ycolor = "blue", lwidth = 2,
               cap.scaling = 100)

Here are the results. Note that increasing the “cap.scaling” decreases the cap width, and all arguments that don’t have a default need to be specifically defined! Finally, observant R users might notice that at the error bars are plotted on top of the points!  This is an easy fix by re-plotting the points after calling Plot_ErrorBars() using the points() function.user_specified_errorbars

Hope you enjoy this post, and let me know if you have improvements or questions. Also, look for some more posts soon!

par(“usr”) is my new friend for inserting legends in plots!

I’ve decided to write a post about a solution to a problem I’ve been dealing with, pretty much since I started using R to make nice graphics 5+ years ago. Many times while exploring datasets, I’d want to make many similar plots (such as xy scatterplots) but with different variables for the x and y axes. This was all fine and well, but when It came time to annotate the graph with legends, text, etc., I would have to manual adjust the positions for the legend and text for each plot so that it wouldn’t cover up the data, the text would always plot in the upper left corner, etc.

I’ve always known about the “usr” parameter in the par() function (I even used it once in the a previous post about plotting dates along the x-axis), but I never really investigated it fully. Well, recently I had to make many xy scatterplots, and wanted to add a legend in the same location for each plot. A prefect application for “usr”!

For this example I will show my solution to always having the legend in the lower right corner (since I knew ahead of time the data would not fall in that region). For a full explanation of the par()“usr” parameter and par() function in general, check out the ?par() next time you are in R. I’m sure there are many different applications!

I first will create some fake data so that I can make a xy scatterplot and plot it up. Next, I’ll use the par("usr") to get the range of the x and y axes in the plot.  Then I will treat the legend as a variable to access the properties of the legend object (namely the width and the height of the legend rectangle).  Finally, I will anchor the legend in the lower right corner of the plot using information from the "lgd" variable and par("usr")!:

# Create fake data
x1 <- seq(0.1, 10, 0.5) + rnorm (length(seq(0.1, 10, 0.5)), 0.25)
y1 <- seq(0.1, 10, 0.5) + rnorm (length(seq(0.1, 10, 0.5)), 0.25)

# Plot
plot(x1, y1, pch = 20, cex = 2.5)

# Determine plot boundaries, in units of the data
xmin <- par("usr")[1]
xmax <- par("usr")[2]
ymin <- par("usr")[3]
ymax <- par("usr")[4]

#Now determine the size of the legend you would like to plot.  Right now the exact
#location is not important, we just want to know the dimension!  Note that we are
# treating the lengend as a variable and we are NOT plotting the legend on the figure!
lgd <- legend(x = mean(c(xmin,xmax)), y =  mean(c(ymin,ymax)),
c("Your data name"), pch = c(20), col = c("black"), plot = F)

# Add legend in the lower right corner:
legend(x = xmax - lgd$rect$w, y =  ymin + lgd$rect$h,
c("Your data name"), pch = c(20), col = c("black"), plot = T)

This has turned out to be a huge time saver, especially when I need to produce lots of quick plots with lots of different data!

“Automated” placement of the legend exactly in the lower right corner.

Using col parameter in plot() to display third variable

I’ve been working on a project recently where I needed to produce a xy scatterplot, and I wanted to plot each of the data points a different color based on a third variable (lets call it “z”). I developed a quick and easy utility to do this. It seems to work pretty well! I’m still working out how to add a colorramp to plot() to displays the values of “z”, but in the mean time, I’ll show you what I’ve come up with:

First, I create a vector of colors. Below is a nice color scale I use often; it goes from black to blue to purple, orange and finally a deep red:

plotclr <- c(colorRampPalette(c("black", "blue"))(50),
             colorRampPalette(c("blue", "purple", "orange"))(55),
             colorRampPalette(c("orange", "red", "darkred"))(70))
 

Here’s an image of the ramp (in case you were wondering):

Colorramp produced from above script.

Next, I create a utility function to scale the third variable, z, to between 0 and 1. Then I multiply z_scl by the length of plotclr (50+55+70 = 175), and make sure that the smallest value in color_scl (ie. 0) is set to 1 so that I can use color_scl as an index for plotclr:

z_scl <- (z - min(z, na.rm=T))/(max(z, na.rm=T) - min(z, na.rm=T))
color_scl = round(z_scl*length(plotclr))
color_scl[color_scl == 0] = 1
 

Now, lets plot something up with it:

 ### Fake data
 x = seq(0.1, 10, 0.5) + rnorm (length(seq(0.1, 10, 0.5)), 0.25)
 y = seq(0.1, 10, 0.5) + rnorm (length(seq(0.1, 10, 0.5)), 0.25)
 z = seq(0.1, 10, 0.5) + rnorm (length(seq(0.1, 10, 0.5)), 0.25)
 
 ### Plotting
 plot(x, y, type = "n") # create new plot
 z_scl <- (z - min(z, na.rm=T))/(max(z, na.rm=T) - min(z, na.rm=T))
 color_scl = round(z_scl*length(plotclr))
 color_scl[color_scl == 0] = 1

 # Loop to plot each point
 for(i in 1:length(x)){ 
   points(x[i], y[i], pch = 20, col = plotclr[color_scl[i]], cex = 2.5)
 }
 ### End of Plotting
 

Here’s the final result:

xy scatterplot with the color of each point based on a third variable.

I should mention that by using a colorramp, the reader may have some difficulty determining the exact value of the third variable for the point, but as long as that is OK, then I think this is a pretty nifty way of plotting three variables together! Till next time…

R scripting in wordpress made easy

Another quick post here…

I thought I’d share a nice feature in wordpress.com that lets you display R code nicely. WordPress.com had a R specific “sourcecode” tag that you can use at the beginning and end of your script to have it display in a readable form.

before R script add: (open bracket)sourcecode language=”r”(close bracket)
…put R script here…
after R script add: (open bracket)/sourcecode(close bracket)

where open bracket = [
and close bracket = ]

If you’re interested in starting R wordpress blog, give it a try!