Plotting DEMs in 3D using persp()

A little background… I got the idea to plot DEMs in 3D from a great R website that seemed quite popular a few years ago, but appears to have been taken down. This site helped me tremendously when I was first learning R. Anyway, one script demonstrated how to plot a DEM in 3D using persp(), but I always felt the script was more complicated that it needed to be.

Fast forward a couple years, and with the creation of the great raster package, working with rasters seems to have gotten a lot easier. Raster has a function (also called persp()) that allows you to plot raster* objects in 3D, without having to convert any data. It works well, but seems rather limited. Coloring the cells (or facets as they are referred to the help page) according to the z value seemed particularly difficult.

So, I decided to go back and see if I could simplify plotting DEMs using the graphic package version of persp(). I’ll demonstrate how I’ve plotted DEMs using persp() in an upcoming post, but I thought it’d be good to go back and review how persp() works, since I’ve had some battles getting data to plot correctly in the past.

Here are some key points to remember when using persp() to plot raster (or any type of data). I’ll then demonstrate them using a simple data matrix.

Key Point 1) The z matrix represents the “wire frame” of the plot NOT the cell centers.
Key Point 2) the x and y vectors represent the (x,y) coordinates where each value in the z matrix is plotted. Also, the x and y vectors MUST be increasing. This means that the z matrix is rotated 90 degrees counterclockwise when plotted. The length of the x vector must be the same length as the nrow(z) and the length of the y vector must equal ncol(z).
Key Point 3) When a nr x nc z matrix is plotted, it creates a (nr-1) x (nc-1) matrix of facets (cells) that can be colored using the “col” argument. This matrix is also rotated 90 degrees counterclockwise when plotted.

To demostrate these points, I will create a simple 4×4 matrix, and x and y coordinate vectors that increase:

zmat <- matrix(seq(1,16,1), nc = 4)
x <- seq(1, 4, 1)
y <- seq(1, 4, 1)
print(zmat)
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16
persp(x, y, zmat, axes= T, nticks = 4,
      ticktype = "detailed", theta = 320)

persp() plot our our simple DEM.

persp() plot our our simple DEM.


See that zmat[1,1] of our persp() surface is plotted at the origin (1,1,1), which is the lower left corner (of the x-y plane). persp() converts the x and y vectors to (x,y) coordinates using something similar to:

n = 1
coords <- list()
for(i in 1:length(x)){
  for(j in 1:length(y)){
    coords[[n]] <- c(x[i], y[j])
    n <- n+1
  }
}

You can see that zmat[1,1] get plotted at (1,1,zmat[1,1]), zmat[1,2] get plotted at (1,2,zmat[1,2]), …. You’ll also notice in the image above, the wireframe created a 3×3 matrix of facets (cells) that can be colored according to a 3×3 matrix of colors. To borrow some code from for the persp() help page:

# create 3x3 matrix which averages the value of
# the 4 points bounding each facet.
nrzmat <- nrow(zmat)
nczmat <- ncol(zmat)
facetValues <- (zmat[-1, -1] + zmat[-1, -nczmat] + 
          zmat[-nrzmat, -1] + zmat[-nrzmat, -nczmat])/4

#Color Ramp
jet.colors <- colorRampPalette( c("blue", "green") )
# Generate the desired number of colors from this palette
nbcol <- 100
color <- jet.colors(nbcol)
# Recode facetValues into color indices
facetcol <- cut(facetValues, nbcol)

persp(x, y, zmat, axes= T, nticks = 4, ticktype = "detailed",
      theta = 320, col = color[facetcol])
persp() plot of DEM colored according to facetValue matrix.

persp() plot of DEM colored according to facetValue matrix.

The cool trick I learned from the help page example is the equation used to calculate the facetValue matrix. This will come in handy when plotting our own DEM. The equation simply calculates the average value of the four points that make up each facet, and this is what determines the facet color!

I’ve decided to break this post up into two parts. I’ll have part 2 up shortly. Cheers!

Creating DEMs from contour lines using R

UPDATE: I made a slight change to the script, because I was getting an error when I ran the script on Windows, but not on Linux.  Previously, the script read:

# Create raster grid where we are going to interpolate our
# elevation data.
contourBBox <- bbox(contour3m)
contourRaster <- raster(xmn=contourBBox[1,1], 
                        xmx=ceiling(contourBBox[1,2]),
                        ymn=contourBBox[2,1], 
                        ymx=ceiling(contourBBox[2,2]),
                        crs = CRS("+init=epsg:26918"),
                        resolution = c(12,12))

But I found, that the resolution = c(12,12) argument was giving an error on my Windows computer.  Fortunately, there is an easy fix. Just remove the argument and add the line res(contourRaster) <- 12:

# Create raster grid where we are going to interpolate our
# elevation data.
contourBBox <- bbox(contour3m)
contourRaster <- raster(xmn=contourBBox[1,1], 
                        xmx=ceiling(contourBBox[1,2]),
                        ymn=contourBBox[2,1], 
                        ymx=ceiling(contourBBox[2,2]),
                        crs = CRS("+init=epsg:26918"))
res(contourRaster) <- 12

Everything should work smoothly now!

Quick Note: A few years ago I did a post on using R for spatial analyses. I recently thought I’d return to the topic, mainly to show how versatile R is and how many of the things you can do in dedicated GIS software (which can cost thousands of dollars to license!) can be re-created easily and quickly using R.

DEMs, or digitial elevation models, are one of the most common types of data used in spatial analysis. DEMs are raster files that contains elevation information for each cell in the raster. I recently was give a shapefile of contour lines for a particular area and needed to create a DEM from them. As such, I thought I’d demonstrate how I created a DEM from the contour lines that were in a vector shapefile (.shp), one of the most common GIS file types.

I won’t go into all the differences between rasters and vectors (ArcGIS has some good online resources in case you need to brush up) but the basic steps are:

1. Read in the files to R.
2. Create a raster “grid” to interpolate the elevation data onto.
3. Convert the contour lines to points so you can interpolate between elevation points.
4. Interpolate the elevation data onto the raster grid.
5. Crop the DEM raster to the boundary of the area of interest.

Before I begin, I obtained my data from the Mass GIS webportal, which is a really nice web-based GIS for querying all types of spatial data within the Commonwealth of Massachusetts. Since I’m interested in elevation data only, I looked for a sub-basin (ie. watershed) that had a nice topographic range and shape. I then downloaded the watershed boundaries, and the 3m contour lines that covered the watershed, as separate shapefiles.

Step 1. This is quite simple, since there are some really nice R packages for dealing with spatial data. To read in our data, we will use the maptools library. You’ll notice that the readShapeLines() and readShapePoly() has a proj4string argument. This is where you tell R what the projection and datum are for your data. MassGIS actually asks which projection you’d like your data in, so I knew ahead of time what EPSG I needed to input:

library(sp)
library(maptools)
library(rgdal)
library(raster)
library(rgeos)

#Read in shape files
setwd("C:/LocationOfYourContourData")
contour3m <- readShapeLines("GISDATA_CONTOURS5K_ARC",
                            proj4string = CRS("+init=epsg:26918"))

setwd("C:/LocationOfYourBoundaryData")
ws_bound <- readShapePoly("GISDATA_SUBBASINS_POLY",
                           proj4string = CRS("+init=epsg:26918"))

Lets plot up the files to see what we are working with. I’m interested in the watershed in the center, but you can see that the downloaded data included adjacent watersheds. Also, the contours extend beyond the watershed of interest. We’ll deal with this later when we crop and clean up our data:

# plot of ws_bound with watershed of interest highlighted
spplot(ws_bound, zcol = "SUB_ID", col = "black", colorkey = F,
       fill = "grey80", 
       panel = function(x, y, ...){
         panel.polygonsplot(x,y, ...)
         sp.polygons(ws_bound[as.character(ws_bound$SUB_ID) == 1007, ],
                    fill = "magenta")
       })

# same plot as above, but include contour3m lines
spplot(ws_bound, zcol = "SUB_ID", col = "black", colorkey = F,
       fill = "grey80", 
       panel = function(x, y, ...){
         panel.polygonsplot(x,y, ...)
         sp.polygons(ws_bound[as.character(ws_bound$SUB_ID) == 1007, ],
                     fill = "magenta")
         sp.lines(HudsonWS_contourSub, col = "black")
       })
watershed boundaries downloaded from MassGIS.  The watershed of interest is highlighted in magenta.

Watershed boundaries downloaded from MassGIS. The watershed of interest is highlighted in magenta.

Added 3m contour lines to plot to show extent of contour lines outside watershed of interest.

Added 3m contour lines to plot to show extent of contour lines outside watershed of interest.

Step 2. I need to create an empty raster grid that spans the range of my sub-basin. Since I’m using the contour lines to create the DEM, I used them as the basis for my raster grid. Notice that I manually set the resolution (size) of each cell in the raster to be square (12m x 12m). This is required if you ever want to import your data into ArcGIS (as far as I’m aware). I also set the resolution so that I didn’t have too many cells. The resolution should depend on your needs and computing power:

# Create raster grid where we are going to interpolate our
# elevation data.
contourBBox <- bbox(contour3m)
contourRaster <- raster(xmn=contourBBox[1,1], 
                        xmx=ceiling(contourBBox[1,2]),
                        ymn=contourBBox[2,1], 
                        ymx=ceiling(contourBBox[2,2]),
                        crs = CRS("+init=epsg:26918"),
                        resolution = c(12,12))

Step 3. Since interpolation methods were conceived to work with point data, we need to convert the elevation contour lines to elevation points. Fortunately, this is quite easy to do in R using as(). Essentially we are creating points along each contour line that has as its value the elevation of the associated contour line:

# Create points from contour lines so we can
# interpolate between points.
p <- as(contour3m, 'SpatialPointsDataFrame')

Step 4. Now we have points from which we can interpolate a elevation raster. There are probably many different intrpolation packages out there, but I’m going to use gstat, since it seems pretty robust for spatial statistics, and there are nice examples using the gstat and raster packages together for interpolating point data to rasters. We will use gstat() to create a model for interpolating our point elevation data using inverse distance weighting. Then, we use interpolate() to predict the elevation value at each cell in contourRaster. Note that this step can take some time depending on the computing power of your computer:

#interpolate between points using IDW into contourRaster
library(gstat)
g <- gstat(id="ELEV_M", formula = ELEV_M~1, data=p,
           nmax=7, set=list(idp = .5))
interpDEM <- interpolate(contourRaster, g)

Step 5. Finally, we can clean up the data by clipping our DEM to the boundaries of the watershed. We can also clip our contour lines to the boundaries of our watershed:

# extract the watershed boundary of interest
HudsonWS_bound <- ws_bound[as.character(ws_bound$SUB_ID) == 1007, ]
# clip raster to watershed boundaries
interpDEM_mask <- mask(interpDEM, HudsonWS_bound)
# clip contours to watershed boundary
HudsonWS_contour <- crop(contour3m,HudsonWS_bound)

That wasn’t too bad. We can plot it up nicely using spplot() like I did in a previous post:

# subset 3m contours for nice plotting
HudsonWS_contourSub <- HudsonWS_contour[(HudsonWS_contour$ELEV_M) %in% 
                                          seq(354, 993, 30), ]
# Plot it up!
spplot(interpDEM_mask, col.regions = heat.colors(100),
       xlim=c(bbox(interpDEM_mask)[1,1] + 900,
              bbox(interpDEM_mask)[1,2] - 900),
       ylim=c(bbox(interpDEM_mask)[2,1] + 1150,
              bbox(interpDEM_mask)[2,2] - 900),
       colorkey = list(space = "top", width = 1), main = "Elev. (m)",
       panel = function(x, y, ...){
         panel.gridplot(x,y, ...)
         sp.lines(HudsonWS_contourSub, col = "black")
       })
Final DEM with 30m contour lines.

Final DEM with 30m contour lines.

Not too shabby! Anyway, keep an eye on Chit Chat R for more spatial analysis in near future…

Fitting a (non-linear) curve to data points

Anyone who has taken a high school math class is probably familiar with fitting a straight line to data points:

1. graph the points
2. draw a best fit straight line through the points
3. calculate the slope, m = (y2 – y1)/(x2 – x1), where (x1, y1) and (x2, y2) are two different points on the best fit line
4. Continue the best fit line through the y-axis to determine b, the y-intercept.

Voila! You have a equation for a straight line in the form y = mx + b! If you’re like me and are lazy, you can use R to calculate the two parameters (m and b), using the lm() function along with the summary() function.

The above steps are all well and good if your data approximates a straight line, but what if your data are non-linear (ie. curved) and you want to fit a line to it?

Well, (not surprisingly) R has you covered. R has a nice built in function called nls(), that can fit a non-linear equation to your data. To use this function, all you need to do is decide on a model to fit to your data. This may seem simple, but choosing the correct model can be quite important. Typically in the natural sciences, we try and choose models that make sense physically (ie. the parameters of the model can be related back to real world phenomena) or a model that is well accepted in a particular area of research. Nonetheless, just keep in mind that model selection is a critical step.

With that out of the way, I’ve decided to demonstrate nls() using some real world data, namely daily maximum temperature data (TMAX) taken from a great of source free climate data, NOAA. They have a really nice user interface to search for all different types of climate data.

On the National Climate Data Center website, I was quickly able to locate daily MAX and MIN temperature data from Logan International Airport in Boston, MA. Within a few hours of submitting my data request, I received an email telling me my data was ready to be downloaded. Let us begin!

They email had a link to a .dat file to download the data from the web, so I used R to grab the data and load it into a data frame ((I don’t know how long NOAA data links are active, so I suggest requesting your own data if you are following my script…) We are just interested in the date and TMAX data columns so we extract those:

library(chron)
url <- paste("http://www1.ncdc.noaa.gov/pub/orders/cdo/294541.dat", sep="")
dat <- read.table(url, sep = "", skip = 2, header = F)
bostonTemp <- data.frame(dat[c("V11", "V13")])
headers <- read.table(url, colClasses = "character", sep = "",
                      nrow = 1, header = F)
names(bostonTemp) <- c(headers[c("V6", "V8")])

Now that we have the data in a usable form, we can do some cleaning up. First, the temperature is given in tenths of a degree Celcius (I have no idea why…) so we can easily convert that into more familiar values by dividing by 10. Next, to make plotting easier, we can convert the dates into a time format that R likes, using the chron() function in the “chron” package. Finally, we can plot the data to see what it looks like:

bostonTemp$TMAX <- bostonTemp$TMAX/10
bostonTemp$DATE <- as.chron(paste(substr(bostonTemp$DATE, 5,6),
                                  substr(bostonTemp$DATE, 7,8),
                         substr(bostonTemp$DATE, 1,4), sep="/"))

par(font=2, font.axis = 2, tck = 0.01)
plot(bostonTemp$DATE, bostonTemp$TMAX)
2000-2014 TMAX data from Logan International Airport.

2000-2014 TMAX data from Logan International Airport.

As you can see I downloaded data spanning 1/1/2000 through the present. Clearly the temperature flucuates in a repeating fashion, cold in the winter and hot in the summer (obviously). Let’s look at the most recent year where we have a full year of data, 2013:

bostonTemp2013 <- subset(bostonTemp, 
                         bostonTemp$DATE >= as.chron("01/01/2013") & 
                         bostonTemp$DATE <= as.chron("12/31/2013"))
plot(bostonTemp2013$DATE, bostonTemp2013$TMAX) 
2013 TMAX data from Logan International Airport.

2013 TMAX data from Logan International Airport.

It appears that the TMAX data have a convex upward shape, with the highest values of TMAX in the middle of the year and the lowest values at the beginning and the end. You could say that the data has the shape of a parabola! Hmm, let’s fit it.

Knowing the vertex form of the parabola equation: y = a*(x – h)^2 + k, we can use this as our model. We also need to supply nls() with a starting value for each of the unknown parameters that we are going to estimate (a, h and k). Knowing the physical meaning behind each of the model parameters (see this website for a quick refresher), we can guess that h is approximately the date where TMAX is the greatest, and k is the maximum TMAX, and we can assume a is a negative value (since we want the parabola to open downwards), let’s say -1. Finally, we need to convert bostonTemp2013$DATE to a numeric format using as.numeric() so that we can fit a equation to the data:

x <- as.numeric(bostonTemp2013$DATE)
y <- bostonTemp2013$TMAX
h1 <- mean(x)
k1 <- max(y)
a1 <- -1

yfit<- nls(y ~ a*(x-h)^2 + k, start = list(a = a1, h = h1, k = k1))
yfitParm <- summary(yfit)$para[,1]

If we look at the parameter estimates nls() gave us, they weren’t too far off from our initial guesses!

> cbind(h1, k1, a1)
        h1   k1 a1
[1,] 15888 37.2  -1
> yfitParm
            a             h             k 
-8.329497e-04  1.590282e+04  2.480478e+01 

Now lets see how nls() did when fitting our data with a parabola:

ymod <- yfitParm[1]*(x-yfitParm[2])^2 + yfitParm[3]

par(font=2, font.axis = 2, tck = 0.01)
plot(bostonTemp2013$DATE, bostonTemp2013$TMAX)
lines(x, ymod)
2013 TMAX data from Logan Internaltional Airport fitted with a parabola.

2013 TMAX data from Logan Internaltional Airport fitted with a parabola.

Not too bad! There may be more appropriate models out there for fitting temperature data, but this model seems to do the trick. Till next time…

Back to basics: Plotting two y-axes on the same plot

After another lengthy hiatus (PhD work isn’t all fun and games!), I’m back and decided to make a post about a particular problem that I keep coming across when I’m making my figures.  Many times, you want to plot two variables, say y1 and y2, on the same graph, against another variable (say x1), but the y variables have very different units/scales/etc.  This comes up a lot with time series data, when y1 and y2 span the same time but are measured on different days and/or represent different variables.

To overcome this issue, we can take advantage of a few built-in features of R, namely the par(new=T). This allows you to create a new plotting device (as R likes to call plots), without erasing the previous plot. Essentially what we are going to do is plot two figure on top of one another, but with a different y axis range specified in the second plot() function.

Let’s work through the code:

First we create or dummy data. For this example, both y1 and y2 are assumed to have the same x variable (“dates”):

### Create Dummy Data
dates <- seq.dates("1/1/2014", "3/1/2014", by = "days")
y1 <- seq(1, length(dates), 1) + rnorm (length(dates), mean =1, sd= 0.5)
y2 <- rev(seq(1+100, length(dates)+100, 1)) + rnorm (length(dates),
                                                     mean =1, sd= 0.5)

Next we calculate the range for our dates and y1 variable and plot it up. Note we are NOT actually plotting axes or points at this time (ie. plot(..., type = "n", ...)):

par(mar = c(5,4,3,4), mgp = c(1,0.3,0), font=2,
    font.axis = 2, tck = 0.01)
date.range = sort(dates)
y1.range <- c(min(floor(range(y1))), max(ceiling(range(y1))))
plot(dates, y1, ylim = c(y1.range), ann = F, axes = F, type = "n",
     xlim = c(date.range[1], date.range[length(date.range)]))

We then add the box around the plotting region, and both the x axis (which are dates) and the y axis. Wa also add the points (finally!):

### Add axes for dates and y1
box()
axis.Date(1, at = seq(date.range[1], date.range[length(date.range)],
                      4), format="%m/%d/%y", labels = F, tck = 0.01)
label=seq(date.range[1], date.range[length(date.range)],4)
x.len=length(label)
text(x=label[1:x.len],  par("usr")[3]-1.35, srt = 45, adj = 1,
     labels = label, xpd = T, cex.axis  = 1.5)
axis(2, at = round(seq(y1.range[1], y1.range[2], 5)), las = 2,
     cex.axis = 1.25)
mtext(side = 2, "Y variable 1", at = 35, line = 2.3, cex = 1.5,
      col = "black")
mtext(side = 1, "Date", cex = 1.5, line = 3.2)

### Add Y1 points
points(dates, y1, pch = 20, cex = 1.5, col = "black")

Finally, we can add the second variable. But first don’t forget to add par(new=T) BEFORE you add the second plot! Also note that the second plot uses the y2 variable, not y1, and the ylim is set to match the range of the y2 variable:

### Now we add second Y axis and points----------------------------------------#
par(new=T)
y2.range <- c(min(floor(range(y2))), max(ceiling(range(y2))))
plot(dates, y2, ylim = c(y2.range), ann = F, axes = F, type = "n",
     xlim = c(date.range[1], date.range[length(date.range)]))
points(dates, y2, col="red", pch = 20)

### Add 2nd y axis
axis(4, at = round(seq(y2.range[1], y2.range[2], 5)), las = 2,
     cex.axis = 1.25,
     col = "red", col.axis = "red")
mtext(side = 4, "Y variable 2", at = 135, line = 2.3, cex = 1.5,
      col = "red")
Example of xy-plot with two y variables that span different ranges

Example of xy-plot with two y variables that span different ranges

The above case works when both y1 and y2 have the same x variable. But, it’s not hard to imagine that y1 and y2 were measured on different days. This is actually quite easy to handle. The only difference from the above code is that when we add the points for the second y variable (in the second points() function above), we would use a the correct x variable. To add a 3rd variable to the plot above, which was measured on different days:

### add third Y variable----------------------------------------#
y3 <- rep(mean(y2),20) + rnorm (20, mean =10, sd= 5)
y3.dates <- dates[seq(2, 40, 2)] + 13
par(new=T)
y3.range <- c(min(floor(range(y3))), max(ceiling(range(y3))))
plot(dates, y3, ylim = c(y3.range), ann = F, axes = F, type = "n",
     xlim = c(date.range[1], date.range[length(date.range)]))
points(y3.dates, y3, col="magenta", pch = 20)
lines(y3.dates, y3, col="magenta", lwd=2, pch = 20)

Here the final product!

Added a 3rd y variable, but with a different length x variable.

Added a 3rd y variable, but with a different length x variable.

HydroApp Part 1: Using Shiny and googleVis to downloading and displaying US water quality data

Update: I’ve added a link to the HydroApp under the Web App tab at the top of the page!  It’s a work in progress, so you might (probably) get some weird behavior.  Somethings I’ve already noticed:

1) It can take some time to load the Google Map with sampling locations. I’m guessing it has to do with calling the Google Map server, but not totally sure.

2) The map doesn’t display all the sites, I think it just displays the first 400.  This is something that I’m working on.  Maybe someone out there has some suggestions???

3) I’m still adding functionality, so right now, all it does is display the sampling locations.  I hope to add data visualization and download capabilities soon…

Anyway, enjoy!

It’s been a busy fall season for me, so I haven’t updated the blog as much as I was hoping.  But, the good news is I’ve been using R quite a bit in my research, and have continued playing around with the Shiny package.  Recently, I’ve been working on an app that’s (arguably) more useful than my first Shiny app, and has some potential for people interested in water quality data (like myself!)  I’d like to extend a big thank you again to Rstudio for creating Shiny and specifically starting a free beta program for hosting Shiny apps on their servers.  I don’t know how long this beta program will last, but it has worked great so far. Keep up the good work.  Anyway, on to the app…

For those who have ever attempted to examine water quality data collected by the United State Geological Survey (USGS) or the Enivornmental Protection Agency (EPA), it’s not the most intuitive website to say the least.  Though the USGS and the Environmental Protection Agency has begun to address this (their Water Quality Portal is actually pretty slick), it’s still quite laborious to quickly and efficiently parse and display the data, particularly if you’re interested in looking at many sites.  Even with the improved website, you still need to download and analyze the data in separate steps, and just organizing all the data from one station can be quite time consuming in a spreadsheet program.

So, I decided to do something about it.  The USGS and EPA have implemented a nice web service for querying their data through the web (details can be found here).  I wanted to create a simple app where you entered the US state and a water quality parameter of interest.  You would then be presented with a map with the all the sampling locations within the state where the parameter of interest has been measured. Finally, you would be able to view a the data, and download the data as a .csv or .txt file for offline analysis. I haven’t had time to add the plotting features or download features, but I hope to have it up soon.

As in my previous post, I will briefly go through the scripts, describing some key points and the general layout. This is just part 1, so the app is still a work in progress.  Let me know what you think!

First, the entire HydroApp script can be seen here. And, if you’d like to try it out, I’ve hosted it on the Rstudio server here!

As you can see, the server side is comprised of 3 reactive functions (makeURL(), testURL(), and stateStationsInput()). Essentially, the inputs from the states and Parameter selection boxes in the ui.R script are combined with the base Water Quality Portal URL to download a .csv file that contains all the stations that have the desired water quality parameter. The makeURL() function handles the correct formatting of the URL so we can download the data:

  makeURL <- reactive({
    inStateID <- input$stateID
    inStateID.url <- URLencode(paste("US", inStateID, sep=":"),
                               reserved = T)
    inParmID <- input$parmID
    inParmID.url <- urlEncode(inParmID)

    url <- paste("http://www.waterqualitydata.us/Station/search?countrycode=US",
                 "&statecode=", inStateID.url,
                 "&characteristicName=", inParmID.url,
                 "&siteType=Stream&mimeType=csv&zip=no", sep="")
    print(url)
    return(url)
  })

You will notice that both the State ID (inStateID) and parameter ID (inParmID) need to be appropriately encode to work within a URL. To do this I used a function called urlEncode() to convert the criteria into a % plus a two-digit hexadecimal representation, which can be read by a web browser:

urlEncode <- function(URL) {
gsubfn(".", list("_" = "%5F", "." = "%2E","+" = "%2B","!" = "%21",
                 "*" = "%2A","'" = "%27"), c(URL))
}

Once the URL is properly encoded, I then test the URL to see if any sites match the criteria. This is done with testURL():

  testURL <- reactive({
    url <- makeURL()
    urlHeaders <- HEAD(url)$header
    warnings <- which(names(urlHeaders) %in% "warning")

    if(!length(warnings)){
      noData = FALSE
      summary <- cbind(paste("Total Site Count: ",
                             urlHeaders$'total-site-count'))
      print(summary)
      print(noData)
    }
    if(length(warnings) > 0){
      noData = TRUE
      summary <- paste("WARNING :", unlist(urlHeaders)[warnings])
      print(summary)
    }
    return(list(noData, summary))

  })

Using the httr library, I read in the header information for the website using HEAD(), and check to see if there are any warnings, specifically, if no stations match both of the input values (ie. State and water quality parameter). The header information also contains information on how many sites exist that match the criteria; somthing I print out later for the user to see.

Finally, assuming there are no warnings, I use the URL to download the requested data and from the downloaded .csv file I create a new data frame with two columns, stationLatLong and stationTip. The stationLatLong column is need by googleVis to correctly display the location of the data. The stationTip column is used for the descriptive text that pops up whenever you click on a station location.

  stateStationsInput <- reactive({
    if(testURL()[[1]] == F){
      url <- makeURL()
      stateStations = read.csv(url, colClasses = "character",
                               sep =",", na.strings = "", fill = F,
                               allowEscapes = F, header = T)

      stationLatLong <- paste(stateStations$LatitudeMeasure,
                              stateStations$LongitudeMeasure, sep=":")
      stationTip <- paste(stateStations$MonitoringLocationName,
                          stateStations$MonitoringLocationIdentifier,
                          sep="<BR>")

      data.frame(stationLatLong, stationTip)
    }
  })

Finally, and the one output (right now), I use the awesome gvisMap() function in the googleVis package to display a Google map with all the stations that match the desired criteria (I think only the first 400 stations are displayed… I’m working on figuring out a way to see more).

  output$stationMap <- renderGvis({
    if(testURL()[[1]] == F){
      mapData <- stateStationsInput()
      gvisMap(mapData, "stationLatLong", "stationTip",
             options=list(showTip=TRUE, showLine=FALSE,
             enableScrollWheel=TRUE, mapType='terrain',
             useMapTypeControl=TRUE))
    }

  })

My new favorite package: Shiny!

In the last post I said that I would be doing some data analysis on the triathlon results I scraped off the web, but I recently decided to delve into the brave new world of web apps! Shiny, designed by the creators of RStudio (my preferred R IDE) is a really easy and simple to use package for develop a web interface for your R scripts. I have some larger ideas regarding what I’d like to develop in the future, but for now I wanted to try out Shiny myself and see if I could develop a simple app.

Being a big fan of Ichiro Suzuki (who just surpassed 4,000 combined hits between the MLB and Japan!), I decided to make an app that allows a user to compare the cumulative hit trajectory of a player (past or present) against the hit trajectory for the 28 players that have more than 3,000 hits in the major leagues. Owing to the top-notch documentation of Shiny, I was able to put the app together in only a few hours.

This post is broken down into 3 parts: 1) getting the data together (which involves a little web scrapping, so see my previous post about if your interested in doing some yourself), 2) writing the Shiny ui.R file and 3) writing the Shiny server.R file. As usual, I’m not trying to recreate the wheel here, so I highly recommend reading the Shiny tutorial before you start since I won’t be covering too much of the basics of Shiny.  The tutorial was about all the introduction I needed to start building my app.

Part 1. Getting the data…

So I did a little googling, and found that there are 28 players who have more that 3,000 hits in the MLB.  Being lazy, I wrote a little script to scrape the names of these players off the Baseball-reference.com webpage that list the all-time hits leaders in descending order:

### Scrape 3000 hit club from www.baseball-reference.com
b = readLines("http://www.baseball-reference.com/leaders/H_career.shtml")
bdoc <- htmlParse(b, asText = T)
pathResult <- getNodeSet(bdoc, path = "//td/a")[1:28]
members <- unlist(lapply(pathResult, function(x) c(xmlValue(x))))
members <- gsub("[+]","", members)

### Get members first and last name to match with Master.csv playerID
memberFirst <- lapply(strsplit(members, split = "[[:space:]]"),
                      function(x) x[1])
memberLast <- lapply(strsplit(members, split = "[[:space:]]"),
                     function(x) x[2])

What I’ve done is download the HTML code and put it in a format that is easy for R to read.  Then, using Firebug for Firefox, I was able to locate the HTML/XML path to the names on the all time hits list.  Finally, I extracted the player names from the HTML code, cleaned it up, and saved it as a vector to be used later on (notice I only extracted the first 28 players on the list since these are the players with >=3,000 hits).

Next, I needed to find the actual hit data, by year so I can cumulatively sum it.  Of course I could have manually entered hit data into a spreadsheet, saved it as a .txt and read it back into R, but what fun is that??? So, I did some more googling and found this amazing baseball statistics website, created by Sean Lahman where you can download insane amounts of data as .csv files or even as database files.  Since I’m only interested in batting stats (hits), I only need to use the Batting.csv file (which contains the batting stats) and Master.csv file (which contains both the player names and playerIDs, which are needed to sort through the mountains of data):

setwd("C:/LocationOfYourDataHere")
master <- read.csv("Master.csv", header = TRUE, sep = ",",
                   quote = "\"", dec = ".", fill = TRUE,
                   comment.char = "")
batting <- read.csv("Batting.csv", header = TRUE, sep = ",",
                    quote = "\"", dec = ".", fill = TRUE, 
                    comment.char = "")

### extract playerIDs from Master.csv and 
### extract hits and other batting data from Batting.csv
memberId <- vector()
battingMember <- list()
hitsMember <- list()
for(i in 1:length(memberLast)){
  masterSub <- subset(master,
                      as.character(master$nameLast) == memberLast[[i]] &
                      as.character(master$nameFirst) == memberFirst[[i]]) 
  
  if(nrow(masterSub) > 1){ masterSub = masterSub[1, ] }
  
  memberId[i] <- as.character(masterSub$playerID)
  battingMember[[i]] <- batting[as.character(batting$playerID) == memberId[i], ]
  hitsMember[[i]] <- battingMember[[i]]$H
}

What I did above was to use the players first and last names to extract the playerID out of the Master.csv file, then use the playerID to extract out the hitting data from Batting.csv (I plan on cleaning this up some day, but for now,  I just wanted to get it to work).

Now that I had all the hit data for the 28 players, I can sum each season cumulatively so I can plot the data nicely:

### Calculate cumulative summation of hits for all members
mHitsCumSum <- lapply(hitsMember, function(x) cumsum(x))

Phew!  That’s actually the hard part. Now, its just doing a little copying a pasting and a some referencing of the tutorial and we have our first web app.

Part 2. Creating the Shiny ui.R file

Now for the fun part, designing your web app.  I wanted to keep my first app simple, so I decided to have a sidebar with only two way of sending data to R, and have one graph as the main output.  The first way a user can interact with the app is to highlight the hit trajectory of a particular member of the 3,000 hit club.  To do this, I made a drop down list containing the names of all 28 players.  When a name is selected, the hit trajectory is highlighted on the graph.  This was accomplished using the selectInput() function.

The other bit of interaction the user can do is plot the hit trajectory of ANY player, past or present, simply by entering in the name of the player and letting R look up the player in the master and batting data frames and plot the data on the graph.  This was accomplished using the textInput() function.

Finally, to display the graph, we tell shiny to plot the hit trajectories in the main panel (mainPanel()) by using the plotOutput() function.

library(shiny)

# Define UI for miles per gallon application
shinyUI(pageWithSidebar(
  
  # Application title
  headerPanel("The 3000 Hit Club"),
  
  # Sidebar with controls to select a member of the 3,000 hit club
  # and input a non-member and plot their hit trajectory
  sidebarPanel(
    
    ### Dropdown menu to select a member of 3,000 hit club to highlight on 
    ### plot
    selectInput("member", "Member of 3000 hit Club:",
                list( "Pete Rose" = "rosepe01",
                      "Ty Cobb" = "cobbty01",
                      "Hank Aaron" = "aaronha01",
                      "Stan Musial" = "musiast01",    
                      "Tris Speaker" = "speaktr01",  
                      "Cap Anson" = "ansonca01",
                      "Honus Wagner" = "wagneho01",   
                      "Carl Yastrzemski" = "yastrca01",
                      "Paul Molitor" = "molitpa01",     
                      "Eddie Collins" = "collied01",
                      "Derek Jeter" = "jeterde01",     
                      "Willie Mays" = "mayswi01",    
                      "Eddie Murray" = "murraed02",
                      "Nap Lajoie" = "lajoina01",      
                      "Cal Ripken" = "ripkeca01",     
                      "George Brett" = "brettge01",   
                      "Paul Waner" = "wanerpa01",      
                      "Robin Yount" = "yountro01",     
                      "Tony Gwynn" = "gwynnto01",   
                      "Dave Winfield" = "winfida01",  
                      "Craig Biggio" = "biggicr01",
                      "Rickey Henderson" = "henderi01",
                      "Rod Carew" = "carewro01",      
                      "Lou Brock" = "brocklo01",    
                      "Rafael Palmeiro" = "palmera01",
                      "Wade Boggs" = "boggswa01",
                      "Al Kaline" = "kalinal01",
                      "Roberto Clemente" = "clemero01")),

    
    # To text input to select non-3000 hit member to plot hit trajectory
    textInput("player", "Player Name:", value = ""),
    
    # Button to update plot output
    submitButton("Update View")
    
  ),
  
  # Show the output plot of the hit trajectory
  mainPanel(
    #tableOutput("view"),    
    
    plotOutput("cumsumPlot")
  )
))

Part 3.  server.R file

the server.R file is the file that does the heavy lifting behind the scenes. It contains the R scripts that takes user inputs (called input variables), does data manipulation, then spits out the results (called output variables) to ui.R, which then displays the outputs of server.R in your web browser.

If you look closely, you will see that I included the script from Part 1 at the beginning of the server.R file.  This code is outside the shinyServer() function, so is run ONCE when the app is loaded into the browser, and then all data frames, matrices, vectors, etc. can be used by shinyServer().

After the data is loaded into R, we run the shinyServer() function, which contains reactive functions.  Reactive functions are run any time a user changes one of the input variables.  You will see there is a reactive function called currentMemberHits(), which simply selects the desired 3,000 hit member for plotting, and there is another reactive function called currentPlayerHits() which get the non-members hit data from the master and batting data frames and calculates the cumulative hits trajectory.  Finally there is reactive function called renderPlot() which is run whenever currentMemberHits() or currentPlayerHits() changes.  renderPlot() just wraps normal R plotting functions and sends the plot back to ui.R to display in the web browser.

library(shiny)
library(XML)

### OVERHEAD
### Scrape 3000 hit club from www.baseball-reference.com
b = readLines("http://www.baseball-reference.com/leaders/H_career.shtml")
bdoc <- htmlParse(b, asText = T)
pathResult <- getNodeSet(bdoc, path = "//td/a")[1:28]
members <- unlist(lapply(pathResult, function(x) c(xmlValue(x))))
members <- gsub("[+]","", members)

### Get members first and last name to match with Master.csv playerID
memberFirst <- lapply(strsplit(members, split = "[[:space:]]"), function(x) x[1])
memberLast <- lapply(strsplit(members, split = "[[:space:]]"), function(x) x[2])

### Read in local files downloaded from...
setwd("C:/chitchat/data")
master <- read.csv("Master.csv", header = TRUE, sep = ",", quote = "\"",
                  dec = ".", fill = TRUE, comment.char = "")
batting <- read.csv("Batting.csv", header = TRUE, sep = ",", quote = "\"",
                   dec = ".", fill = TRUE, comment.char = "")

### extract playerIDs from Master.csv and 
### extract hits and other batting data from Batting.csv
memberId <- vector()
battingMember <- list()
hitsMember <- list()
for(i in 1:length(memberLast)){
  masterSub <- subset(master, as.character(master$nameLast) == memberLast[[i]] &
                        as.character(master$nameFirst) == memberFirst[[i]]) 
  
  if(nrow(masterSub) > 1){ masterSub = masterSub[1, ] }
  
  memberId[i] <- as.character(masterSub$playerID)
  battingMember[[i]] <- batting[as.character(batting$playerID) == memberId[i], ]
  hitsMember[[i]] <- battingMember[[i]]$H
}

### Calculate cumulative summation of hits for all members
mHitsCumSum <- lapply(hitsMember, function(x) cumsum(x))

### For plotting
maxYears <- max(unlist(lapply(hitsMember, function(x) length(x))))
maxHits <- max(unlist(lapply(mHitsCumSum, function(x) max(x))))

### Define server logic required to plot various players against 3000 hit club
shinyServer(function(input, output) {
  
  ### get hits for chosen 3000 club member
  currentMemberHits <- reactive({ 
    
    ### Calculate cumulative summation of hits
    cumsum(hitsMember[[match(as.character(input$member), memberId)]])
    
  })
  
  ### get hits for non-3000 club player
  currentPlayerHits <- reactive({ 
    
    playerFirst <- lapply(strsplit(input$player, split = "[[:space:]]"),
                          function(x) x[1])
    playerLast <- lapply(strsplit(input$player, split = "[[:space:]]"),
                         function(x) x[2])
    
    ### extract hits and other batting data from Batting table
    masterPlayer <- list()
    playerId <- vector()
    battingPlayer <- list()
    hitsPlayer <- list()
    for(i in 1:length(playerLast)){
      masterSub <- subset(master, 
                          as.character(master$nameLast) == playerLast &
                          as.character(master$nameFirst) == playerFirst) 
      
      if(nrow(masterSub) > 1){ masterSub = masterSub[1, ] }
      
      playerId <- as.character(masterSub$playerID)  
      battingPlayer <- batting[as.character(batting$playerID) == playerId, ]
      hitsPlayer <- battingPlayer$H
    }
    
    ### Calculate cumulative summation of hits for non-member
    cumsum(hitsPlayer)
    
  })
  
  ### Output table comparing currentMemberHits() and currentPlayerHits()
  ### NOT IMPLEMENTED!
  output$view <- renderTable({
    data.frame("X" = currentMemberHits())#, "Y" = currentPlayerHits())
  })
  
  ### Output xy-scatterplot
  output$cumsumPlot <- renderPlot({
    plot(seq(1, maxYears, 1), rep(0, maxYears), type = "n",
         lwd = 2, xlim = c(0, maxYears), ylim = c(0, maxHits),
         xlab = "Year", ylab = "Hits")
    segments(x0 = -100, x1 = 1000, y0 = 3000, y1 = 3000, lty = 2, lwd = 2,
             col = "black")
    for(i in 1:length(mHitsCumSum)){
      lines(seq(1, length(mHitsCumSum[[i]]), 1), mHitsCumSum[[i]], lwd = 2,
            col = "grey70")
    }
    lines(seq(1, length(currentMemberHits()), 1), currentMemberHits(), lwd = 2, 
        col = "magenta")
    lines(seq(1, length(currentPlayerHits()), 1), currentPlayerHits(), lwd = 2, 
          col = "blue")
  })
})

That’s it. Now, to run the app we need to put both the ui.R and server.R into a directory, then in a R session, we can run the follow code to run the app LOCALLY:

library(shiny)
runApp("C:/LocationOfYourShinyAppDirectory")
screenshot of my first web app!

screenshot of my first web app!

I was really impressed with how professional the app looks. This is hopefully only the beginning for a really promising package. I also plan on getting these scripts onto Github soon. When I do, I’ll add a link. Cheers!

UPDATE: I’ve added a new tab at the top of the page which includes the entire script for this app (3000 Hit Analyzer). It was a lot easier to add a new tab at the top of this website then create Github respository to share the scripts!

How to access data in html tables through R!

This post is going to be on web scraping, a technique that I’ve only used a few times, but which is can be very useful. It’s essentially the process of accessing the underlying HTML code or other “behind the scenes” data from a websites. From reading the wikipedia page on web scraping, you should be aware that not all webmasters will be happy about someone “scraping” their data (and in some cases it may be illegal), so proceed at your own risk!!!

First, before I begin, the data I’m going to be using are from a race management company located in the northeast called F.I.R.M. I recently complete a race organized by them (I’m part of the data I will be using for this demo!)The data is accessible by anyone with an internet connection (not just participants) so I felt it is OK to use race result data for this post.

Secondly, the method I will be discussing below is used to access data embedded in a HTML table from a particular website, and therefore it’s website-specific. If you want to repeat what I show you for another website, you will need to 1) be able to view their HTML code, 2) figure out where in the HTML code the data of interest is stored, and 3) have a basic understanding of HTML and HTML/XML structure so you can use the correct path (more on this later) to tell R to where the data is located. (1) isn’t too hard to do. I use Firefox browser, and they have a really robust and useful plug-in called Firebug that let’s you view the HTML code of any website you visit. (2) and (3) are more tricky, and will require some time studying HTML, but there are some good website out there to help you learn the basics of HTML and Xpath/XML path syntax.

So, first we need to know the URL of the website with the data we want to scrap. We can use the readLines() function in the XML library to read in the HTML code (which includes the data we are interested in). This creates a vector of the HTML code. Unfortunately, this isn’t very useful to us, since it will be VERY difficult to extract the data we are interested in. So, we create and XMLInternalDocument object using htmlParse() to more easily access the XML node that has our data:

library(XML)
b <- readLines("http://www.firm-racing.com/result_report.asp?RID=792&type=1")
bdoc <- htmlParse(b, asText = T)

Now that we have an XMLInternalDocument, we access the XML node that has our data using getNodeSet() and retrieve the raw data using xmlValue().

result.table <- getNodeSet(bdoc, path = "//table/td/div")
racer.rslt <- matrix(unlist(lapply(result.table, 
                     function(x) c(xmlValue(x)))),
                     ncol = 16, byrow = T)

You can see there is a arguement “path” in getNodeSet(), which is where R looks to get our desired data out of the XML document. Defining the correct path requires knowledge of XPath syntax that’s not going to be covered here, but using Firebug and some trial and error, I was able to narrow down the location of the data itself to “//table/td/div” fairly quickly. You can see that the getNodeSet() returns a list of XML nodes that R found when it followed the “path” we defined in the function call. We then use xmlValue() in a lapply() loop to extract the actual values in the Nodes.

racer.rslt is our desired matrix which contains the data we want (race time, age group, finishing order, etc.). Now we can convert the matrix to a data frame, add headers (obtained from the website) and start analyzing the data!

result.df <- as.data.frame(racer.rslt,stringsAsFactors = FALSE)
header <- c("bib", "category", "swim_cat", "swim_ov", "swim_time",
            "TT1", "bike_cat", "bike_ov", "bike_time", "TT2",
            "run_cat", "run_ov", "run_time", "overall_cat",
            "overall_ov", "overall_time")
names(result.df) <- header
final data frame of race results ready for analysis.

final data frame of race results ready for analysis.

For my next post I’m going to do some analyzing of the data. Stay tuned!