`install.pacakages()`

function.
First I’ll create some dummy data, in this case, it will be a simple box (so we can make sure that the area is correct):

box.coords <- matrix(c(1, 1, 4, 1, 1, 4, 4, 4), nrow = 4, ncol = 2, byrow = T) plot(box.coords[,1], box.coords[,2], pch = 20)

We have a square with sides of length 3. So, we know beforehand that we should get an area of 9. Next, we will use the `chull()`

to extract the coordinates that make up the convex hull (in this case, all of the points, since we only have 4 points!). Also, we’ll need to close the convex hull by repeating the first row in the last row. The reason for this will be apparent in minute. Finally, we create a new matrix with the convex hull points, that includes the repeated coordinates to close off the convex hull:

box.hpts <- chull(x = box.coords[,1], y = box.coords[,2]) box.hpts <- c(box.hpts, box.hpts[1]) box.chull.coords <- box.coords[box.hpts,]

Finally, we load the “sp” package so we can use the function `Polygon()`

. If you read the help files for `Polygon()`

, you’ll see the first argument is the a two row matrix of coordinates (x, y), with the coordinates in the first repeated again in the last row (aha!) This will close off the polygon and allow `Polygon()`

to calculate an area. To extract the area of the polygon, you need to access the attributes of a SpatialPolygon object, using the `@area`

symbol:

chull.poly <- Polygon(box.chull.coords, hole=F) chull.area <- chull.poly@area > chull.area [1] 9

Not too hard! You can do this with any polygon, as long as it is a valid SpatialPolygon object. Here is a quick examples using more realistic data:

library(sp) x1 <- rnorm(100, 0.8, 0.3) y1 <- rnorm(100, 0.8, 0.3) hpts <- chull(x = x1, y = y1) hpts <- c(hpts, hpts[1]) xy.coords <- cbind(x1, y1) chull.coords <- xy.coords[hpts,] chull.poly <- Polygon(chull.coords, hole=F) chull.area <- chull.poly@area

Just remember, that calculating an area of a convex hull only makes sense if the units of the X and Y axes are the same. Hope this answers the question!

]]>There’s no downloading code, running it locally etc… Just click on this link to try it out. I still haven’t figured out why it’s slow to load at certain time and fast other times. It seems to happen randomly but often enough that it’s annoying. It’s probably something with the code, but it could have to do with sending commands to the server; the server might be dealing with other people’s requests (it is a free service after all!) Anyway, try it out, I still recommend it. And happy new year!

]]>`persp()`

function, and how to shade the facets of a `persp()`

plot using a desired color ramp. I’m going to build off this earlier post to show how you can take a 2D DEM, and plot it in 3 dimensions. I’m going to be using the output from the script from this post, so don’t forget to run that script first before you start running the script below.First, I’m going to extract some summary information from our final DEM, “interpDEM_mask”, particularly the minimum elevation value, `summary.elev[1]`

. I then create a simple function, `setBaseElev`

, that sets all NA’s in our raster to the minimum elevation value. This allows us to plot a base surface around our DEM, and give our 3D plot some spatial context:

summary.elev <- summary(interpDEM_mask) setBaseElev <- function(x) { x[is.na(x)] <- as.numeric(summary.elev[1]) return(x) } elev3D <- calc(interpDEM_mask, setBaseElev)

Next, we convert our Raster object to a regular matrix, so that we can extract the `nrow`

and `ncol`

, and eventually use `persp()`

to plot it. You’ll also notice that I’ve trimmed a few rows and columns from my matrix. This isn’t necessary, but I did this because I realized that my bounding box for the raster was waaay larger than the actual DEM:

zData <-round(as.matrix(elev3D),1) zData <- zData[c(50:540), c(50:300)]

The next lines of code are required so that we can plot our DEM using the graphics package version of `persp()`

. If you read the `persp()`

help page, you’ll find that both the x and y vectors need to be INCREASING in values. Since we are only plotting for the visual effect, and not for analyzing data, we can simply create vectors that have the same length as the dimensions of zData. You’ll notice that the x and y vectors are multipled by 12; this is to match the resolution of the raster we created earlier:

x = (12 * (1:nrow(zData))) y = (12 * (1:ncol(zData)))

We want to color our DEM based on the elevation data of our DEM (duh!), so we create a matrix, containing elevation data derived from our DEM data, which will tell R how to color the facets. This matrix MUST BE one less that the number of rows and columns in the elevation matrix (zData). We use the same function we introduced in Part 1 to calculate the elevation values that go into the zfacet matrix:

zfacet <- (zData[-1, -1] + zData[-1, -ncz] + zData[-nrz, -1] + zData[-nrz, -ncz])/4

We then create a vector of colors that we will use to color the DEM facets. Here I am creating a vector with 100 colors, the first being “grey”, and the rest in hexadecimal (hex) notation from the built-in `terrain.colors()`

color ramp. We use the `cut()`

function to divide up the zfacet elevation data in 100 equal ranges. Each of these ranges is then assigned the corresponding color from the color vector, starting with the lowest elevation being “grey”, and so on and so forth:

nbcol <- 99 color <- c("grey", terrain.colors(nbcol)) facetcol <- cut(zfacet, nbcol+1)

Finally, we are ready to plot! I’ve added a bunch of arguements to the `persp()`

function to make the output look pretty, but you can play around with different values to see how they change the final product:

res = persp(x, y, z = zData*2, theta = 90, phi = 30, col = color[facetcol], scale = FALSE, expand = 0.75, ltheta = 75, shade = 0.75, border = NA, box = F, ticktype = "detailed")

Pretty nice DEM! I’m working on few new posts, so hope to get them up soon. Ciao!

]]>`persp()`

, but I always felt the script was more complicated that it needed to be.
Fast forward a couple years, and with the creation of the great raster package, working with rasters seems to have gotten a lot easier. Raster has a function (also called `persp()`

) that allows you to plot raster* objects in 3D, without having to convert any data. It works well, but seems rather limited. Coloring the cells (or facets as they are referred to the help page) according to the z value seemed particularly difficult.

So, I decided to go back and see if I could simplify plotting DEMs using the graphic package version of `persp()`

. I’ll demonstrate how I’ve plotted DEMs using `persp()`

in an upcoming post, but I thought it’d be good to go back and review how `persp()`

works, since I’ve had some battles getting data to plot correctly in the past.

Here are some key points to remember when using `persp()`

to plot raster (or any type of data). I’ll then demonstrate them using a simple data matrix.

**Key Point 1)** The *z* matrix represents the “wire frame” of the plot NOT the cell centers.

**Key Point 2)** the *x* and *y* vectors represent the (x,y) coordinates where each value in the *z* matrix is plotted. Also, the *x* and *y* vectors MUST be increasing. This means that the *z* matrix is rotated 90 degrees counterclockwise when plotted. The length of the *x* vector must be the same length as the `nrow(z)`

and the length of the *y* vector must equal `ncol(z)`

.

**Key Point 3)** When a *nr x nc* z matrix is plotted, it creates a *(nr-1) x (nc-1)* matrix of facets (cells) that can be colored using the “col” argument. This matrix is also rotated 90 degrees counterclockwise when plotted.

To demostrate these points, I will create a simple 4×4 matrix, and x and y coordinate vectors that increase:

zmat <- matrix(seq(1,16,1), nc = 4) x <- seq(1, 4, 1) y <- seq(1, 4, 1) print(zmat) [,1] [,2] [,3] [,4] [1,] 1 5 9 13 [2,] 2 6 10 14 [3,] 3 7 11 15 [4,] 4 8 12 16 persp(x, y, zmat, axes= T, nticks = 4, ticktype = "detailed", theta = 320)

See that

`zmat[1,1]`

of our `persp()`

surface is plotted at the origin (1,1,1), which is the lower left corner (of the x-y plane). `persp()`

converts the x and y vectors to (x,y) coordinates using something similar to:
n = 1 coords <- list() for(i in 1:length(x)){ for(j in 1:length(y)){ coords[[n]] <- c(x[i], y[j]) n <- n+1 } }

You can see that `zmat[1,1]`

get plotted at (1,1,zmat[1,1]), `zmat[1,2]`

get plotted at (1,2,zmat[1,2]), …. You’ll also notice in the image above, the wireframe created a 3×3 matrix of facets (cells) that can be colored according to a 3×3 matrix of colors. To borrow some code from for the `persp()`

help page:

# create 3x3 matrix which averages the value of # the 4 points bounding each facet. nrzmat <- nrow(zmat) nczmat <- ncol(zmat) facetValues <- (zmat[-1, -1] + zmat[-1, -nczmat] + zmat[-nrzmat, -1] + zmat[-nrzmat, -nczmat])/4 #Color Ramp jet.colors <- colorRampPalette( c("blue", "green") ) # Generate the desired number of colors from this palette nbcol <- 100 color <- jet.colors(nbcol) # Recode facetValues into color indices facetcol <- cut(facetValues, nbcol) persp(x, y, zmat, axes= T, nticks = 4, ticktype = "detailed", theta = 320, col = color[facetcol])

The cool trick I learned from the help page example is the equation used to calculate the `facetValue`

matrix. This will come in handy when plotting our own DEM. The equation simply calculates the average value of the four points that make up each facet, and this is what determines the facet color!

I’ve decided to break this post up into two parts. I’ll have part 2 up shortly. Cheers!

]]># Create raster grid where we are going to interpolate our # elevation data. contourBBox <- bbox(contour3m) contourRaster <- raster(xmn=contourBBox[1,1], xmx=ceiling(contourBBox[1,2]), ymn=contourBBox[2,1], ymx=ceiling(contourBBox[2,2]), crs = CRS("+init=epsg:26918"), resolution = c(12,12))

But I found, that the `resolution = c(12,12)`

argument was giving an error on my Windows computer. Fortunately, there is an easy fix. Just remove the argument and add the line `res(contourRaster) <- 12`

:

# Create raster grid where we are going to interpolate our # elevation data. contourBBox <- bbox(contour3m) contourRaster <- raster(xmn=contourBBox[1,1], xmx=ceiling(contourBBox[1,2]), ymn=contourBBox[2,1], ymx=ceiling(contourBBox[2,2]), crs = CRS("+init=epsg:26918")) res(contourRaster) <- 12

Everything should work smoothly now!

*Quick Note: A few years ago I did a post on using R for spatial analyses. I recently thought I’d return to the topic, mainly to show how versatile R is and how many of the things you can do in dedicated GIS software (which can cost thousands of dollars to license!) can be re-created easily and quickly using R.*

DEMs, or digitial elevation models, are one of the most common types of data used in spatial analysis. DEMs are raster files that contains elevation information for each cell in the raster. I recently was give a shapefile of contour lines for a particular area and needed to create a DEM from them. As such, I thought I’d demonstrate how I created a DEM from the contour lines that were in a vector shapefile (.shp), one of the most common GIS file types.

I won’t go into all the differences between rasters and vectors (ArcGIS has some good online resources in case you need to brush up) but the basic steps are:

1. Read in the files to R.

2. Create a raster “grid” to interpolate the elevation data onto.

3. Convert the contour lines to points so you can interpolate between elevation points.

4. Interpolate the elevation data onto the raster grid.

5. Crop the DEM raster to the boundary of the area of interest.

Before I begin, I obtained my data from the Mass GIS webportal, which is a really nice web-based GIS for querying all types of spatial data within the Commonwealth of Massachusetts. Since I’m interested in elevation data only, I looked for a sub-basin (ie. watershed) that had a nice topographic range and shape. I then downloaded the watershed boundaries, and the 3m contour lines that covered the watershed, as separate shapefiles.

Step 1. This is quite simple, since there are some really nice R packages for dealing with spatial data. To read in our data, we will use the maptools library. You’ll notice that the `readShapeLines()`

and `readShapePoly()`

has a proj4string argument. This is where you tell R what the projection and datum are for your data. MassGIS actually asks which projection you’d like your data in, so I knew ahead of time what EPSG I needed to input:

library(sp) library(maptools) library(rgdal) library(raster) library(rgeos) #Read in shape files setwd("C:/LocationOfYourContourData") contour3m <- readShapeLines("GISDATA_CONTOURS5K_ARC", proj4string = CRS("+init=epsg:26918")) setwd("C:/LocationOfYourBoundaryData") ws_bound <- readShapePoly("GISDATA_SUBBASINS_POLY", proj4string = CRS("+init=epsg:26918"))

Lets plot up the files to see what we are working with. I’m interested in the watershed in the center, but you can see that the downloaded data included adjacent watersheds. Also, the contours extend beyond the watershed of interest. We’ll deal with this later when we crop and clean up our data:

# plot of ws_bound with watershed of interest highlighted spplot(ws_bound, zcol = "SUB_ID", col = "black", colorkey = F, fill = "grey80", panel = function(x, y, ...){ panel.polygonsplot(x,y, ...) sp.polygons(ws_bound[as.character(ws_bound$SUB_ID) == 1007, ], fill = "magenta") }) # same plot as above, but include contour3m lines spplot(ws_bound, zcol = "SUB_ID", col = "black", colorkey = F, fill = "grey80", panel = function(x, y, ...){ panel.polygonsplot(x,y, ...) sp.polygons(ws_bound[as.character(ws_bound$SUB_ID) == 1007, ], fill = "magenta") sp.lines(HudsonWS_contourSub, col = "black") })

Step 2. I need to create an empty raster grid that spans the range of my sub-basin. Since I’m using the contour lines to create the DEM, I used them as the basis for my raster grid. Notice that I manually set the resolution (size) of each cell in the raster to be square (12m x 12m). This is required if you ever want to import your data into ArcGIS (as far as I’m aware). I also set the resolution so that I didn’t have too many cells. The resolution should depend on your needs and computing power:

# Create raster grid where we are going to interpolate our # elevation data. contourBBox <- bbox(contour3m) contourRaster <- raster(xmn=contourBBox[1,1], xmx=ceiling(contourBBox[1,2]), ymn=contourBBox[2,1], ymx=ceiling(contourBBox[2,2]), crs = CRS("+init=epsg:26918"), resolution = c(12,12))

Step 3. Since interpolation methods were conceived to work with point data, we need to convert the elevation contour lines to elevation points. Fortunately, this is quite easy to do in R using `as()`

. Essentially we are creating points along each contour line that has as its value the elevation of the associated contour line:

# Create points from contour lines so we can # interpolate between points. p <- as(contour3m, 'SpatialPointsDataFrame')

Step 4. Now we have points from which we can interpolate a elevation raster. There are probably many different intrpolation packages out there, but I’m going to use gstat, since it seems pretty robust for spatial statistics, and there are nice examples using the gstat and raster packages together for interpolating point data to rasters. We will use `gstat()`

to create a model for interpolating our point elevation data using inverse distance weighting. Then, we use `interpolate()`

to predict the elevation value at each cell in contourRaster. Note that this step can take some time depending on the computing power of your computer:

#interpolate between points using IDW into contourRaster library(gstat) g <- gstat(id="ELEV_M", formula = ELEV_M~1, data=p, nmax=7, set=list(idp = .5)) interpDEM <- interpolate(contourRaster, g)

Step 5. Finally, we can clean up the data by clipping our DEM to the boundaries of the watershed. We can also clip our contour lines to the boundaries of our watershed:

# extract the watershed boundary of interest HudsonWS_bound <- ws_bound[as.character(ws_bound$SUB_ID) == 1007, ] # clip raster to watershed boundaries interpDEM_mask <- mask(interpDEM, HudsonWS_bound) # clip contours to watershed boundary HudsonWS_contour <- crop(contour3m,HudsonWS_bound)

That wasn’t too bad. We can plot it up nicely using `spplot()`

like I did in a previous post:

# subset 3m contours for nice plotting HudsonWS_contourSub <- HudsonWS_contour[(HudsonWS_contour$ELEV_M) %in% seq(354, 993, 30), ] # Plot it up! spplot(interpDEM_mask, col.regions = heat.colors(100), xlim=c(bbox(interpDEM_mask)[1,1] + 900, bbox(interpDEM_mask)[1,2] - 900), ylim=c(bbox(interpDEM_mask)[2,1] + 1150, bbox(interpDEM_mask)[2,2] - 900), colorkey = list(space = "top", width = 1), main = "Elev. (m)", panel = function(x, y, ...){ panel.gridplot(x,y, ...) sp.lines(HudsonWS_contourSub, col = "black") })

Not too shabby! Anyway, keep an eye on Chit Chat R for more spatial analysis in near future…

]]>1. graph the points

2. draw a best fit straight line through the points

3. calculate the slope, m = (y2 – y1)/(x2 – x1), where (x1, y1) and (x2, y2) are two different points on the best fit line

4. Continue the best fit line through the y-axis to determine b, the y-intercept.

Voila! You have a equation for a straight line in the form y = mx + b! If you’re like me and are lazy, you can use R to calculate the two parameters (m and b), using the `lm()`

function along with the `summary()`

function.

The above steps are all well and good if your data approximates a straight line, but what if your data are non-linear (ie. curved) and you want to fit a line to it?

Well, (not surprisingly) R has you covered. R has a nice built in function called `nls()`

, that can fit a non-linear equation to your data. To use this function, all you need to do is decide on a model to fit to your data. This may seem simple, but choosing the correct model can be quite important. Typically in the natural sciences, we try and choose models that make sense physically (ie. the parameters of the model can be related back to real world phenomena) or a model that is well accepted in a particular area of research. Nonetheless, just keep in mind that model selection is a critical step.

With that out of the way, I’ve decided to demonstrate `nls()`

using some real world data, namely daily maximum temperature data (TMAX) taken from a great of source free climate data, NOAA. They have a really nice user interface to search for all different types of climate data.

On the National Climate Data Center website, I was quickly able to locate daily MAX and MIN temperature data from Logan International Airport in Boston, MA. Within a few hours of submitting my data request, I received an email telling me my data was ready to be downloaded. Let us begin!

They email had a link to a .dat file to download the data from the web, so I used R to grab the data and load it into a data frame ((I don’t know how long NOAA data links are active, so I suggest requesting your own data if you are following my script…) We are just interested in the date and TMAX data columns so we extract those:

library(chron) url <- paste("http://www1.ncdc.noaa.gov/pub/orders/cdo/294541.dat", sep="") dat <- read.table(url, sep = "", skip = 2, header = F) bostonTemp <- data.frame(dat[c("V11", "V13")]) headers <- read.table(url, colClasses = "character", sep = "", nrow = 1, header = F) names(bostonTemp) <- c(headers[c("V6", "V8")])

Now that we have the data in a usable form, we can do some cleaning up. First, the temperature is given in tenths of a degree Celcius (I have no idea why…) so we can easily convert that into more familiar values by dividing by 10. Next, to make plotting easier, we can convert the dates into a time format that R likes, using the `chron()`

function in the “chron” package. Finally, we can plot the data to see what it looks like:

bostonTemp$TMAX <- bostonTemp$TMAX/10 bostonTemp$DATE <- as.chron(paste(substr(bostonTemp$DATE, 5,6), substr(bostonTemp$DATE, 7,8), substr(bostonTemp$DATE, 1,4), sep="/")) par(font=2, font.axis = 2, tck = 0.01) plot(bostonTemp$DATE, bostonTemp$TMAX)

As you can see I downloaded data spanning 1/1/2000 through the present. Clearly the temperature flucuates in a repeating fashion, cold in the winter and hot in the summer (obviously). Let’s look at the most recent year where we have a full year of data, 2013:

bostonTemp2013 <- subset(bostonTemp, bostonTemp$DATE >= as.chron("01/01/2013") & bostonTemp$DATE <= as.chron("12/31/2013")) plot(bostonTemp2013$DATE, bostonTemp2013$TMAX)

It appears that the TMAX data have a convex upward shape, with the highest values of TMAX in the middle of the year and the lowest values at the beginning and the end. You could say that the data has the shape of a parabola! Hmm, let’s fit it.

Knowing the vertex form of the parabola equation: y = a*(x – h)^2 + k, we can use this as our model. We also need to supply `nls()`

with a starting value for each of the unknown parameters that we are going to estimate (a, h and k). Knowing the physical meaning behind each of the model parameters (see this website for a quick refresher), we can guess that h is approximately the date where TMAX is the greatest, and k is the maximum TMAX, and we can assume a is a negative value (since we want the parabola to open downwards), let’s say -1. Finally, we need to convert `bostonTemp2013$DATE`

to a numeric format using `as.numeric()`

so that we can fit a equation to the data:

x <- as.numeric(bostonTemp2013$DATE) y <- bostonTemp2013$TMAX h1 <- mean(x) k1 <- max(y) a1 <- -1 yfit<- nls(y ~ a*(x-h)^2 + k, start = list(a = a1, h = h1, k = k1)) yfitParm <- summary(yfit)$para[,1]

If we look at the parameter estimates `nls()`

gave us, they weren’t too far off from our initial guesses!

> cbind(h1, k1, a1) h1 k1 a1 [1,] 15888 37.2 -1 > yfitParm a h k -8.329497e-04 1.590282e+04 2.480478e+01

Now lets see how `nls()`

did when fitting our data with a parabola:

ymod <- yfitParm[1]*(x-yfitParm[2])^2 + yfitParm[3] par(font=2, font.axis = 2, tck = 0.01) plot(bostonTemp2013$DATE, bostonTemp2013$TMAX) lines(x, ymod)

Not too bad! There may be more appropriate models out there for fitting temperature data, but this model seems to do the trick. Till next time…

]]>To overcome this issue, we can take advantage of a few built-in features of R, namely the `par(new=T)`

. This allows you to create a new plotting device (as R likes to call plots), without erasing the previous plot. Essentially what we are going to do is plot two figure on top of one another, but with a different y axis range specified in the second `plot()`

function.

Let’s work through the code:

First we create or dummy data. For this example, both y1 and y2 are assumed to have the same x variable (“dates”):

### Create Dummy Data dates <- seq.dates("1/1/2014", "3/1/2014", by = "days") y1 <- seq(1, length(dates), 1) + rnorm (length(dates), mean =1, sd= 0.5) y2 <- rev(seq(1+100, length(dates)+100, 1)) + rnorm (length(dates), mean =1, sd= 0.5)

Next we calculate the range for our dates and y1 variable and plot it up. Note we are NOT actually plotting axes or points at this time (ie. `plot(..., type = "n", ...)`

):

par(mar = c(5,4,3,4), mgp = c(1,0.3,0), font=2, font.axis = 2, tck = 0.01) date.range = sort(dates) y1.range <- c(min(floor(range(y1))), max(ceiling(range(y1)))) plot(dates, y1, ylim = c(y1.range), ann = F, axes = F, type = "n", xlim = c(date.range[1], date.range[length(date.range)]))

We then add the box around the plotting region, and both the x axis (which are dates) and the y axis. Wa also add the points (finally!):

### Add axes for dates and y1 box() axis.Date(1, at = seq(date.range[1], date.range[length(date.range)], 4), format="%m/%d/%y", labels = F, tck = 0.01) label=seq(date.range[1], date.range[length(date.range)],4) x.len=length(label) text(x=label[1:x.len], par("usr")[3]-1.35, srt = 45, adj = 1, labels = label, xpd = T, cex.axis = 1.5) axis(2, at = round(seq(y1.range[1], y1.range[2], 5)), las = 2, cex.axis = 1.25) mtext(side = 2, "Y variable 1", at = 35, line = 2.3, cex = 1.5, col = "black") mtext(side = 1, "Date", cex = 1.5, line = 3.2) ### Add Y1 points points(dates, y1, pch = 20, cex = 1.5, col = "black")

Finally, we can add the second variable. But first don’t forget to add `par(new=T)`

BEFORE you add the second plot! Also note that the second plot uses the y2 variable, not y1, and the `ylim`

is set to match the range of the y2 variable:

### Now we add second Y axis and points----------------------------------------# par(new=T) y2.range <- c(min(floor(range(y2))), max(ceiling(range(y2)))) plot(dates, y2, ylim = c(y2.range), ann = F, axes = F, type = "n", xlim = c(date.range[1], date.range[length(date.range)])) points(dates, y2, col="red", pch = 20) ### Add 2nd y axis axis(4, at = round(seq(y2.range[1], y2.range[2], 5)), las = 2, cex.axis = 1.25, col = "red", col.axis = "red") mtext(side = 4, "Y variable 2", at = 135, line = 2.3, cex = 1.5, col = "red")

The above case works when both y1 and y2 have the same x variable. But, it’s not hard to imagine that y1 and y2 were measured on different days. This is actually quite easy to handle. The only difference from the above code is that when we add the points for the second y variable (in the second `points()`

function above), we would use a the correct x variable. To add a 3rd variable to the plot above, which was measured on different days:

### add third Y variable----------------------------------------# y3 <- rep(mean(y2),20) + rnorm (20, mean =10, sd= 5) y3.dates <- dates[seq(2, 40, 2)] + 13 par(new=T) y3.range <- c(min(floor(range(y3))), max(ceiling(range(y3)))) plot(dates, y3, ylim = c(y3.range), ann = F, axes = F, type = "n", xlim = c(date.range[1], date.range[length(date.range)])) points(y3.dates, y3, col="magenta", pch = 20) lines(y3.dates, y3, col="magenta", lwd=2, pch = 20)

Here the final product!

1) It can take some time to load the Google Map with sampling locations. I’m guessing it has to do with calling the Google Map server, but not totally sure.

2) The map doesn’t display all the sites, I think it just displays the first 400. This is something that I’m working on. Maybe someone out there has some suggestions???

3) I’m still adding functionality, so right now, all it does is display the sampling locations. I hope to add data visualization and download capabilities soon…

Anyway, enjoy!

It’s been a busy fall season for me, so I haven’t updated the blog as much as I was hoping. But, the good news is I’ve been using R quite a bit in my research, and have continued playing around with the Shiny package. Recently, I’ve been working on an app that’s (arguably) more useful than my first Shiny app, and has some potential for people interested in water quality data (like myself!) I’d like to extend a big thank you again to Rstudio for creating Shiny and specifically starting a free beta program for hosting Shiny apps on their servers. I don’t know how long this beta program will last, but it has worked great so far. Keep up the good work. Anyway, on to the app…

For those who have ever attempted to examine water quality data collected by the United State Geological Survey (USGS) or the Enivornmental Protection Agency (EPA), it’s not the most intuitive website to say the least. Though the USGS and the Environmental Protection Agency has begun to address this (their Water Quality Portal is actually pretty slick), it’s still quite laborious to quickly and efficiently parse and display the data, particularly if you’re interested in looking at many sites. Even with the improved website, you still need to download and analyze the data in separate steps, and just organizing all the data from one station can be quite time consuming in a spreadsheet program.

So, I decided to do something about it. The USGS and EPA have implemented a nice web service for querying their data through the web (details can be found here). I wanted to create a simple app where you entered the US state and a water quality parameter of interest. You would then be presented with a map with the all the sampling locations within the state where the parameter of interest has been measured. Finally, you would be able to view a the data, and download the data as a .csv or .txt file for offline analysis. I haven’t had time to add the plotting features or download features, but I hope to have it up soon.

As in my previous post, I will briefly go through the scripts, describing some key points and the general layout. This is just part 1, so the app is still a work in progress. Let me know what you think!

First, the entire HydroApp script can be seen here. And, if you’d like to try it out, I’ve hosted it on the Rstudio server here!

As you can see, the server side is comprised of 3 reactive functions (makeURL(), testURL(), and stateStationsInput()). Essentially, the inputs from the states and Parameter selection boxes in the ui.R script are combined with the base Water Quality Portal URL to download a .csv file that contains all the stations that have the desired water quality parameter. The makeURL() function handles the correct formatting of the URL so we can download the data:

makeURL <- reactive({ inStateID <- input$stateID inStateID.url <- URLencode(paste("US", inStateID, sep=":"), reserved = T) inParmID <- input$parmID inParmID.url <- urlEncode(inParmID) url <- paste("http://www.waterqualitydata.us/Station/search?countrycode=US", "&statecode=", inStateID.url, "&characteristicName=", inParmID.url, "&siteType=Stream&mimeType=csv&zip=no", sep="") print(url) return(url) })

You will notice that both the State ID (inStateID) and parameter ID (inParmID) need to be appropriately encode to work within a URL. To do this I used a function called urlEncode() to convert the criteria into a % plus a two-digit hexadecimal representation, which can be read by a web browser:

urlEncode <- function(URL) { gsubfn(".", list("_" = "%5F", "." = "%2E","+" = "%2B","!" = "%21", "*" = "%2A","'" = "%27"), c(URL)) }

Once the URL is properly encoded, I then test the URL to see if any sites match the criteria. This is done with testURL():

testURL <- reactive({ url <- makeURL() urlHeaders <- HEAD(url)$header warnings <- which(names(urlHeaders) %in% "warning") if(!length(warnings)){ noData = FALSE summary <- cbind(paste("Total Site Count: ", urlHeaders$'total-site-count')) print(summary) print(noData) } if(length(warnings) > 0){ noData = TRUE summary <- paste("WARNING :", unlist(urlHeaders)[warnings]) print(summary) } return(list(noData, summary)) })

Using the httr library, I read in the header information for the website using HEAD(), and check to see if there are any warnings, specifically, if no stations match both of the input values (ie. State and water quality parameter). The header information also contains information on how many sites exist that match the criteria; somthing I print out later for the user to see.

Finally, assuming there are no warnings, I use the URL to download the requested data and from the downloaded .csv file I create a new data frame with two columns, stationLatLong and stationTip. The stationLatLong column is need by googleVis to correctly display the location of the data. The stationTip column is used for the descriptive text that pops up whenever you click on a station location.

stateStationsInput <- reactive({ if(testURL()[[1]] == F){ url <- makeURL() stateStations = read.csv(url, colClasses = "character", sep =",", na.strings = "", fill = F, allowEscapes = F, header = T) stationLatLong <- paste(stateStations$LatitudeMeasure, stateStations$LongitudeMeasure, sep=":") stationTip <- paste(stateStations$MonitoringLocationName, stateStations$MonitoringLocationIdentifier, sep="<BR>") data.frame(stationLatLong, stationTip) } })

Finally, and the one output (right now), I use the awesome gvisMap() function in the googleVis package to display a Google map with all the stations that match the desired criteria (I think only the first 400 stations are displayed… I’m working on figuring out a way to see more).

output$stationMap <- renderGvis({ if(testURL()[[1]] == F){ mapData <- stateStationsInput() gvisMap(mapData, "stationLatLong", "stationTip", options=list(showTip=TRUE, showLine=FALSE, enableScrollWheel=TRUE, mapType='terrain', useMapTypeControl=TRUE)) } })]]>

Being a big fan of Ichiro Suzuki (who just surpassed 4,000 combined hits between the MLB and Japan!), I decided to make an app that allows a user to compare the cumulative hit trajectory of a player (past or present) against the hit trajectory for the 28 players that have more than 3,000 hits in the major leagues. Owing to the top-notch documentation of Shiny, I was able to put the app together in only a few hours.

This post is broken down into 3 parts: 1) getting the data together (which involves a little web scrapping, so see my previous post about if your interested in doing some yourself), 2) writing the Shiny ui.R file and 3) writing the Shiny server.R file. As usual, I’m not trying to recreate the wheel here, so I highly recommend reading the Shiny tutorial before you start since I won’t be covering too much of the basics of Shiny. The tutorial was about all the introduction I needed to start building my app.

**Part 1. Getting the data…**

So I did a little googling, and found that there are 28 players who have more that 3,000 hits in the MLB. Being lazy, I wrote a little script to scrape the names of these players off the Baseball-reference.com webpage that list the all-time hits leaders in descending order:

### Scrape 3000 hit club from www.baseball-reference.com b = readLines("http://www.baseball-reference.com/leaders/H_career.shtml") bdoc <- htmlParse(b, asText = T) pathResult <- getNodeSet(bdoc, path = "//td/a")[1:28] members <- unlist(lapply(pathResult, function(x) c(xmlValue(x)))) members <- gsub("[+]","", members) ### Get members first and last name to match with Master.csv playerID memberFirst <- lapply(strsplit(members, split = "[[:space:]]"), function(x) x[1]) memberLast <- lapply(strsplit(members, split = "[[:space:]]"), function(x) x[2])

What I’ve done is download the HTML code and put it in a format that is easy for R to read. Then, using Firebug for Firefox, I was able to locate the HTML/XML path to the names on the all time hits list. Finally, I extracted the player names from the HTML code, cleaned it up, and saved it as a vector to be used later on (notice I only extracted the first 28 players on the list since these are the players with >=3,000 hits).

Next, I needed to find the actual hit data, by year so I can cumulatively sum it. Of course I could have manually entered hit data into a spreadsheet, saved it as a .txt and read it back into R, but what fun is that??? So, I did some more googling and found this amazing baseball statistics website, created by Sean Lahman where you can download insane amounts of data as .csv files or even as database files. Since I’m only interested in batting stats (hits), I only need to use the Batting.csv file (which contains the batting stats) and Master.csv file (which contains both the player names and playerIDs, which are needed to sort through the mountains of data):

setwd("C:/LocationOfYourDataHere") master <- read.csv("Master.csv", header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "") batting <- read.csv("Batting.csv", header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "") ### extract playerIDs from Master.csv and ### extract hits and other batting data from Batting.csv memberId <- vector() battingMember <- list() hitsMember <- list() for(i in 1:length(memberLast)){ masterSub <- subset(master, as.character(master$nameLast) == memberLast[[i]] & as.character(master$nameFirst) == memberFirst[[i]]) if(nrow(masterSub) > 1){ masterSub = masterSub[1, ] } memberId[i] <- as.character(masterSub$playerID) battingMember[[i]] <- batting[as.character(batting$playerID) == memberId[i], ] hitsMember[[i]] <- battingMember[[i]]$H }

What I did above was to use the players first and last names to extract the playerID out of the Master.csv file, then use the playerID to extract out the hitting data from Batting.csv (I plan on cleaning this up some day, but for now, I just wanted to get it to work).

Now that I had all the hit data for the 28 players, I can sum each season cumulatively so I can plot the data nicely:

### Calculate cumulative summation of hits for all members mHitsCumSum <- lapply(hitsMember, function(x) cumsum(x))

Phew! That’s actually the hard part. Now, its just doing a little copying a pasting and a some referencing of the tutorial and we have our first web app.

**Part 2. Creating the Shiny ui.R file**

Now for the fun part, designing your web app. I wanted to keep my first app simple, so I decided to have a sidebar with only two way of sending data to R, and have one graph as the main output. The first way a user can interact with the app is to highlight the hit trajectory of a particular member of the 3,000 hit club. To do this, I made a drop down list containing the names of all 28 players. When a name is selected, the hit trajectory is highlighted on the graph. This was accomplished using the selectInput() function.

The other bit of interaction the user can do is plot the hit trajectory of ANY player, past or present, simply by entering in the name of the player and letting R look up the player in the master and batting data frames and plot the data on the graph. This was accomplished using the textInput() function.

Finally, to display the graph, we tell shiny to plot the hit trajectories in the main panel (mainPanel()) by using the plotOutput() function.

library(shiny) # Define UI for miles per gallon application shinyUI(pageWithSidebar( # Application title headerPanel("The 3000 Hit Club"), # Sidebar with controls to select a member of the 3,000 hit club # and input a non-member and plot their hit trajectory sidebarPanel( ### Dropdown menu to select a member of 3,000 hit club to highlight on ### plot selectInput("member", "Member of 3000 hit Club:", list( "Pete Rose" = "rosepe01", "Ty Cobb" = "cobbty01", "Hank Aaron" = "aaronha01", "Stan Musial" = "musiast01", "Tris Speaker" = "speaktr01", "Cap Anson" = "ansonca01", "Honus Wagner" = "wagneho01", "Carl Yastrzemski" = "yastrca01", "Paul Molitor" = "molitpa01", "Eddie Collins" = "collied01", "Derek Jeter" = "jeterde01", "Willie Mays" = "mayswi01", "Eddie Murray" = "murraed02", "Nap Lajoie" = "lajoina01", "Cal Ripken" = "ripkeca01", "George Brett" = "brettge01", "Paul Waner" = "wanerpa01", "Robin Yount" = "yountro01", "Tony Gwynn" = "gwynnto01", "Dave Winfield" = "winfida01", "Craig Biggio" = "biggicr01", "Rickey Henderson" = "henderi01", "Rod Carew" = "carewro01", "Lou Brock" = "brocklo01", "Rafael Palmeiro" = "palmera01", "Wade Boggs" = "boggswa01", "Al Kaline" = "kalinal01", "Roberto Clemente" = "clemero01")), # To text input to select non-3000 hit member to plot hit trajectory textInput("player", "Player Name:", value = ""), # Button to update plot output submitButton("Update View") ), # Show the output plot of the hit trajectory mainPanel( #tableOutput("view"), plotOutput("cumsumPlot") ) ))

**Part 3. server.R file**

the server.R file is the file that does the heavy lifting behind the scenes. It contains the R scripts that takes user inputs (called input variables), does data manipulation, then spits out the results (called output variables) to ui.R, which then displays the outputs of server.R in your web browser.

If you look closely, you will see that I included the script from Part 1 at the beginning of the server.R file. This code is outside the shinyServer() function, so is run ONCE when the app is loaded into the browser, and then all data frames, matrices, vectors, etc. can be used by shinyServer().

After the data is loaded into R, we run the shinyServer() function, which contains reactive functions. Reactive functions are run any time a user changes one of the input variables. You will see there is a reactive function called currentMemberHits(), which simply selects the desired 3,000 hit member for plotting, and there is another reactive function called currentPlayerHits() which get the non-members hit data from the master and batting data frames and calculates the cumulative hits trajectory. Finally there is reactive function called renderPlot() which is run whenever currentMemberHits() or currentPlayerHits() changes. renderPlot() just wraps normal R plotting functions and sends the plot back to ui.R to display in the web browser.

library(shiny) library(XML) ### OVERHEAD ### Scrape 3000 hit club from www.baseball-reference.com b = readLines("http://www.baseball-reference.com/leaders/H_career.shtml") bdoc <- htmlParse(b, asText = T) pathResult <- getNodeSet(bdoc, path = "//td/a")[1:28] members <- unlist(lapply(pathResult, function(x) c(xmlValue(x)))) members <- gsub("[+]","", members) ### Get members first and last name to match with Master.csv playerID memberFirst <- lapply(strsplit(members, split = "[[:space:]]"), function(x) x[1]) memberLast <- lapply(strsplit(members, split = "[[:space:]]"), function(x) x[2]) ### Read in local files downloaded from... setwd("C:/chitchat/data") master <- read.csv("Master.csv", header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "") batting <- read.csv("Batting.csv", header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "") ### extract playerIDs from Master.csv and ### extract hits and other batting data from Batting.csv memberId <- vector() battingMember <- list() hitsMember <- list() for(i in 1:length(memberLast)){ masterSub <- subset(master, as.character(master$nameLast) == memberLast[[i]] & as.character(master$nameFirst) == memberFirst[[i]]) if(nrow(masterSub) > 1){ masterSub = masterSub[1, ] } memberId[i] <- as.character(masterSub$playerID) battingMember[[i]] <- batting[as.character(batting$playerID) == memberId[i], ] hitsMember[[i]] <- battingMember[[i]]$H } ### Calculate cumulative summation of hits for all members mHitsCumSum <- lapply(hitsMember, function(x) cumsum(x)) ### For plotting maxYears <- max(unlist(lapply(hitsMember, function(x) length(x)))) maxHits <- max(unlist(lapply(mHitsCumSum, function(x) max(x)))) ### Define server logic required to plot various players against 3000 hit club shinyServer(function(input, output) { ### get hits for chosen 3000 club member currentMemberHits <- reactive({ ### Calculate cumulative summation of hits cumsum(hitsMember[[match(as.character(input$member), memberId)]]) }) ### get hits for non-3000 club player currentPlayerHits <- reactive({ playerFirst <- lapply(strsplit(input$player, split = "[[:space:]]"), function(x) x[1]) playerLast <- lapply(strsplit(input$player, split = "[[:space:]]"), function(x) x[2]) ### extract hits and other batting data from Batting table masterPlayer <- list() playerId <- vector() battingPlayer <- list() hitsPlayer <- list() for(i in 1:length(playerLast)){ masterSub <- subset(master, as.character(master$nameLast) == playerLast & as.character(master$nameFirst) == playerFirst) if(nrow(masterSub) > 1){ masterSub = masterSub[1, ] } playerId <- as.character(masterSub$playerID) battingPlayer <- batting[as.character(batting$playerID) == playerId, ] hitsPlayer <- battingPlayer$H } ### Calculate cumulative summation of hits for non-member cumsum(hitsPlayer) }) ### Output table comparing currentMemberHits() and currentPlayerHits() ### NOT IMPLEMENTED! output$view <- renderTable({ data.frame("X" = currentMemberHits())#, "Y" = currentPlayerHits()) }) ### Output xy-scatterplot output$cumsumPlot <- renderPlot({ plot(seq(1, maxYears, 1), rep(0, maxYears), type = "n", lwd = 2, xlim = c(0, maxYears), ylim = c(0, maxHits), xlab = "Year", ylab = "Hits") segments(x0 = -100, x1 = 1000, y0 = 3000, y1 = 3000, lty = 2, lwd = 2, col = "black") for(i in 1:length(mHitsCumSum)){ lines(seq(1, length(mHitsCumSum[[i]]), 1), mHitsCumSum[[i]], lwd = 2, col = "grey70") } lines(seq(1, length(currentMemberHits()), 1), currentMemberHits(), lwd = 2, col = "magenta") lines(seq(1, length(currentPlayerHits()), 1), currentPlayerHits(), lwd = 2, col = "blue") }) })

That’s it. Now, to run the app we need to put both the ui.R and server.R into a directory, then in a R session, we can run the follow code to run the app LOCALLY:

library(shiny) runApp("C:/LocationOfYourShinyAppDirectory")

I was really impressed with how professional the app looks. This is hopefully only the beginning for a really promising package. I also plan on getting these scripts onto Github soon. When I do, I’ll add a link. Cheers!

UPDATE: I’ve added a new tab at the top of the page which includes the entire script for this app (3000 Hit Analyzer). It was a lot easier to add a new tab at the top of this website then create Github respository to share the scripts!

]]>First, before I begin, the data I’m going to be using are from a race management company located in the northeast called F.I.R.M. I recently complete a race organized by them (I’m part of the data I will be using for this demo!)The data is accessible by anyone with an internet connection (not just participants) so I felt it is OK to use race result data for this post.

Secondly, the method I will be discussing below is used to access data embedded in a HTML table from a particular website, and therefore it’s website-specific. If you want to repeat what I show you for another website, you will need to 1) be able to view their HTML code, 2) figure out where in the HTML code the data of interest is stored, and 3) have a basic understanding of HTML and HTML/XML structure so you can use the correct path (more on this later) to tell R to where the data is located. (1) isn’t too hard to do. I use Firefox browser, and they have a really robust and useful plug-in called Firebug that let’s you view the HTML code of any website you visit. (2) and (3) are more tricky, and will require some time studying HTML, but there are some good website out there to help you learn the basics of HTML and Xpath/XML path syntax.

So, first we need to know the URL of the website with the data we want to scrap. We can use the readLines() function in the XML library to read in the HTML code (which includes the data we are interested in). This creates a vector of the HTML code. Unfortunately, this isn’t very useful to us, since it will be VERY difficult to extract the data we are interested in. So, we create and XMLInternalDocument object using `htmlParse()`

to more easily access the XML node that has our data:

library(XML) b <- readLines("http://www.firm-racing.com/result_report.asp?RID=792&type=1") bdoc <- htmlParse(b, asText = T)

Now that we have an XMLInternalDocument, we access the XML node that has our data using `getNodeSet()`

and retrieve the raw data using `xmlValue()`

.

result.table <- getNodeSet(bdoc, path = "//table/td/div") racer.rslt <- matrix(unlist(lapply(result.table, function(x) c(xmlValue(x)))), ncol = 16, byrow = T)

You can see there is a arguement “path” in `getNodeSet()`

, which is where R looks to get our desired data out of the XML document. Defining the correct path requires knowledge of XPath syntax that’s not going to be covered here, but using Firebug and some trial and error, I was able to narrow down the location of the data itself to “//table/td/div” fairly quickly. You can see that the `getNodeSet()`

returns a list of XML nodes that R found when it followed the “path” we defined in the function call. We then use `xmlValue()`

in a `lapply()`

loop to extract the actual values in the Nodes.

racer.rslt is our desired matrix which contains the data we want (race time, age group, finishing order, etc.). Now we can convert the matrix to a data frame, add headers (obtained from the website) and start analyzing the data!

result.df <- as.data.frame(racer.rslt,stringsAsFactors = FALSE) header <- c("bib", "category", "swim_cat", "swim_ov", "swim_time", "TT1", "bike_cat", "bike_ov", "bike_time", "TT2", "run_cat", "run_ov", "run_time", "overall_cat", "overall_ov", "overall_time") names(result.df) <- header

For my next post I’m going to do some analyzing of the data. Stay tuned!

]]>