Monthly Archives: March 2014

Creating DEMs from contour lines using R

UPDATE: I made a slight change to the script, because I was getting an error when I ran the script on Windows, but not on Linux.  Previously, the script read:

# Create raster grid where we are going to interpolate our
# elevation data.
contourBBox <- bbox(contour3m)
contourRaster <- raster(xmn=contourBBox[1,1], 
                        xmx=ceiling(contourBBox[1,2]),
                        ymn=contourBBox[2,1], 
                        ymx=ceiling(contourBBox[2,2]),
                        crs = CRS("+init=epsg:26918"),
                        resolution = c(12,12))

But I found, that the resolution = c(12,12) argument was giving an error on my Windows computer.  Fortunately, there is an easy fix. Just remove the argument and add the line res(contourRaster) <- 12:

# Create raster grid where we are going to interpolate our
# elevation data.
contourBBox <- bbox(contour3m)
contourRaster <- raster(xmn=contourBBox[1,1], 
                        xmx=ceiling(contourBBox[1,2]),
                        ymn=contourBBox[2,1], 
                        ymx=ceiling(contourBBox[2,2]),
                        crs = CRS("+init=epsg:26918"))
res(contourRaster) <- 12

Everything should work smoothly now!

Quick Note: A few years ago I did a post on using R for spatial analyses. I recently thought I’d return to the topic, mainly to show how versatile R is and how many of the things you can do in dedicated GIS software (which can cost thousands of dollars to license!) can be re-created easily and quickly using R.

DEMs, or digitial elevation models, are one of the most common types of data used in spatial analysis. DEMs are raster files that contains elevation information for each cell in the raster. I recently was give a shapefile of contour lines for a particular area and needed to create a DEM from them. As such, I thought I’d demonstrate how I created a DEM from the contour lines that were in a vector shapefile (.shp), one of the most common GIS file types.

I won’t go into all the differences between rasters and vectors (ArcGIS has some good online resources in case you need to brush up) but the basic steps are:

1. Read in the files to R.
2. Create a raster “grid” to interpolate the elevation data onto.
3. Convert the contour lines to points so you can interpolate between elevation points.
4. Interpolate the elevation data onto the raster grid.
5. Crop the DEM raster to the boundary of the area of interest.

Before I begin, I obtained my data from the Mass GIS webportal, which is a really nice web-based GIS for querying all types of spatial data within the Commonwealth of Massachusetts. Since I’m interested in elevation data only, I looked for a sub-basin (ie. watershed) that had a nice topographic range and shape. I then downloaded the watershed boundaries, and the 3m contour lines that covered the watershed, as separate shapefiles.

Step 1. This is quite simple, since there are some really nice R packages for dealing with spatial data. To read in our data, we will use the maptools library. You’ll notice that the readShapeLines() and readShapePoly() has a proj4string argument. This is where you tell R what the projection and datum are for your data. MassGIS actually asks which projection you’d like your data in, so I knew ahead of time what EPSG I needed to input:

library(sp)
library(maptools)
library(rgdal)
library(raster)
library(rgeos)

#Read in shape files
setwd("C:/LocationOfYourContourData")
contour3m <- readShapeLines("GISDATA_CONTOURS5K_ARC",
                            proj4string = CRS("+init=epsg:26918"))

setwd("C:/LocationOfYourBoundaryData")
ws_bound <- readShapePoly("GISDATA_SUBBASINS_POLY",
                           proj4string = CRS("+init=epsg:26918"))

Lets plot up the files to see what we are working with. I’m interested in the watershed in the center, but you can see that the downloaded data included adjacent watersheds. Also, the contours extend beyond the watershed of interest. We’ll deal with this later when we crop and clean up our data:

# plot of ws_bound with watershed of interest highlighted
spplot(ws_bound, zcol = "SUB_ID", col = "black", colorkey = F,
       fill = "grey80", 
       panel = function(x, y, ...){
         panel.polygonsplot(x,y, ...)
         sp.polygons(ws_bound[as.character(ws_bound$SUB_ID) == 1007, ],
                    fill = "magenta")
       })

# same plot as above, but include contour3m lines
spplot(ws_bound, zcol = "SUB_ID", col = "black", colorkey = F,
       fill = "grey80", 
       panel = function(x, y, ...){
         panel.polygonsplot(x,y, ...)
         sp.polygons(ws_bound[as.character(ws_bound$SUB_ID) == 1007, ],
                     fill = "magenta")
         sp.lines(HudsonWS_contourSub, col = "black")
       })
watershed boundaries downloaded from MassGIS.  The watershed of interest is highlighted in magenta.

Watershed boundaries downloaded from MassGIS. The watershed of interest is highlighted in magenta.

Added 3m contour lines to plot to show extent of contour lines outside watershed of interest.

Added 3m contour lines to plot to show extent of contour lines outside watershed of interest.

Step 2. I need to create an empty raster grid that spans the range of my sub-basin. Since I’m using the contour lines to create the DEM, I used them as the basis for my raster grid. Notice that I manually set the resolution (size) of each cell in the raster to be square (12m x 12m). This is required if you ever want to import your data into ArcGIS (as far as I’m aware). I also set the resolution so that I didn’t have too many cells. The resolution should depend on your needs and computing power:

# Create raster grid where we are going to interpolate our
# elevation data.
contourBBox <- bbox(contour3m)
contourRaster <- raster(xmn=contourBBox[1,1], 
                        xmx=ceiling(contourBBox[1,2]),
                        ymn=contourBBox[2,1], 
                        ymx=ceiling(contourBBox[2,2]),
                        crs = CRS("+init=epsg:26918"),
                        resolution = c(12,12))

Step 3. Since interpolation methods were conceived to work with point data, we need to convert the elevation contour lines to elevation points. Fortunately, this is quite easy to do in R using as(). Essentially we are creating points along each contour line that has as its value the elevation of the associated contour line:

# Create points from contour lines so we can
# interpolate between points.
p <- as(contour3m, 'SpatialPointsDataFrame')

Step 4. Now we have points from which we can interpolate a elevation raster. There are probably many different intrpolation packages out there, but I’m going to use gstat, since it seems pretty robust for spatial statistics, and there are nice examples using the gstat and raster packages together for interpolating point data to rasters. We will use gstat() to create a model for interpolating our point elevation data using inverse distance weighting. Then, we use interpolate() to predict the elevation value at each cell in contourRaster. Note that this step can take some time depending on the computing power of your computer:

#interpolate between points using IDW into contourRaster
library(gstat)
g <- gstat(id="ELEV_M", formula = ELEV_M~1, data=p,
           nmax=7, set=list(idp = .5))
interpDEM <- interpolate(contourRaster, g)

Step 5. Finally, we can clean up the data by clipping our DEM to the boundaries of the watershed. We can also clip our contour lines to the boundaries of our watershed:

# extract the watershed boundary of interest
HudsonWS_bound <- ws_bound[as.character(ws_bound$SUB_ID) == 1007, ]
# clip raster to watershed boundaries
interpDEM_mask <- mask(interpDEM, HudsonWS_bound)
# clip contours to watershed boundary
HudsonWS_contour <- crop(contour3m,HudsonWS_bound)

That wasn’t too bad. We can plot it up nicely using spplot() like I did in a previous post:

# subset 3m contours for nice plotting
HudsonWS_contourSub <- HudsonWS_contour[(HudsonWS_contour$ELEV_M) %in% 
                                          seq(354, 993, 30), ]
# Plot it up!
spplot(interpDEM_mask, col.regions = heat.colors(100),
       xlim=c(bbox(interpDEM_mask)[1,1] + 900,
              bbox(interpDEM_mask)[1,2] - 900),
       ylim=c(bbox(interpDEM_mask)[2,1] + 1150,
              bbox(interpDEM_mask)[2,2] - 900),
       colorkey = list(space = "top", width = 1), main = "Elev. (m)",
       panel = function(x, y, ...){
         panel.gridplot(x,y, ...)
         sp.lines(HudsonWS_contourSub, col = "black")
       })
Final DEM with 30m contour lines.

Final DEM with 30m contour lines.

Not too shabby! Anyway, keep an eye on Chit Chat R for more spatial analysis in near future…

Fitting a (non-linear) curve to data points

Anyone who has taken a high school math class is probably familiar with fitting a straight line to data points:

1. graph the points
2. draw a best fit straight line through the points
3. calculate the slope, m = (y2 – y1)/(x2 – x1), where (x1, y1) and (x2, y2) are two different points on the best fit line
4. Continue the best fit line through the y-axis to determine b, the y-intercept.

Voila! You have a equation for a straight line in the form y = mx + b! If you’re like me and are lazy, you can use R to calculate the two parameters (m and b), using the lm() function along with the summary() function.

The above steps are all well and good if your data approximates a straight line, but what if your data are non-linear (ie. curved) and you want to fit a line to it?

Well, (not surprisingly) R has you covered. R has a nice built in function called nls(), that can fit a non-linear equation to your data. To use this function, all you need to do is decide on a model to fit to your data. This may seem simple, but choosing the correct model can be quite important. Typically in the natural sciences, we try and choose models that make sense physically (ie. the parameters of the model can be related back to real world phenomena) or a model that is well accepted in a particular area of research. Nonetheless, just keep in mind that model selection is a critical step.

With that out of the way, I’ve decided to demonstrate nls() using some real world data, namely daily maximum temperature data (TMAX) taken from a great of source free climate data, NOAA. They have a really nice user interface to search for all different types of climate data.

On the National Climate Data Center website, I was quickly able to locate daily MAX and MIN temperature data from Logan International Airport in Boston, MA. Within a few hours of submitting my data request, I received an email telling me my data was ready to be downloaded. Let us begin!

They email had a link to a .dat file to download the data from the web, so I used R to grab the data and load it into a data frame ((I don’t know how long NOAA data links are active, so I suggest requesting your own data if you are following my script…) We are just interested in the date and TMAX data columns so we extract those:

library(chron)
url <- paste("http://www1.ncdc.noaa.gov/pub/orders/cdo/294541.dat", sep="")
dat <- read.table(url, sep = "", skip = 2, header = F)
bostonTemp <- data.frame(dat[c("V11", "V13")])
headers <- read.table(url, colClasses = "character", sep = "",
                      nrow = 1, header = F)
names(bostonTemp) <- c(headers[c("V6", "V8")])

Now that we have the data in a usable form, we can do some cleaning up. First, the temperature is given in tenths of a degree Celcius (I have no idea why…) so we can easily convert that into more familiar values by dividing by 10. Next, to make plotting easier, we can convert the dates into a time format that R likes, using the chron() function in the “chron” package. Finally, we can plot the data to see what it looks like:

bostonTemp$TMAX <- bostonTemp$TMAX/10
bostonTemp$DATE <- as.chron(paste(substr(bostonTemp$DATE, 5,6),
                                  substr(bostonTemp$DATE, 7,8),
                         substr(bostonTemp$DATE, 1,4), sep="/"))

par(font=2, font.axis = 2, tck = 0.01)
plot(bostonTemp$DATE, bostonTemp$TMAX)
2000-2014 TMAX data from Logan International Airport.

2000-2014 TMAX data from Logan International Airport.

As you can see I downloaded data spanning 1/1/2000 through the present. Clearly the temperature flucuates in a repeating fashion, cold in the winter and hot in the summer (obviously). Let’s look at the most recent year where we have a full year of data, 2013:

bostonTemp2013 <- subset(bostonTemp, 
                         bostonTemp$DATE >= as.chron("01/01/2013") & 
                         bostonTemp$DATE <= as.chron("12/31/2013"))
plot(bostonTemp2013$DATE, bostonTemp2013$TMAX) 
2013 TMAX data from Logan International Airport.

2013 TMAX data from Logan International Airport.

It appears that the TMAX data have a convex upward shape, with the highest values of TMAX in the middle of the year and the lowest values at the beginning and the end. You could say that the data has the shape of a parabola! Hmm, let’s fit it.

Knowing the vertex form of the parabola equation: y = a*(x – h)^2 + k, we can use this as our model. We also need to supply nls() with a starting value for each of the unknown parameters that we are going to estimate (a, h and k). Knowing the physical meaning behind each of the model parameters (see this website for a quick refresher), we can guess that h is approximately the date where TMAX is the greatest, and k is the maximum TMAX, and we can assume a is a negative value (since we want the parabola to open downwards), let’s say -1. Finally, we need to convert bostonTemp2013$DATE to a numeric format using as.numeric() so that we can fit a equation to the data:

x <- as.numeric(bostonTemp2013$DATE)
y <- bostonTemp2013$TMAX
h1 <- mean(x)
k1 <- max(y)
a1 <- -1

yfit<- nls(y ~ a*(x-h)^2 + k, start = list(a = a1, h = h1, k = k1))
yfitParm <- summary(yfit)$para[,1]

If we look at the parameter estimates nls() gave us, they weren’t too far off from our initial guesses!

> cbind(h1, k1, a1)
        h1   k1 a1
[1,] 15888 37.2  -1
> yfitParm
            a             h             k 
-8.329497e-04  1.590282e+04  2.480478e+01 

Now lets see how nls() did when fitting our data with a parabola:

ymod <- yfitParm[1]*(x-yfitParm[2])^2 + yfitParm[3]

par(font=2, font.axis = 2, tck = 0.01)
plot(bostonTemp2013$DATE, bostonTemp2013$TMAX)
lines(x, ymod)
2013 TMAX data from Logan Internaltional Airport fitted with a parabola.

2013 TMAX data from Logan Internaltional Airport fitted with a parabola.

Not too bad! There may be more appropriate models out there for fitting temperature data, but this model seems to do the trick. Till next time…