Anyone who has taken a high school math class is probably familiar with fitting a straight line to data points:
1. graph the points
2. draw a best fit straight line through the points
3. calculate the slope, m = (y2 – y1)/(x2 – x1), where (x1, y1) and (x2, y2) are two different points on the best fit line
4. Continue the best fit line through the y-axis to determine b, the y-intercept.
Voila! You have a equation for a straight line in the form y = mx + b! If you’re like me and are lazy, you can use R to calculate the two parameters (m and b), using the
lm() function along with the
The above steps are all well and good if your data approximates a straight line, but what if your data are non-linear (ie. curved) and you want to fit a line to it?
Well, (not surprisingly) R has you covered. R has a nice built in function called
nls(), that can fit a non-linear equation to your data. To use this function, all you need to do is decide on a model to fit to your data. This may seem simple, but choosing the correct model can be quite important. Typically in the natural sciences, we try and choose models that make sense physically (ie. the parameters of the model can be related back to real world phenomena) or a model that is well accepted in a particular area of research. Nonetheless, just keep in mind that model selection is a critical step.
With that out of the way, I’ve decided to demonstrate
nls() using some real world data, namely daily maximum temperature data (TMAX) taken from a great of source free climate data, NOAA. They have a really nice user interface to search for all different types of climate data.
On the National Climate Data Center website, I was quickly able to locate daily MAX and MIN temperature data from Logan International Airport in Boston, MA. Within a few hours of submitting my data request, I received an email telling me my data was ready to be downloaded. Let us begin!
They email had a link to a .dat file to download the data from the web, so I used R to grab the data and load it into a data frame ((I don’t know how long NOAA data links are active, so I suggest requesting your own data if you are following my script…) We are just interested in the date and TMAX data columns so we extract those:
library(chron) url <- paste("http://www1.ncdc.noaa.gov/pub/orders/cdo/294541.dat", sep="") dat <- read.table(url, sep = "", skip = 2, header = F) bostonTemp <- data.frame(dat[c("V11", "V13")]) headers <- read.table(url, colClasses = "character", sep = "", nrow = 1, header = F) names(bostonTemp) <- c(headers[c("V6", "V8")])
Now that we have the data in a usable form, we can do some cleaning up. First, the temperature is given in tenths of a degree Celcius (I have no idea why…) so we can easily convert that into more familiar values by dividing by 10. Next, to make plotting easier, we can convert the dates into a time format that R likes, using the
chron() function in the “chron” package. Finally, we can plot the data to see what it looks like:
bostonTemp$TMAX <- bostonTemp$TMAX/10 bostonTemp$DATE <- as.chron(paste(substr(bostonTemp$DATE, 5,6), substr(bostonTemp$DATE, 7,8), substr(bostonTemp$DATE, 1,4), sep="/")) par(font=2, font.axis = 2, tck = 0.01) plot(bostonTemp$DATE, bostonTemp$TMAX)
As you can see I downloaded data spanning 1/1/2000 through the present. Clearly the temperature flucuates in a repeating fashion, cold in the winter and hot in the summer (obviously). Let’s look at the most recent year where we have a full year of data, 2013:
bostonTemp2013 <- subset(bostonTemp, bostonTemp$DATE >= as.chron("01/01/2013") & bostonTemp$DATE <= as.chron("12/31/2013")) plot(bostonTemp2013$DATE, bostonTemp2013$TMAX)
It appears that the TMAX data have a convex upward shape, with the highest values of TMAX in the middle of the year and the lowest values at the beginning and the end. You could say that the data has the shape of a parabola! Hmm, let’s fit it.
Knowing the vertex form of the parabola equation: y = a*(x – h)^2 + k, we can use this as our model. We also need to supply
nls() with a starting value for each of the unknown parameters that we are going to estimate (a, h and k). Knowing the physical meaning behind each of the model parameters (see this website for a quick refresher), we can guess that h is approximately the date where TMAX is the greatest, and k is the maximum TMAX, and we can assume a is a negative value (since we want the parabola to open downwards), let’s say -1. Finally, we need to convert
bostonTemp2013$DATE to a numeric format using
as.numeric() so that we can fit a equation to the data:
x <- as.numeric(bostonTemp2013$DATE) y <- bostonTemp2013$TMAX h1 <- mean(x) k1 <- max(y) a1 <- -1 yfit<- nls(y ~ a*(x-h)^2 + k, start = list(a = a1, h = h1, k = k1)) yfitParm <- summary(yfit)$para[,1]
If we look at the parameter estimates
nls() gave us, they weren’t too far off from our initial guesses!
> cbind(h1, k1, a1) h1 k1 a1 [1,] 15888 37.2 -1 > yfitParm a h k -8.329497e-04 1.590282e+04 2.480478e+01
Now lets see how
nls() did when fitting our data with a parabola:
ymod <- yfitParm*(x-yfitParm)^2 + yfitParm par(font=2, font.axis = 2, tck = 0.01) plot(bostonTemp2013$DATE, bostonTemp2013$TMAX) lines(x, ymod)
Not too bad! There may be more appropriate models out there for fitting temperature data, but this model seems to do the trick. Till next time…