This post is going to be on web scraping, a technique that I’ve only used a few times, but which is can be very useful. It’s essentially the process of accessing the underlying HTML code or other “behind the scenes” data from a websites. From reading the wikipedia page on web scraping, you should be aware that not all webmasters will be happy about someone “scraping” their data (and in some cases it may be illegal), so proceed at your own risk!!!
First, before I begin, the data I’m going to be using are from a race management company located in the northeast called F.I.R.M. I recently complete a race organized by them (I’m part of the data I will be using for this demo!)The data is accessible by anyone with an internet connection (not just participants) so I felt it is OK to use race result data for this post.
Secondly, the method I will be discussing below is used to access data embedded in a HTML table from a particular website, and therefore it’s website-specific. If you want to repeat what I show you for another website, you will need to 1) be able to view their HTML code, 2) figure out where in the HTML code the data of interest is stored, and 3) have a basic understanding of HTML and HTML/XML structure so you can use the correct path (more on this later) to tell R to where the data is located. (1) isn’t too hard to do. I use Firefox browser, and they have a really robust and useful plug-in called Firebug that let’s you view the HTML code of any website you visit. (2) and (3) are more tricky, and will require some time studying HTML, but there are some good website out there to help you learn the basics of HTML and Xpath/XML path syntax.
So, first we need to know the URL of the website with the data we want to scrap. We can use the readLines() function in the XML library to read in the HTML code (which includes the data we are interested in). This creates a vector of the HTML code. Unfortunately, this isn’t very useful to us, since it will be VERY difficult to extract the data we are interested in. So, we create and XMLInternalDocument object using
htmlParse() to more easily access the XML node that has our data:
library(XML) b <- readLines("http://www.firm-racing.com/result_report.asp?RID=792&type=1") bdoc <- htmlParse(b, asText = T)
Now that we have an XMLInternalDocument, we access the XML node that has our data using
getNodeSet() and retrieve the raw data using
result.table <- getNodeSet(bdoc, path = "//table/td/div") racer.rslt <- matrix(unlist(lapply(result.table, function(x) c(xmlValue(x)))), ncol = 16, byrow = T)
You can see there is a arguement “path” in
getNodeSet(), which is where R looks to get our desired data out of the XML document. Defining the correct path requires knowledge of XPath syntax that’s not going to be covered here, but using Firebug and some trial and error, I was able to narrow down the location of the data itself to “//table/td/div” fairly quickly. You can see that the
getNodeSet() returns a list of XML nodes that R found when it followed the “path” we defined in the function call. We then use
xmlValue() in a
lapply() loop to extract the actual values in the Nodes.
racer.rslt is our desired matrix which contains the data we want (race time, age group, finishing order, etc.). Now we can convert the matrix to a data frame, add headers (obtained from the website) and start analyzing the data!
result.df <- as.data.frame(racer.rslt,stringsAsFactors = FALSE) header <- c("bib", "category", "swim_cat", "swim_ov", "swim_time", "TT1", "bike_cat", "bike_ov", "bike_time", "TT2", "run_cat", "run_ov", "run_time", "overall_cat", "overall_ov", "overall_time") names(result.df) <- header
For my next post I’m going to do some analyzing of the data. Stay tuned!