Introduction

This article shows the process of how I scraped a website. This article heavily depends on {tidyverse} ecosystem and thus presumes that you have knowledge of wrangling with {dplyr} and programming with {purrr}.

Scraping a website means to receive unstructured information - in this case from a website - and saving it in a structured format (i.e. data frame). Because some websites are rendered through javascript, the act of scraping a website is not as straightforward as just taking out bits and pieces from a html document. In order to scrape a javascript website, you need to perform the following steps:


  1. Are we allowed to scrape?

Before starting to scrape, one should always check the robot.txt file to see if one is allowed to scrape a website:

robotstxt::paths_allowed("https://www.meesterbaan.nl")
## 
 www.meesterbaan.nl                      No encoding supplied: defaulting to UTF-8.
## [1] TRUE
  1. set up a session

In order to set up a session, we lauch a Docker in the terminal. If you haven’t installed Docker yet, then check out this tutorial on how to install Docker on your device.

docker-machine start default


After Docker is installed and a Docker machine is running, you need to install a web browser within the Docker machine:

docker run -d -p 4445:4444 selenium/standalone-chrome


This Docker session is used to set up the connection with the website via the {Rselenium}.

## open server via docker ----

remDr <- remoteDriver(
        remoteServerAddr = "192.168.99.100",
        port = 4445L,
        browserName = "chrome"
)

remDr$open()


In this example, we are scraping a Dutch website with vacancies for school teachers of primary and high school education in Amsterdam. But before we do this, we need to take a look at the website and see which information we want to capture.



## website url ----

url <- "https://www.meesterbaan.nl/onderwijs/vacatures.aspx?functie=alle%20functies&regio=amsterdam&doelgroep=2&id_sector=-1&filter=&s=Datum"


## direct to website ----

remDr$navigate(url)



  1. understand the website’s structure


After setting up a ‘live’ connection with the website, the html/javascript-structure of the websites need to be unravelled. For this quest, I used the browser add-in called selectorgadget . With the use of selector gadget you can see that the links are nested within the .vacancyText > h2 > a bit of the html. Additionaly I saved the id-code within the href to check for unique vacancies. To extract the information, I used the workflow provided by the {tidyverse} ecosystem and especially the {rvest} package.

## scrape hrefs from first page ----

hrefs <- tibble(url = read_html(remDr$getPageSource()[[1]]) %>%
                        html_nodes(".vacancyText") %>%
                        html_nodes('h2') %>% 
                        html_nodes('a') %>% 
                        html_attr("href")) %>% 
        mutate(id = str_extract(url,'(?<!\\d)[0-9]{6}(?!\\d)'))


  1. map through the pages


The trickiest part - for me at least - was to scroll through the different pages on a website that is rendered through javascript. I ended up finding a solution within the documentation of {Rselenium} here. I will run through the solutions I came up with in the following three subsections.


On the first page that is visited, the number of vacancies are shown in this #ctl00_plhControl_lblMessage node. Each page shows ten vacancies, therefore we divide the number of vacancies by ten and then make sure that all the vacancies are included in the following line mutate(. = ceiling(./10) * 10).

## Check how many pages to scrape ----

page <- read_html(remDr$getPageSource()[[1]]) %>%
        html_nodes("#ctl00_plhControl_lblMessage") %>% 
        html_children() %>% 
        html_text() %>% 
        as.numeric() %>%
        tibble() %>% 
        mutate(. = ceiling(./10) * 10) %/%
        10



After every five pages the page scroller uses a > to go to the next five pages. In order to simulate this behaviour, we use the next piece of code.

## create tibble with page id's to scroll through ----

scroller <- tibble(x = c(1:page$.)) %>%
        transmute(
                arrow =
                        case_when(x %in% c(seq(6, page$., 5)) ~ ">",
                                  TRUE ~ NA_character_),
                arrow = coalesce(arrow, as.character(x))
        )


This piece of code is a bit longer because of different parts being put together. The function starts off with entering a new page by:

I. locating the button that brings us to a new page

In this part I use the solution that I found in the {Rselenium} vignette. The following code identifies the button: remDr$findElements(using = "css", ".PageNumbers"). The following piece selects the next page and with the webElem$clickElement() the actual button is clicked on.

II. by taking the links of all the individual vacancies

The second part of the code is repetition and saves all the links and IDs in a tibble (i.e. data frame).

## create function which can scroll through javascript post content ----

scrol_ler <- function(scroll) {
  
  Sys.sleep(3)
  
  webElems <- remDr$findElements(using = "css", ".PageNumbers")
  
  resHeaders <- webElems %>%
    map_chr(~ unlist(.x$getElementText()))
  
  
  webElem <- webElems[[which(resHeaders == scroll)]]
  
  webElem$clickElement()
  
  Sys.sleep(3)
  
  tibble(
    url = xml2::read_html(remDr$getPageSource()[[1]]) %>%
      rvest::html_nodes(".vacancyText") %>%
      html_nodes('h2') %>%
      html_nodes('a') %>%
      html_attr("href")
  ) %>%
    mutate(id = str_extract(url, '(?<!\\d)[0-9]{6}(?!\\d)'))
  
  
}


  1. extract the information


After figuring this out I extracted the information in two parts. First, I scrolled through the different pages to obtain the links from the vacancies listed on the website.

## scrape hrefs from first page ----

hrefs <- tibble(url = read_html(remDr$getPageSource()[[1]]) %>%
                        html_nodes(".vacancyText") %>%
                        html_nodes('h2') %>% 
                        html_nodes('a') %>% 
                        html_attr("href")) %>% 
        mutate(id = str_extract(url,'(?<!\\d)[0-9]{6}(?!\\d)'))

## map through the other pages and scrape the href from those pages ----

hrefs2 <- scroller %>% 
        mutate(hrefs = map(arrow, possibly(scrol_ler, NA)))

## Bind the two tibbles together ----

hrefs_all <- hrefs2 %>% 
        unnest() %>% 
        select(-arrow) %>% 
        bind_rows(hrefs)
## Warning: `cols` is now required.
## Please use `cols = c(hrefs)`


Secondly, I go to all the links individually to get the actual content that I am looking for. To obtain the wanted information, I use a longer bit of code, which is shown within the go_school function. With this function I map over all the links that we obtained in the previous code block (tibble: hrefs_all).

## function to get vacancy information ----

go_school <- function(url) {

remDr$navigate(glue::glue("{url}"))
        
        Sys.sleep(
                sample(3:6, 1)
        )
        
### school name
        
school_name = read_html(remDr$getPageSource()[[1]]) %>% 
        html_nodes('#ctl00_plhControl_lblSchool') %>% 
        html_text()

### street name

street_name = read_html(remDr$getPageSource()[[1]]) %>% 
        html_nodes('#ctl00_plhControl_lblStraatnaam') %>% 
        html_text()

### postal adress

postal = read_html(remDr$getPageSource()[[1]]) %>% 
        html_nodes('#ctl00_plhControl_lblPostalcode') %>% 
        html_text()

### city

city = read_html(remDr$getPageSource()[[1]]) %>% 
        html_nodes('#ctl00_plhControl_lblPlaats') %>% 
        html_text()

### type of eduction

edu_type = read_html(remDr$getPageSource()[[1]]) %>% 
        html_nodes('#ctl00_plhControl_lblTypeOnderwijs') %>% 
        html_text()

### date of vacancy posting

entry_date = read_html(remDr$getPageSource()[[1]]) %>% 
        html_nodes('#ctl00_plhControl_lblPlaatsing2') %>% 
        html_text()

### date of vacancy closing

closing_date = read_html(remDr$getPageSource()[[1]]) %>% 
        html_nodes('#ctl00_plhControl_lblSluitingsDatum') %>% 
        html_text()

### Starting date 

starting_date <- read_html(remDr$getPageSource()[[1]]) %>% 
        html_nodes('#ctl00_plhControl_lblMetIngang') %>% 
        html_text()

### type of contract

contract <- read_html(remDr$getPageSource()[[1]]) %>% 
        html_nodes('#ctl00_plhControl_lblDienstverband') %>% 
        html_text()

### Job description

description <- read_html(remDr$getPageSource()[[1]]) %>% 
        html_nodes("#ctl00_plhControl_lblOmschrijving") %>% 
        html_text()

tibble(school_name = if(length(school_name) == 0) NA_character_ else school_name,
       street_name = if(length(street_name) == 0) NA_character_ else street_name,
       postal = if(length(postal) == 0) NA_character_ else postal,
       city = if(length(city) == 0) NA_character_ else city,
       edu_type = if(length(edu_type) == 0) NA_character_ else edu_type,
       entry_date = if(length(entry_date) == 0) NA_character_ else entry_date,
       closing_date = if(length(closing_date) == 0) NA_character_ else closing_date,
       starting_date = if(length(starting_date) == 0) NA_character_ else starting_date,
       contract = if(length(contract) == 0) NA_character_ else contract,
       description = if(length(description) == 0) NA_character_ else description)

}

### map over all the links with the go_school function

vacancies <- hrefs_all %>% 
        mutate(info = map(url, go_school))



  1. Additional information

In a following post, I will do an analysis of the description of the vacancies through a natural language processing technic.

Bellow my session information is shown. You can take a look at the entire code within my Github repro. Remember to shut down your Docker container after you are done with your analysis.

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RSelenium_1.7.7    rvest_0.3.5        xml2_1.2.2         forcats_0.5.0     
##  [5] stringr_1.4.0      dplyr_0.8.5        purrr_0.3.4        readr_1.3.1       
##  [9] tidyr_1.0.2        tibble_3.0.1       ggplot2_3.3.0.9000 tidyverse_1.3.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4.6       lubridate_1.7.4    lattice_0.20-40    listenv_0.8.0     
##  [5] binman_0.1.1       assertthat_0.2.1   digest_0.6.25      R6_2.4.1          
##  [9] cellranger_1.1.0   backports_1.1.7    reprex_0.3.0       evaluate_0.14     
## [13] httr_1.4.1         pillar_1.4.4       rlang_0.4.6        curl_4.3          
## [17] readxl_1.3.1       rstudioapi_0.11    rmarkdown_2.1      selectr_0.4-2     
## [21] wdman_0.2.5        munsell_0.5.0      broom_0.5.5        spiderbar_0.2.2   
## [25] compiler_3.6.2     modelr_0.1.6       xfun_0.14          pkgconfig_2.0.3   
## [29] askpass_1.1        globals_0.12.5     htmltools_0.4.0    openssl_1.4.1     
## [33] tidyselect_1.1.0   codetools_0.2-16   XML_3.99-0.3       future_1.17.0     
## [37] fansi_0.4.1        crayon_1.3.4       dbplyr_1.4.2       withr_2.2.0       
## [41] bitops_1.0-6       grid_3.6.2         nlme_3.1-144       jsonlite_1.6.1    
## [45] gtable_0.3.0       lifecycle_0.2.0    DBI_1.1.0          magrittr_1.5      
## [49] semver_0.2.0       scales_1.1.1       future.apply_1.5.0 cli_2.0.2         
## [53] stringi_1.4.6      fs_1.4.1           robotstxt_0.6.2    ellipsis_0.3.1    
## [57] generics_0.0.2     vctrs_0.3.0        tools_3.6.2        glue_1.4.1        
## [61] hms_0.5.3          parallel_3.6.2     yaml_2.2.1         colorspace_1.4-1  
## [65] caTools_1.18.0     knitr_1.28         haven_2.2.0