scrapeR 101 • scrapeR

library(scrapeR)
library(rvest)

In this tutorial we will be scraping data contained on the cat and dog wikipedia pages.

wiki_scraper <- spider("wiki scraper")
wiki_scraper
#> # A spider: wiki scraper 
#> ### Queue: 0 item(s)
#> ### Steps: 0

wiki_scraper %>%
    add_queue(
        c(
        "https://simple.wikipedia.org/wiki/Cat",
        "https://simple.wikipedia.org/wiki/Dog"
        )
    ) -> wiki_scraper
wiki_scraper
#> # A spider: wiki scraper 
#> ### Queue: 2 item(s)
#> ### Steps: 0

Lets add our first steps to the scraper. Steps are applied to each item in the queue. Here, we will extract all of the headers from each wikipedia page. Please note that instead of passing functions to the [add_parser()] function, we pass formulas. This is because each step is run with the [purrr::map()] function with the queue item as the first argument and spider as the second argument.

wiki_scraper %>%
    add_parser(
      ~ read_html(.x) %>%
          html_nodes(".mw-headline") %>%
          html_text(),
      name = "get headers"
    ) -> wiki_scraper
wiki_scraper
#> # A spider: wiki scraper 
#> ### Queue: 2 item(s)
#> ### Steps: 1
#> ( 1 ) A parser: get headers

All evaluation in scrapeR is delayed until the [run()] method is called. We can do so now to see our output.

run(wiki_scraper) %>% paste(sep = "")
#> Executing parser  get headers
#>  [1] "History"                  "Cat anatomy"             
#>  [3] "Behaviour"                "Mating"                  
#>  [5] "Birth and after"          "Grooming"                
#>  [7] "Food"                     "Health concerns"         
#>  [9] "References"               "Other websites"          
#> [11] "Appearance and behaviour" "Lifespan"                
#> [13] "Origin of dogs"           "Dogs and humans"         
#> [15] "Dog breeds"               "Photogallery"            
#> [17] "Related pages"            "References"