Robert's Data Science Blog

Download Lots of Data

library(magrittr)

I have a situation where I need to download lots of files with a specific format. The files are monthly measurements and are located at a URL depending on the year and month of year:

http://some.site.com/data/{year}/{month}/interesting_file_{year}{month}.zip

I want to download the files to the subfolder data/zip of my current project, so I split the url:

url_base_pattern <- "http://some.site.com/data/{year}/{month}/"
file_pattern <- "interesting_file_{year}{month}.zip"

To generate all combinations of year and month, I use the tidyr::crossing function. If January through September should be prefixed with a “0” the following code will do the trick:

tidyr::crossing(year = 2010:2011, month = stringr::str_pad(1:12, width = 2, pad = "0"))
## # A tibble: 24 x 2
##     year month
##    <int> <chr>
##  1  2010 01   
##  2  2010 02   
##  3  2010 03   
##  4  2010 04   
##  5  2010 05   
##  6  2010 06   
##  7  2010 07   
##  8  2010 08   
##  9  2010 09   
## 10  2010 10   
## # … with 14 more rows

From the year and month the full URL and desired local location are created with the wonderful glue package and here package:

urls <- tidyr::crossing(
    year = 2010:2011, 
    month = stringr::str_pad(1:12, width = 2, pad = "0")
) %>% 
    dplyr::mutate(
        file_name = glue::glue_data(., file_pattern),
        destfile = here::here("data", "zip", file_name),
        url = paste0(glue::glue_data(., url_base_pattern), file_name)
    )

The glue package really shines here compared to an ordinary paste because there are numerous year and month in the URL.

To make things a little more automatic I also create the download folder in my script:

download_path <- here::here("data", "zip")
if (isFALSE(fs::dir_exists(download_path)))
    fs::dir_create(download_path)

Finally, I iterate over the rows of the urls tibble to download each of the files.

urls %>% 
    dplyr::select(url, destfile) %>% 
    purrr::pwalk(download.file, quiet = TRUE)

Since I have no interest in the output of download.file, but only in the side effect (downloading the file), I use walk function.

Multiple runs

The form above gets the job done. But if I re-run the script all the files will be downloaded again – a waste if all I want is to add a few months.

This can be remedied by making a custom download function to replace download.file above:

download_once <- function(url, destfile, ...) {
    if (isFALSE(fs::file_exists(destfile)))
        download.file(url = url, destfile = destfile, ...)
}

Read data

In my case each zip file contains a single csv file. To read all files I first need a vector of filenames (or use the column destfile from the urls tibble).

zip_files <- fs::dir_ls(here::here("data", "zip"))

The oldschool way of loading all of them into one big tibble would be with the purrr package:

tbl <- purrr::map_dfr(zip_files, readr::read_csv, col_types = readr::cols(...))

Since (potentially) a large number of files have been downloaded I prefer to list the column types with to be warned about any unexpected deviations.

The new school way would be with the vroom package:

tbl <- vroom::vroom(zip_files, col_types = vroom::cols(...))