Safe and efficient big data access

This document intends to demonstrate two methods of accessing big (or confidential) data. The first seems at first glance to be the most efficient way of doing it. But once you factor in time or cost constraints, there will be issues. The second method therefore breaks down the data and processing stream into separate components, and allows to separate the time consuming part from other reproducible parts.

Context

We will use U.S. Census Bureau data (American Community Survey data) through an API. Downloading the entire corpus of ACS data would consume multiple 100 GB, but the API allows us to slice and dice the data. The example we will use will artificially reduce the amount of data for illustrative purposes, but we identify how this simple example becomes a “big data” example.

Fake reproducibility setup

In principle, you should be using packages such as checkpoint to provide a fully reproducible environment. In the interest of time, we will skip that here. Please consult the Checkpoint vignette

library(checkpoint)
checkpoint("2021-01-27")

Requirements

Other than a number of R requirements for general processing (tidyverse, citation), we will rely on the tidycensus package (citation here).

library(tidycensus)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.5     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Confidential parameters

Whether we work with actual confidential data, or just need a place to store login credentials, you need to think hard about this. In this case, the Census API requires a key to be requested. Where/how do we store it? There is no real consensus - in other words, there are lots of solutions. Here, we use the ~/.Renviron file, though I won’t show you my actual file. So others can do this, I provide a template:

cat(readLines(file.path(basepath,".Renviron.template")))

## censusapikey = "adfafafda"

To reproduce this, obtain a Census API key, create a file called .Renviron and restart R. We can then set the API key for the tidycensus package.

census_api_key(Sys.getenv("censusapikey"))

## To install your API key for use in future sessions, run this function with `install = TRUE`.

Basic example

We’re going to follow the "Basic Usage tutorial from the tidycensus website, and compute the median age by state for 2010:

time.age10 <- system.time(age10 <- get_decennial(geography = "state",
                       variables = "P013001",
                       year = 2010))

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

# time it took to run this
time.age10

##    user  system elapsed 
##   0.080   0.016   0.741

# inspect the data
head(age10)

## # A tibble: 6 x 4
##   GEOID NAME       variable value
##   <chr> <chr>      <chr>    <dbl>
## 1 01    Alabama    P013001   37.9
## 2 02    Alaska     P013001   33.8
## 3 04    Arizona    P013001   35.9
## 4 05    Arkansas   P013001   37.4
## 5 06    California P013001   35.2
## 6 22    Louisiana  P013001   35.8

# some stats
summary(age10)

##     GEOID               NAME             variable             value      
##  Length:52          Length:52          Length:52          Min.   :29.20  
##  Class :character   Class :character   Class :character   1st Qu.:36.20  
##  Mode  :character   Mode  :character   Mode  :character   Median :37.45  
##                                                           Mean   :37.49  
##                                                           3rd Qu.:38.80  
##                                                           Max.   :42.70

That was fast: 0.741 seconds. But it only generated 52 observations! What if we used a bit more detail for this? Let’s see this for Tompkins county (FIPS code “36109”)

start.time <- Sys.time()
age10block <- get_decennial(geography = "block",
                                                 state="NY",
                                                 county="109",
                                                 show_call=TRUE,
                       variables = "P013001",
                       year = 2010)

## Getting data from the 2010 decennial Census

## Census API call: https://api.census.gov/data/2010/dec/sf1?get=P013001%2CNAME&for=block%3A%2A&in=state%3A36%2Bcounty%3A109

## Using Census Summary File 1

time.block <- Sys.time() - start.time
# time it took to run this
time.block

## Time difference of 1.139917 secs

# inspect the data
head(age10block)

## # A tibble: 6 x 4
##   GEOID         NAME                                              variable value
##   <chr>         <chr>                                             <chr>    <dbl>
## 1 361090001001… Block 1000, Block Group 1, Census Tract 1, Tompk… P013001   30.3
## 2 361090001001… Block 1002, Block Group 1, Census Tract 1, Tompk… P013001    0  
## 3 361090001001… Block 1001, Block Group 1, Census Tract 1, Tompk… P013001   30.1
## 4 361090001001… Block 1006, Block Group 1, Census Tract 1, Tompk… P013001    0  
## 5 361090001001… Block 1021, Block Group 1, Census Tract 1, Tompk… P013001   28.5
## 6 361090001001… Block 1003, Block Group 1, Census Tract 1, Tompk… P013001   43.5

# some stats
summary(age10block)

##     GEOID               NAME             variable             value      
##  Length:3082        Length:3082        Length:3082        Min.   : 0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :33.00  
##                                                           Mean   :28.07  
##                                                           3rd Qu.:45.50  
##                                                           Max.   :91.50

# prepare the next block
counties.to.query <- 30
# if we wanted all of this, we would replace the number with "nrow(fips_codes)"
# counties.to.query <- nrow(fips_codes)

That took 1.13991665840149 seconds. It generated 3082 observations. For ONE county. There are 3237 counties. Let’s see how long this takes for the first 30 counties.

# tidycensus/Census API forces to loop over counties
start.time <- Sys.time()
blocks <- NA
for (row in 1:counties.to.query ) {
  county <- fips_codes[row,"county_code"]
  state  <- fips_codes[row,"state"]
  thisblock <- get_decennial(geography = "block",
                             state=state,
                             county=county,
                       variables = "P013001",
                       year = 2010)
  if ( row == 1 ) {
    blocks <- thisblock
    rm(thisblock)
  } else {
    blocks <- bind_rows(blocks,thisblock)
    rm(thisblock)
  }
}

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

## Getting data from the 2010 decennial Census

## Using Census Summary File 1

end.time <- Sys.time()
elapsed.time <- end.time - start.time
elapsed.scale <- elapsed.time / counties.to.query * nrow(fips_codes)

That took 24.8147444725037 for 30 of 3237 counties, yielding 84042 records. You could estimate the total time as:

44.6 minutes

Would you want to incur that time every time you run the code for the entire country?

Solution 1: sampling

We already implemented the first solution, which is useful while you are developing this: we reduced the number down to a feasible number, and estimated the total runtime. Ideally, we would use two values for the parameter to control this: a really small number to test code for functionality, and a larger number to get some meaningful results. For the final run, we would set it to run on the full scope of the problem.

Solution 2: Intermediate files

The second solution is to break the problem apart. Let’s see how long it takes to save and to read the resulting file. First, let’s be clear about the directory structure here:

basedata <- file.path(basepath,"data")
rawdata  <- file.path(basedata,"raw")
cache    <- file.path(basedata,"cache")
generated<- file.path(basedata,"generated")

We’ve defined three directories:

basedata to house all the data
rawdata to house any data we may have (manually) downloaded from somewhere else
cache to house intermediate data which can be programmatically downloaded, but maybe need a temporary home
generated to house any data we generate by modifying either rawdata or cache data.

Our README should describe this, and could also specify that all data in cache and generated can be recreated, given enough time.

We’re going to use the cache to speed up processing for subsequent runs during testing and possibly for demonstration purposes.

cache definition

Let’s make sure these directories exist:

for ( dir in list(basedata,rawdata,cache,generated) ) {
  if (file.exists(dir)) {
    message(paste0("Directory ",dir," already present!"))
  } else {
    dir.create(file.path(dir),recursive=TRUE)
    message(paste0("Directory ",dir," created!"))
  }
}

## Directory /cloud/project/data already present!

## Directory /cloud/project/data/raw already present!

## Directory /cloud/project/data/cache already present!

## Directory /cloud/project/data/generated already present!

Those steps would normally go into the header of our reproducible document!

Let’s move to the timing:

system.time(saveRDS(blocks,file = file.path(cache,"block_median.Rds")))

##    user  system elapsed 
##   0.214   0.000   0.312

rm(blocks)
start.time <- Sys.time()
blocks <- readRDS(file=file.path(cache,"block_median.Rds"))
read.time <- Sys.time() - start.time
read.scaled <- read.time / counties.to.query * nrow(fips_codes)

Assuming that scaling up to the full filesize is linear, it would take 14.81 seconds to read back the entire universe of blocks from a cached file, compared to 2677.51 seconds for using the API each time.

Refinements

How could this be even more refined? For one, we could test whether the cache file has already been generated in the download section:

# not evaluated
cache.blocks <- file.path(cache,"block_median.Rds")

if ( file.exists(cache.blocks)) {
  blocks <- readRDS(cache.blocks)
} else {
  readin_from_api(outfile=cache.blocks)
}

Now we can routinely process this, without having to worry about that pesky download part.¹

Generalizing

Note that the above works for any kind of programming language (Stata, SPSS, etc.). It also works and should be used for any large-ish files, and may percolate through an entire job stream.

Robustness

So what happens when the API (inevitably) changes, breaks, is shut down, or otherwise stops working?

Nothing.

Because we have the cached file, we are safe from such breaks in the API. In fact, when providing our replication package, we should (if allowed by license) provide the cached file, yet not remove the part about downloading it.

Note that Rmarkdown allows to define a section as cached as well. However, that cache is automatically invalidated when any code or text above it is modified, leading to potentially undesireable re-downloads. In this case, it may be better to work with a manually defined cache.↩︎