This document intends to demonstrate two methods of accessing big (or confidential) data. The first seems at first glance to be the most efficient way of doing it. But once you factor in time or cost constraints, there will be issues. The second method therefore breaks down the data and processing stream into separate components, and allows to separate the time consuming part from other reproducible parts.
We will use U.S. Census Bureau data (American Community Survey data) through an API. Downloading the entire corpus of ACS data would consume multiple 100 GB, but the API allows us to slice and dice the data. The example we will use will artificially reduce the amount of data for illustrative purposes, but we identify how this simple example becomes a “big data” example.
In principle, you should be using packages such as checkpoint
to provide a fully reproducible environment. In the interest of time, we will skip that here. Please consult the Checkpoint vignette
library(checkpoint)
checkpoint("2021-01-27")
Other than a number of R requirements for general processing (tidyverse
, citation), we will rely on the tidycensus
package (citation here).
library(tidycensus)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.5 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Whether we work with actual confidential data, or just need a place to store login credentials, you need to think hard about this. In this case, the Census API requires a key to be requested. Where/how do we store it? There is no real consensus - in other words, there are lots of solutions. Here, we use the ~/.Renviron
file, though I won’t show you my actual file. So others can do this, I provide a template:
cat(readLines(file.path(basepath,".Renviron.template")))
## censusapikey = "adfafafda"
To reproduce this, obtain a Census API key, create a file called .Renviron
and restart R. We can then set the API key for the tidycensus
package.
census_api_key(Sys.getenv("censusapikey"))
## To install your API key for use in future sessions, run this function with `install = TRUE`.
We’re going to follow the "Basic Usage tutorial from the tidycensus
website, and compute the median age by state for 2010:
time.age10 <- system.time(age10 <- get_decennial(geography = "state",
variables = "P013001",
year = 2010))
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# time it took to run this
time.age10
## user system elapsed
## 0.080 0.016 0.741
# inspect the data
head(age10)
## # A tibble: 6 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 01 Alabama P013001 37.9
## 2 02 Alaska P013001 33.8
## 3 04 Arizona P013001 35.9
## 4 05 Arkansas P013001 37.4
## 5 06 California P013001 35.2
## 6 22 Louisiana P013001 35.8
# some stats
summary(age10)
## GEOID NAME variable value
## Length:52 Length:52 Length:52 Min. :29.20
## Class :character Class :character Class :character 1st Qu.:36.20
## Mode :character Mode :character Mode :character Median :37.45
## Mean :37.49
## 3rd Qu.:38.80
## Max. :42.70
That was fast: 0.741 seconds. But it only generated 52 observations! What if we used a bit more detail for this? Let’s see this for Tompkins county (FIPS code “36109”)
start.time <- Sys.time()
age10block <- get_decennial(geography = "block",
state="NY",
county="109",
show_call=TRUE,
variables = "P013001",
year = 2010)
## Getting data from the 2010 decennial Census
## Census API call: https://api.census.gov/data/2010/dec/sf1?get=P013001%2CNAME&for=block%3A%2A&in=state%3A36%2Bcounty%3A109
## Using Census Summary File 1
time.block <- Sys.time() - start.time
# time it took to run this
time.block
## Time difference of 1.139917 secs
# inspect the data
head(age10block)
## # A tibble: 6 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 361090001001… Block 1000, Block Group 1, Census Tract 1, Tompk… P013001 30.3
## 2 361090001001… Block 1002, Block Group 1, Census Tract 1, Tompk… P013001 0
## 3 361090001001… Block 1001, Block Group 1, Census Tract 1, Tompk… P013001 30.1
## 4 361090001001… Block 1006, Block Group 1, Census Tract 1, Tompk… P013001 0
## 5 361090001001… Block 1021, Block Group 1, Census Tract 1, Tompk… P013001 28.5
## 6 361090001001… Block 1003, Block Group 1, Census Tract 1, Tompk… P013001 43.5
# some stats
summary(age10block)
## GEOID NAME variable value
## Length:3082 Length:3082 Length:3082 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median :33.00
## Mean :28.07
## 3rd Qu.:45.50
## Max. :91.50
# prepare the next block
counties.to.query <- 30
# if we wanted all of this, we would replace the number with "nrow(fips_codes)"
# counties.to.query <- nrow(fips_codes)
That took 1.13991665840149 seconds. It generated 3082 observations. For ONE county. There are 3237 counties. Let’s see how long this takes for the first 30 counties.
# tidycensus/Census API forces to loop over counties
start.time <- Sys.time()
blocks <- NA
for (row in 1:counties.to.query ) {
county <- fips_codes[row,"county_code"]
state <- fips_codes[row,"state"]
thisblock <- get_decennial(geography = "block",
state=state,
county=county,
variables = "P013001",
year = 2010)
if ( row == 1 ) {
blocks <- thisblock
rm(thisblock)
} else {
blocks <- bind_rows(blocks,thisblock)
rm(thisblock)
}
}
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
end.time <- Sys.time()
elapsed.time <- end.time - start.time
elapsed.scale <- elapsed.time / counties.to.query * nrow(fips_codes)
That took 24.8147444725037 for 30 of 3237 counties, yielding 84042 records. You could estimate the total time as:
44.6 minutes
Would you want to incur that time every time you run the code for the entire country?
We already implemented the first solution, which is useful while you are developing this: we reduced the number down to a feasible number, and estimated the total runtime. Ideally, we would use two values for the parameter to control this: a really small number to test code for functionality, and a larger number to get some meaningful results. For the final run, we would set it to run on the full scope of the problem.
The second solution is to break the problem apart. Let’s see how long it takes to save and to read the resulting file. First, let’s be clear about the directory structure here:
basedata <- file.path(basepath,"data")
rawdata <- file.path(basedata,"raw")
cache <- file.path(basedata,"cache")
generated<- file.path(basedata,"generated")
We’ve defined three directories:
basedata
to house all the datarawdata
to house any data we may have (manually) downloaded from somewhere elsecache
to house intermediate data which can be programmatically downloaded, but maybe need a temporary homegenerated
to house any data we generate by modifying either rawdata
or cache
data.Our README should describe this, and could also specify that all data in cache
and generated
can be recreated, given enough time.
We’re going to use the cache to speed up processing for subsequent runs during testing and possibly for demonstration purposes.
Let’s make sure these directories exist:
for ( dir in list(basedata,rawdata,cache,generated) ) {
if (file.exists(dir)) {
message(paste0("Directory ",dir," already present!"))
} else {
dir.create(file.path(dir),recursive=TRUE)
message(paste0("Directory ",dir," created!"))
}
}
## Directory /cloud/project/data already present!
## Directory /cloud/project/data/raw already present!
## Directory /cloud/project/data/cache already present!
## Directory /cloud/project/data/generated already present!
Those steps would normally go into the header of our reproducible document!
Let’s move to the timing:
system.time(saveRDS(blocks,file = file.path(cache,"block_median.Rds")))
## user system elapsed
## 0.214 0.000 0.312
rm(blocks)
start.time <- Sys.time()
blocks <- readRDS(file=file.path(cache,"block_median.Rds"))
read.time <- Sys.time() - start.time
read.scaled <- read.time / counties.to.query * nrow(fips_codes)
Assuming that scaling up to the full filesize is linear, it would take 14.81 seconds to read back the entire universe of blocks from a cached file, compared to 2677.51 seconds for using the API each time.
How could this be even more refined? For one, we could test whether the cache file has already been generated in the download section:
# not evaluated
cache.blocks <- file.path(cache,"block_median.Rds")
if ( file.exists(cache.blocks)) {
blocks <- readRDS(cache.blocks)
} else {
readin_from_api(outfile=cache.blocks)
}
Now we can routinely process this, without having to worry about that pesky download part.1
Note that the above works for any kind of programming language (Stata, SPSS, etc.). It also works and should be used for any large-ish files, and may percolate through an entire job stream.
So what happens when the API (inevitably) changes, breaks, is shut down, or otherwise stops working?
Nothing.
Because we have the cached file, we are safe from such breaks in the API. In fact, when providing our replication package, we should (if allowed by license) provide the cached file, yet not remove the part about downloading it.
Note that Rmarkdown allows to define a section as cached as well. However, that cache is automatically invalidated when any code or text above it is modified, leading to potentially undesireable re-downloads. In this case, it may be better to work with a manually defined cache.↩︎