Introduction

Application programming interfaces (APIs) are popular, and often a very convenient way to get just the data one needs.

However, they pose reproducibility challenges:

This brief tutorial will discuss some safeguards that can be done at relatively low cost by the researcher to improve reproducibility.

The setting

We will use the St. Louis Fed’s FRED data service, frequently used by economists. In this example, we will use a single time-series – GNPCA – Real Gross National Product – but the example can be easily extended to a whole set of series.

Often, researchers will use the default landing page for a time-series, in this case, https://fred.stlouisfed.org/series/GNPCA, and then download a CSV or Excel file.

GNPCA landing page
GNPCA landing page

However, they could instead use the API that is offered. This tutorial uses R and the fredr to describe the use of that API.1

Using the API

The FRED API, as most other APIs, requires an API key - a kind of password. One typical technique is to store the API key as an environment variable, or (less securely) hard-coding it in the code (this makes the code not easily shareable). For this tutorial, obtain an API key from the FRED profile page (a login is required), then store in the file .Renviron as

# This is an example, not a valid API key!
FRED_API_KEY="78862231cc0bd7a7f3b84eb9e19d4b7e"

Save it, and restart R. The environment variable will now be available to the R package fredr.

To use the API, we load the necessary libraries.

if (!require("fredr")) install.packages("fredr")
if (!require("dplyr")) install.packages("dplyr")
if (!require("knitr")) install.packages("knitr")
if (!require("ggplot2")) install.packages("ggplot2")
library(fredr)
library(dplyr)
library(knitr)

# if no .Renviron is present, try and read FRED_API_KEY from Sys.getenv
if (Sys.getenv("FRED_API_KEY") == "") {
  stop("FRED_API_KEY environment variable not set. See https://fred.stlouisfed.org/docs/api/api_key.html")
} else {
  fredr_set_key(Sys.getenv("FRED_API_KEY"))
}

DEFAULT_DATE=as.Date("2016-01-01")
PLOT_DATE=as.Date("2012-01-01")

Simple usage

We can obtain the entire series for GNPCA as follows, focussing on the value for `r :

data_current <- fredr(
  series_id = "GNPCA"
)
names(data_current)
## [1] "date"           "series_id"      "value"          "realtime_start"
## [5] "realtime_end"
nrow(data_current)
## [1] 96
print(head(data_current))
## # A tibble: 6 × 5
##   date       series_id value realtime_start realtime_end
##   <date>     <chr>     <dbl> <date>         <date>      
## 1 1929-01-01 GNPCA     1203. 2025-09-25     2025-09-25  
## 2 1930-01-01 GNPCA     1101. 2025-09-25     2025-09-25  
## 3 1931-01-01 GNPCA     1029. 2025-09-25     2025-09-25  
## 4 1932-01-01 GNPCA      896. 2025-09-25     2025-09-25  
## 5 1933-01-01 GNPCA      884. 2025-09-25     2025-09-25  
## 6 1934-01-01 GNPCA      978. 2025-09-25     2025-09-25
current_value <- round(data_current$value[data_current$date == DEFAULT_DATE],0)
max_current_date <- max(data_current$date)
min_date <- min(data_current$date)
today <- Sys.Date()

So the value of GNPCA, as of 2025-10-15 when we ran this, install

GNPCA(2016-01-01) = 19373, as of 2025-10-15

Two observations

Two things are of note:

Clipping the time series

We pulled the entire time series, even though we only were interested in a subset. While the start date (for GNPCA, 1929-01-01) will never change, the length of the time series will change if we pull in the future, or if we had pulled in the past. This might affect the rest of the code. So a good practice is to define precise start and end dates.

# Set some parameters we want to re-use
# - the date range we want
DATE_START <- as.Date("2000-01-01")
DATE_END <- as.Date("2016-01-01")

data_clipped <- fredr(
  series_id = "GNPCA",
  observation_start = DATE_START,
  observation_end = DATE_END
)
nrow(data_clipped)
## [1] 17
print(head(data_clipped))
## # A tibble: 6 × 5
##   date       series_id  value realtime_start realtime_end
##   <date>     <chr>      <dbl> <date>         <date>      
## 1 2000-01-01 GNPCA     14145. 2025-10-15     2025-10-15  
## 2 2001-01-01 GNPCA     14295. 2025-10-15     2025-10-15  
## 3 2002-01-01 GNPCA     14530. 2025-10-15     2025-10-15  
## 4 2003-01-01 GNPCA     14949. 2025-10-15     2025-10-15  
## 5 2004-01-01 GNPCA     15543. 2025-10-15     2025-10-15  
## 6 2005-01-01 GNPCA     16075. 2025-10-15     2025-10-15
clipped_value <- round(data_clipped$value[data_clipped$date == DEFAULT_DATE],0)
max_clipped_date <- max(data_clipped$date)

Thus, if we are primarily interested in the time series between, say, 2000-01-01 and 2016-01-01, whenever we pull the data in the future, the length of the time series will be the same: nrow(data_clipped) rows. Note that this should not change the value obtained for 2016-01-01:

GNPCA(2016-01-01) = 19373, when clipped between 2000-01-01 and 2016-01-01

but it would affect other (naively computed) values such as the mean of the time-series, or a computed linear trend.

Revisions

The second issue is that the measures of GNP, both specific data points, as well as occassionally the entire time series, are revised by the Bureau of Economic Analysis. At the time of writing this in 2025, GNPCA was expressed in “Billions of Chained 2017 Dollars”. Obviously, prior to 2017, it would have been expressed differently. This also leads to revisions of historical values.

The FRED API (although not all other APIs), allows to extract data with an as-if date, which they call a vintage. Thus, suppose we had pulled the time series as of a specific date:

VINTAGE <- as.Date("2017-06-01")

data_vintage <- fredr(
  series_id = "GNPCA",
  observation_start = DATE_START,
  observation_end = DATE_END,
  vintage_dates = VINTAGE
)
nrow(data_vintage)
## [1] 17
print(head(data_vintage))
## # A tibble: 6 × 5
##   date       series_id  value realtime_start realtime_end
##   <date>     <chr>      <dbl> <date>         <date>      
## 1 2000-01-01 GNPCA     12609. 2017-06-01     2017-06-01  
## 2 2001-01-01 GNPCA     12748. 2017-06-01     2017-06-01  
## 3 2002-01-01 GNPCA     12970. 2017-06-01     2017-06-01  
## 4 2003-01-01 GNPCA     13352. 2017-06-01     2017-06-01  
## 5 2004-01-01 GNPCA     13877. 2017-06-01     2017-06-01  
## 6 2005-01-01 GNPCA     14338. 2017-06-01     2017-06-01
vintaged_value <- round(data_vintage$value[data_vintage$date == DEFAULT_DATE],0)

So what does the value for GNPCA look like as of 2017-06-01?

As of 2025-10-15, the two values for GNPCA that we have obtained are:

  • 19373, when not specifying a vintage, as of 2025-10-15
  • 16835, when specifying vintage 2017-06-01

That is a substantial difference in absolute value! In part, this is due to the rebase-lining of the time series, in part this is due to revisions of the annual value as additional data becomes available.

Pulling Multiple Vintages for Comparison

Now let’s see how long this can matter, by obtaining a number of additional vintages.

# - for testing, other as-of dates
ALTVINTAGES <- as.Date(c("2018-01-01", "2019-01-01", "2020-01-01", "2021-01-01", "2022-01-01", "2023-01-01"))
# Combine all vintage dates
all_vintages <- c(VINTAGE, ALTVINTAGES)

# Pull data for each vintage
vintage_data_list <- lapply(all_vintages, function(vintage) {
  data <- fredr(
    series_id = "GNPCA",
    observation_start = DATE_START,
    observation_end = DATE_END,
    vintage_dates = vintage
  )
  data$vintage <- as.character(vintage)
  return(data)
})

# Combine all vintage data
vintage_comparison <- do.call(rbind, vintage_data_list)

# Reshape for comparison
vintage_wide <- vintage_comparison %>%
  select(date, value, vintage) %>%
  filter(date == DEFAULT_DATE) 

Comparison of GNPCA values across different vintages

Focussing on the value for 2016-01-01, we see how the value has changed over time:

A tibble: 7 × 3

date value vintage

1 2016-01-01 16835. 2017-06-01 2 2016-01-01 16879. 2018-01-01 3 2016-01-01 17868. 2019-01-01 4 2016-01-01 17902. 2020-01-01 5 2016-01-01 17955. 2021-01-01 6 2016-01-01 17902. 2022-01-01 7 2016-01-01 17902. 2023-01-01

date value vintage
2016-01-01 16835.20 2017-06-01
2016-01-01 16879.01 2018-01-01
2016-01-01 17867.77 2019-01-01
2016-01-01 17902.23 2020-01-01
2016-01-01 17955.44 2021-01-01
2016-01-01 17901.89 2022-01-01
2016-01-01 17901.89 2023-01-01

The effect on the time series is shown in the following graph (focussing only on the period 2012-2016 for clarity):

library(ggplot2)
vintage_comparison %>% 
  filter(date > PLOT_DATE) %>%
  ggplot(aes(x = date, y = value, color = vintage)) +
  geom_line() +
  labs(title = "GNPCA over time for different vintages",
       x = "Date",
       y = "GNPCA (Billions of Chained 2017 Dollars)",
       color = "Vintage Date") +
  theme_minimal()

Data Persistence Strategy

APIs in general have one additional “feature”: At some point, they may break, because the hosting institution makes decisions that affects its availability. While the above sections show how data can be obtained that precisely reflect the intended range and as-of date, they cannot compensate for the disappearance or breaking changes to an API.

The solution is to save the data pulled through the API as an intermediate dataset (“cache”) upon first use, and henceforth use the cached data. If redistribution is permissible by the license (check!), this also allows to provide future users with the same data, in case that the API is deprecated and won’t work in the future.

Example implementation of caching

# Create directories if they don't exist
dir.create("data", showWarnings = FALSE)
dir.create("data/fred", showWarnings = FALSE)

# Check if file exists
if (file.exists("data/fred/fred_gnpca.rds")) {
  fred_data <- readRDS("data/fred/fred_gnpca.rds")
  # get the vintage id from the file
  VINTAGE_READ <- max(as.Date(fred_data$realtime_start))
  message(NOTE, "Re-using existing file with vintage =", as.character(VINTAGE_READ),"\n")
} else {
  # Code if the file does not exist
  # You could do the full API pull
  # conditional on the intermediate
  # file NOT being there
  message(NOTE, "Reading in data from FRED API with vintage =", as.character(VINTAGE), "\n")
  fred_data <- fredr(
    series_id = "GNPCA",
    observation_start = DATE_START,
    observation_end = DATE_END,
    vintage_dates = VINTAGE
  )
  saveRDS(fred_data, "data/fred/fred_gnpca.rds")
}
## README ::::Re-using existing file with vintage =2017-06-01

When first run, the code will output

README ::::Reading in data from FRED API with vintage =2017-06-01

but subsequent runs will show the output

README :::: Re-using existing file with vintage =2017-06-01

Impressum


  1. For a version using Stata, see this other document. Other interfaces include MATLAB and of course Python.↩︎