Application programming interfaces (APIs) are popular, and often a very convenient way to get just the data one needs.
However, they pose reproducibility challenges:
This brief tutorial will discuss some safeguards that can be done at relatively low cost by the researcher to improve reproducibility.
We will use the St. Louis Fed’s FRED data service, frequently used by
economists. In this example, we will use a single time-series –
GNPCA – Real Gross National Product – but the example can
be easily extended to a whole set of series.
Often, researchers will use the default landing page for a time-series, in this case, https://fred.stlouisfed.org/series/GNPCA, and then download a CSV or Excel file.
However, they could instead use the API that is offered. This
tutorial uses R and the fredr
to describe the use of that API.1
The FRED API, as most other APIs, requires an API key - a kind of
password. One typical technique is to store the API key as an
environment variable, or (less securely) hard-coding it in the code
(this makes the code not easily shareable). For this tutorial, obtain an
API key from the FRED profile
page (a login is required), then store in the file
.Renviron as
# This is an example, not a valid API key!
FRED_API_KEY="78862231cc0bd7a7f3b84eb9e19d4b7e"
Save it, and restart R. The environment variable will now be
available to the R package fredr.
To use the API, we load the necessary libraries.
if (!require("fredr")) install.packages("fredr")
if (!require("dplyr")) install.packages("dplyr")
if (!require("knitr")) install.packages("knitr")
if (!require("ggplot2")) install.packages("ggplot2")
library(fredr)
library(dplyr)
library(knitr)
# if no .Renviron is present, try and read FRED_API_KEY from Sys.getenv
if (Sys.getenv("FRED_API_KEY") == "") {
stop("FRED_API_KEY environment variable not set. See https://fred.stlouisfed.org/docs/api/api_key.html")
} else {
fredr_set_key(Sys.getenv("FRED_API_KEY"))
}
DEFAULT_DATE=as.Date("2016-01-01")
PLOT_DATE=as.Date("2012-01-01")
We can obtain the entire series for GNPCA as follows,
focussing on the value for `r :
data_current <- fredr(
series_id = "GNPCA"
)
names(data_current)
## [1] "date" "series_id" "value" "realtime_start"
## [5] "realtime_end"
nrow(data_current)
## [1] 96
print(head(data_current))
## # A tibble: 6 × 5
## date series_id value realtime_start realtime_end
## <date> <chr> <dbl> <date> <date>
## 1 1929-01-01 GNPCA 1203. 2025-09-25 2025-09-25
## 2 1930-01-01 GNPCA 1101. 2025-09-25 2025-09-25
## 3 1931-01-01 GNPCA 1029. 2025-09-25 2025-09-25
## 4 1932-01-01 GNPCA 896. 2025-09-25 2025-09-25
## 5 1933-01-01 GNPCA 884. 2025-09-25 2025-09-25
## 6 1934-01-01 GNPCA 978. 2025-09-25 2025-09-25
current_value <- round(data_current$value[data_current$date == DEFAULT_DATE],0)
max_current_date <- max(data_current$date)
min_date <- min(data_current$date)
today <- Sys.Date()
So the value of GNPCA, as of 2025-10-15 when we ran this, install
GNPCA(2016-01-01) = 19373, as of 2025-10-15
Two things are of note:
We pulled the entire time series, even though we only were interested in a subset. While the start date (for GNPCA, 1929-01-01) will never change, the length of the time series will change if we pull in the future, or if we had pulled in the past. This might affect the rest of the code. So a good practice is to define precise start and end dates.
# Set some parameters we want to re-use
# - the date range we want
DATE_START <- as.Date("2000-01-01")
DATE_END <- as.Date("2016-01-01")
data_clipped <- fredr(
series_id = "GNPCA",
observation_start = DATE_START,
observation_end = DATE_END
)
nrow(data_clipped)
## [1] 17
print(head(data_clipped))
## # A tibble: 6 × 5
## date series_id value realtime_start realtime_end
## <date> <chr> <dbl> <date> <date>
## 1 2000-01-01 GNPCA 14145. 2025-10-15 2025-10-15
## 2 2001-01-01 GNPCA 14295. 2025-10-15 2025-10-15
## 3 2002-01-01 GNPCA 14530. 2025-10-15 2025-10-15
## 4 2003-01-01 GNPCA 14949. 2025-10-15 2025-10-15
## 5 2004-01-01 GNPCA 15543. 2025-10-15 2025-10-15
## 6 2005-01-01 GNPCA 16075. 2025-10-15 2025-10-15
clipped_value <- round(data_clipped$value[data_clipped$date == DEFAULT_DATE],0)
max_clipped_date <- max(data_clipped$date)
Thus, if we are primarily interested in the time series between, say,
2000-01-01 and 2016-01-01, whenever we pull the data in the future, the
length of the time series will be the same:
nrow(data_clipped) rows. Note that this should not change
the value obtained for 2016-01-01:
GNPCA(2016-01-01) = 19373, when clipped between 2000-01-01 and 2016-01-01
but it would affect other (naively computed) values such as the mean of the time-series, or a computed linear trend.
The second issue is that the measures of GNP, both specific data
points, as well as occassionally the entire time series, are revised by
the Bureau of Economic Analysis. At the time of writing this in 2025,
GNPCA was expressed in “Billions of Chained 2017 Dollars”.
Obviously, prior to 2017, it would have been expressed differently. This
also leads to revisions of historical values.
The FRED API (although not all other APIs), allows to extract data with an as-if date, which they call a vintage. Thus, suppose we had pulled the time series as of a specific date:
VINTAGE <- as.Date("2017-06-01")
data_vintage <- fredr(
series_id = "GNPCA",
observation_start = DATE_START,
observation_end = DATE_END,
vintage_dates = VINTAGE
)
nrow(data_vintage)
## [1] 17
print(head(data_vintage))
## # A tibble: 6 × 5
## date series_id value realtime_start realtime_end
## <date> <chr> <dbl> <date> <date>
## 1 2000-01-01 GNPCA 12609. 2017-06-01 2017-06-01
## 2 2001-01-01 GNPCA 12748. 2017-06-01 2017-06-01
## 3 2002-01-01 GNPCA 12970. 2017-06-01 2017-06-01
## 4 2003-01-01 GNPCA 13352. 2017-06-01 2017-06-01
## 5 2004-01-01 GNPCA 13877. 2017-06-01 2017-06-01
## 6 2005-01-01 GNPCA 14338. 2017-06-01 2017-06-01
vintaged_value <- round(data_vintage$value[data_vintage$date == DEFAULT_DATE],0)
So what does the value for GNPCA look like as of
2017-06-01?
As of 2025-10-15, the two values for GNPCA that we have obtained are:
That is a substantial difference in absolute value! In part, this is due to the rebase-lining of the time series, in part this is due to revisions of the annual value as additional data becomes available.
Now let’s see how long this can matter, by obtaining a number of additional vintages.
# - for testing, other as-of dates
ALTVINTAGES <- as.Date(c("2018-01-01", "2019-01-01", "2020-01-01", "2021-01-01", "2022-01-01", "2023-01-01"))
# Combine all vintage dates
all_vintages <- c(VINTAGE, ALTVINTAGES)
# Pull data for each vintage
vintage_data_list <- lapply(all_vintages, function(vintage) {
data <- fredr(
series_id = "GNPCA",
observation_start = DATE_START,
observation_end = DATE_END,
vintage_dates = vintage
)
data$vintage <- as.character(vintage)
return(data)
})
# Combine all vintage data
vintage_comparison <- do.call(rbind, vintage_data_list)
# Reshape for comparison
vintage_wide <- vintage_comparison %>%
select(date, value, vintage) %>%
filter(date == DEFAULT_DATE)
Focussing on the value for 2016-01-01, we see how the value has changed over time:
date value vintage
1 2016-01-01 16835. 2017-06-01 2 2016-01-01 16879. 2018-01-01 3
2016-01-01 17868. 2019-01-01 4 2016-01-01 17902. 2020-01-01 5 2016-01-01
17955. 2021-01-01 6 2016-01-01 17902. 2022-01-01 7 2016-01-01 17902.
2023-01-01
| date | value | vintage |
|---|---|---|
| 2016-01-01 | 16835.20 | 2017-06-01 |
| 2016-01-01 | 16879.01 | 2018-01-01 |
| 2016-01-01 | 17867.77 | 2019-01-01 |
| 2016-01-01 | 17902.23 | 2020-01-01 |
| 2016-01-01 | 17955.44 | 2021-01-01 |
| 2016-01-01 | 17901.89 | 2022-01-01 |
| 2016-01-01 | 17901.89 | 2023-01-01 |
The effect on the time series is shown in the following graph (focussing only on the period 2012-2016 for clarity):
library(ggplot2)
vintage_comparison %>%
filter(date > PLOT_DATE) %>%
ggplot(aes(x = date, y = value, color = vintage)) +
geom_line() +
labs(title = "GNPCA over time for different vintages",
x = "Date",
y = "GNPCA (Billions of Chained 2017 Dollars)",
color = "Vintage Date") +
theme_minimal()
APIs in general have one additional “feature”: At some point, they may break, because the hosting institution makes decisions that affects its availability. While the above sections show how data can be obtained that precisely reflect the intended range and as-of date, they cannot compensate for the disappearance or breaking changes to an API.
The solution is to save the data pulled through the API as an intermediate dataset (“cache”) upon first use, and henceforth use the cached data. If redistribution is permissible by the license (check!), this also allows to provide future users with the same data, in case that the API is deprecated and won’t work in the future.
# Create directories if they don't exist
dir.create("data", showWarnings = FALSE)
dir.create("data/fred", showWarnings = FALSE)
# Check if file exists
if (file.exists("data/fred/fred_gnpca.rds")) {
fred_data <- readRDS("data/fred/fred_gnpca.rds")
# get the vintage id from the file
VINTAGE_READ <- max(as.Date(fred_data$realtime_start))
message(NOTE, "Re-using existing file with vintage =", as.character(VINTAGE_READ),"\n")
} else {
# Code if the file does not exist
# You could do the full API pull
# conditional on the intermediate
# file NOT being there
message(NOTE, "Reading in data from FRED API with vintage =", as.character(VINTAGE), "\n")
fred_data <- fredr(
series_id = "GNPCA",
observation_start = DATE_START,
observation_end = DATE_END,
vintage_dates = VINTAGE
)
saveRDS(fred_data, "data/fred/fred_gnpca.rds")
}
## README ::::Re-using existing file with vintage =2017-06-01
When first run, the code will output
README ::::Reading in data from FRED API with vintage =2017-06-01
but subsequent runs will show the output
README :::: Re-using existing file with vintage =2017-06-01
For a version using Stata, see this other document. Other interfaces include MATLAB and of course Python.↩︎