Time | January 19, 2025 |
---|---|
8:00 | Breakfast |
9:00 | Introduction |
10:00 | Reproducible practices, A template README |
10:50 | Coffee break |
11:00 | Data provenance, data citations |
12:00 | Lunch Break |
2025-01-19
Time | January 19, 2025 |
---|---|
8:00 | Breakfast |
9:00 | Introduction |
10:00 | Reproducible practices, A template README |
10:50 | Coffee break |
11:00 | Data provenance, data citations |
12:00 | Lunch Break |
One of the most vexing issues is how to cite data. Many researchers have not been trained, and don’t get it right.
Many authors initially neglect to add data citations, or do not know how to add a data citation. Often, we see authors cite papers with supplementary data, but not databases or other data:
We use data acquired from the NHL, dates of power outages collected by Tremblay et al (2018), augmented with information on the language and grammar skills of hockey players provided by the Ethnologue database.
(note absence of citation for NHL and Ethnologue data). In the above example, three datasets are used, but only one is cited in some fashion.
We use data acquired from the NHL (NHL, 2018), dates of power outages collected by Tremblay et al (2018, 2019), augmented with information on the language and grammar skills of hockey players provided by the Ethnologue database (Eberhard et al, 2019).
with the reference list having the following entries:
The Data Citation Principles note that (emphasis added):
Sound, reproducible scholarship rests upon a foundation of robust, accessible data. For this to be so in practice as well as theory, data must be accorded due importance in the practice of scholarship and in the enduring scholarly record. In other words, data should be considered legitimate, citable products of research. Data citation, like the citation of other evidence and sources, is good research practice and is part of the scholarly ecosystem supporting data reuse.
The AEA follows the Chicago Manual of Style (CMOS), with several additions on the AEA website.
One of the criteria for a useful citation is conveying authority and permanence (CMOS):
Electronic content presented without formal ties to a publisher or sponsoring body has the authority equivalent to that of unpublished or self-published material in other media.
They also note that
Authors should note that anything posted on the internet is “published” in the sense of copyright and must be treated as such for the purposes of complete citation and clearance of permissions, if relevant.
When citing information from websites, including data downloaded from websites, use the general website citation style for data:
Google. 2017. “Privacy Policy.” Privacy & Terms. Last modified April 17, 2017. https://www.google.com/policies/privacy/.
Yale University. n.d. “About Yale: Yale Facts.” Accessed May 1, 2017. https://www.yale.edu/about-yale/yale-facts.
Note that this does NOT apply when the data have a permanent URL, a DOI, or a suggested citation!
Many researchers now use Github to store code or documents. How should Github repositories (or for that matter, Gitlab or Bitbucket) repositories be cited?
Any online repository can be deleted by its owner at any time, and in that sense, are no more or less robust than [websites], and the same rules apply:
Vilhuber, Lars, Connolly, Marie, Koren, Miklós, Llull, Joan, & Morrow, Peter. (2020). “A template README for social science replication packages.” Github. https://github.com/social-science-data-editors/template_README/. Accessed 2021-07-01.
However, online repositories do provide the ability to make releases, providing a better way to pin changes made over time. While this does not change the fundamental treatment as a website, it is a more robust reference:
Vilhuber, Lars, Connolly, Marie, Koren, Miklós, Llull, Joan, & Morrow, Peter. (2020). “A template README for social science replication packages.” Github. Release v1.0.0). https://github.com/social-science-data-editors/template_README/releases/tag/v1.0.0. Accessed 2021-07-01.
However, Github is currently unique among online repositories in enabling an automatic integration with Zenodo (making code citable)
Zenodo-preserved versions are no longer subject to arbitrary deletion, get a DOI:
Vilhuber, Lars, Connolly, Marie, Koren, Miklós, Llull, Joan, & Morrow, Peter. (2020). A template README for social science replication packages (Version v1.0.0). Zenodo. http://doi.org/10.5281/zenodo.4319999
Of note: while Github is unique in its automatic integration, any release of an online code repository can be manually preserved on Zenodo, then cited in the above fashion.
What is a Digital Object Idenfifier?
a DOI (e.g. 10.25490/a97f-egyk or https://doi.org/10.25490/a97f-egyk) is a permanent identifier, and comes with a commitment by the issuer to preserve the underlying object and its landing page.
it redirects to a URL, which is what your browser shows in the address bar.
if a DOI is shown on the website, you should NEVER use the URL to refer to it.
a DOI never shows in your address bar (except if it fails) (try https://doi.org/10.25490/a97f-egyk-gobbledygook)
CMOS guidance for online databases:
NASA/IPAC Extragalactic Database (object name IRAS F00400+4059; accessed April 6, 2016). http://ned.ipac.caltech.edu/.
which would be cited in the text as NASA/IPAC Extragalactic Database
.
The CMOS provides examples of how to cite supplementary materials that are attached to a specific article:
Suárez-Rodríguez, M. and C. Macías Garcia. 2014. “There Is No Such a Thing as a Free Cigarette: Lining Nests with Discarded Butts Brings Short-Term Benefits, but Causes Toxic Damage.” Journal of Evolutionary Biology 27, no. 12 (December 2014): 2719–26, https://doi.org/10.1111/jeb.12531, data deposited at Dryad Digital Repository, https://doi.org/10.5061/dryad.4t5rt.
Romer, Christina D., and David H. Romer. 2010. “Replication data for: The Macroeconomic Effects of Tax Changes: Estimates Based on a New Measure of Fiscal Shocks.” American Economic Association [publisher], * Inter-university Consortium for Political and Social Research [distributor]*, https://doi.org/10.3886/E112357V1
with the article also cited as:
Romer, Christina D., and David H. Romer. 2010. “The Macroeconomic Effects of Tax Changes: Estimates Based on a New Measure of Fiscal Shocks” American Economic Review. no. 3 (June 2010): 763–801. https://doi.org/10.1257/aer.100.3.763.
Several data distributors already request or require citation as per their terms of use, or as a note on their websites. Often, following traditional usage, they request citation of a published article, not of the data. For example, the widely used “Penn World Tables” request that the article that introduced this version of the Penn tables be cited:
Feenstra, Robert C, Robert Inklaar, and Marcel P Timmer. 2015.“The next generation of the Penn World Table.” American Economic Review, 105(10): 3150-82.
For accurate citations as well as compliance with data citation principles, in addition to citing the Feenstra et al (2015) article as requested by the data provider, authors should also correctly cite the data:
Robert C. Feenstra, Robert Inklaar and Marcel P. Timmer. 2016. “Penn World Table 9.0.” Groningen Growth and Development Centre. https://doi.org/10.15141/S5J01T.
The key to data citations is that the creator, the name, the location, and the date last accessed for a data source should be clear. This pertains to online data, offline data, physical data, whether it is in boxes or on tapes, or in a corporate database behind a firewall.
ICPSR notes that a citation should include the following items:
Note that all but the URN would apply also for an offline database. Consider the citation of objects in archives:
Often, the creator of a dataset is an organization. The same way that an organization as a work’s author can be cited:
ISO (International Organization for Standardization). 1997. Information and Documentation—Rules for the Abbreviation of Title Words and Titles of Publications. ISO 4:1997. Paris: ISO.
an organization can be cited as the creator of a dataset:
Standard and Poor’s (S&P). 2017. Compustat-Capital IQ. S&P Global Market Intelligence.
Feenstra, Robert C, Robert Inklaar, and Marcel P Timmer. 2015.“The next generation of the Penn World Table.” American Economic Review, 105(10): 3150-82.
The distributor is the journal.
In many cases, the data are not distributed by the creator. This means the distributor takes on the role of a publisher (of a book, of data). So if using Compustat through the Wharton Research Data Services, one might cite as
Standard and Poor’s (S&P). 2017. Compustat-Capital IQ. Wharton Research Data Services. https://wrds-www.wharton.upenn.edu/pages/about/data-vendors/sp-global-market-intelligence/
or maybe
Standard and Poor’s (S&P). 2017. “Compustat-Capital IQ.” Wharton Research Data Services. https://wrds-www.wharton.upenn.edu/pages/about/data-vendors/sp-global-market-intelligence/
S&P Dow Jones Indices LLC, S&P 500 [SP500], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/SP500, January 24, 2020.
S&P Dow Jones Indices LLC, S&P 500, provided via Haver Analytics Data Subscription, February 24, 2018.
with hopefully the same content. Note that often, such data is subject to copyright and redistribution restrictions (see the page at FRED on SP500).
In some cases, it isn’t clear when the dataset was published, though it may be clear what time period the dataset covers. One way to address this may be by using the “n.d.” abbreviation for the date of publication:
Standard and Poor’s (S&P). n.d. Compustat-Capital IQ (1982-2017). Wharton Research Data Services. Accessed April 6, 2018. https://wrds-www.wharton.upenn.edu/pages/about/data-vendors/sp-global-market-intelligence/
Many datasets are available
Traditional offline archives, for instance manuscript collections (CMOS)
Kallen, Horace. Papers. YIVO Institute for Jewish Research, New York.
and usage in the text as
Alvin Johnson, in a memorandum prepared sometime in 1937 (Kallen Papers, file 36), observed that …
Bloom, Nick. 2019. Confidential survey data on Cameroon business processes. Stanford Secure Access Center (file “cameroon-bloom.zip”). Stanford University.
Similar forms may be used for confidential databases when no DOI exists:
Internal Revenue Service. (YEAR). Corporate Income Tax Returns [database]. Department of Treasury, Washington DC, accessed YYYY-MM-DD.
If the same data had been accessed via a secure research data center, the reference should have instead noted that access mechanism:
Internal Revenue Service. (YEAR). Corporate Income Tax Returns [database]. Federal Research Data Centers [distributor], last accessed YYYY-MM-DD.
If multiple databases are used and combined, they should be cited (within reason) separately.
Guidance here may be:
Can and do researchers combine various extracts in different ways?
For instance, do some combine the IRS 1040 database with death records, and others merge elements from the IRS 1040 database with information returns? Then the information returns, and the 1040 file should be cited separately.
In some cases, governments have list of their (named) registers. For instance, Statistics Denmark provides the full list of registers at http://www.dst.dk/extranet/forskningvariabellister/Oversigt%20over%20registre.html. These can be used to craft data citations, for instance
Statistics Denmark. 2020. “Døde i Danmark (DOD, Deaths in Denmark), 1970-2019 [database]”, Danmarks Statistiks Forskningsservice, accessed YYYY-MM-DD.
where the “author” is Statistics Denmark, but the “[publisher]” is the research service of Statistics Denmark.
If a DOI exists, the formal citation generated from that DOI should be used:
Forschungsdatenzentrum der Bundesagentur für Arbeit. 2020. “Betriebs-Historik-Panel (BHP) – Version 7518 v1.” Institut für Arbeitsmarkt- und Berufsforschung (IAB). https://doi.org/10.5164/IAB.BHP7518.DE.EN.V1.
In some cases (not infrequently), access to data is through informal means. The CMOS allows for citation of such information, without inclusion in the references.
We would deviate from that suggestion (DCP!), ask for inclusion in the reference list, and simply suggest using unpublished data as the locator, similar to a URN, in the reference list:
Møller, A. P. n.d. “Data on Crocodile Sightings in Manhattan.” Unpublished data. Accessed February 29, 2019.
In some cases, the data provider (often a firm) must remain anonymous. This does not prevent citation, and the provider should be mentioned in much the same way as when there is no formal access mechanism:
Anonymous Firm. 1999. “Personnel records of windowshield installers.” Unpublished data. Accessed February 29, 2000.
https://social-science-data-editors.github.io/guidance/addtl-data-citation-guidance.html#try-it-out
Time | January 19, 2025 |
---|---|
13:00 | What will you be doing in the Lab |
13:30 | Tech break |
14:00 | Command line/Git/Markdown/Version control |
15:00 | A prototypical replication report |
15:30 | Break |
16:00 | A walkthrough of the workflow |
17:00 | How to run Stata code |
18:00 | End |