Ingesting author materials
We will now ingest the authors’ materials, and run a few statistics. Typically, the materials will be on a (private) openICPSR repository. Sometimes, the materials will be at Dataverse, Zenodo, or elsewhere.
If at openICPSR, the fields Replication package URL
, openICPSR alternate URL
, and openICPSR Project Number
will be filled.
If at Zenodo or Dataverse, the Replication package URL
will have the DOI of the replication package, openICPSR alternate URL
and openICPSR Project Number
will be empty.
Note
This currently works reliably only for openICPSR. This documentation will be updated when it works for Dataverse and Zenodo as well.
Inspect the deposit
First, click on the openICPSR alternate URL
URL (or Replication package URL
if it contains a DOI and the other fields are empty). Inspect the deposit.
The information may be in different locations at other repositories.
Note
Make a note of the size of the deposit!
Running the pipeline
You will now run what is called a Bitbucket Pipeline
. Similar tools on other sites might be called Continuous integration
, Github Actions
, etc. If you have encountered these before, this will not be news for you, but it isn’t hard even when this new.
Note
This is where the information about the size of the deposit matters! Choose the option that best matches the size of the deposit.
If the deposit is less than 3 GB…
If the deposit is more than 3 GB…
Note that if you choose this pipeline, certain information is not generated (Stata scan, R package scan), and you may need to augment these manually. Try to avoid this pipeline if possible, and make a note in the Jira comments if you had to run this.
Monitoring the pipeline
Possible errors for pipeline failure
Files too big
Bitbucket might complain in the Commit everything back
step that
remote: Your push has been blocked because it includes at least one file that is 100 MB or larger.
Solution
Investigate which files are being captured that are too big. The list of file endings that Git should ignore is kept in the .gitignore
file. Once you have figured out which files are causing the problem, you should exclude them:
in your repository, by adding them into the repository-specific .gitignore
in the template .gitignore
file, by suggesting an edit. Click on this link, then choose “Edit
”, and add the extension to the file (you will need a Github account to create a pull request).
Memory or CPU usage to high
If your pipeline fails in the Stata step, click on the failed step, and scroll to the error message. If you see this:
./automations/10_run_stata_scanner.sh: line 64: 122 Killed stata-mp -b do ../PII_stata_scan.do
It is likely that the PII scan failed because the in-memory dataset is too large (too much memory was run, and the pipeline was killed). Try running the pipeline again with the “w-big populate from ICPSR
” (see above).