Bitbucket Pipelines Configuration#
Warning
This documentation was AI-generated by Claude Code and should be reviewed for accuracy. Please report any errors or inconsistencies.
File: bitbucket-pipelines.yml
Purpose#
This file defines the CI/CD pipelines for automated replication package analysis and verification. It orchestrates downloading deposits from various repositories, analyzing code and data, scanning for dependencies and PII, and generating comprehensive reports.
Custom Pipelines#
The configuration defines several custom pipelines that can be manually triggered via Bitbucket’s web interface.
1. 1-populate-from-icpsr#
Purpose: Full analysis pipeline with parallel processing for optimal performance.
Parameters:
openICPSRID- openICPSR project ID (or read from config.yml)jiraticket- JIRA ticket identifierZenodoID- Zenodo deposit ID (alternative to openICPSR)ProcessStata- Enable/disable Stata scanning (default: yes)ProcessR- Enable/disable R scanning (default: yes)ProcessPython- Enable/disable Python scanning (default: yes)ProcessJulia- Enable/disable Julia scanning (default: no)ProcessPii- Enable/disable PII scanning (default: yes)SkipProcessing- Skip all processing steps (default: no)
Pipeline Steps:
Step 1: Download#
Image:
python:3.12Downloads deposit from openICPSR or Zenodo
Unpacks ZIP archives
Lists data and program files
Creates manifests with checksums
Checks for ZIP files, duplicates, zero-byte files
Validates file paths
Compares manifests
Artifacts:
generated/**,cache/**
Step 2: Check downloads#
Verifies that deposit ZIP file exists in cache
Fails pipeline if download was unsuccessful
Step 3: Parallel Processing#
Runs multiple scanners concurrently for maximum efficiency:
3a. Run Stata parser
Image:
larsvilhuber/bitbucket-stata:latestScans Stata code for package dependencies
Generates
candidatepackages.csvArtifacts:
generated/**
3b. Run Stata PII scanner
Image:
larsvilhuber/bitbucket-stata:latestScans for Personally Identifiable Information
Generates PII reports
Artifacts:
generated/**
3c. Run R parser
Image:
aeadataeditor/verification-r:latestChecks R dependencies and data files
Generates
r-deps.csvandr-data-checks.csv
3d. Run Python parser
Image:
python:3.12Scans Python code for package dependencies
Generates
python-deps.csvArtifacts:
generated/**
3e. Run Julia parser
Image:
julia:latestScans Julia code for package dependencies
Artifacts:
generated/**
3f. Count lines and comments
Image:
aldanial/clocCounts lines of code by language
Generates code statistics
Artifacts:
generated/**
Step 4: Add info to REPLICATION.md#
Image:
python:3.12Consolidates all findings into report
Runs
24_amend_report.shArtifacts:
generated/**
Step 5: Commit everything back#
Image:
python:3.12Re-extracts deposit
Commits analyzed code to repository
Cleans up deposit directory
Replaces report sections
Updates
config.ymlPushes to Git with tags
Use Case: Standard replication package analysis with optimal performance.
2. w-big-populate-from-icpsr#
Purpose: Single-step pipeline for large deposits requiring more resources.
Parameters:
openICPSRID- openICPSR project IDZenodoID- Zenodo deposit IDjiraticket- JIRA ticket identifier
Pipeline Steps:
Step: Download and commit#
Image:
python:3.12Size:
2x(double resources)Installs
clocfor line countingDownloads and analyzes deposit sequentially
All processing in single step (no parallelization)
Commits and pushes results
Key Differences from Pipeline 1:
All processing is sequential (slower but more reliable)
Double resources (2x memory and CPU)
No artifact passing between steps
Includes
clocinstallation
Use Case: Large deposits that may timeout or fail with standard resources.
3. z-run-stata#
Purpose: Execute Stata replication code.
Parameters:
openICPSRID- Project IDMainFile- Main script to execute (default: main.do)RunCommand- Command to run (default: run.sh)
Pipeline Steps:
Step: Run Stata code#
Image:
larsvilhuber/bitbucket-stata:latestSize:
2xDownloads deposit
Executes replication code
Commits outputs
Pushes results
Use Case: Running actual Stata replication code (not just analysis).
4. z-run-any-big#
Purpose: Execute replication code with maximum resources.
Parameters:
openICPSRID- Project IDMainFile- Main script to executeRunCommand- Command to run (default: run.sh)
Pipeline Steps:
Step: Run R or Stata code#
Image:
larsvilhuber/bitbucket-stata:latestSize:
8x(8x resources - maximum available)Downloads deposit
Executes replication code
Commits outputs
Pushes results
Use Case: Large, resource-intensive replications requiring maximum compute.
5. 2-merge-report#
Purpose: Combine Part A and Part B of a split report.
Parameters:
jiraticket- JIRA ticket identifier
Pipeline Steps:
Runs
50_merge-parts.shPushes merged report
Use Case: After separate completion of report sections.
6. 3-split-report#
Purpose: Split REPLICATION.md into Part A and Part B.
Parameters:
jiraticket- JIRA ticket identifier
Pipeline Steps:
Runs
51_split-parts.shPushes split reports
Use Case: When report needs to be worked on in sections.
7. 4-refresh-tools#
Purpose: Update pipeline tools from master template.
Pipeline Steps:
Downloads latest
update_tools.shExecutes update
Pushes updated tools
Use Case: Keeping tools synchronized with template repository.
8. 5-rename-directory#
Purpose: Rename a deposit directory in the repository.
Parameters:
oldName- Current directory namenewName- New directory namejiraticket- JIRA ticket identifier
Pipeline Steps:
Runs
git mv $oldName $newNameCommits with
[skip ci]to avoid triggering pipelinesPushes changes
Use Case: Correcting directory names or reorganizing deposits.
9. 6-convert-eps-pdf#
Purpose: Convert EPS and PDF graphics to PNG format.
Parameters:
path- Directory containing graphicsjiraticket- JIRA ticket identifierProcessEPS- Convert EPS files (default: yes)ProcessPDF- Convert PDF files (default: no)DockerImg- Docker image to use (default: dpokidov/imagemagick)
Pipeline Steps:
Uses Docker-in-Docker (services: docker)
Mounts current directory into container
Runs
52_convert_eps_pdf.shCommits converted files
Pushes with
[skip ci]
Use Case: Converting graphics for better diff visualization or compatibility.
10. 7-download-box-manifest#
Purpose: Download restricted data from Box and generate manifest files.
Parameters:
jiraticket- JIRA ticket identifier
Pipeline Steps:
Step: Download Box and create manifests#
Image:
python:3.12Caches: pip packages
Installs Python requirements
Runs
download_box_private.pyto download restricted data from BoxExecutes
04_create_manifest.sh restrictedtwice to generate checksumsForce-adds all files in
generated/directoryCommits with
[skip ci]to avoid triggering pipelinesPushes changes
Use Case: Downloading and documenting restricted data stored on Box for replication packages that include confidential data.
Note: Requires Box API credentials to be configured in the environment.
11. x-run-python#
Purpose: Execute custom Python scripts.
Parameters:
Script- Script to run (default: run-python.sh)
Pipeline Steps:
Image:
python:3.11Executes specified script
Runs post-run script if present
Pushes results
Use Case: Custom Python processing tasks.
Docker Images#
The pipeline uses several specialized Docker images:
Image |
Purpose |
Used In |
|---|---|---|
|
Python analysis, downloads |
Download, Python scanner |
|
Stata scanning/execution |
Stata scanner, PII scanner, execution |
|
R dependency checking |
R scanner |
|
Julia dependency checking |
Julia scanner |
|
Line counting |
Line counter |
|
Image conversion |
Graphics conversion |
Artifact Management#
Artifacts are files passed between pipeline steps:
artifacts:
- generated/** # Analysis outputs
- cache/** # Downloaded deposits
Only specified steps preserve artifacts. This reduces storage and transfer time.
Caching#
The pipeline uses Bitbucket’s caching for pip packages:
caches:
- pip
This speeds up subsequent runs by reusing downloaded Python packages.
Configuration Integration#
All pipelines read from config.yml:
. ./tools/parse_yaml.sh
eval $(parse_yaml config.yml)
This allows parameters to be stored in the repository rather than entered manually.
Conditional Processing#
Scripts check environment variables to skip processing:
[[ "$SkipProcessing" == "yes" ]] && exit 0
[[ "$ProcessStata" == "no" ]] && exit 0
This provides fine-grained control over which analyses run.
Git Integration#
Most pipelines end with:
git status
git push
git push --tags # Some pipelines
Some use [skip ci] in commit messages to prevent recursive pipeline triggers:
git commit -m "[skip ci] Rename $oldName to $newName"
Resource Sizing#
Bitbucket provides different resource tiers:
Default: 4GB RAM, 2 vCPU
2x: 8GB RAM, 4 vCPU (
size: 2x)8x: 32GB RAM, 16 vCPU (
size: 8x)
Larger sizes cost more build minutes but prevent timeouts on big deposits.
Parallel vs Sequential#
Pipeline 1 (1-populate-from-icpsr):
Parallel processing of language scanners
Faster completion
More efficient use of build minutes
Requires artifact passing
Pipeline w (w-big-populate-from-icpsr):
Sequential processing
Simpler (no artifact coordination)
Better for large files
Higher resource allocation
YAML Anchors#
The configuration uses YAML anchors for reusability:
- step: &z-run-any-anchor
name: Run R or Stata code
script: [...]
Referenced later:
- step:
<<: *z-run-any-anchor
name: Run Stata code
size: 2x
Environment Variables#
Available in pipelines:
$CI- Set in CI environment$openICPSRID- From parameters or config$ZenodoID- From parameters or config$W_DOCKER_USERNAME- Docker Hub credentials (secured)$W_DOCKER_PAT- Docker Hub PAT (secured)
Best Practices#
Use Pipeline 1 for standard deposits
Use Pipeline w for deposits >1GB or with many files
Use Pipeline z-run-any-big for compute-intensive replications
Set ProcessX=”no” for languages not in deposit (faster)
Use SkipProcessing=”yes” to only download without analysis
Use [skip ci] commits to avoid recursive triggers
Check artifacts in Bitbucket UI if debugging
Troubleshooting#
Issue |
Solution |
|---|---|
Timeout during download |
Use |
Out of memory during analysis |
Use |
Parallel steps fail |
Check individual step logs in Bitbucket |
Artifacts not found |
Verify previous step completed successfully |
Git push fails |
Check repository permissions and credentials |