Bitbucket Pipelines Configuration#

Warning

This documentation was AI-generated by Claude Code and should be reviewed for accuracy. Please report any errors or inconsistencies.

File: bitbucket-pipelines.yml

Purpose#

This file defines the CI/CD pipelines for automated replication package analysis and verification. It orchestrates downloading deposits from various repositories, analyzing code and data, scanning for dependencies and PII, and generating comprehensive reports.

Custom Pipelines#

The configuration defines several custom pipelines that can be manually triggered via Bitbucket’s web interface.

1. 1-populate-from-icpsr#

Purpose: Full analysis pipeline with parallel processing for optimal performance.

Parameters:

  • openICPSRID - openICPSR project ID (or read from config.yml)

  • jiraticket - JIRA ticket identifier

  • ZenodoID - Zenodo deposit ID (alternative to openICPSR)

  • ProcessStata - Enable/disable Stata scanning (default: yes)

  • ProcessR - Enable/disable R scanning (default: yes)

  • ProcessPython - Enable/disable Python scanning (default: yes)

  • ProcessJulia - Enable/disable Julia scanning (default: no)

  • ProcessPii - Enable/disable PII scanning (default: yes)

  • SkipProcessing - Skip all processing steps (default: no)

Pipeline Steps:

Step 1: Download#

  • Image: python:3.12

  • Downloads deposit from openICPSR or Zenodo

  • Unpacks ZIP archives

  • Lists data and program files

  • Creates manifests with checksums

  • Checks for ZIP files, duplicates, zero-byte files

  • Validates file paths

  • Compares manifests

  • Artifacts: generated/**, cache/**

Step 2: Check downloads#

  • Verifies that deposit ZIP file exists in cache

  • Fails pipeline if download was unsuccessful

Step 3: Parallel Processing#

Runs multiple scanners concurrently for maximum efficiency:

3a. Run Stata parser

  • Image: larsvilhuber/bitbucket-stata:latest

  • Scans Stata code for package dependencies

  • Generates candidatepackages.csv

  • Artifacts: generated/**

3b. Run Stata PII scanner

  • Image: larsvilhuber/bitbucket-stata:latest

  • Scans for Personally Identifiable Information

  • Generates PII reports

  • Artifacts: generated/**

3c. Run R parser

  • Image: aeadataeditor/verification-r:latest

  • Checks R dependencies and data files

  • Generates r-deps.csv and r-data-checks.csv

3d. Run Python parser

  • Image: python:3.12

  • Scans Python code for package dependencies

  • Generates python-deps.csv

  • Artifacts: generated/**

3e. Run Julia parser

  • Image: julia:latest

  • Scans Julia code for package dependencies

  • Artifacts: generated/**

3f. Count lines and comments

  • Image: aldanial/cloc

  • Counts lines of code by language

  • Generates code statistics

  • Artifacts: generated/**

Step 4: Add info to REPLICATION.md#

  • Image: python:3.12

  • Consolidates all findings into report

  • Runs 24_amend_report.sh

  • Artifacts: generated/**

Step 5: Commit everything back#

  • Image: python:3.12

  • Re-extracts deposit

  • Commits analyzed code to repository

  • Cleans up deposit directory

  • Replaces report sections

  • Updates config.yml

  • Pushes to Git with tags

Use Case: Standard replication package analysis with optimal performance.


2. w-big-populate-from-icpsr#

Purpose: Single-step pipeline for large deposits requiring more resources.

Parameters:

  • openICPSRID - openICPSR project ID

  • ZenodoID - Zenodo deposit ID

  • jiraticket - JIRA ticket identifier

Pipeline Steps:

Step: Download and commit#

  • Image: python:3.12

  • Size: 2x (double resources)

  • Installs cloc for line counting

  • Downloads and analyzes deposit sequentially

  • All processing in single step (no parallelization)

  • Commits and pushes results

Key Differences from Pipeline 1:

  • All processing is sequential (slower but more reliable)

  • Double resources (2x memory and CPU)

  • No artifact passing between steps

  • Includes cloc installation

Use Case: Large deposits that may timeout or fail with standard resources.


3. z-run-stata#

Purpose: Execute Stata replication code.

Parameters:

  • openICPSRID - Project ID

  • MainFile - Main script to execute (default: main.do)

  • RunCommand - Command to run (default: run.sh)

Pipeline Steps:

Step: Run Stata code#

  • Image: larsvilhuber/bitbucket-stata:latest

  • Size: 2x

  • Downloads deposit

  • Executes replication code

  • Commits outputs

  • Pushes results

Use Case: Running actual Stata replication code (not just analysis).


4. z-run-any-big#

Purpose: Execute replication code with maximum resources.

Parameters:

  • openICPSRID - Project ID

  • MainFile - Main script to execute

  • RunCommand - Command to run (default: run.sh)

Pipeline Steps:

Step: Run R or Stata code#

  • Image: larsvilhuber/bitbucket-stata:latest

  • Size: 8x (8x resources - maximum available)

  • Downloads deposit

  • Executes replication code

  • Commits outputs

  • Pushes results

Use Case: Large, resource-intensive replications requiring maximum compute.


5. 2-merge-report#

Purpose: Combine Part A and Part B of a split report.

Parameters:

  • jiraticket - JIRA ticket identifier

Pipeline Steps:

  • Runs 50_merge-parts.sh

  • Pushes merged report

Use Case: After separate completion of report sections.


6. 3-split-report#

Purpose: Split REPLICATION.md into Part A and Part B.

Parameters:

  • jiraticket - JIRA ticket identifier

Pipeline Steps:

  • Runs 51_split-parts.sh

  • Pushes split reports

Use Case: When report needs to be worked on in sections.


7. 4-refresh-tools#

Purpose: Update pipeline tools from master template.

Pipeline Steps:

  • Downloads latest update_tools.sh

  • Executes update

  • Pushes updated tools

Use Case: Keeping tools synchronized with template repository.


8. 5-rename-directory#

Purpose: Rename a deposit directory in the repository.

Parameters:

  • oldName - Current directory name

  • newName - New directory name

  • jiraticket - JIRA ticket identifier

Pipeline Steps:

  • Runs git mv $oldName $newName

  • Commits with [skip ci] to avoid triggering pipelines

  • Pushes changes

Use Case: Correcting directory names or reorganizing deposits.


9. 6-convert-eps-pdf#

Purpose: Convert EPS and PDF graphics to PNG format.

Parameters:

  • path - Directory containing graphics

  • jiraticket - JIRA ticket identifier

  • ProcessEPS - Convert EPS files (default: yes)

  • ProcessPDF - Convert PDF files (default: no)

  • DockerImg - Docker image to use (default: dpokidov/imagemagick)

Pipeline Steps:

  • Uses Docker-in-Docker (services: docker)

  • Mounts current directory into container

  • Runs 52_convert_eps_pdf.sh

  • Commits converted files

  • Pushes with [skip ci]

Use Case: Converting graphics for better diff visualization or compatibility.


10. 7-download-box-manifest#

Purpose: Download restricted data from Box and generate manifest files.

Parameters:

  • jiraticket - JIRA ticket identifier

Pipeline Steps:

Step: Download Box and create manifests#

  • Image: python:3.12

  • Caches: pip packages

  • Installs Python requirements

  • Runs download_box_private.py to download restricted data from Box

  • Executes 04_create_manifest.sh restricted twice to generate checksums

  • Force-adds all files in generated/ directory

  • Commits with [skip ci] to avoid triggering pipelines

  • Pushes changes

Use Case: Downloading and documenting restricted data stored on Box for replication packages that include confidential data.

Note: Requires Box API credentials to be configured in the environment.


11. x-run-python#

Purpose: Execute custom Python scripts.

Parameters:

Pipeline Steps:

  • Image: python:3.11

  • Executes specified script

  • Runs post-run script if present

  • Pushes results

Use Case: Custom Python processing tasks.


Docker Images#

The pipeline uses several specialized Docker images:

Image

Purpose

Used In

python:3.12

Python analysis, downloads

Download, Python scanner

larsvilhuber/bitbucket-stata:latest

Stata scanning/execution

Stata scanner, PII scanner, execution

aeadataeditor/verification-r:latest

R dependency checking

R scanner

julia:latest

Julia dependency checking

Julia scanner

aldanial/cloc

Line counting

Line counter

dpokidov/imagemagick

Image conversion

Graphics conversion

Artifact Management#

Artifacts are files passed between pipeline steps:

artifacts:
  - generated/**   # Analysis outputs
  - cache/**       # Downloaded deposits

Only specified steps preserve artifacts. This reduces storage and transfer time.

Caching#

The pipeline uses Bitbucket’s caching for pip packages:

caches:
  - pip

This speeds up subsequent runs by reusing downloaded Python packages.

Configuration Integration#

All pipelines read from config.yml:

. ./tools/parse_yaml.sh
eval $(parse_yaml config.yml)

This allows parameters to be stored in the repository rather than entered manually.

Conditional Processing#

Scripts check environment variables to skip processing:

[[ "$SkipProcessing" == "yes" ]] && exit 0
[[ "$ProcessStata" == "no" ]] && exit 0

This provides fine-grained control over which analyses run.

Git Integration#

Most pipelines end with:

git status
git push
git push --tags  # Some pipelines

Some use [skip ci] in commit messages to prevent recursive pipeline triggers:

git commit -m "[skip ci] Rename $oldName to $newName"

Resource Sizing#

Bitbucket provides different resource tiers:

  • Default: 4GB RAM, 2 vCPU

  • 2x: 8GB RAM, 4 vCPU (size: 2x)

  • 8x: 32GB RAM, 16 vCPU (size: 8x)

Larger sizes cost more build minutes but prevent timeouts on big deposits.

Parallel vs Sequential#

Pipeline 1 (1-populate-from-icpsr):

  • Parallel processing of language scanners

  • Faster completion

  • More efficient use of build minutes

  • Requires artifact passing

Pipeline w (w-big-populate-from-icpsr):

  • Sequential processing

  • Simpler (no artifact coordination)

  • Better for large files

  • Higher resource allocation

YAML Anchors#

The configuration uses YAML anchors for reusability:

- step: &z-run-any-anchor
    name: Run R or Stata code
    script: [...]

Referenced later:

- step:
    <<: *z-run-any-anchor
    name: Run Stata code
    size: 2x

Environment Variables#

Available in pipelines:

  • $CI - Set in CI environment

  • $openICPSRID - From parameters or config

  • $ZenodoID - From parameters or config

  • $W_DOCKER_USERNAME - Docker Hub credentials (secured)

  • $W_DOCKER_PAT - Docker Hub PAT (secured)

Best Practices#

  1. Use Pipeline 1 for standard deposits

  2. Use Pipeline w for deposits >1GB or with many files

  3. Use Pipeline z-run-any-big for compute-intensive replications

  4. Set ProcessX=”no” for languages not in deposit (faster)

  5. Use SkipProcessing=”yes” to only download without analysis

  6. Use [skip ci] commits to avoid recursive triggers

  7. Check artifacts in Bitbucket UI if debugging

Troubleshooting#

Issue

Solution

Timeout during download

Use w-big-populate-from-icpsr

Out of memory during analysis

Use w-big-populate-from-icpsr or z-run-any-big

Parallel steps fail

Check individual step logs in Bitbucket

Artifacts not found

Verify previous step completed successfully

Git push fails

Check repository permissions and credentials