Bitbucket Pipeline Overview#

Warning

This documentation was AI-generated by Claude Code and should be reviewed for accuracy. Please report any errors or inconsistencies.

Introduction#

This guide helps you understand and use the automated replication verification pipeline. The pipeline downloads replication packages from data repositories, analyzes code and data files, scans for dependencies and issues, and generates comprehensive reports.

Supported Repositories#

The pipeline can automatically download from:

  • openICPSR - Most common for AEA replications

  • Zenodo - General-purpose research repository

  • Dataverse - Harvard and other Dataverse instances

  • OSF - Open Science Framework (not yet)

  • World Bank (not yet)

  • Box - For restricted data (not yet)

Available Pipelines#

Standard Pipelines#

1-populate-from-icpsr#

What it does:

  • Downloads deposit from openICPSR, Zenodo, or other repository

  • Lists all data and program files

  • Generates file checksums for integrity verification

  • Scans code for package dependencies (Stata, R, Python, Julia)

  • Checks for Personally Identifiable Information (PII)

  • Identifies duplicate files and potential issues

  • Counts lines of code

  • Generates comprehensive markdown report

  • Commits results to repository

Parameters:

  • openICPSRID or ZenodoID - Repository identifier (or from config.yml)

  • jiraticket - Your JIRA ticket number

  • ProcessStata/R/Python/Julia - Enable/disable specific language scanners

  • ProcessPii - Enable/disable PII scanning

When it’s done:

  • ✅ Deposit downloaded and analyzed

  • ✅ All findings in generated/ directory

  • ✅ Code committed with tags


w-big-populate-from-icpsr (Large Deposit Pipeline)#

When to use:

  • Deposits larger than 1GB

  • Many files (>5,000)

  • Previous timeouts with standard pipeline

What it does: Same as standard pipeline, but optimized for large deposits

How it’s different:

  • Uses 2x computing resources (8GB RAM vs 4GB)

  • Sequential processing instead of parallel

  • More reliable for large files

  • Takes longer but less likely to timeout

  • Uses more build minutes (costlier)


Utility Pipelines#

2-merge-report (Merge Report Sections)#

When to use: After completing split report sections separately

What it does:

  • Combines REPLICATION-PartA.md and REPLICATION-PartB.md

  • Creates unified REPLICATION.md

  • Removes split files


3-split-report (Split Report)#

When to use: Need to work on report in separate sections

What it does:

  • Splits REPLICATION.md at marker comment

  • Creates Part A and Part B files

  • Removes original unified report


4-refresh-tools (Update Tools)#

When to use:

  • After template repository updates

  • Periodically to stay current

What it does:

  • Downloads latest tools from master template

  • Updates all automation scripts

  • Commits changes


5-rename-directory (Rename Deposit)#

When to use: Need to rename a deposit directory

What it does:

  • Renames directory using git mv

  • Preserves history

  • Updates references

Parameters:

  • oldName - Current directory name

  • newName - New directory name


6-convert-eps-pdf (Convert Graphics)#

When to use: Convert graphics for better viewing/comparison

What it does:

  • Converts EPS files to PNG

  • Converts PDF graphics to PNG

  • Preserves originals

Parameters:

  • path - Directory containing graphics

  • ProcessEPS - Convert EPS (yes/no)

  • ProcessPDF - Convert PDF (yes/no)


7-download-box-manifest (Download Box Data)#

When to use: Download restricted data from Box and create manifests

What it does:

  • Downloads restricted data from Box using credentials

  • Generates manifest files with checksums (runs twice for verification)

  • Commits all generated files to repository

  • Uses [skip ci] to avoid triggering other pipelines

Parameters:

  • jiraticket - Your JIRA ticket number

Requirements:

  • Box API credentials must be configured

  • download_box_private.py script must be set up

Note: This pipeline is specifically for handling restricted/confidential data that cannot be stored in public repositories.


Execution Pipelines#

z-run-stata (Run Stata Code) (BETA)#

When to use: Execute Stata replication code

What it does:

  • Downloads deposit

  • Runs your specified Stata script

  • Captures all output and logs

  • Commits results and generated files

Requirements:

  • Set MainFile parameter (e.g., “main.do”)

  • Or ensure run.sh exists in deposit

Resources: 2x (8GB RAM)

Typical duration: Varies based on code (30 minutes to several hours)


z-run-any-big (BETA)#

When to use:

  • Memory-intensive replications

  • Large datasets that need maximum RAM

  • Previous out-of-memory errors

What it does: Same as z-run-stata, also allows for R jobs, with maximum resources

Resources: 8x (32GB RAM, 16 vCPU)

Cost: Uses significantly more build minutes


Understanding Pipeline Output#

After a successful pipeline run, you’ll find:

Main Report#

  • REPLICATION.md - Prefilled with automatically identified results. Still needs your input!

Generated Directory#

All analysis outputs in generated/:

  • File inventories:

    • data-list.txt - All data files found

    • programs-list.txt - All code files found

    • manifest.YYYY-MM-DD.sha256 - Checksums for verification

  • Dependency reports:

    • candidatepackages.md - Stata packages needed

    • r-deps-summary.md - R packages needed

    • python-deps.md - Python packages needed

  • Quality checks:

    • duplicate-files-report.md - Identical files

    • zero-byte-files-report.md - Empty files

    • large-file-report.md - Files over size threshold

    • PII_stata_scan_summary.txt - Potential PII found

  • Code statistics:

    • cloc-results.txt - Lines of code by language

Deposit Directory#

  • {projectID}/ - Downloaded deposit (code only; no data!)