Bitbucket Pipeline Overview#
Warning
This documentation was AI-generated by Claude Code and should be reviewed for accuracy. Please report any errors or inconsistencies.
Introduction#
This guide helps you understand and use the automated replication verification pipeline. The pipeline downloads replication packages from data repositories, analyzes code and data files, scans for dependencies and issues, and generates comprehensive reports.
Supported Repositories#
The pipeline can automatically download from:
openICPSR - Most common for AEA replications
Zenodo - General-purpose research repository
Dataverse - Harvard and other Dataverse instances
OSF - Open Science Framework (not yet)
World Bank (not yet)
Box - For restricted data (not yet)
Available Pipelines#
Standard Pipelines#
1-populate-from-icpsr#
What it does:
Downloads deposit from openICPSR, Zenodo, or other repository
Lists all data and program files
Generates file checksums for integrity verification
Scans code for package dependencies (Stata, R, Python, Julia)
Checks for Personally Identifiable Information (PII)
Identifies duplicate files and potential issues
Counts lines of code
Generates comprehensive markdown report
Commits results to repository
Parameters:
openICPSRIDorZenodoID- Repository identifier (or from config.yml)jiraticket- Your JIRA ticket numberProcessStata/R/Python/Julia- Enable/disable specific language scannersProcessPii- Enable/disable PII scanning
When it’s done:
✅ Deposit downloaded and analyzed
✅ All findings in
generated/directory✅ Code committed with tags
w-big-populate-from-icpsr (Large Deposit Pipeline)#
When to use:
Deposits larger than 1GB
Many files (>5,000)
Previous timeouts with standard pipeline
What it does: Same as standard pipeline, but optimized for large deposits
How it’s different:
Uses 2x computing resources (8GB RAM vs 4GB)
Sequential processing instead of parallel
More reliable for large files
Takes longer but less likely to timeout
Uses more build minutes (costlier)
Utility Pipelines#
2-merge-report (Merge Report Sections)#
When to use: After completing split report sections separately
What it does:
Combines
REPLICATION-PartA.mdandREPLICATION-PartB.mdCreates unified
REPLICATION.mdRemoves split files
3-split-report (Split Report)#
When to use: Need to work on report in separate sections
What it does:
Splits
REPLICATION.mdat marker commentCreates Part A and Part B files
Removes original unified report
4-refresh-tools (Update Tools)#
When to use:
After template repository updates
Periodically to stay current
What it does:
Downloads latest tools from master template
Updates all automation scripts
Commits changes
5-rename-directory (Rename Deposit)#
When to use: Need to rename a deposit directory
What it does:
Renames directory using git mv
Preserves history
Updates references
Parameters:
oldName- Current directory namenewName- New directory name
6-convert-eps-pdf (Convert Graphics)#
When to use: Convert graphics for better viewing/comparison
What it does:
Converts EPS files to PNG
Converts PDF graphics to PNG
Preserves originals
Parameters:
path- Directory containing graphicsProcessEPS- Convert EPS (yes/no)ProcessPDF- Convert PDF (yes/no)
7-download-box-manifest (Download Box Data)#
When to use: Download restricted data from Box and create manifests
What it does:
Downloads restricted data from Box using credentials
Generates manifest files with checksums (runs twice for verification)
Commits all generated files to repository
Uses
[skip ci]to avoid triggering other pipelines
Parameters:
jiraticket- Your JIRA ticket number
Requirements:
Box API credentials must be configured
download_box_private.pyscript must be set up
Note: This pipeline is specifically for handling restricted/confidential data that cannot be stored in public repositories.
Execution Pipelines#
z-run-stata (Run Stata Code) (BETA)#
When to use: Execute Stata replication code
What it does:
Downloads deposit
Runs your specified Stata script
Captures all output and logs
Commits results and generated files
Requirements:
Set
MainFileparameter (e.g., “main.do”)Or ensure
run.shexists in deposit
Resources: 2x (8GB RAM)
Typical duration: Varies based on code (30 minutes to several hours)
z-run-any-big (BETA)#
When to use:
Memory-intensive replications
Large datasets that need maximum RAM
Previous out-of-memory errors
What it does: Same as z-run-stata, also allows for R jobs, with maximum resources
Resources: 8x (32GB RAM, 16 vCPU)
Cost: Uses significantly more build minutes
Understanding Pipeline Output#
After a successful pipeline run, you’ll find:
Main Report#
REPLICATION.md- Prefilled with automatically identified results. Still needs your input!
Generated Directory#
All analysis outputs in generated/:
File inventories:
data-list.txt- All data files foundprograms-list.txt- All code files foundmanifest.YYYY-MM-DD.sha256- Checksums for verification
Dependency reports:
candidatepackages.md- Stata packages neededr-deps-summary.md- R packages neededpython-deps.md- Python packages needed
Quality checks:
duplicate-files-report.md- Identical fileszero-byte-files-report.md- Empty fileslarge-file-report.md- Files over size thresholdPII_stata_scan_summary.txt- Potential PII found
Code statistics:
cloc-results.txt- Lines of code by language
Deposit Directory#
{projectID}/- Downloaded deposit (code only; no data!)