download_dv.py - Download datasets from Dataverse repositories#
Description#
This script downloads complete datasets from Dataverse instances (like Harvard Dataverse) using their DOI (Digital Object Identifier) and the Dataverse API. It creates organized directory structures and automatically extracts downloaded ZIP files.
Usage#
python3 tools/download_dv.py --doi DOI [--server_url URL] [--output PATH]
Arguments#
–doi (Required) - DOI of the dataset (e.g., “doi:10.7910/DVN/ABC123”)
–server_url (Optional) - Dataverse server URL (default: https://dataverse.harvard.edu)
–output (Optional) - Output directory (default: current directory)
Examples#
# Download from Harvard Dataverse (default)
python3 tools/download_dv.py --doi "doi:10.7910/DVN/ABC123"
# Download from custom Dataverse instance
python3 tools/download_dv.py --doi "doi:10.7910/DVN/ABC123" --server_url "https://dataverse.example.edu"
# Specify custom output directory
python3 tools/download_dv.py --doi "doi:10.7910/DVN/ABC123" --output "./downloads"
Features#
Downloads entire dataset as ZIP archive using Dataverse API
Creates organized directory structure:
dv-[PUBLISHER]-[DATASET_ID]
Automatic extraction of downloaded ZIP files
Progress feedback during download
Automatic git integration in CI environments
Skips re-extraction if target directory already exists
API Usage#
Uses Dataverse Native API to get dataset metadata
Downloads files in original format via dataset access API
Supports public datasets (no authentication required)
Output Structure#
Input DOI: doi:10.7910/DVN/ABC123
Output directory: ./dv-DVN-ABC123/
Downloaded file: ./dv-DVN-ABC123/ABC123.zip (extracted automatically)
Error Handling#
Validates DOI format and dataset availability
Handles download failures gracefully
Reports API errors and connection issues
Requirements#
Python 3.x
requests: HTTP client library
How It Works#
DOI Parsing: Extracts publisher and dataset ID from DOI
API Query: Connects to Dataverse API to get dataset metadata
Directory Creation: Creates organized output directory structure
Download: Downloads complete dataset as ZIP file
Extraction: Automatically extracts ZIP contents
Cleanup: Maintains organized file structure for further processing
This tool is essential for reproducible research workflows that rely on datasets hosted in Dataverse repositories.