Preparing to Run Code (Part B)#

Signal that you are working on Part B#

To signal that you are starting to run the code, you will now transition the JIRA subtask to “In Progress”.

Move to Part B to In Progress

Prepare your working area#

Before we can verify code, data, and documentation, we need to first get the code, then the data into your “working area”. Let’s start with the code.

Note

You now need to decide on what computer you are going to do the data analysis - that should be the place you do the next few steps. See Access Computer for details. This is because the git setup we use does not allow you to include the data files in the Bitbucket repository, so when you download the replication package from openICPSR or elsewhere, they do not get added to the Bitbucket repository.

Get the code via the existing Bitbucket repository#

  • Access the CCSS environment: shortcut to Cloud, for other access, see Appendix.

  • Ensure that you have set up your CCSS environment (see appendix)

  • Get the code: Clone the Bitbucket repository onto the CCSS server you are working on

    git clone https://yourname@bitbucket.org/aeaverification/aearep-xxx.git
    
  • Alternatively: If you have done the full Bash setup, you can run

    aeagit xxx
    

All actions on BioHPC will be performed in a terminal. Depending on whether you connect with SSH, VNC, or Visual Studio Code, details may differ (see Access Computer for details). We suggest connecting via SSH or using Visual Studio Code (which uses SSH in the background), for simplicity. It is assumed that you have done Bash setup.

  • If using SSH, you are already at the “terminal”. If using VNC, choose Terminal from the application menu.

    • If using Visual Studio Code to connect to BioHPC, follow instructions from the Github Codespaces tab!

  • Change directory to the common workarea:

    cd /home2/ecco_lv39/Workspace
    
  • Populate BioHPC with the Bitbucket repo

    • Use the LDI short-cut command

      aeagit 123 http
      

      to clone the Bitbucket repository for aearep-123 onto BioHPC (this executed git clone (URL)/aearep-123 behind the scenes).

      • If this is the first repository you run this, you may need to configure authentication. Follow the instructions from the aeagit command.

      • Once the aeagit 123 has completed, it show a bunch of error messages (maybe), and a command cd aearep-123. Copy and paste that command, entering the directory where you just cloned the repo. All subsequent actions should be done in that window.

All actions in Github Codespaces (CS for short) will be performed in Visual Studio Code (VSC): the left pane for file and Git actions, in the Terminal, or in the text editor. To connect to CS, see [instructions to come]. It is assumed that you will run this from your laptop, but it can be run from any internet-connected computer. The particular template you use for CS already has the Bash setup done for you. However, you need to have followed the CS setup instructions.

  • Access Github Codespaces, see appendix for more details.

  • Use the Terminal built into VSC (Menu -> Terminal -> New Terminal ). By default, the Terminal should run a bash shell.

  • Populate CS with the Bitbucket repo (yes, this is a bit weird)

    • Use the LDI short-cut command

      aeagit 123 http
      

      to clone the Bitbucket repository for aearep-123 onto CS (this executed git clone (URL)/aearep-123 behind the scenes).

      • If this is the first repository you run on this CS instance, you may need to configure authentication. Follow the instructions from the aeagit command.

      • Once the aeagit 123 has completed, it will open a new VSC windows. All subsequent actions should be done in that window.

Verify that the code is present#

With the first step, you obtained a copy of the Bitbucket repository. You should now have a folder called aearep-123 (or whatever the repository number is). All subsequent steps should be done from there.

Important

If you do not see a folder like 123456 or dropbox-xyz in the repository, then the Bitbucket Pipeline likely did not work. You will then need to populate the code (and data) manually. You will do this AFTER downloading the data, in the next step.

Verify that you can actually run the code#

Before you spend time obtaining the data, now is a good time to assess (again) if you have everything needed to run the code.

Have a look at the README again.

  • Is there information about Requirements?

    • Is there information about the software?

    • How long does the author say the code will run? Is it a reasonable time, or do we maybe need to run it on a more powerful computer?

    • How much memory, or processors, does the code need? Again, is the computer you intended to choose sufficient, or do we need to get access to a more powerful computer, or even a cluster of computers?

Now fill out the Stated Requirements section of Part B of the report.

Prepare the code-check#

Now is a good time to understand the code in a bit more detail:

  • In the template, you will find code-check.xlsx.

    • Use this to create a list of all Tables and Figures in the paper

    • You will use this to guide later to tabulate your findings!

  • Fill out the “Code Description” section of the REPLICATION.md

    • Provide some information about the program files (are there 3 Stata files? Are there 5 Matlab programs?). You will use this information to fill out the Software Used (in the main task) later as well, but provide details here.

      • You can use the file “generated/programs-list.txt to help you here.

    • Did you have difficulty aligning the README with the files? Does the sequence suggested by the programs differ from what’s written in the README?

    • Are there files in the archive not explained in the README?

    • Copy-and-paste the code-check.xlsx into the code description part, listing the programs. Omit the “Reproduced?” Column in doing so. Use the Excel-to-Markdown plugin for VSCode.

      • This table will be pasted in under “Findings” again, with “Reproduced?” column, once code has been run.

Verify that you can actually run the code#

Do you think you know how to run the code in the software mentioned?

You may not have the right experience, talk to your supervisor!

If both of you agree that nobody will run the code, then

  • Move the subtask to “Part B is complete” via No code was run

  • Go straight to Part C, writing the Findings section

Describe the provided data#

The automated scripts should have filled out the “All data files provided” section, but if not, please do so here.

If the list is VERY long, put it into an appendix, but make a note in this section that there is an appendix with this info.

What if there are no data provided at all?

If there are no data at all, for instance, when data are confidential and only available through some computer system at the Census Bureau or in Sweden?

Then you will skip getting the data and running code, and you are done with Part B!

  • Move the subtask to “Part B is complete” via No code was run

  • Go straight to Part C, writing the Findings section

If there is ANY data at all present, then continue to get the data.

Get the Data#

If you think that you are ready to run the code, you need to get the data. When getting the data, please take care to distinguish

  • data that is part of the openICPSR deposit

  • data that the README tells you to download or otherwise access

  • data that you are provided on the L-Drive, which is typically provided under an agreement with the authors, and cannot be redistributed.

Here, we will describe the most likely first step: getting the data from openICPSR. Any data you download should also be stored on this computer. We do not explicitly describe this here. CCSS is the most likely place where you do this, but double-check with your supervisor.

  • Access the CCSS environment: shortcut to Cloud, for other access, see Appendix.

  • In some cases, you may be asked to use (restricted) data on the S: drive or L: drive. Follow instructions as you receive them.

  • Download the openICPSR data (if available).

    • Try to do this first using scripts. See the details in the appendix.

      python tools/download_openicpsr-private.py 111234 
      
      unzip -n 111234 -d 111234
      

      which should unpack the data files only, not overwriting anything else. If this fails, do the “Manual steps.”

  • attempt to download data from various sources indicated by the authors, but ONLY if no sign-up/ application process is involved.

  • Download the project from openICPSR, using the script. See the details in the appendix:

    python3 tools/download_openicpsr-private.py 111234 
    
    • The downloaded ZIP file (111234.zip) needs to be unzipped (the terminal output will tell you):

      unzip -n 111234 -d 111234
      
  • Upload data that you obtained from other sources to CS. This is an advanced activity not described here.

  • Access Github Codespaces, see appendix for more details.

  • Use the Terminal built into VSC (Menu -> Terminal -> New Terminal ). By default, the Terminal should run a bash shell.

  • Be sure you still have the Bitbucket repository clone, if not, follow instructions under Code again.

  • Download the openICPSR deposit using the LDI short-cut command. See the details in the appendix.

    python3 tools/download_openicpsr-private.py 111234 
    

    If this fails, do the “Manual steps.”

    • The downloaded ZIP file (111234.zip) needs to be unzipped (the terminal output will tell you):

      unzip -n 111234 -d 111234
      
  • Upload data that you obtained from other sources to CS. There are two ways to do this:

    • Drag-and-drop the downloaded data file into the file pane of VSC, into the appropriate location.

    • Use the gh command line tool from a non-VSC terminal (on your local computer):

      gh cs cp datafile.dat remote:/workspaces/aearep-123/111234/data/location
      

      (adjust accordingly as per the author’s instructions)

If the automated population of the author’s code directory did not work, you will need to manually download the replication package. Try to do this first using scripts.

Downloading using scripts

See the details in the appendix. You can do this on CCSS, BioHPC, or on Github CS.

Downloading using a browser

If you are still unsuccesful at this point, try this manual method (typically, on CCSS)

  • Download the code (and data) from openICPSR (typically for most cases). See details in the appendix for instructions on downloading these materials. Typically called 111234.zip. icpsr screen

    • Copy/paste the downloaded openICPSR ZIP file into the local copy of the aearep-123 repository

      • The ZIP file should be called something like 111234.zip. Note: it might look like a folder, but it is not! (on Windows)

      • The ZIP file will be wherever your browser downloads materials - probably your Download folder.

Next manual steps

The local repository should now have the relevant LDI replication template materials and the openICPSR ZIP file containing the replication materials provided by the authors.

  • If uploading to CS, copy the downloaded ZIP file into the CS. There are two ways to do this:

    • Drag-and-drop the downloaded openICPSR ZIP file into the file pane of VSC.

    • Use the gh command line tool from a non-VSC terminal (on your local computer):

      gh cs cp 111234.zip remote:/workspaces/aearep-123
      

      (adjust accordingly)

  • Unzip the openICPSR folder under a folder named for the openICPSR repostory number.

    • From bash:

      unzip -n 111234.zip -d 111234
      
    • On Windows, right-click and select “Extract all”. When asked, do not overwrite files.

    • On OSX, double-click. When asked, do not overwrite files.

    • The individual files that are part of the replication package should now be in a subdirectory (e.g, 111234, the openICPSR repository number).

    • Perform a git add: git add 111234 should do the right thing.

    Warning

    Do not try to git add -f! Unexpected (typically bad) things will happen.

Did the Bitbucket Pipeline scripts work?#

In cases where the package downloaded from openICPSR is too big, or where the data do not come from openICPSR, the automated Bitbucket Pipeline scripts will not work. In this case, you will need to run some additional steps to run the automated scripts. This is best done on BioHPC or CS.

You don’t have to do anything additional, you can move on.

Running the automated steps on CCSS has not been fully tested yet.

  • Access BioHPC, see Access Computer for details.

  • Change directory to the place where you downloaded the repository and the data bash     cd /home2/ecco_lv39/Workspace/aearep-123    

  • Then run the ingest scripts:

    export PATH=$PATH:/usr/local/stata16/
    ./tools/pipeline-steps1-4.sh 123456
    git push
    
  • Change directory to the place where you downloaded the repository and the data, typically bash     cd /workspaces/aearep-123    

  • Then run the ingest scripts:

    ./tools/pipeline-steps1-4.sh 123456
    git push
    

Ready!#

You are now ready to run code.