HowTo Pre-Process Input Data

We begin in the 01_runcg directory. This directory contains 3 main SAS programs and a configuration program named config_param.sas. The goal of this part of the process is to create a binary input file for CG2 and to calculate the size of the problem.

Testing Your Installation

The package is setup by default to use a small 49 observation dataset created by the 01_create_pik_sein_records.sas program. The data is purely fictional, but has the correct structure expected by CG2 and can be used to test your installation. If you follow the complete HowTo set of guides you will be able to compare your CG2 output with the results from our reference system.

Pre-Processing Guide

Set up the values in config_param.sas. The binvars macro variable should contain a list beginning with the dependent variable followed by each right hand side variable (do not include a constant, it is absorbed in the person effect. Be careful as well specifying your model to insure that no linear combinations of your right side variables are included. Since CG2 never inverts X'X no check is made to insure that your specification is identified. CG2 will produce output, but it will be incorrect. If you are unsure of your specification run a regression using PROC GLM with the absorb pik sein option.). The inputs variable is not required but can be used in your programs when creating the input file. The macro variable cellout should point to the directory where you would like the binary file to be created. This directory can be located anywhere on the system. If you are solving a large problem it would be a good idea to use the fastest disk structure available on your system.
Run the 01_create_pik_sein_records.sas program to create cg.wage_history_01.sas7bdat or prepare your own file with the same name. The file must be sorted by person ID (pik) firm ID (sein) and time (year, quarter, week, etc.), with no duplicates (A person should only have one employer in each period).
Run the 02 _size_calc.sas program to determine the size of the problem (program requires a dataset named cg.wage_history_01.sas7bdat). Check the log file to insure the program completed successfully. Search for the following line....
```
cg.cgin:           49         11           8            5.
 
```
This line provides the configuration information for CG2: Number of Observations, Cells, Persons, and Firms. The program also creates sequential person and firm identifiers that are used by CG2 and the groups program. The cellsout file contains a list of the person firm matches or cells that will be used by the grouping program. The _lookup file contains the crosswalk for pik sein to the sequential person and firm identifiers used by CG2. The strip02 file is a pik sein year file that contains the sequential person and firm identifiers. This file is used to attach the sequential identifiers to your input data in the next step.
Run the 04_get_binary4.sas program to create the binary input file for CG2. The sequential identifiers are attached to your data and the binary file cgout4 is created. This file is specific to each run of CG2. If you would like to add a variable to your model, change the record selection criteria, etc. then the whole process must be repeated.

Additional Information

You are likely to run CG2 multiple times while conducting your data analysis. It is a good idea to create a SAS dataset in pik sein year sort order containing all of the variables and records you would like to use. This dataset will ideally be created only once. For each succeeding analysis you can select the sample you would like to use for each run in the 02_size_calc.sas and 04_get_binary4.sas programs. If you discover you need a new variable, you can create it in the 04_get_binary4.sas program.

Return to the HowTo or Main page.