HowTo Pre-Process Input Data
We begin in the 01_runcg
directory. This directory contains 3 main SAS programs and a
configuration program named config_param.sas. The goal of this
part of the process is to create a binary input file for CG2 and to
calculate the size of the problem.
Testing Your Installation
The package is setup by default to use a small 49 observation dataset
created by the 01_create_pik_sein_records.sas program. The data is
purely fictional, but has the correct structure expected by CG2 and can
be used to test your installation. If you follow the complete
HowTo set of guides you will be able to compare your CG2 output with
the results from our reference system.
Pre-Processing Guide
- Set up the values in config_param.sas. The binvars macro
variable should contain a list beginning with the dependent variable
followed by each right hand side variable (do not include a constant,
it is absorbed in the person effect. Be careful as well specifying your
model to insure that no linear combinations of your right side variables are included. Since
CG2 never inverts X'X no check is made to insure that your
specification is identified. CG2 will produce output, but it will
be incorrect. If you are unsure of your specification run a regression using
PROC GLM with the absorb pik sein option.).
The inputs variable is not required but can be used in your
programs when creating the input file. The macro variable cellout
should point to the directory where you would like the binary file to
be created. This directory can be located anywhere on the system.
If you are solving a large problem it would be a good idea to use
the fastest disk structure available on your system.
- Run the 01_create_pik_sein_records.sas program to create
cg.wage_history_01.sas7bdat or prepare your own file with
the same name. The file must be sorted by person ID (pik) firm ID
(sein) and time (year, quarter, week, etc.), with no duplicates (A
person should only have one employer in each period).
- Run the 02 _size_calc.sas program to determine the size of the
problem (program requires a dataset named cg.wage_history_01.sas7bdat).
Check the log file to insure the program completed successfully.
Search
for the following line....
cg.cgin: 49 11 8 5.
This line provides the configuration information for CG2: Number
of Observations, Cells, Persons, and Firms. The program also
creates sequential person and firm identifiers that are used by CG2 and
the groups program. The cellsout file contains a list of the
person firm matches or cells that will be used by the grouping program.
The _lookup file contains the crosswalk for pik sein to the
sequential person and firm identifiers used by CG2. The strip02
file is a pik sein year file that contains the sequential person and
firm identifiers. This file is used to attach the sequential
identifiers to your input data in the next step.
- Run the 04_get_binary4.sas program to create the binary input
file for CG2. The sequential identifiers are attached to your
data and the binary file cgout4 is created. This file is
specific to each run of CG2. If you would like to add a variable
to your model, change the record selection criteria, etc. then the
whole process must be repeated.
Additional Information
You are likely to run CG2 multiple times while conducting your data
analysis. It is a good idea to create a SAS dataset in pik sein
year sort order containing all of the variables and records you would
like to use. This dataset will ideally be created only once.
For each succeeding analysis you can select the sample you would
like to use for each run in the 02_size_calc.sas and 04_get_binary4.sas
programs. If you discover you need a new variable, you can create
it in the 04_get_binary4.sas program.
Return to the HowTo or Main page.