Lab 7: Probabilistic Cross Walks and Industry Data

Data are on the Virtual RDC at /space/courses/info747/lab7.

Input data: naicsmiss.sas7bdat and naicsmiss.dta.

Cross walk: sic_naics.sas7bdat and/or sic_naics.dta.

1. The variables in sic_naics.sas7bdat are es_sic = the 4-digit 1987 SIC code; naics_impute = the 6-digit 2002 NAICS code; emp = employment in the indicated (SIC, NAICS) pair; sum_sic = employment in the indicated SIC; pct_emp = emp/sum_sic; low_limit = lower limit for random comparison to pct_emp in imputation; up_limit = upper limit for random comparison to pct_emp in imputation. (Note: incomplete employment data in the cross walk is indicated by a value of 1 for sum_sic and fractions for emp. Do not worry about this.)

2. The variables in naicsmiss are sic (sometimes incomplete; i.e., expressed to only 2 or 3 digits) and naics (always missing).

3. Write a SAS program to do a single probabilistic imputation of naics from the data in sic_naics. This is a straightforward application of the information in the sic_naics cross walk. Be careful how you handle the incomplete SIC codes. For these cases you will have to build the correct conditional probability model for the imputation. You may do this exercise in Stata if you prefer, the input data files in .dta format are also in the lab7 space.

4. If you run your program a second time, you should not get the same answer. Explain why not.