Sometimes when data is combined from different sources it is possible to end up with undesirable missing values. Run the following code against the attached CSV file to generate a dataset demonstrating this issue:
data miss_val ; infile 'miss_val.csv' dsd missover ; input name :$30. email :$30. first :$15. last :$15. age ; run ;
The dataset looks like:
As can be seen there are missing values of the
age variables in each name BY-GROUP. The objective is to populate the missing fields, based on the non-missing fields for the same BY-GROUP. The RETAIN statement could be used in conjunction with FIRST. and LAST. variables for each of the columns in turn, re-initializing the value only as the BY-GROUP changes.
PROC SQL re-merging
Alternatively the re-merge facility within PROC SQL can produce the same result in a more succint form. The re-merge facility causes a summary statistic (calculated down a column) to be re-merged back on to the detail records, with a message to that effect written to the LOG. e.g. consider the dataset:
and the code:
proc sql noprint ; create table two as select cvar ,nvar ,mean(nvar) as avg from one ; quit
will generate the output:
It might be assumed that the 'summary statistic' column can only be a numeric variable, however certain of the Statistical Functions can be used with character variables, and it is this feature which allows us to generate our desired outcome.
Use the code:
proc sql noprint ; create table no_miss as select name /* GROUPing variable */ ,email /* Un-Summarized data to ensure data lines are not collapsed to a single entry */ ,max(first) as first /* Summarized variable - re-merged back against detail rows */ ,max(last) as last /* Summarized variable - re-merged back against detail rows */ ,max(age) as age /* Summarized variable - re-merged back against detail rows */ from miss_val group by name ; quit ;
In this example, the
name column is the BY-GROUP variable, the
Attached files: miss_val.csv