Sometimes when data is combined from different sources it is possible to end up with undesirable missing values. Run the following code against the attached CSV file to generate a dataset demonstrating this issue:
data miss_val ; infile 'miss_val.csv' dsd missover ; input name :$30. email :$30. first :$15. last :$15. age ; run ;
The dataset looks like:
name | first | last | age | |
John Doe | john_doe@acme.com | John | . | |
John Doe | john_doe@acme.com | Doe | . | |
John Doe | john_doe@acme.com | John | 18 | |
John Doe | john_doe@acme.com | John | Doe | . |
Joe Smith | joe_smith@acme.com | Joe | Smith | . |
Joe Smith | joe_smith@acme.com | Joe | Smith | . |
Becky Smith | becky_smith@acme.com | Becky | Smith | . |
Becky Smith | becky_smith@acme.com | Smith | 34 | |
Charlie Thomas | charlie_thomas@acme.com | Charlie | Thomas | 9 |
Charlie Thomas | charlie_thomas@acme.com | Charlie | Thomas | . |
Charlie Thomas | charlie_thomas@acme.com | Thomas | 9 |
As can be seen there are missing values of the first
, last
and age
variables in each name BY-GROUP. The objective is to populate the missing fields, based on the non-missing fields for the same BY-GROUP. The RETAIN statement could be used in conjunction with FIRST. and LAST. variables for each of the columns in turn, re-initializing the value only as the BY-GROUP changes.
PROC SQL re-merging
Alternatively the re-merge facility within PROC SQL can produce the same result in a more succint form. The re-merge facility causes a summary statistic (calculated down a column) to be re-merged back on to the detail records, with a message to that effect written to the LOG. e.g. consider the dataset:
cvar | nvar |
A | 100 |
B | 500 |
C | 300 |
and the code:
proc sql noprint ; create table two as select cvar ,nvar ,mean(nvar) as avg from one ; quit
will generate the output:
cvar | nvar | avg |
A | 100 | 300 |
B | 500 | 300 |
C | 300 | 300 |
It might be assumed that the 'summary statistic' column can only be a numeric variable, however certain of the Statistical Functions can be used with character variables, and it is this feature which allows us to generate our desired outcome.
Use the code:
proc sql noprint ; create table no_miss as select name /* GROUPing variable */ ,email /* Un-Summarized data to ensure data lines are not collapsed to a single entry */ ,max(first) as first /* Summarized variable - re-merged back against detail rows */ ,max(last) as last /* Summarized variable - re-merged back against detail rows */ ,max(age) as age /* Summarized variable - re-merged back against detail rows */ from miss_val group by name ; quit ;
In this example, the name
column is the BY-GROUP variable, the email
variable is neither in the BY-GROUP, nor is it a summarized column, and therefore causes the re-merge feature to operate. The MAX function returns the largest non-missing value, and because it lists only a single variable, the search is performed DOWN the column (ANSI Standard default action) according to the GROUP BY clause. The returned value is then re-merged back against the detail records for the corresponding BY-GROUP in a new computed column which shares a name with the original incomplete column.
Attached files: miss_val.csv