0 users online | 0 Guests and 0 Registered

How can I build sample datasets to perform fuzzy matches?


Sometimes data coming from different sources can be subtly different, resulting in a mis-match when performing a straight character comparison, but a visual comparison would suggest that the data should in fact be matched.  The sort of issues which can arise are non-alphabetic characters, or issues regarding case-sensitivity.

To demonstrate how to perform the 'fuzzy' matches, we must first generate some 'dirty' data.  For this exercise I have used open-source data of companies listed on the NY Stock Exchange.  Reading in this data, and generating a dummy stock holding figure:

*** Use open-source data with details of companies listed on the NY Stock Exchange *** ;

filename usco url 'https://raw.githubusercontent.com/datasets/nyse-listings/master/data/nyse-listed.csv' ;

data nyse (drop = d1) ;
  infile usco dsd firstobs = 2 missover ;
  input d1    : $1.
        stock : $100.
	;
  *** Remove some of the additional characters which can cause issues identifying word boundaries *** ;
  stock   = compress(stock,"',)(.-/")   ;
  *** Generate a stock holding *** ;
  holding = round(ranuni(0)*10000,0.01) ;
run ;

The stock name has also had a number of non-alphabetic characters removed to assist in clarifying word-boundaries.

Using this master list, the text can be manipulated using PERL Regular Expressions to extract a random number of 'words' from the stock name.  The stock holding is also randomly adjusted for a proportion of the records.

data list1 (keep = stock stock1 hold1)  ;
  set nyse ;
    *** Count the number of 'words' in the stock name *** ;
	wc = countw(stock) ;
	*** Build a PERL Regular Expression in the form /(\w+ ){04}/ i.e. four words separated by spaces *** ;
	*** Select a randon number of words between 3 and all available words *** ;
	regexp = '/(\w+ ){'!!put(max(3,floor(ranuni(0) * wc)),z2.)!!'}/' ;

	/*** 
	     /             Start the PERL RegExp
	     (             Open Content Buffer 1
             \w            Match a 'word' character
             +             ...one or more times
                           ...followed by a space
             )             Close Content Buffer 1
             {             Quantifier Multiple of the previous 'thing' (in this case 'a word followed by a space')
	     !!            Concatenate         
             put(max(3,floor(ranuni(0) * wc)),z2.)    
	                   Generate a random number between 0-1 and multiply by the number of words           : RANUNI
	                   Take the integer less than or equal to the result                                  : FLOOR
	                   Return the greater of 3 or this integer (amend the 3 to return more / fewer words) : MAX
	                   As a zero-filled two-digit text string e.g. 04                                     : PUT
	     !!            Concatenate
	     }             Finish Quantifier Multiple
	     /             End the PERL RegExp
	***/

	*** Parse the PERL RegExp and assign it a unique numeric identifier *** ;
        words = prxparse(regexp) ;
	*** Endeavour to match the desired RegExp and return the Starting point (s) and Length (e) *** ;
        call prxsubstr(words,stock,s,e) ;
	*** If the RegExp was found within the target string, extract it and convert to upper case *** ;
	if s and e then stock1 = upcase(substrn(stock,s,e)) ;
	else stock1 = stock ;
	*** Generate a random number, add 1 and round to 2 decimal places *** ;
	rn = 1 + round(ranuni(0),.01) ;
	*** Adjust the holding for a proportion of the records *** ;
	if rn < 1.3 then hold1 = round(holding * rn, .01) ; 
	else hold1 = holding ;
run ;

A second list is generated which results in subtly different stock names and holdings.

*** Repeat the process to generate a second list *** ;
data list2 (keep = stock2 hold2)  ;
  set nyse ;
	wc = countw(stock) ;
	regexp = '/(\w+ ){'!!put(max(3,floor(ranuni(0) * wc)),z2.)!!'}/' ;
        words = prxparse(regexp) ;
        call prxsubstr(words,stock,s,e) ;
	if s and e then stock2 = upcase(substrn(stock,s,e)) ;
	else stock2 = stock ;
	rn = 1 + round(ranuni(0),.01) ;
	if rn < 1.3 then hold2 = round(holding * rn, .01) ; 
	else hold2 = holding ;
run ;

The two lists can then be aligned to show the original stock name and the two sets of variants.

*** Align the two lists to show how the values have been altered *** ;
data paired ;
  set list1 ;
  set list2 ;
run ;

Partial output:

stock stock1 hold1 stock2 hold2
Agilent Technologies Inc Common Stock AGILENT TECHNOLOGIES INC 5671.37 AGILENT TECHNOLOGIES INC 5671.37
Alcoa Inc Common Stock ALCOA INC COMMON 3770.28 ALCOA INC COMMON 3770.28
Alcoa Inc Depository Shares Representing 110th Preferred Convertilble Class B Series 1 ALCOA INC DEPOSITORY 9229.79 ALCOA INC DEPOSITORY SHARES REPRESENTING 110TH PREFERRED CONVERTILBLE 9229.79
AAC Holdings Inc Common Stock AAC HOLDINGS INC 2324.52 AAC HOLDINGS INC COMMON 2324.52
Aarons Inc Common Stock AARONS INC COMMON 2140.73 AARONS INC COMMON 2140.73
Advance Auto Parts Inc Advance Auto Parts Inc WI ADVANCE AUTO PARTS INC ADVANCE AUTO 5653.82 ADVANCE AUTO PARTS 5653.82
American Assets Trust Inc Common Stock AMERICAN ASSETS TRUST 7538.1 AMERICAN ASSETS TRUST 7538.1
Advantage Oil & Gas Ltd Ordinary Shares Advantage Oil & Gas Ltd Ordinary Shares 5445.6 Advantage Oil & Gas Ltd Ordinary Shares 5445.6
Allianceberstein Holding LP Units ALLIANCEBERSTEIN HOLDING LP 278.76 ALLIANCEBERSTEIN HOLDING LP 239.6
ABB Ltd Common Stock ABB LTD COMMON 6559.09 ABB LTD COMMON 7870.91
AbbVie Inc Common Stock ABBVIE INC COMMON 3115.37 ABBVIE INC COMMON 3115.37
AmerisourceBergen Corporation Holding Co Common Stock AMERISOURCEBERGEN CORPORATION HOLDING 4909.42 AMERISOURCEBERGEN CORPORATION HOLDING CO 4958.51
Ambev SA American Depositary Shares Each representing 1 Common Share AMBEV SA AMERICAN DEPOSITARY SHARES EACH 4082.79 AMBEV SA AMERICAN 4776.86
Asbury Automotive Group Inc Common Stock ASBURY AUTOMOTIVE GROUP INC 4532.2 ASBURY AUTOMOTIVE GROUP 4532.2

 

Fuzzy matches can then be performed against the two lists - matching the character variable left -> right and stopping when the end of the shorter string is reached.  The holdings can be rounded to a desired level of accuracy before comparing.

*** Perform a 'fuzzy' match on the two lists *** ;
*** Read the text strings from left to right and stop when the end of the shorter one is reached *** ;
*** Compare holdings rounded to the nearest integer-multiple of 500 (amend this for different margin) *** ;
proc sql noprint ;
  create table matched (drop = len) as
  select  stock1
         ,hold1
         ,stock2
	 ,hold2
	 ,min(length(stock1),length(stock2)) as len
  from  list1
       ,list2
  where upcase(substr(stock1,1,calculated len)) = upcase(substr(stock2,1,calculated len))
  and   round(hold1,500) = round(hold2,500)
  ;
quit ;
SPEDIS Function

Another useful function for comparing character strings is the SPEDIS function which generates an integer value representing the SPElling DIStance between two strings.  The SPEDIS function is case-sensitive - it may be necessary to transform the data prior to comparison.

The function reads the two character variables from left to right and assigns a score for each character difference.  Differences at the beginning of the string count more highly than differences at the end of the string.  The length of the string also impacts on the score, as a one letter spelling difference is more significant in a short string e.g.

data _null_ ;
length w1 w2 $ 200 ;

w1 = 'cat' ;
w2 = 'vat' ;
sped = spedis(w1,w2) ;
put _all_ ;

w1 = 'car' ;
w2 = 'cat' ;
sped = spedis(w1,w2) ;
put _all_ ;

w1 = 'catastrophic' ;
w2 = 'vatastrophic' ;
sped = spedis(w1,w2) ;
put _all_ ;

w1 = 'catastrophic' ;
w2 = 'catastrophiv' ;
sped = spedis(w1,w2) ;
put _all_ ;

run ;


w1=cat w2=vat sped=66 _ERROR_=0 _N_=1
w1=car w2=cat sped=33 _ERROR_=0 _N_=1
w1=catastrophic w2=vatastrophic sped=16 _ERROR_=0 _N_=1
w1=catastrophic w2=catastrophiv sped=8 _ERROR_=0 _N_=1

Using the SPEDIS function on the paired dataset created earlier to generate a measure of the distance between the two stock name variants, would allow a filter to be applied to select records which are deemed to be sufficiently similar.

*** For variables which have been joined by some other means a 'reasonability' check can be performed *** ;
*** Using the PAIRED dataset the SPEDIS function assigns numeric value to the Spelling Distance       *** ;

data paired ;
  set paired ;
    sped = spedis(stock1,stock2) ;
run ;

Partial Output:

 

stock stock1 hold1 stock2 hold2 sped
Agilent Technologies Inc Common Stock AGILENT TECHNOLOGIES INC 5671.37 AGILENT TECHNOLOGIES INC 5671.37 0
Alcoa Inc Common Stock ALCOA INC COMMON 3770.28 ALCOA INC COMMON 3770.28 0
Alcoa Inc Depository Shares Representing 110th Preferred Convertilble Class B Series 1 ALCOA INC DEPOSITORY 9229.79 ALCOA INC DEPOSITORY SHARES REPRESENTING 110TH PREFERRED CONVERTILBLE 9229.79 122
AAC Holdings Inc Common Stock AAC HOLDINGS INC 2324.52 AAC HOLDINGS INC COMMON 2324.52 21
Aarons Inc Common Stock AARONS INC COMMON 2140.73 AARONS INC COMMON 2140.73 0
Advance Auto Parts Inc Advance Auto Parts Inc WI ADVANCE AUTO PARTS INC ADVANCE AUTO 5653.82 ADVANCE AUTO PARTS 5653.82 17
American Assets Trust Inc Common Stock AMERICAN ASSETS TRUST 7538.1 AMERICAN ASSETS TRUST 7538.1 0

Attached files: us_companies.csv, fuzzy-matches.txt

Author:
Alan D Rudland
Revision:
1.8
Average rating:0 (0 Votes)

You cannot comment on this entry

Chuck Norris has counted to infinity. Twice.

Records in this category

Tags