Data preparation and fuzzy matching techniques for improved statistical modeling

Sloan, Stephen; Lafler, Kirk Paul

doi:10.3233/MAS-180447

Data preparation and fuzzy matching techniques for improved statistical modeling

Issue title: Special Issue – SAS Global Forum 2018

Guest editors: Jennifer Waller and Tyler Smith

Article type: Research Article

Authors: Sloan, Stephen^{a; *} | Lafler, Kirk Paul^b

Affiliations: [a] Cream Ridge, NJ 08514, USA | [b] Spring Valley, CA 91978, USA

Correspondence: [*] Corresponding author: Stephen Sloan, 42 Tower Road, Cream Ridge, NJ 08514, USA. Tel.: +1 917 375 2937; Fax: +1 609 758 5240; E-mail: [email protected].

Abstract: Data comes in all forms, shapes, sizes and complexities. Stored in files and data sets, SAS® users know all too well that data can be, and often is, problematic and plagued with a variety of issues. Although today’s statistical software programs are extremely powerful, they are typically not designed to overcome poor quality data. This paper describes and recommends a comprehensive data preparation and fuzzy matching process to follow to enable improved statistical modeling. Statistical techniques are also available for comparing the results of the process. Most statistical software users are aware that two or more data files can be joined, or combined, without a problem when the data files have identifiers with unique and reliable values. However, many files do not have unique identifiers, or “keys”, and need to be joined using character values, like names or E-mail addresses. To add to the difficulty and confusion, these identifiers might be spelled differently, or use different abbreviation or capitalization protocols. This paper describes a versatile 6-step approach to handling data preparation and fuzzy matching issues for improved statistical modeling. The steps include the identification and understanding of potential matching scenarios; exploring data values and data types; data cleaning and validation; data transformation; traditional merge and join techniques; and an assortment of techniques to successfully merge, join and match less than perfect, or “messy”, data by doing phonetic matching using special-purpose character-handling functions like the SOUNDEX algorithm, and the SPEDIS, COMPLEV, and COMPGED fuzzy matching functions. Although the programming techniques described in this paper are illustrated using SAS code, many, if not most, of the techniques can be applied to any software platform that supports character-handling capabilities.

Keywords: SAS, fuzzy matching, character-handling functions, phonetic matching, SOUNDEX, SPEDIS, edit distance, Levenshtein, COMPLEV, COMPGED

DOI: 10.3233/MAS-180447

Journal: Model Assisted Statistics and Applications, vol. 13, no. 4, pp. 367-375, 2018

Published: 31 October 2018

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia