A Guide to Basic Pattern Analysis in R

What is basic pattern analysis?

Basic pattern analysis, as implemented in the R package bpa, is a data pre-processing tool and is designed to help reduce the time spent doing various pre-processing tasks. It takes inspiration from some of the functionality of SAS/DataFlux Data Management Studio. More specifically, the functions in bpa help standardize the data so that multiple formatting issues, typos, and other unexpected outcomes can more easily be identified in unfamiliar and/or large amounts of data.

What is the bpa package?

Package bpa, through the function basic_pattern_analysis, allows the analyst to run basic pattern analyses on character sets, digits, or combined input containing both characters and numeric digits. It is useful for data cleaning and for identifying columns containing multiple or nonstandard formats.

For illustration, this package comes with a simulated data set containing 1000 observations on three variables:

  1. Gender - Gender of subject.
  2. Date - Date of observation.
  3. Phone - Subject’s phone number.

The following snipped of code loads the package, sample data, and prints out the first six observations.

# Load the package
library(bpa)  # pipe operator %>% included for convenience

# Load the data
data(messy, package = "bpa")
head(messy)
##   Gender             Date         Phone
## 1 Female        16Aug2001   571 972 510
## 2   Male       2009-08-21   090 875 285
## 3      M       01/22/2004 (079)-200-150
## 4   male       09/02/2008 (129)-006-808
## 5   Male February 19 2006   016 651 851
## 6 Female        16Nov2015 (492)-455-108

Immediately we can see that all of the variables have mixed format. For example, in the Gender column, a male is represented as a Male, male, or M. This often happens when data are input manually, or are the result from merging multiple data sources. Of course we could also just print out the unique of a column, but a variable like Date or Phone in a large data base would likely have too many unique values for this approach to be useful. Instead, the bpa package tries to standardize each column in a way that produces the least amount of unique value so that issues like this become more apparent.

Basic usage

The core function in bpa is get_pattern. This function is used to extract patterns from a vector of data. This function will transform the values so that all numeric digits are represented by the character 9. Similarly, all lowercase and uppercase letters are represented by the characters a and A, respectively. Everything else (e.g., special characters like -, @, #, etc.) remains the same. The only exception are whitespace characters which are represented as w if show_ws = TRUE (by default). This can be changed to any character string using the ws_char option. For example,

get_pattern("(123)  456-789")
## [1] "(999)ww999-999"
get_pattern("(123)  456-789", show_ws = FALSE)  # same as ws_char = " "
## [1] "(999)  999-999"
get_pattern("(123)  456-789", ws_char = "<ws>")
## [1] "(999)<ws><ws>999-999"

Getting back to the example data, consider the Date column. We can easily extract the unique patterns and their corresponding frequencies:

messy$Date %>%
  get_pattern %>%  # extract patterns
  table %>%        # tabulate frequencies
  as.data.frame    # display as a data frame
##                    . Freq
## 1         99/99/9999  259
## 2         9999-99-99  262
## 3          99Aaa9999  241
## 4  Aaaaaaaaaw99w9999   19
## 5   Aaaaaaaaw99w9999   56
## 6    Aaaaaaaw99w9999   45
## 7     Aaaaaaw99w9999   24
## 8      Aaaaaw99w9999   36
## 9       Aaaaw99w9999   42
## 10       Aaaw99w9999   16

It appears as though the Date column contains four different date formats – which is a problem since R requires dates to have a standard unambiguous format. Perhaps the data were entered in by different people, or the data are the result of a merge from multiple sources? Nonetheless, now that we have identified a problem, we can easily correct it by converting to a single standard date format. On the other hand, if we just looked at the unique values of Date without first standardizing the data, it would have been more difficult to identify all of the formatting problems.

messy$Date %>%
  unique %>%    # extract unique values
  head(50)      # look at first 50 observations
##  [1] "16Aug2001"         "2009-08-21"        "01/22/2004"       
##  [4] "09/02/2008"        "February 19 2006"  "16Nov2015"        
##  [7] "2004-07-12"        "24Sep2013"         "November 01 2007" 
## [10] "2007-08-15"        "August 22 2009"    "2003-05-29"       
## [13] "2011-04-02"        "October 29 2004"   "2004-09-13"       
## [16] "25Nov2009"         "02/12/2010"        "23Jan2008"        
## [19] "06/30/2006"        "2009-07-24"        "December 02 2013" 
## [22] "14Jun2014"         "2009-04-19"        "09/03/2009"       
## [25] "November 25 2007"  "18Jan2015"         "2009-03-05"       
## [28] "10/24/2006"        "09/09/2008"        "September 08 2012"
## [31] "12/31/2011"        "11/12/2009"        "09/23/2002"       
## [34] "10/09/2002"        "2004-12-11"        "2001-08-23"       
## [37] "02/09/2004"        "November 02 2010"  "03/14/2007"       
## [40] "24Dec2004"         "March 10 2004"     "November 05 2012" 
## [43] "2010-02-11"        "12/25/2009"        "October 25 2007"  
## [46] "2011-07-22"        "April 01 2014"     "18Mar2009"        
## [49] "06Feb2015"         "2015-12-10"

Standardizing the data via basic pattern analysis provides a much cleaner representation of the data that is often more useful during the pre-processing step.

The function basic_pattern_analysis applies get_pattern to each column of a data frame and, by default, returns a "dataframe" object of the same size. This function is especially useful when working with big, messy, and unfamiliar data sets. The following snippet of code exemplifies this by highlighting potential issues in the entire messy data set.

# Standardize the entire data set (returns a data frame)
messy %>%
  basic_pattern_analysis %>%  # note: you can also use bpa for short
  head(10)                    # only look at first 10 observations
##    Gender             Date         Phone
## 1  Aaaaaa        99Aaa9999   999w999w999
## 2    Aaaa       9999-99-99   999w999w999
## 3       A       99/99/9999 (999)-999-999
## 4    aaaa       99/99/9999 (999)-999-999
## 5    Aaaa Aaaaaaaaw99w9999   999w999w999
## 6  Aaaaaa        99Aaa9999 (999)-999-999
## 7  Aaaaaa       9999-99-99 (999)w999w999
## 8  Aaaaaa        99Aaa9999   999-999-999
## 9    aaaa Aaaaaaaaw99w9999   999w999w999
## 10 Aaaaaa       9999-99-99   999-999-999

Also, to save typing, we have included bpa as an alias for basic_pattern_analysis.

With lots of data, it will often be more useful to view a list containing only the unique patterns for each column of a data frame. This can be accomplished by setting unique_only = TRUE.

# Only return unique patterns (returns a list)
bpa(messy, unique_only = TRUE)
## $Gender
## 
##      A   Aaaa Aaaaaa   aaaa aaaaaa 
##    189    312    291    110     98 
## 
## $Date
## 
##        99/99/9999        9999-99-99         99Aaa9999 Aaaaaaaaaw99w9999 
##               259               262               241                19 
##  Aaaaaaaaw99w9999   Aaaaaaaw99w9999    Aaaaaaw99w9999     Aaaaaw99w9999 
##                56                45                24                36 
##      Aaaaw99w9999       Aaaw99w9999 
##                42                16 
## 
## $Phone
## 
## (999)-999-999 (999)w999w999   999-999-999   999w999w999 
##           242           250           276           232

Finally, we have included the function match_pattern for extracting the values of a vector that match a specified pattern. For example, the following code chunk will extract the unique values of Gender that match the standardized pattern Aaaa.

# Extract Gender values matching the pattern "Aaaa"
match_pattern(messy$Gender, pattern = "Aaaa", unique_only = TRUE)
## [1] "Male"