Basic pattern analysis, as implemented in the R package
bpa
, is a data pre-processing tool and is designed to help
reduce the time spent doing various pre-processing tasks. It takes
inspiration from some of the functionality of SAS/DataFlux
Data Management Studio. More specifically, the functions in
bpa
help standardize the data so that multiple formatting
issues, typos, and other unexpected outcomes can more easily be
identified in unfamiliar and/or large amounts of data.
bpa
package?Package bpa
, through the function
basic_pattern_analysis
, allows the analyst to run basic
pattern analyses on character sets, digits, or combined input containing
both characters and numeric digits. It is useful for data cleaning and
for identifying columns containing multiple or nonstandard formats.
For illustration, this package comes with a simulated data set containing 1000 observations on three variables:
Gender
- Gender of subject.Date
- Date of observation.Phone
- Subject’s phone number.The following snipped of code loads the package, sample data, and prints out the first six observations.
# Load the package
library(bpa) # pipe operator %>% included for convenience
# Load the data
data(messy, package = "bpa")
head(messy)
## Gender Date Phone
## 1 Female 16Aug2001 571 972 510
## 2 Male 2009-08-21 090 875 285
## 3 M 01/22/2004 (079)-200-150
## 4 male 09/02/2008 (129)-006-808
## 5 Male February 19 2006 016 651 851
## 6 Female 16Nov2015 (492)-455-108
Immediately we can see that all of the variables have mixed format.
For example, in the Gender
column, a male is represented as
a Male
, male
, or M
. This often
happens when data are input manually, or are the result from merging
multiple data sources. Of course we could also just print out the unique
of a column, but a variable like Date
or Phone
in a large data base would likely have too many unique values for this
approach to be useful. Instead, the bpa
package tries to
standardize each column in a way that produces the least amount of
unique value so that issues like this become more apparent.
The core function in bpa
is get_pattern
.
This function is used to extract patterns from a vector of data. This
function will transform the values so that all numeric digits are
represented by the character 9
. Similarly, all lowercase
and uppercase letters are represented by the characters a
and A
, respectively. Everything else (e.g., special
characters like -, @, #, etc.) remains the same. The only exception are
whitespace characters which are represented as w
if
show_ws = TRUE
(by default). This can be changed to any
character string using the ws_char
option. For example,
## [1] "(999)ww999-999"
## [1] "(999) 999-999"
## [1] "(999)<ws><ws>999-999"
Getting back to the example data, consider the Date
column. We can easily extract the unique patterns and their
corresponding frequencies:
messy$Date %>%
get_pattern %>% # extract patterns
table %>% # tabulate frequencies
as.data.frame # display as a data frame
## . Freq
## 1 99/99/9999 259
## 2 9999-99-99 262
## 3 99Aaa9999 241
## 4 Aaaaaaaaaw99w9999 19
## 5 Aaaaaaaaw99w9999 56
## 6 Aaaaaaaw99w9999 45
## 7 Aaaaaaw99w9999 24
## 8 Aaaaaw99w9999 36
## 9 Aaaaw99w9999 42
## 10 Aaaw99w9999 16
It appears as though the Date
column contains four
different date formats – which is a problem since R
requires dates to have a standard unambiguous format. Perhaps the data
were entered in by different people, or the data are the result of a
merge from multiple sources? Nonetheless, now that we have identified a
problem, we can easily correct it by converting to a single standard
date format. On the other hand, if we just looked at the unique values
of Date
without first standardizing the data, it would have
been more difficult to identify all of the formatting problems.
## [1] "16Aug2001" "2009-08-21" "01/22/2004"
## [4] "09/02/2008" "February 19 2006" "16Nov2015"
## [7] "2004-07-12" "24Sep2013" "November 01 2007"
## [10] "2007-08-15" "August 22 2009" "2003-05-29"
## [13] "2011-04-02" "October 29 2004" "2004-09-13"
## [16] "25Nov2009" "02/12/2010" "23Jan2008"
## [19] "06/30/2006" "2009-07-24" "December 02 2013"
## [22] "14Jun2014" "2009-04-19" "09/03/2009"
## [25] "November 25 2007" "18Jan2015" "2009-03-05"
## [28] "10/24/2006" "09/09/2008" "September 08 2012"
## [31] "12/31/2011" "11/12/2009" "09/23/2002"
## [34] "10/09/2002" "2004-12-11" "2001-08-23"
## [37] "02/09/2004" "November 02 2010" "03/14/2007"
## [40] "24Dec2004" "March 10 2004" "November 05 2012"
## [43] "2010-02-11" "12/25/2009" "October 25 2007"
## [46] "2011-07-22" "April 01 2014" "18Mar2009"
## [49] "06Feb2015" "2015-12-10"
Standardizing the data via basic pattern analysis provides a much cleaner representation of the data that is often more useful during the pre-processing step.
The function basic_pattern_analysis
applies
get_pattern
to each column of a data frame and, by default,
returns a "dataframe"
object of the same size. This
function is especially useful when working with big, messy, and
unfamiliar data sets. The following snippet of code exemplifies this by
highlighting potential issues in the entire messy
data
set.
# Standardize the entire data set (returns a data frame)
messy %>%
basic_pattern_analysis %>% # note: you can also use bpa for short
head(10) # only look at first 10 observations
## Gender Date Phone
## 1 Aaaaaa 99Aaa9999 999w999w999
## 2 Aaaa 9999-99-99 999w999w999
## 3 A 99/99/9999 (999)-999-999
## 4 aaaa 99/99/9999 (999)-999-999
## 5 Aaaa Aaaaaaaaw99w9999 999w999w999
## 6 Aaaaaa 99Aaa9999 (999)-999-999
## 7 Aaaaaa 9999-99-99 (999)w999w999
## 8 Aaaaaa 99Aaa9999 999-999-999
## 9 aaaa Aaaaaaaaw99w9999 999w999w999
## 10 Aaaaaa 9999-99-99 999-999-999
Also, to save typing, we have included bpa
as an alias
for basic_pattern_analysis
.
With lots of data, it will often be more useful to view a list
containing only the unique patterns for each column of a data frame.
This can be accomplished by setting unique_only = TRUE
.
## $Gender
##
## A Aaaa Aaaaaa aaaa aaaaaa
## 189 312 291 110 98
##
## $Date
##
## 99/99/9999 9999-99-99 99Aaa9999 Aaaaaaaaaw99w9999
## 259 262 241 19
## Aaaaaaaaw99w9999 Aaaaaaaw99w9999 Aaaaaaw99w9999 Aaaaaw99w9999
## 56 45 24 36
## Aaaaw99w9999 Aaaw99w9999
## 42 16
##
## $Phone
##
## (999)-999-999 (999)w999w999 999-999-999 999w999w999
## 242 250 276 232
Finally, we have included the function match_pattern
for
extracting the values of a vector that match a specified pattern. For
example, the following code chunk will extract the unique values of
Gender
that match the standardized pattern
Aaaa
.
# Extract Gender values matching the pattern "Aaaa"
match_pattern(messy$Gender, pattern = "Aaaa", unique_only = TRUE)
## [1] "Male"