Comparing Data
Techniques and tools to compare data in R
By Chi Kit Yeung in R Data Analysis Notes
August 5, 2022
Operators
Custom ‘NOT IN’ Operator
R has a built in %in%
operator that’s useful for comparing values that’s similar to SQL’s LIKE
operator. However, it doesn’t have a built in NOT LIKE
operator like SQL does which is useful in some cases.
# Defining the operator
`%!in%` <- Negate(`%in%`)
Example:
fav_fruits <- c('apple', 'oranges', 'bananas', 'grape')
shopping_list <- c('bananas', 'persimmon', 'peach', 'apple', 'custard apple')
# Normal `%in%` operator use
fav_fruits[fav_fruits %in% shopping_list]
## [1] "apple" "bananas"
Above we can see two of our favorite fruits being on the shopping list
Next, using the custom defined %!in%
operator
# Favorite fruits not being bought
fav_fruits[fav_fruits %!in% shopping_list]
## [1] "oranges" "grape"
# Not so favorite fruits being bought :(
shopping_list[shopping_list %!in% fav_fruits]
## [1] "persimmon" "peach" "custard apple"
Intersect and SetDiff
intersect(fav_fruits, shopping_list)
## [1] "apple" "bananas"
setdiff(fav_fruits, shopping_list)
## [1] "oranges" "grape"
Uncategorized
Regex
Regular expressions can be utilized using the str_detect() function.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
str_detect(fav_fruits, "a..le")
## [1] TRUE FALSE FALSE FALSE
# Matching
fav_fruits[str_detect(fav_fruits, "a..le")]
## [1] "apple"
# Not matching
fav_fruits[!str_detect(fav_fruits, "a..le")]
## [1] "oranges" "bananas" "grape"
Unique
Getting unique values from a list using the unique()
function.
# A lot of dupes here, I just want the unique values!
locales <- c("ar_SA", "ar_SA", "de_DE", "de_DE", "de_DE", "en_AU", "es_ES", "es_ES", "es_ES", "es_ES", "es_MX", "es_MX", "es_MX", "es_MX", "es_US", "es_US", "es_US", "es_US", "es_US", "es_US", "fr_FR", "he_IL", "he_IL", "it_IT", "it_IT", "it_IT", "nb_NO", "nb_NO", "ru_RU", "ru_RU", "ru_RU", "sv_SE", "sv_SE", "sv_SE", "tr_TR", "tr_TR", "tr_TR")
locales
## [1] "ar_SA" "ar_SA" "de_DE" "de_DE" "de_DE" "en_AU" "es_ES" "es_ES" "es_ES"
## [10] "es_ES" "es_MX" "es_MX" "es_MX" "es_MX" "es_US" "es_US" "es_US" "es_US"
## [19] "es_US" "es_US" "fr_FR" "he_IL" "he_IL" "it_IT" "it_IT" "it_IT" "nb_NO"
## [28] "nb_NO" "ru_RU" "ru_RU" "ru_RU" "sv_SE" "sv_SE" "sv_SE" "tr_TR" "tr_TR"
## [37] "tr_TR"
unique(locales)
## [1] "ar_SA" "de_DE" "en_AU" "es_ES" "es_MX" "es_US" "fr_FR" "he_IL" "it_IT"
## [10] "nb_NO" "ru_RU" "sv_SE" "tr_TR"
- Posted on:
- August 5, 2022
- Length:
- 2 minute read, 359 words
- Categories:
- R Data Analysis Notes
- Tags:
- R