Data Wrangling with {janitor}

The janitor package in the R programming language is a popular and powerful tool for data cleaning and manipulation. It offers a variety of functions that help to clean up messy data sets quickly and easily and turn them into a more organized format. The package includes functions that remove duplicate data, handle missing values, and format column names. Additionally, it provides features to handle whitespace, convert data types, and generate summary statistics for data sets. Overall, the janitor package is an essential resource for anyone working with data in R, as it can significantly save time and effort in data cleaning and preparation. The janitor package is designed to be user-friendly and is optimized for those who are new to R. It includes simple functions for examining and cleaning up messy data, and the package can help advanced R users to perform these tasks more quickly and efficiently. The main functions of the janitor package include formatting data frame column names, creating and formatting frequency tables of one, two, or three variables, and providing other tools for cleaning and examining data frames.

Some Important functions

Here are some common functionalists provided by the {janitor} package:

Cleaning Functions

  1. clean_names(): Renames columns by converting them to snake_case or lower_case and removing special characters.

  2. remove_empty(): Removes rows and columns that are entirely empty.

  3. remove_constant(): Removes columns that have constant values throughout.

Data Frame Manipulation:

  1. get_dupes(): Finds duplicate rows in a data frame.

  2. tabyl(): Generates frequency tables (similar to table() but returns a data frame).

  3. adorn_totals(): Adds total rows or columns to a data frame.

Column Operations:

  1. add_columns(): Adds new columns to a data frame.

  2. recode_factor(): Modifies levels of a factor variable.

Missing Values Handling:

  1. remove_missing(): Removes rows or columns containing missing values.

  2. replace_na(): Replaces missing values with specified values.

Factor Variables:

  1. factorize(): Converts columns to factor variables.

Other Useful Functions:

  1. row_to_names(): Converts a row to column names.

  2. crossing(): Creates all combinations of rows from multiple data frames.

Using these functions, you can perform various data cleaning and transformation tasks

Check and Install Required R Packages

Code
packages <- c(
          'tidyverse',
          'janitor'
) 
#| warning: false
#| error: false

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

Verify Installation

Code
# Verify installation
cat("Installed packages:\n")
Installed packages:
Code
print(sapply(packages, requireNamespace, quietly = TRUE))
tidyverse   janitor 
     TRUE      TRUE 

Load Packages

Code
# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

Check Loaded Packages

Code
# Check loaded packages
cat("Successfully loaded packages:\n")
Successfully loaded packages:
Code
print(search()[grepl("package:", search())])
 [1] "package:janitor"   "package:lubridate" "package:forcats"  
 [4] "package:stringr"   "package:dplyr"     "package:purrr"    
 [7] "package:readr"     "package:tidyr"     "package:tibble"   
[10] "package:ggplot2"   "package:tidyverse" "package:stats"    
[13] "package:graphics"  "package:grDevices" "package:utils"    
[16] "package:datasets"  "package:methods"   "package:base"     

Data Wrangling

We will create some “bad” data and clean them with janitor. We will apply following functions:

  • clean_names()

  • remove_empty()

  • trim_ws()

  • get_dupes()

  • remove_constant()

clean_names()

The clean_names() function is used to clean column names in a data frame. It converts the column names to lowercase and replaces all spaces and special characters with underscores.

Code
# Create a data frame with messy column names
df <- data.frame("Column One" = 1:5, 
                 "Column Two!!" = 6:10, 
                 "Column Three $" = 11:15,
                 "%Column four" = 11:15)
head(df)
  Column.One Column.Two.. Column.Three.. X.Column.four
1          1            6             11            11
2          2            7             12            12
3          3            8             13            13
4          4            9             14            14
5          5           10             15            15
Code
df  |> 
  janitor::clean_names() |> 
  glimpse()
Rows: 5
Columns: 4
$ column_one    <int> 1, 2, 3, 4, 5
$ column_two    <int> 6, 7, 8, 9, 10
$ column_three  <int> 11, 12, 13, 14, 15
$ x_column_four <int> 11, 12, 13, 14, 15

remove_empty()

The remove_empty() function is used to remove rows or columns that contain only missing or empty values.

Code
# Create a data frame with empty rows and columns
df <-  data.frame(x = c(1,NA,4),
                    y = c(NA,NA,3),
                    z = c(NA, NA, NA))

head(df)
   x  y  z
1  1 NA NA
2 NA NA NA
3  4  3 NA
Code
df %>% 
  janitor::remove_empty(c("rows","cols")) |> 
  glimpse()
Rows: 2
Columns: 2
$ x <dbl> 1, 4
$ y <dbl> NA, 3

get_dupes()

The get_dupes() function is used to find duplicate rows in a data frame.

Code
# Create a dataframe with duplicates rows
df <- data.frame("Column One" = c(1, 2, 3, 1), "Column Two" = c("A", "B", "C", "A"))
head(df)
  Column.One Column.Two
1          1          A
2          2          B
3          3          C
4          1          A
Code
get_dupes(df)
No variable names specified - using all columns.
  Column.One Column.Two dupe_count
1          1          A          2
2          1          A          2

Cleaning a bad data

Now, we will clean up messy data using some functions of janitor packages. We will use Lung Cancer Mortality data from the USA. All data set use in this exercise can be downloaded from here.

Code
mf = read_csv("https://raw.githubusercontent.com/zia207/Data/main/CSV_files/USA_LBC_Data_raw.csv")
Code
glimpse(df)
Rows: 4
Columns: 2
$ Column.One <dbl> 1, 2, 3, 1
$ Column.Two <chr> "A", "B", "C", "A"

You may have received data files that contain some text at the top of the spreadsheet before the actual data begins. In this data-frame, the column headings briefly describe the data. However, we want the first row to be the column heading. To achieve this, we will use the row_to_names() function. This function requires the following arguments: the data source, the row number from which the column names should come, whether that row should be deleted from the data, and whether the rows above it should be deleted from the data.

Code
mf.01 = mf |> 
  janitor::row_to_names(1, remove_row = TRUE, remove_rows_above = TRUE)  |> 
  glimpse()
Rows: 3,117
Columns: 26
$ REGION_ID            <chr> "3", "3", "3", "3", NA, NA, "3", "3", "3", "3", "…
$ STATE                <chr> "Alabama", "Alabama", "Alabama", "Alabama", NA, N…
$ County               <chr> "Baldwin County", "Butler County", "Butler County…
$ `Empty Column 1`     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ X                    <chr> "789777.5039", "877731.5725", "877731.5725", "984…
$ Y                    <chr> "884557.0795", "1007285.71", "1007285.71", "11486…
$ Fips                 <chr> "1003", "1013", "1013", "1017", NA, NA, "1023", "…
$ `Empty_Column 2`     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `LCB Mortality Rate` <chr> "48.1", "38.3", "38.3", "49.6", NA, NA, "31.8", "…
$ Smoking              <chr> "20.8", "26", "26", "25.1", NA, NA, "21.8", "22.6…
$ `PM  25`             <chr> "7.89", "8.46", "8.46", "8.87", NA, NA, "8.58", "…
$ NO2                  <chr> "0.7939", "0.6344", "0.6344", "0.8442", NA, NA, "…
$ SO2                  <chr> "0.035343", "0.0135", "0.0135", "0.048177", NA, N…
$ Ozone                <chr> "39.79", "38.31", "38.31", "40.1", NA, NA, "37.07…
$ `Pop 65`             <chr> "19.5", "19", "19", "18.9", NA, NA, "22.1", "19",…
$ `Pop Black`          <chr> "9.24", "43.94", "43.94", "39.24", NA, NA, "41.94…
$ `Pop Hipanic`        <chr> "4.54", "1.26", "1.26", "2.14", NA, NA, "0.86", "…
$ `Pop White`          <chr> "83.06", "52.64", "52.64", "56.42", NA, NA, "56.2…
$ Education            <chr> "66", "38", "38", "47", NA, NA, "55", "39", "60",…
$ `Poverty %`          <chr> "13.14", "26.14", "26.14", "21.52", NA, NA, "23.0…
$ `Income Equality`    <chr> "4.5", "5.1", "5.1", "4.7", NA, NA, "5.8", "8.2",…
$ Uninsured            <chr> "13.34", "12.74", "12.74", "13.34", NA, NA, "12.8…
$ DEM                  <chr> "36.78", "111.70", "111.70", "227.03", NA, NA, "6…
$ `Radon Zone Class`   <chr> "Zone-3", "Zone-3", "Zone-3", "Zone-3", NA, NA, "…
$ `Urban Rural`        <chr> "Medium/small metro", "Nonmetro", "Nonmetro", "No…
$ `Coal Production`    <chr> "No", "No", "No", "No", NA, NA, "No", "No", "No",…

Still data has some empty columns and and empty rows, we are going to remove these empty columns and rows using remove_empty() function:

Code
mf.02 = mf.01 |> 
  janitor::remove_empty()  |> 
  glimpse()
Rows: 3,110
Columns: 24
$ REGION_ID            <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3",…
$ STATE                <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ County               <chr> "Baldwin County", "Butler County", "Butler County…
$ X                    <chr> "789777.5039", "877731.5725", "877731.5725", "984…
$ Y                    <chr> "884557.0795", "1007285.71", "1007285.71", "11486…
$ Fips                 <chr> "1003", "1013", "1013", "1017", "1023", "1025", "…
$ `LCB Mortality Rate` <chr> "48.1", "38.3", "38.3", "49.6", "31.8", "42", "53…
$ Smoking              <chr> "20.8", "26", "26", "25.1", "21.8", "22.6", "21.2…
$ `PM  25`             <chr> "7.89", "8.46", "8.46", "8.87", "8.58", "8.42", "…
$ NO2                  <chr> "0.7939", "0.6344", "0.6344", "0.8442", "0.5934",…
$ SO2                  <chr> "0.035343", "0.0135", "0.0135", "0.048177", "0.02…
$ Ozone                <chr> "39.79", "38.31", "38.31", "40.1", "37.07", "37.6…
$ `Pop 65`             <chr> "19.5", "19", "19", "18.9", "22.1", "19", "16.3",…
$ `Pop Black`          <chr> "9.24", "43.94", "43.94", "39.24", "41.94", "43.9…
$ `Pop Hipanic`        <chr> "4.54", "1.26", "1.26", "2.14", "0.86", "1.34", "…
$ `Pop White`          <chr> "83.06", "52.64", "52.64", "56.42", "56.28", "52.…
$ Education            <chr> "66", "38", "38", "47", "55", "39", "60", "35", "…
$ `Poverty %`          <chr> "13.14", "26.14", "26.14", "21.52", "23.06", "24.…
$ `Income Equality`    <chr> "4.5", "5.1", "5.1", "4.7", "5.8", "8.2", "4.8", …
$ Uninsured            <chr> "13.34", "12.74", "12.74", "13.34", "12.86", "13.…
$ DEM                  <chr> "36.78", "111.70", "111.70", "227.03", "68.24", "…
$ `Radon Zone Class`   <chr> "Zone-3", "Zone-3", "Zone-3", "Zone-3", "Zone-3",…
$ `Urban Rural`        <chr> "Medium/small metro", "Nonmetro", "Nonmetro", "No…
$ `Coal Production`    <chr> "No", "No", "No", "No", "No", "No", "No", "No", "…

Now, we are going fix column headings using clean_names(). It converts the column names to lowercase and replaces all spaces and special characters with underscores.

Code
mf.03 = mf.02  |> 
  janitor::clean_names()  |> 
  glimpse()
Rows: 3,110
Columns: 24
$ region_id          <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county             <chr> "Baldwin County", "Butler County", "Butler County",…
$ x                  <chr> "789777.5039", "877731.5725", "877731.5725", "98421…
$ y                  <chr> "884557.0795", "1007285.71", "1007285.71", "1148648…
$ fips               <chr> "1003", "1013", "1013", "1017", "1023", "1025", "10…
$ lcb_mortality_rate <chr> "48.1", "38.3", "38.3", "49.6", "31.8", "42", "53.7…
$ smoking            <chr> "20.8", "26", "26", "25.1", "21.8", "22.6", "21.2",…
$ pm_25              <chr> "7.89", "8.46", "8.46", "8.87", "8.58", "8.42", "8.…
$ no2                <chr> "0.7939", "0.6344", "0.6344", "0.8442", "0.5934", "…
$ so2                <chr> "0.035343", "0.0135", "0.0135", "0.048177", "0.0239…
$ ozone              <chr> "39.79", "38.31", "38.31", "40.1", "37.07", "37.68"…
$ pop_65             <chr> "19.5", "19", "19", "18.9", "22.1", "19", "16.3", "…
$ pop_black          <chr> "9.24", "43.94", "43.94", "39.24", "41.94", "43.96"…
$ pop_hipanic        <chr> "4.54", "1.26", "1.26", "2.14", "0.86", "1.34", "6.…
$ pop_white          <chr> "83.06", "52.64", "52.64", "56.42", "56.28", "52.98…
$ education          <chr> "66", "38", "38", "47", "55", "39", "60", "35", "53…
$ poverty_percent    <chr> "13.14", "26.14", "26.14", "21.52", "23.06", "24.6"…
$ income_equality    <chr> "4.5", "5.1", "5.1", "4.7", "5.8", "8.2", "4.8", "4…
$ uninsured          <chr> "13.34", "12.74", "12.74", "13.34", "12.86", "13.28…
$ dem                <chr> "36.78", "111.70", "111.70", "227.03", "68.24", "69…
$ radon_zone_class   <chr> "Zone-3", "Zone-3", "Zone-3", "Zone-3", "Zone-3", "…
$ urban_rural        <chr> "Medium/small metro", "Nonmetro", "Nonmetro", "Nonm…
$ coal_production    <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No…

All data are exported in R as chr. We are going to convert column from 4 to 21as.numeric and 22 to 23 as.factor. We will use dplyr::mutate_at() function:

Code
mf.04= mf.03  |> 
     dplyr::mutate_at(4:21, as.numeric)  |> 
     dplyr::mutate_at(22:24, as.factor)  |> 
     glimpse()
Rows: 3,110
Columns: 24
$ region_id          <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county             <chr> "Baldwin County", "Butler County", "Butler County",…
$ x                  <dbl> 789777.5, 877731.6, 877731.6, 984214.7, 726606.5, 7…
$ y                  <dbl> 884557.1, 1007285.7, 1007285.7, 1148648.7, 1023615.…
$ fips               <dbl> 1003, 1013, 1013, 1017, 1023, 1025, 1031, 1035, 103…
$ lcb_mortality_rate <dbl> 48.1, 38.3, 38.3, 49.6, 31.8, 42.0, 53.7, 46.9, 65.…
$ smoking            <dbl> 20.8, 26.0, 26.0, 25.1, 21.8, 22.6, 21.2, 24.9, 25.…
$ pm_25              <dbl> 7.89, 8.46, 8.46, 8.87, 8.58, 8.42, 8.42, 8.23, 8.2…
$ no2                <dbl> 0.7939, 0.6344, 0.6344, 0.8442, 0.5934, 0.6432, 0.5…
$ so2                <dbl> 0.035343, 0.013500, 0.013500, 0.048177, 0.023989, 0…
$ ozone              <dbl> 39.79, 38.31, 38.31, 40.10, 37.07, 37.68, 38.46, 37…
$ pop_65             <dbl> 19.5, 19.0, 19.0, 18.9, 22.1, 19.0, 16.3, 21.6, 20.…
$ pop_black          <dbl> 9.24, 43.94, 43.94, 39.24, 41.94, 43.96, 17.26, 45.…
$ pop_hipanic        <dbl> 4.54, 1.26, 1.26, 2.14, 0.86, 1.34, 6.76, 1.84, 1.6…
$ pop_white          <dbl> 83.06, 52.64, 52.64, 56.42, 56.28, 52.98, 70.90, 50…
$ education          <dbl> 66, 38, 38, 47, 55, 39, 60, 35, 53, 44, 58, 38, 38,…
$ poverty_percent    <dbl> 13.14, 26.14, 26.14, 21.52, 23.06, 24.60, 16.20, 29…
$ income_equality    <dbl> 4.5, 5.1, 5.1, 4.7, 5.8, 8.2, 4.8, 4.9, 4.6, 5.8, 5…
$ uninsured          <dbl> 13.34, 12.74, 12.74, 13.34, 12.86, 13.28, 13.16, 15…
$ dem                <dbl> 36.78, 111.70, 111.70, 227.03, 68.24, 69.29, 99.32,…
$ radon_zone_class   <fct> Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zon…
$ urban_rural        <fct> Medium/small metro, Nonmetro, Nonmetro, Nonmetro, N…
$ coal_production    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,…

Now will check the duplicates record in the this data:

Code
mf.04 |> janitor::get_dupes(fips)
# A tibble: 6 × 25
   fips dupe_count region_id state    county         x      y lcb_mortality_rate
  <dbl>      <int> <chr>     <chr>    <chr>      <dbl>  <dbl>              <dbl>
1  1013          2 3         Alabama  Butler C… 8.78e5 1.01e6               38.3
2  1013          2 3         Alabama  Butler C… 8.78e5 1.01e6               38.3
3  1053          2 3         Alabama  Escambia… 8.39e5 9.34e5               58.3
4  1053          2 3         Alabama  Escambia… 8.39e5 9.34e5               58.3
5  5011          2 3         Arkansas Bradley … 3.54e5 1.16e6               69.9
6  5011          2 3         Arkansas Bradley … 3.54e5 1.16e6               69.9
# ℹ 17 more variables: smoking <dbl>, pm_25 <dbl>, no2 <dbl>, so2 <dbl>,
#   ozone <dbl>, pop_65 <dbl>, pop_black <dbl>, pop_hipanic <dbl>,
#   pop_white <dbl>, education <dbl>, poverty_percent <dbl>,
#   income_equality <dbl>, uninsured <dbl>, dem <dbl>, radon_zone_class <fct>,
#   urban_rural <fct>, coal_production <fct>

As shown above, the data frame is filtered down to those rows with duplicate values in the Fips column. For removing these duplicate rows, we have to use dplyr::distinct(.keep_all = TRUE)

Now will check the duplicates record in the this dat with get_dupes() function:

Code
mf.05= mf.04 |>   
     dplyr::distinct(fips,.keep_all = TRUE) |> 
     janitor::get_dupes(fips) |> 
     glimpse()
Rows: 0
Columns: 25
$ fips               <dbl> 
$ dupe_count         <int> 
$ region_id          <chr> 
$ state              <chr> 
$ county             <chr> 
$ x                  <dbl> 
$ y                  <dbl> 
$ lcb_mortality_rate <dbl> 
$ smoking            <dbl> 
$ pm_25              <dbl> 
$ no2                <dbl> 
$ so2                <dbl> 
$ ozone              <dbl> 
$ pop_65             <dbl> 
$ pop_black          <dbl> 
$ pop_hipanic        <dbl> 
$ pop_white          <dbl> 
$ education          <dbl> 
$ poverty_percent    <dbl> 
$ income_equality    <dbl> 
$ uninsured          <dbl> 
$ dem                <dbl> 
$ radon_zone_class   <fct> 
$ urban_rural        <fct> 
$ coal_production    <fct> 

Now we run all above function with Pipe ( |> ):

Code
mf_clean = mf |> 
  janitor::row_to_names(1, remove_row = TRUE, remove_rows_above = TRUE)  |> 
  janitor::remove_empty()  |> 
  janitor::clean_names()  |> 
  dplyr::mutate_at(4:21, as.numeric)  |> 
  dplyr::mutate_at(22:24, as.factor)  |> 
  dplyr::distinct(fips,.keep_all = TRUE)  |> 
     glimpse()
Rows: 3,107
Columns: 24
$ region_id          <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county             <chr> "Baldwin County", "Butler County", "Chambers County…
$ x                  <dbl> 789777.5, 877731.6, 984214.7, 726606.5, 770408.9, 9…
$ y                  <dbl> 884557.1, 1007285.7, 1148648.7, 1023615.8, 988910.5…
$ fips               <dbl> 1003, 1013, 1017, 1023, 1025, 1031, 1035, 1039, 104…
$ lcb_mortality_rate <dbl> 48.1, 38.3, 49.6, 31.8, 42.0, 53.7, 46.9, 65.5, 57.…
$ smoking            <dbl> 20.8, 26.0, 25.1, 21.8, 22.6, 21.2, 24.9, 25.9, 22.…
$ pm_25              <dbl> 7.89, 8.46, 8.87, 8.58, 8.42, 8.42, 8.23, 8.24, 8.4…
$ no2                <dbl> 0.7939, 0.6344, 0.8442, 0.5934, 0.6432, 0.5698, 0.5…
$ so2                <dbl> 0.035343, 0.013500, 0.048177, 0.023989, 0.033700, 0…
$ ozone              <dbl> 39.79, 38.31, 40.10, 37.07, 37.68, 38.46, 37.92, 38…
$ pop_65             <dbl> 19.5, 19.0, 18.9, 22.1, 19.0, 16.3, 21.6, 20.5, 18.…
$ pop_black          <dbl> 9.24, 43.94, 39.24, 41.94, 43.96, 17.26, 45.94, 12.…
$ pop_hipanic        <dbl> 4.54, 1.26, 2.14, 0.86, 1.34, 6.76, 1.84, 1.62, 1.8…
$ pop_white          <dbl> 83.06, 52.64, 56.42, 56.28, 52.98, 70.90, 50.56, 83…
$ education          <dbl> 66, 38, 47, 55, 39, 60, 35, 53, 44, 58, 38, 45, 58,…
$ poverty_percent    <dbl> 13.14, 26.14, 21.52, 23.06, 24.60, 16.20, 29.76, 20…
$ income_equality    <dbl> 4.5, 5.1, 4.7, 5.8, 8.2, 4.8, 4.9, 4.6, 5.8, 5.2, 5…
$ uninsured          <dbl> 13.34, 12.74, 13.34, 12.86, 13.28, 13.16, 15.16, 13…
$ dem                <dbl> 36.78, 111.70, 227.03, 68.24, 69.29, 99.32, 96.03, …
$ radon_zone_class   <fct> Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zon…
$ urban_rural        <fct> Medium/small metro, Nonmetro, Nonmetro, Nonmetro, N…
$ coal_production    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,…

Summary and Conclusion

This guide explains cleaning up datasets using the R package janitor. It emphasizes the importance of data cleanliness and offers an easy-to-use approach to data wrangling. The janitor package provides functions that simplify everyday data-cleaning tasks. It can handle missing data and convert data types, making it a valuable tool for preparing datasets for analysis. Remember, data cleaning is essential, and the janitor simplifies this task. With the skills gained in this guide, you can efficiently clean and tidy datasets for more meaningful analyses.

References

  1. Overview of janitor functions

  2. Cleaning and Exploring Data with the “janitor” Package