Data Wrangling with Janitor

The R package janitor provides a set of tools for cleaning and organizing data in R. The package is designed to help make data cleaning tasks easier and more efficient, with functions that handle common data cleaning tasks.

Some of the functions provided by the janitor package include:

clean_names: This function cleans column names by removing special characters and converting them to lowercase.

remove_empty: This function removes rows or columns that are entirely empty from a data frame.

tabyl: This function creates frequency tables with ease.

get_dupes: This function identifies duplicate rows in a data frame.

factorize: This function converts columns in a data frame to factors.

The janitor package can be installed using the following command in R:

install.packages(“janitor”)

Load Packages

Code
library(janitor)

Some Important functions

We will create some “bad” data and clean them with janitor. We will apply following functions:

  • clean_names()
  • remove_empty()
  • trim_ws()
  • get_dupes()
  • remove_constant()

clean_names()

The clean_names() function is used to clean column names in a data frame. It converts the column names to lowercase and replaces all spaces and special characters with underscores.

Code
# Create a data frame with messy column names
df <- data.frame("Column One" = 1:5, 
                 "Column Two!!" = 6:10, 
                 "Column Three $" = 11:15,
                 "%Column four" = 11:15)
head(df)
  Column.One Column.Two.. Column.Three.. X.Column.four
1          1            6             11            11
2          2            7             12            12
3          3            8             13            13
4          4            9             14            14
5          5           10             15            15
Code
df %>%
  janitor::clean_names()
  column_one column_two column_three x_column_four
1          1          6           11            11
2          2          7           12            12
3          3          8           13            13
4          4          9           14            14
5          5         10           15            15

remove_empty()

The remove_empty() function is used to remove rows or columns that contain only missing or empty values.

Code
# Create a data frame with empty rows and columns
df <-  data.frame(x = c(1,NA,4),
                    y = c(NA,NA,3),
                    z = c(NA, NA, NA))

head(df)
   x  y  z
1  1 NA NA
2 NA NA NA
3  4  3 NA
Code
df %>% 
  janitor::remove_empty(c("rows","cols"))
  x  y
1 1 NA
3 4  3

get_dupes()

The get_dupes() function is used to find duplicate rows in a data frame.

Code
df <- data.frame("Column One" = c(1, 2, 3, 1), "Column Two" = c("A", "B", "C", "A"))
get_dupes(df)
  Column.One Column.Two dupe_count
1          1          A          2
2          1          A          2

Cleanigd a bad data

Now we will clean on very messy data using some functions of janitor packages. We will use Lung Cancer Mortality data.

Code
# Load tidyverse
library(tidyverse)
urlfile = "https://github.com//zia207/r-colab/raw/main/Data/USA/USA_LBC_Data.csv"
df<-read_csv(url(urlfile))
Code
glimpse(df)
Rows: 3,118
Columns: 26
$ `Lung Cancer Moratlity Rates and Risk in USA, Data Provider: Zia Ahmed` <chr> …
$ ...2                                                                    <chr> …
$ ...3                                                                    <chr> …
$ ...4                                                                    <chr> …
$ ...5                                                                    <chr> …
$ ...6                                                                    <chr> …
$ ...7                                                                    <chr> …
$ ...8                                                                    <chr> …
$ ...9                                                                    <chr> …
$ ...10                                                                   <chr> …
$ ...11                                                                   <chr> …
$ ...12                                                                   <chr> …
$ ...13                                                                   <chr> …
$ ...14                                                                   <chr> …
$ ...15                                                                   <chr> …
$ ...16                                                                   <chr> …
$ ...17                                                                   <chr> …
$ ...18                                                                   <chr> …
$ ...19                                                                   <chr> …
$ ...20                                                                   <chr> …
$ ...21                                                                   <chr> …
$ ...22                                                                   <chr> …
$ ...23                                                                   <chr> …
$ ...24                                                                   <chr> …
$ ...25                                                                   <chr> …
$ ...26                                                                   <chr> …

You’ve probably received plenty of data files like this, that have some text at the top of the spreadsheet before the actual data begins.

In this data-frame, the column heading describe briefly the data. But we want 1st row as column heading. So we apply we will apply row_to_names(). The row_to_names() function takes the following arguments: the data source, the row number that column names should come from, whether that row should be deleted from the data, and whether the rows above should be deleted from the data:

Code
df.01 = df %>% 
  janitor::row_to_names(1, remove_row = TRUE, remove_rows_above = TRUE) %>%
  glimpse()
Rows: 3,117
Columns: 26
$ REGION_ID            <chr> "3", "3", "3", "3", NA, NA, "3", "3", "3", "3", "…
$ STATE                <chr> "Alabama", "Alabama", "Alabama", "Alabama", NA, N…
$ County               <chr> "Baldwin County", "Butler County", "Butler County…
$ `Empty Column 1`     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ X                    <chr> "789777.5039", "877731.5725", "877731.5725", "984…
$ Y                    <chr> "884557.0795", "1007285.71", "1007285.71", "11486…
$ Fips                 <chr> "1003", "1013", "1013", "1017", NA, NA, "1023", "…
$ `Empty_Column 2`     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `LCB Mortality Rate` <chr> "48.1", "38.3", "38.3", "49.6", NA, NA, "31.8", "…
$ Smoking              <chr> "20.8", "26", "26", "25.1", NA, NA, "21.8", "22.6…
$ `PM  25`             <chr> "7.89", "8.46", "8.46", "8.87", NA, NA, "8.58", "…
$ NO2                  <chr> "0.7939", "0.6344", "0.6344", "0.8442", NA, NA, "…
$ SO2                  <chr> "0.035343", "0.0135", "0.0135", "0.048177", NA, N…
$ Ozone                <chr> "39.79", "38.31", "38.31", "40.1", NA, NA, "37.07…
$ `Pop 65`             <chr> "19.5", "19", "19", "18.9", NA, NA, "22.1", "19",…
$ `Pop Black`          <chr> "9.24", "43.94", "43.94", "39.24", NA, NA, "41.94…
$ `Pop Hipanic`        <chr> "4.54", "1.26", "1.26", "2.14", NA, NA, "0.86", "…
$ `Pop White`          <chr> "83.06", "52.64", "52.64", "56.42", NA, NA, "56.2…
$ Education            <chr> "66", "38", "38", "47", NA, NA, "55", "39", "60",…
$ `Poverty %`          <chr> "13.14", "26.14", "26.14", "21.52", NA, NA, "23.0…
$ `Income Equality`    <chr> "4.5", "5.1", "5.1", "4.7", NA, NA, "5.8", "8.2",…
$ Uninsured            <chr> "13.34", "12.74", "12.74", "13.34", NA, NA, "12.8…
$ DEM                  <chr> "36.78", "111.70", "111.70", "227.03", NA, NA, "6…
$ `Radon Zone Class`   <chr> "Zone-3", "Zone-3", "Zone-3", "Zone-3", NA, NA, "…
$ `Urban Rural`        <chr> "Medium/small metro", "Nonmetro", "Nonmetro", "No…
$ `Coal Production`    <chr> "No", "No", "No", "No", NA, NA, "No", "No", "No",…

Still data has some empty columns and and empty rows, we are going to remove these empty columns and rows using remove_empty() function:

Code
df.02 = df.01 %>% 
  janitor::remove_empty() %>%
  glimpse()
Rows: 3,110
Columns: 24
$ REGION_ID            <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3",…
$ STATE                <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ County               <chr> "Baldwin County", "Butler County", "Butler County…
$ X                    <chr> "789777.5039", "877731.5725", "877731.5725", "984…
$ Y                    <chr> "884557.0795", "1007285.71", "1007285.71", "11486…
$ Fips                 <chr> "1003", "1013", "1013", "1017", "1023", "1025", "…
$ `LCB Mortality Rate` <chr> "48.1", "38.3", "38.3", "49.6", "31.8", "42", "53…
$ Smoking              <chr> "20.8", "26", "26", "25.1", "21.8", "22.6", "21.2…
$ `PM  25`             <chr> "7.89", "8.46", "8.46", "8.87", "8.58", "8.42", "…
$ NO2                  <chr> "0.7939", "0.6344", "0.6344", "0.8442", "0.5934",…
$ SO2                  <chr> "0.035343", "0.0135", "0.0135", "0.048177", "0.02…
$ Ozone                <chr> "39.79", "38.31", "38.31", "40.1", "37.07", "37.6…
$ `Pop 65`             <chr> "19.5", "19", "19", "18.9", "22.1", "19", "16.3",…
$ `Pop Black`          <chr> "9.24", "43.94", "43.94", "39.24", "41.94", "43.9…
$ `Pop Hipanic`        <chr> "4.54", "1.26", "1.26", "2.14", "0.86", "1.34", "…
$ `Pop White`          <chr> "83.06", "52.64", "52.64", "56.42", "56.28", "52.…
$ Education            <chr> "66", "38", "38", "47", "55", "39", "60", "35", "…
$ `Poverty %`          <chr> "13.14", "26.14", "26.14", "21.52", "23.06", "24.…
$ `Income Equality`    <chr> "4.5", "5.1", "5.1", "4.7", "5.8", "8.2", "4.8", …
$ Uninsured            <chr> "13.34", "12.74", "12.74", "13.34", "12.86", "13.…
$ DEM                  <chr> "36.78", "111.70", "111.70", "227.03", "68.24", "…
$ `Radon Zone Class`   <chr> "Zone-3", "Zone-3", "Zone-3", "Zone-3", "Zone-3",…
$ `Urban Rural`        <chr> "Medium/small metro", "Nonmetro", "Nonmetro", "No…
$ `Coal Production`    <chr> "No", "No", "No", "No", "No", "No", "No", "No", "…

Now, we are going fix column headings using clean_names(). It converts the column names to lowercase and replaces all spaces and special characters with underscores.

Code
df.03 = df.02 %>%
  janitor::clean_names() %>%
  glimpse()
Rows: 3,110
Columns: 24
$ region_id          <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county             <chr> "Baldwin County", "Butler County", "Butler County",…
$ x                  <chr> "789777.5039", "877731.5725", "877731.5725", "98421…
$ y                  <chr> "884557.0795", "1007285.71", "1007285.71", "1148648…
$ fips               <chr> "1003", "1013", "1013", "1017", "1023", "1025", "10…
$ lcb_mortality_rate <chr> "48.1", "38.3", "38.3", "49.6", "31.8", "42", "53.7…
$ smoking            <chr> "20.8", "26", "26", "25.1", "21.8", "22.6", "21.2",…
$ pm_25              <chr> "7.89", "8.46", "8.46", "8.87", "8.58", "8.42", "8.…
$ no2                <chr> "0.7939", "0.6344", "0.6344", "0.8442", "0.5934", "…
$ so2                <chr> "0.035343", "0.0135", "0.0135", "0.048177", "0.0239…
$ ozone              <chr> "39.79", "38.31", "38.31", "40.1", "37.07", "37.68"…
$ pop_65             <chr> "19.5", "19", "19", "18.9", "22.1", "19", "16.3", "…
$ pop_black          <chr> "9.24", "43.94", "43.94", "39.24", "41.94", "43.96"…
$ pop_hipanic        <chr> "4.54", "1.26", "1.26", "2.14", "0.86", "1.34", "6.…
$ pop_white          <chr> "83.06", "52.64", "52.64", "56.42", "56.28", "52.98…
$ education          <chr> "66", "38", "38", "47", "55", "39", "60", "35", "53…
$ poverty_percent    <chr> "13.14", "26.14", "26.14", "21.52", "23.06", "24.6"…
$ income_equality    <chr> "4.5", "5.1", "5.1", "4.7", "5.8", "8.2", "4.8", "4…
$ uninsured          <chr> "13.34", "12.74", "12.74", "13.34", "12.86", "13.28…
$ dem                <chr> "36.78", "111.70", "111.70", "227.03", "68.24", "69…
$ radon_zone_class   <chr> "Zone-3", "Zone-3", "Zone-3", "Zone-3", "Zone-3", "…
$ urban_rural        <chr> "Medium/small metro", "Nonmetro", "Nonmetro", "Nonm…
$ coal_production    <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No…

All data are exported in R as chr. We are going to convert column from 4 to 21 as.numeric and 22 to 23 as.factor. We use dplyr::mutate_at() function:

Code
df.04= df.03 %>%
     dplyr::mutate_at(4:21, as.numeric) %>%
     dplyr::mutate_at(22:24, as.factor) %>%
     glimpse()
Rows: 3,110
Columns: 24
$ region_id          <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county             <chr> "Baldwin County", "Butler County", "Butler County",…
$ x                  <dbl> 789777.5, 877731.6, 877731.6, 984214.7, 726606.5, 7…
$ y                  <dbl> 884557.1, 1007285.7, 1007285.7, 1148648.7, 1023615.…
$ fips               <dbl> 1003, 1013, 1013, 1017, 1023, 1025, 1031, 1035, 103…
$ lcb_mortality_rate <dbl> 48.1, 38.3, 38.3, 49.6, 31.8, 42.0, 53.7, 46.9, 65.…
$ smoking            <dbl> 20.8, 26.0, 26.0, 25.1, 21.8, 22.6, 21.2, 24.9, 25.…
$ pm_25              <dbl> 7.89, 8.46, 8.46, 8.87, 8.58, 8.42, 8.42, 8.23, 8.2…
$ no2                <dbl> 0.7939, 0.6344, 0.6344, 0.8442, 0.5934, 0.6432, 0.5…
$ so2                <dbl> 0.035343, 0.013500, 0.013500, 0.048177, 0.023989, 0…
$ ozone              <dbl> 39.79, 38.31, 38.31, 40.10, 37.07, 37.68, 38.46, 37…
$ pop_65             <dbl> 19.5, 19.0, 19.0, 18.9, 22.1, 19.0, 16.3, 21.6, 20.…
$ pop_black          <dbl> 9.24, 43.94, 43.94, 39.24, 41.94, 43.96, 17.26, 45.…
$ pop_hipanic        <dbl> 4.54, 1.26, 1.26, 2.14, 0.86, 1.34, 6.76, 1.84, 1.6…
$ pop_white          <dbl> 83.06, 52.64, 52.64, 56.42, 56.28, 52.98, 70.90, 50…
$ education          <dbl> 66, 38, 38, 47, 55, 39, 60, 35, 53, 44, 58, 38, 38,…
$ poverty_percent    <dbl> 13.14, 26.14, 26.14, 21.52, 23.06, 24.60, 16.20, 29…
$ income_equality    <dbl> 4.5, 5.1, 5.1, 4.7, 5.8, 8.2, 4.8, 4.9, 4.6, 5.8, 5…
$ uninsured          <dbl> 13.34, 12.74, 12.74, 13.34, 12.86, 13.28, 13.16, 15…
$ dem                <dbl> 36.78, 111.70, 111.70, 227.03, 68.24, 69.29, 99.32,…
$ radon_zone_class   <fct> Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zon…
$ urban_rural        <fct> Medium/small metro, Nonmetro, Nonmetro, Nonmetro, N…
$ coal_production    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,…

Now will check the duplicates record in the this data:

Code
df.04 %>% janitor::get_dupes(fips)
# A tibble: 6 × 25
   fips dupe_count region_id state    county         x      y lcb_mortality_rate
  <dbl>      <int> <chr>     <chr>    <chr>      <dbl>  <dbl>              <dbl>
1  1013          2 3         Alabama  Butler C… 8.78e5 1.01e6               38.3
2  1013          2 3         Alabama  Butler C… 8.78e5 1.01e6               38.3
3  1053          2 3         Alabama  Escambia… 8.39e5 9.34e5               58.3
4  1053          2 3         Alabama  Escambia… 8.39e5 9.34e5               58.3
5  5011          2 3         Arkansas Bradley … 3.54e5 1.16e6               69.9
6  5011          2 3         Arkansas Bradley … 3.54e5 1.16e6               69.9
# ℹ 17 more variables: smoking <dbl>, pm_25 <dbl>, no2 <dbl>, so2 <dbl>,
#   ozone <dbl>, pop_65 <dbl>, pop_black <dbl>, pop_hipanic <dbl>,
#   pop_white <dbl>, education <dbl>, poverty_percent <dbl>,
#   income_equality <dbl>, uninsured <dbl>, dem <dbl>, radon_zone_class <fct>,
#   urban_rural <fct>, coal_production <fct>

As shown above, the data frame is filtered down to those rows with duplicate values in the Fips column. For removing these duplicate rows, we have to use dplyr::distinct(.keep_all = TRUE)

Now will check the duplicates record in the this data:

Code
df.05= df.04 %>% 
     dplyr::distinct(fips,.keep_all = TRUE) %>%
     glimpse()
Rows: 3,107
Columns: 24
$ region_id          <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county             <chr> "Baldwin County", "Butler County", "Chambers County…
$ x                  <dbl> 789777.5, 877731.6, 984214.7, 726606.5, 770408.9, 9…
$ y                  <dbl> 884557.1, 1007285.7, 1148648.7, 1023615.8, 988910.5…
$ fips               <dbl> 1003, 1013, 1017, 1023, 1025, 1031, 1035, 1039, 104…
$ lcb_mortality_rate <dbl> 48.1, 38.3, 49.6, 31.8, 42.0, 53.7, 46.9, 65.5, 57.…
$ smoking            <dbl> 20.8, 26.0, 25.1, 21.8, 22.6, 21.2, 24.9, 25.9, 22.…
$ pm_25              <dbl> 7.89, 8.46, 8.87, 8.58, 8.42, 8.42, 8.23, 8.24, 8.4…
$ no2                <dbl> 0.7939, 0.6344, 0.8442, 0.5934, 0.6432, 0.5698, 0.5…
$ so2                <dbl> 0.035343, 0.013500, 0.048177, 0.023989, 0.033700, 0…
$ ozone              <dbl> 39.79, 38.31, 40.10, 37.07, 37.68, 38.46, 37.92, 38…
$ pop_65             <dbl> 19.5, 19.0, 18.9, 22.1, 19.0, 16.3, 21.6, 20.5, 18.…
$ pop_black          <dbl> 9.24, 43.94, 39.24, 41.94, 43.96, 17.26, 45.94, 12.…
$ pop_hipanic        <dbl> 4.54, 1.26, 2.14, 0.86, 1.34, 6.76, 1.84, 1.62, 1.8…
$ pop_white          <dbl> 83.06, 52.64, 56.42, 56.28, 52.98, 70.90, 50.56, 83…
$ education          <dbl> 66, 38, 47, 55, 39, 60, 35, 53, 44, 58, 38, 45, 58,…
$ poverty_percent    <dbl> 13.14, 26.14, 21.52, 23.06, 24.60, 16.20, 29.76, 20…
$ income_equality    <dbl> 4.5, 5.1, 4.7, 5.8, 8.2, 4.8, 4.9, 4.6, 5.8, 5.2, 5…
$ uninsured          <dbl> 13.34, 12.74, 13.34, 12.86, 13.28, 13.16, 15.16, 13…
$ dem                <dbl> 36.78, 111.70, 227.03, 68.24, 69.29, 99.32, 96.03, …
$ radon_zone_class   <fct> Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zon…
$ urban_rural        <fct> Medium/small metro, Nonmetro, Nonmetro, Nonmetro, N…
$ coal_production    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,…

Now we run all above function with Pipe (%>%):

Code
df_clean = df %>% 
  janitor::row_to_names(1, remove_row = TRUE, remove_rows_above = TRUE) %>%
  janitor::remove_empty() %>%
  janitor::clean_names() %>%
  dplyr::mutate_at(4:21, as.numeric) %>%
  dplyr::mutate_at(22:24, as.factor) %>%
  dplyr::distinct(fips,.keep_all = TRUE) %>%
     glimpse()
Rows: 3,107
Columns: 24
$ region_id          <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county             <chr> "Baldwin County", "Butler County", "Chambers County…
$ x                  <dbl> 789777.5, 877731.6, 984214.7, 726606.5, 770408.9, 9…
$ y                  <dbl> 884557.1, 1007285.7, 1148648.7, 1023615.8, 988910.5…
$ fips               <dbl> 1003, 1013, 1017, 1023, 1025, 1031, 1035, 1039, 104…
$ lcb_mortality_rate <dbl> 48.1, 38.3, 49.6, 31.8, 42.0, 53.7, 46.9, 65.5, 57.…
$ smoking            <dbl> 20.8, 26.0, 25.1, 21.8, 22.6, 21.2, 24.9, 25.9, 22.…
$ pm_25              <dbl> 7.89, 8.46, 8.87, 8.58, 8.42, 8.42, 8.23, 8.24, 8.4…
$ no2                <dbl> 0.7939, 0.6344, 0.8442, 0.5934, 0.6432, 0.5698, 0.5…
$ so2                <dbl> 0.035343, 0.013500, 0.048177, 0.023989, 0.033700, 0…
$ ozone              <dbl> 39.79, 38.31, 40.10, 37.07, 37.68, 38.46, 37.92, 38…
$ pop_65             <dbl> 19.5, 19.0, 18.9, 22.1, 19.0, 16.3, 21.6, 20.5, 18.…
$ pop_black          <dbl> 9.24, 43.94, 39.24, 41.94, 43.96, 17.26, 45.94, 12.…
$ pop_hipanic        <dbl> 4.54, 1.26, 2.14, 0.86, 1.34, 6.76, 1.84, 1.62, 1.8…
$ pop_white          <dbl> 83.06, 52.64, 56.42, 56.28, 52.98, 70.90, 50.56, 83…
$ education          <dbl> 66, 38, 47, 55, 39, 60, 35, 53, 44, 58, 38, 45, 58,…
$ poverty_percent    <dbl> 13.14, 26.14, 21.52, 23.06, 24.60, 16.20, 29.76, 20…
$ income_equality    <dbl> 4.5, 5.1, 4.7, 5.8, 8.2, 4.8, 4.9, 4.6, 5.8, 5.2, 5…
$ uninsured          <dbl> 13.34, 12.74, 13.34, 12.86, 13.28, 13.16, 15.16, 13…
$ dem                <dbl> 36.78, 111.70, 227.03, 68.24, 69.29, 99.32, 96.03, …
$ radon_zone_class   <fct> Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zon…
$ urban_rural        <fct> Medium/small metro, Nonmetro, Nonmetro, Nonmetro, N…
$ coal_production    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,…

Exercise

  1. Create a R-Markdown Project on exiting Homework directory

  2. Create a R-Markdown documents (name homework_02.rmd) in this project directory and do all Tasks (1 to 6) using the data shown below.

  3. Submit all codes and output as a HTML document (homework_02.html) before class of next week.

Required R-Package

tidyverse and janitor

Data

bd_arsenic_data_raw.csv

Download the data and save in your project directory. Use read_csv to load the data in your R-session. For example:

bf<-read_csv(“bd_arsenic_data_raw.csv”))

Tasks

  1. Use janitor::row_to_names() remove text from column heading

  2. Remove empty rows and columns using janitor::remove_empty()

  3. Clean column names using janitor::clean_names()

  4. Use as.numeric and as.factor arguments in dplyr::mutate_at() function to convert ‘chr’ columns to numeric and factors accordingly

  5. Find duplicate records and remove them

  6. Run all above functions with pipe.

Further Reading

  1. Overview of janitor functions

  2. Cleaning and Exploring Data with the “janitor” Package