Code
packages <- c(
'tidyverse',
'janitor'
) The janitor package in the R programming language is a popular and powerful tool for data cleaning and manipulation. It offers a variety of functions that help to clean up messy data sets quickly and easily and turn them into a more organized format. The package includes functions that remove duplicate data, handle missing values, and format column names. Additionally, it provides features to handle whitespace, convert data types, and generate summary statistics for data sets. Overall, the janitor package is an essential resource for anyone working with data in R, as it can significantly save time and effort in data cleaning and preparation. The janitor package is designed to be user-friendly and is optimized for those who are new to R. It includes simple functions for examining and cleaning up messy data, and the package can help advanced R users to perform these tasks more quickly and efficiently. The main functions of the janitor package include formatting data frame column names, creating and formatting frequency tables of one, two, or three variables, and providing other tools for cleaning and examining data frames.

Here are some common functionalists provided by the {janitor} package:
clean_names(): Renames columns by converting them to snake_case or lower_case and removing special characters.
remove_empty(): Removes rows and columns that are entirely empty.
remove_constant(): Removes columns that have constant values throughout.
get_dupes(): Finds duplicate rows in a data frame.
tabyl(): Generates frequency tables (similar to table() but returns a data frame).
adorn_totals(): Adds total rows or columns to a data frame.
add_columns(): Adds new columns to a data frame.
recode_factor(): Modifies levels of a factor variable.
remove_missing(): Removes rows or columns containing missing values.
replace_na(): Replaces missing values with specified values.
factorize(): Converts columns to factor variables.row_to_names(): Converts a row to column names.
crossing(): Creates all combinations of rows from multiple data frames.
Using these functions, you can perform various data cleaning and transformation tasks
#| warning: false
#| error: false
# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)Successfully loaded packages:
[1] "package:janitor" "package:lubridate" "package:forcats"
[4] "package:stringr" "package:dplyr" "package:purrr"
[7] "package:readr" "package:tidyr" "package:tibble"
[10] "package:ggplot2" "package:tidyverse" "package:stats"
[13] "package:graphics" "package:grDevices" "package:utils"
[16] "package:datasets" "package:methods" "package:base"
We will create some “bad” data and clean them with janitor. We will apply following functions:
clean_names()
remove_empty()
trim_ws()
get_dupes()
remove_constant()
The clean_names() function is used to clean column names in a data frame. It converts the column names to lowercase and replaces all spaces and special characters with underscores.
Column.One Column.Two.. Column.Three.. X.Column.four
1 1 6 11 11
2 2 7 12 12
3 3 8 13 13
4 4 9 14 14
5 5 10 15 15
The remove_empty() function is used to remove rows or columns that contain only missing or empty values.
x y z
1 1 NA NA
2 NA NA NA
3 4 3 NA
The get_dupes() function is used to find duplicate rows in a data frame.
Column.One Column.Two
1 1 A
2 2 B
3 3 C
4 1 A
Now, we will clean up messy data using some functions of janitor packages. We will use Lung Cancer Mortality data from the USA. All data set use in this exercise can be downloaded from here.
Rows: 4
Columns: 2
$ Column.One <dbl> 1, 2, 3, 1
$ Column.Two <chr> "A", "B", "C", "A"
You may have received data files that contain some text at the top of the spreadsheet before the actual data begins. In this data-frame, the column headings briefly describe the data. However, we want the first row to be the column heading. To achieve this, we will use the row_to_names() function. This function requires the following arguments: the data source, the row number from which the column names should come, whether that row should be deleted from the data, and whether the rows above it should be deleted from the data.
Rows: 3,117
Columns: 26
$ REGION_ID <chr> "3", "3", "3", "3", NA, NA, "3", "3", "3", "3", "…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", NA, N…
$ County <chr> "Baldwin County", "Butler County", "Butler County…
$ `Empty Column 1` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ X <chr> "789777.5039", "877731.5725", "877731.5725", "984…
$ Y <chr> "884557.0795", "1007285.71", "1007285.71", "11486…
$ Fips <chr> "1003", "1013", "1013", "1017", NA, NA, "1023", "…
$ `Empty_Column 2` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `LCB Mortality Rate` <chr> "48.1", "38.3", "38.3", "49.6", NA, NA, "31.8", "…
$ Smoking <chr> "20.8", "26", "26", "25.1", NA, NA, "21.8", "22.6…
$ `PM 25` <chr> "7.89", "8.46", "8.46", "8.87", NA, NA, "8.58", "…
$ NO2 <chr> "0.7939", "0.6344", "0.6344", "0.8442", NA, NA, "…
$ SO2 <chr> "0.035343", "0.0135", "0.0135", "0.048177", NA, N…
$ Ozone <chr> "39.79", "38.31", "38.31", "40.1", NA, NA, "37.07…
$ `Pop 65` <chr> "19.5", "19", "19", "18.9", NA, NA, "22.1", "19",…
$ `Pop Black` <chr> "9.24", "43.94", "43.94", "39.24", NA, NA, "41.94…
$ `Pop Hipanic` <chr> "4.54", "1.26", "1.26", "2.14", NA, NA, "0.86", "…
$ `Pop White` <chr> "83.06", "52.64", "52.64", "56.42", NA, NA, "56.2…
$ Education <chr> "66", "38", "38", "47", NA, NA, "55", "39", "60",…
$ `Poverty %` <chr> "13.14", "26.14", "26.14", "21.52", NA, NA, "23.0…
$ `Income Equality` <chr> "4.5", "5.1", "5.1", "4.7", NA, NA, "5.8", "8.2",…
$ Uninsured <chr> "13.34", "12.74", "12.74", "13.34", NA, NA, "12.8…
$ DEM <chr> "36.78", "111.70", "111.70", "227.03", NA, NA, "6…
$ `Radon Zone Class` <chr> "Zone-3", "Zone-3", "Zone-3", "Zone-3", NA, NA, "…
$ `Urban Rural` <chr> "Medium/small metro", "Nonmetro", "Nonmetro", "No…
$ `Coal Production` <chr> "No", "No", "No", "No", NA, NA, "No", "No", "No",…
Still data has some empty columns and and empty rows, we are going to remove these empty columns and rows using remove_empty() function:
Rows: 3,110
Columns: 24
$ REGION_ID <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3",…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ County <chr> "Baldwin County", "Butler County", "Butler County…
$ X <chr> "789777.5039", "877731.5725", "877731.5725", "984…
$ Y <chr> "884557.0795", "1007285.71", "1007285.71", "11486…
$ Fips <chr> "1003", "1013", "1013", "1017", "1023", "1025", "…
$ `LCB Mortality Rate` <chr> "48.1", "38.3", "38.3", "49.6", "31.8", "42", "53…
$ Smoking <chr> "20.8", "26", "26", "25.1", "21.8", "22.6", "21.2…
$ `PM 25` <chr> "7.89", "8.46", "8.46", "8.87", "8.58", "8.42", "…
$ NO2 <chr> "0.7939", "0.6344", "0.6344", "0.8442", "0.5934",…
$ SO2 <chr> "0.035343", "0.0135", "0.0135", "0.048177", "0.02…
$ Ozone <chr> "39.79", "38.31", "38.31", "40.1", "37.07", "37.6…
$ `Pop 65` <chr> "19.5", "19", "19", "18.9", "22.1", "19", "16.3",…
$ `Pop Black` <chr> "9.24", "43.94", "43.94", "39.24", "41.94", "43.9…
$ `Pop Hipanic` <chr> "4.54", "1.26", "1.26", "2.14", "0.86", "1.34", "…
$ `Pop White` <chr> "83.06", "52.64", "52.64", "56.42", "56.28", "52.…
$ Education <chr> "66", "38", "38", "47", "55", "39", "60", "35", "…
$ `Poverty %` <chr> "13.14", "26.14", "26.14", "21.52", "23.06", "24.…
$ `Income Equality` <chr> "4.5", "5.1", "5.1", "4.7", "5.8", "8.2", "4.8", …
$ Uninsured <chr> "13.34", "12.74", "12.74", "13.34", "12.86", "13.…
$ DEM <chr> "36.78", "111.70", "111.70", "227.03", "68.24", "…
$ `Radon Zone Class` <chr> "Zone-3", "Zone-3", "Zone-3", "Zone-3", "Zone-3",…
$ `Urban Rural` <chr> "Medium/small metro", "Nonmetro", "Nonmetro", "No…
$ `Coal Production` <chr> "No", "No", "No", "No", "No", "No", "No", "No", "…
Now, we are going fix column headings using clean_names(). It converts the column names to lowercase and replaces all spaces and special characters with underscores.
Rows: 3,110
Columns: 24
$ region_id <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county <chr> "Baldwin County", "Butler County", "Butler County",…
$ x <chr> "789777.5039", "877731.5725", "877731.5725", "98421…
$ y <chr> "884557.0795", "1007285.71", "1007285.71", "1148648…
$ fips <chr> "1003", "1013", "1013", "1017", "1023", "1025", "10…
$ lcb_mortality_rate <chr> "48.1", "38.3", "38.3", "49.6", "31.8", "42", "53.7…
$ smoking <chr> "20.8", "26", "26", "25.1", "21.8", "22.6", "21.2",…
$ pm_25 <chr> "7.89", "8.46", "8.46", "8.87", "8.58", "8.42", "8.…
$ no2 <chr> "0.7939", "0.6344", "0.6344", "0.8442", "0.5934", "…
$ so2 <chr> "0.035343", "0.0135", "0.0135", "0.048177", "0.0239…
$ ozone <chr> "39.79", "38.31", "38.31", "40.1", "37.07", "37.68"…
$ pop_65 <chr> "19.5", "19", "19", "18.9", "22.1", "19", "16.3", "…
$ pop_black <chr> "9.24", "43.94", "43.94", "39.24", "41.94", "43.96"…
$ pop_hipanic <chr> "4.54", "1.26", "1.26", "2.14", "0.86", "1.34", "6.…
$ pop_white <chr> "83.06", "52.64", "52.64", "56.42", "56.28", "52.98…
$ education <chr> "66", "38", "38", "47", "55", "39", "60", "35", "53…
$ poverty_percent <chr> "13.14", "26.14", "26.14", "21.52", "23.06", "24.6"…
$ income_equality <chr> "4.5", "5.1", "5.1", "4.7", "5.8", "8.2", "4.8", "4…
$ uninsured <chr> "13.34", "12.74", "12.74", "13.34", "12.86", "13.28…
$ dem <chr> "36.78", "111.70", "111.70", "227.03", "68.24", "69…
$ radon_zone_class <chr> "Zone-3", "Zone-3", "Zone-3", "Zone-3", "Zone-3", "…
$ urban_rural <chr> "Medium/small metro", "Nonmetro", "Nonmetro", "Nonm…
$ coal_production <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No…
All data are exported in R as chr. We are going to convert column from 4 to 21as.numeric and 22 to 23 as.factor. We will use dplyr::mutate_at() function:
Rows: 3,110
Columns: 24
$ region_id <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county <chr> "Baldwin County", "Butler County", "Butler County",…
$ x <dbl> 789777.5, 877731.6, 877731.6, 984214.7, 726606.5, 7…
$ y <dbl> 884557.1, 1007285.7, 1007285.7, 1148648.7, 1023615.…
$ fips <dbl> 1003, 1013, 1013, 1017, 1023, 1025, 1031, 1035, 103…
$ lcb_mortality_rate <dbl> 48.1, 38.3, 38.3, 49.6, 31.8, 42.0, 53.7, 46.9, 65.…
$ smoking <dbl> 20.8, 26.0, 26.0, 25.1, 21.8, 22.6, 21.2, 24.9, 25.…
$ pm_25 <dbl> 7.89, 8.46, 8.46, 8.87, 8.58, 8.42, 8.42, 8.23, 8.2…
$ no2 <dbl> 0.7939, 0.6344, 0.6344, 0.8442, 0.5934, 0.6432, 0.5…
$ so2 <dbl> 0.035343, 0.013500, 0.013500, 0.048177, 0.023989, 0…
$ ozone <dbl> 39.79, 38.31, 38.31, 40.10, 37.07, 37.68, 38.46, 37…
$ pop_65 <dbl> 19.5, 19.0, 19.0, 18.9, 22.1, 19.0, 16.3, 21.6, 20.…
$ pop_black <dbl> 9.24, 43.94, 43.94, 39.24, 41.94, 43.96, 17.26, 45.…
$ pop_hipanic <dbl> 4.54, 1.26, 1.26, 2.14, 0.86, 1.34, 6.76, 1.84, 1.6…
$ pop_white <dbl> 83.06, 52.64, 52.64, 56.42, 56.28, 52.98, 70.90, 50…
$ education <dbl> 66, 38, 38, 47, 55, 39, 60, 35, 53, 44, 58, 38, 38,…
$ poverty_percent <dbl> 13.14, 26.14, 26.14, 21.52, 23.06, 24.60, 16.20, 29…
$ income_equality <dbl> 4.5, 5.1, 5.1, 4.7, 5.8, 8.2, 4.8, 4.9, 4.6, 5.8, 5…
$ uninsured <dbl> 13.34, 12.74, 12.74, 13.34, 12.86, 13.28, 13.16, 15…
$ dem <dbl> 36.78, 111.70, 111.70, 227.03, 68.24, 69.29, 99.32,…
$ radon_zone_class <fct> Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zon…
$ urban_rural <fct> Medium/small metro, Nonmetro, Nonmetro, Nonmetro, N…
$ coal_production <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,…
Now will check the duplicates record in the this data:
# A tibble: 6 × 25
fips dupe_count region_id state county x y lcb_mortality_rate
<dbl> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1013 2 3 Alabama Butler C… 8.78e5 1.01e6 38.3
2 1013 2 3 Alabama Butler C… 8.78e5 1.01e6 38.3
3 1053 2 3 Alabama Escambia… 8.39e5 9.34e5 58.3
4 1053 2 3 Alabama Escambia… 8.39e5 9.34e5 58.3
5 5011 2 3 Arkansas Bradley … 3.54e5 1.16e6 69.9
6 5011 2 3 Arkansas Bradley … 3.54e5 1.16e6 69.9
# ℹ 17 more variables: smoking <dbl>, pm_25 <dbl>, no2 <dbl>, so2 <dbl>,
# ozone <dbl>, pop_65 <dbl>, pop_black <dbl>, pop_hipanic <dbl>,
# pop_white <dbl>, education <dbl>, poverty_percent <dbl>,
# income_equality <dbl>, uninsured <dbl>, dem <dbl>, radon_zone_class <fct>,
# urban_rural <fct>, coal_production <fct>
As shown above, the data frame is filtered down to those rows with duplicate values in the Fips column. For removing these duplicate rows, we have to use dplyr::distinct(.keep_all = TRUE)
Now will check the duplicates record in the this dat with get_dupes() function:
Rows: 0
Columns: 25
$ fips <dbl>
$ dupe_count <int>
$ region_id <chr>
$ state <chr>
$ county <chr>
$ x <dbl>
$ y <dbl>
$ lcb_mortality_rate <dbl>
$ smoking <dbl>
$ pm_25 <dbl>
$ no2 <dbl>
$ so2 <dbl>
$ ozone <dbl>
$ pop_65 <dbl>
$ pop_black <dbl>
$ pop_hipanic <dbl>
$ pop_white <dbl>
$ education <dbl>
$ poverty_percent <dbl>
$ income_equality <dbl>
$ uninsured <dbl>
$ dem <dbl>
$ radon_zone_class <fct>
$ urban_rural <fct>
$ coal_production <fct>
Now we run all above function with Pipe ( |> ):
Rows: 3,107
Columns: 24
$ region_id <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county <chr> "Baldwin County", "Butler County", "Chambers County…
$ x <dbl> 789777.5, 877731.6, 984214.7, 726606.5, 770408.9, 9…
$ y <dbl> 884557.1, 1007285.7, 1148648.7, 1023615.8, 988910.5…
$ fips <dbl> 1003, 1013, 1017, 1023, 1025, 1031, 1035, 1039, 104…
$ lcb_mortality_rate <dbl> 48.1, 38.3, 49.6, 31.8, 42.0, 53.7, 46.9, 65.5, 57.…
$ smoking <dbl> 20.8, 26.0, 25.1, 21.8, 22.6, 21.2, 24.9, 25.9, 22.…
$ pm_25 <dbl> 7.89, 8.46, 8.87, 8.58, 8.42, 8.42, 8.23, 8.24, 8.4…
$ no2 <dbl> 0.7939, 0.6344, 0.8442, 0.5934, 0.6432, 0.5698, 0.5…
$ so2 <dbl> 0.035343, 0.013500, 0.048177, 0.023989, 0.033700, 0…
$ ozone <dbl> 39.79, 38.31, 40.10, 37.07, 37.68, 38.46, 37.92, 38…
$ pop_65 <dbl> 19.5, 19.0, 18.9, 22.1, 19.0, 16.3, 21.6, 20.5, 18.…
$ pop_black <dbl> 9.24, 43.94, 39.24, 41.94, 43.96, 17.26, 45.94, 12.…
$ pop_hipanic <dbl> 4.54, 1.26, 2.14, 0.86, 1.34, 6.76, 1.84, 1.62, 1.8…
$ pop_white <dbl> 83.06, 52.64, 56.42, 56.28, 52.98, 70.90, 50.56, 83…
$ education <dbl> 66, 38, 47, 55, 39, 60, 35, 53, 44, 58, 38, 45, 58,…
$ poverty_percent <dbl> 13.14, 26.14, 21.52, 23.06, 24.60, 16.20, 29.76, 20…
$ income_equality <dbl> 4.5, 5.1, 4.7, 5.8, 8.2, 4.8, 4.9, 4.6, 5.8, 5.2, 5…
$ uninsured <dbl> 13.34, 12.74, 13.34, 12.86, 13.28, 13.16, 15.16, 13…
$ dem <dbl> 36.78, 111.70, 227.03, 68.24, 69.29, 99.32, 96.03, …
$ radon_zone_class <fct> Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zon…
$ urban_rural <fct> Medium/small metro, Nonmetro, Nonmetro, Nonmetro, N…
$ coal_production <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,…
This guide explains cleaning up datasets using the R package janitor. It emphasizes the importance of data cleanliness and offers an easy-to-use approach to data wrangling. The janitor package provides functions that simplify everyday data-cleaning tasks. It can handle missing data and convert data types, making it a valuable tool for preparing datasets for analysis. Remember, data cleaning is essential, and the janitor simplifies this task. With the skills gained in this guide, you can efficiently clean and tidy datasets for more meaningful analyses.