Code
<- c(
packages 'tidyverse',
'data.table',
'feather',
'arrow'
)
When working with large and complex datasets in R, it is essential to have effective techniques for importing and exporting data. Since these datasets can be enormous, standard techniques for data transfer can often be insufficient and may result in inefficient and time-consuming processes. Therefore, it is crucial to use efficient data management methods to handle the size and complexity of the datasets involved. Doing so ensures that your data analysis is accurate, reliable, and fast, which is essential when working with big data in R. In this tutorial we will explore three packages: data.table
, feather
, and arrow
. These packages are designed to handle large datasets efficiently, making them ideal for big data analysis in R.
Successfully loaded packages:
[1] "package:arrow" "package:feather" "package:data.table"
[4] "package:lubridate" "package:forcats" "package:stringr"
[7] "package:dplyr" "package:purrr" "package:readr"
[10] "package:tidyr" "package:tibble" "package:ggplot2"
[13] "package:tidyverse" "package:stats" "package:graphics"
[16] "package:grDevices" "package:utils" "package:datasets"
[19] "package:methods" "package:base"
All data set use in this exercise can be downloaded from my Dropbox or from my Github accounts.
We will use read_csv()
function to nepal_df_balance.csv
as data.frame
Rows: 17,865
Columns: 20
$ Foodstatus <dbl> 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0…
$ Schooling_year <dbl> 0, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Age <dbl> 55, 25, 25, 84, 84, 16, 65, 49, 60, 74, 69, 45, 6…
$ Household_size <dbl> 3, 3, 3, 2, 2, 2, 3, 3, 3, 1, 1, 2, 3, 3, 1, 3, 3…
$ Rainfed_area <dbl> 0.175, 0.258, 0.257, 0.334, 0.334, 0.127, 0.000, …
$ Irrigated_area <dbl> 0.076, 0.000, 0.000, 0.000, 0.000, 0.051, 0.000, …
$ Remittance <dbl> 0.000, 0.000, 0.000, 30.600, 30.600, 0.000, 0.000…
$ No_livestock <dbl> 3.210, 1.960, 1.961, 2.830, 2.830, 1.420, 1.480, …
$ Infrastructure_Index <dbl> 0.381, 0.726, 0.727, 0.765, 0.765, 0.773, 0.785, …
$ Region <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ Sex <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0…
$ Caste <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1…
$ Livelihood <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ School_Class <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Household_Class <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Remitance_Class <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Foodstatus_ID <chr> "Food inadquate", "Food inadquate", "Food inadqua…
$ Sex_ID <chr> "Male", "Male", "Male", "Female", "Female", "Male…
$ Region_ID <chr> "central", "central", "central", "central", "cent…
$ Livelihood_ID <chr> "Agriculture Household", "Agriculture Household",…
The data.table is a powerful tool that offers a high-performance alternative to the standard “data.frame” object in base R. With a range of syntax and feature enhancements, this package provides unparalleled ease of use, convenience, and programming speed. Whether you’re working with large datasets or complex queries, “data.table” is a versatile and efficient solution for all your data manipulation needs. With its intuitive syntax, powerful indexing capabilities, and seamless integration with other R packages, “data.table” is a must-have tool for any data scientist or analyst looking to optimize their workflow and get the most out of their data.
install.packages(“data.table”)
The latest development version (only if newer available)
data.table::update_dev_pkg()
The atest development version (force install)
install.packages(“data.table”, repos=“https://rdatatable.gitlab.io/data.table”)
Importan Features of data.table
fast and friendly delimited file reader: ?fread
, see also convenience features for small data
fast and feature rich delimited file writer: ?fwrite
low-level parallelism: many common operations are internally parallelized to use multiple CPU threads
fast and scalable aggregations; e.g. 100GB in RAM (see benchmarks on up to two billion rows)
fast and feature rich joins: ordered joins (e.g. rolling forwards, backwards, nearest and limited staleness), overlapping range joins (similar to IRanges::findOverlaps
), non-equi joins (i.e. joins using operators >, >=, <, <=
), aggregate on join (by=.EACHI
), update on join
fast add/update/delete columns by reference by group using no copies at all
fast and feature rich reshaping data: ?dcast
(pivot/wider/spread) and ?melt
(unpivot/longer/gather)
any R function from any R package can be used in queries not just the subset of functions made available by a database backend, also columns of type list
are supported
has no dependencies at all other than base R itself, for simpler production/maintenance
the R dependency is as old as possible for as long as possible, dated April 2014, and we continuously test against that version; e.g. v1.11.0 released on 5 May 2018 bumped the dependency up from 5 year old R 3.0.0 to 4 year old R 3.1.0
We can create data.table object using the data.table()
function. Here is an example:
You can also convert existing objects to data.table using
, setDT()
for data.frame
fread()
and fwrite()
If you’re dealing with large datasets and looking for an efficient way to read files into R as data tables, the data.table package has got you covered with its highly efficient function called fread()
. This function outperforms other alternatives like read.csv or read.table and is specifically designed to handle large datasets. So, if you want to save time and increase your productivity, consider using fread()
for your file reading needs.
The fread()
function in data.table offers a great level of versatility when it comes to efficiently reading various types of delimited files. You can easily specify delimiters, select specific columns, and even set particular data types while reading to optimize memory usage. This function proves to be especially powerful when dealing with large datasets due to its exceptional speed and memory efficiency.
fread(input, file,….)
Classes 'data.table' and 'data.frame': 17865 obs. of 20 variables:
$ Foodstatus : int 1 1 1 0 0 1 0 1 1 1 ...
$ Schooling_year : int 0 5 5 0 0 0 0 0 0 0 ...
$ Age : int 55 25 25 84 84 16 65 49 60 74 ...
$ Household_size : int 3 3 3 2 2 2 3 3 3 1 ...
$ Rainfed_area : num 0.175 0.258 0.257 0.334 0.334 0.127 0 0.052 0.267 0 ...
$ Irrigated_area : num 0.076 0 0 0 0 0.051 0 0.016 0 0 ...
$ Remittance : num 0 0 0 30.6 30.6 0 0 0 0 0 ...
$ No_livestock : num 3.21 1.96 1.96 2.83 2.83 ...
$ Infrastructure_Index: num 0.381 0.726 0.727 0.765 0.765 0.773 0.785 0.809 0.82 0.823 ...
$ Region : int 1 1 1 1 1 1 1 1 1 1 ...
$ Sex : int 0 0 0 1 1 0 0 0 0 0 ...
$ Caste : int 0 0 0 1 1 1 1 1 1 1 ...
$ Livelihood : int 1 1 1 0 0 1 1 1 1 1 ...
$ School_Class : int 0 0 0 0 0 0 0 0 0 0 ...
$ Household_Class : int 0 0 0 0 0 0 0 0 0 0 ...
$ Remitance_Class : int 0 0 0 0 0 0 0 0 0 0 ...
$ Foodstatus_ID : chr "Food inadquate" "Food inadquate" "Food inadquate" "Food adquate" ...
$ Sex_ID : chr "Male" "Male" "Male" "Female" ...
$ Region_ID : chr "central" "central" "central" "central" ...
$ Livelihood_ID : chr "Agriculture Household" "Agriculture Household" "Agriculture Household" "Non Agriculture Household" ...
- attr(*, ".internal.selfref")=<externalptr>
In the data.table package of R, fwrite()
serves as the counterpart to fread()
. It is primarily utilized for writing data tables to files, usually in CSV or other delimited formats. With a focus on speed and efficiency, fwrite()
is optimized to handle large datasets effectively. Therefore, it is an excellent option for saving such datasets.
The [feather is a binary columnar serialization tool that is specifically designed to make reading and writing data frames highly efficient, while also making it easier to share data across various data analysis languages. It offers bindings for both Python (written by Wes McKinney) and R (written by Hadley Wickham) and uses the Apache Arrow columnar memory specification to represent binary data on disk, which results in fast read and write operations. This feature is particularly useful when it comes to encoding null/NA values and variable-length types like UTF8 strings. Feather is an integral part of the Apache Arrow project and defines its own simplified schemas and metadata for on-disk representation.
Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:
Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible
Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too.
Feather is extremely fast. Since Feather does not currently use any compression internally, it works best when used with solid-state drives as come with most of today’s laptop computers. For this first release, we prioritized a simple implementation and are thus writing unmodified Arrow memory to disk source.
Feather currently supports the following column types:
A wide range of numeric types (int8, int16, int32, int64, uint8, uint16, uint32, uint64, float, double).
Logical/boolean values.
Dates, times, and timestamps.
Factors/categorical variables that have fixed set of possible values.
UTF-8 encoded strings.
Arbitrary binary data.
All column types support NA/null values.
install.packages(“feather”)
The feather package in R provides functions to read and write data in the Feather file format. Feather is a fast, lightweight, and cross-language columnar storage file format designed for efficient data interchange between programming languages.
First, we create heather object using write_feather()
function:
Then we use read_feather()
function specifically reads data from a Feather file into an R data frame.
tibble [17,865 × 20] (S3: tbl_df/tbl/data.frame)
$ Foodstatus : num [1:17865] 1 1 1 0 0 1 0 1 1 1 ...
$ Schooling_year : num [1:17865] 0 5 5 0 0 0 0 0 0 0 ...
$ Age : num [1:17865] 55 25 25 84 84 16 65 49 60 74 ...
$ Household_size : num [1:17865] 3 3 3 2 2 2 3 3 3 1 ...
$ Rainfed_area : num [1:17865] 0.175 0.258 0.257 0.334 0.334 0.127 0 0.052 0.267 0 ...
$ Irrigated_area : num [1:17865] 0.076 0 0 0 0 0.051 0 0.016 0 0 ...
$ Remittance : num [1:17865] 0 0 0 30.6 30.6 0 0 0 0 0 ...
$ No_livestock : num [1:17865] 3.21 1.96 1.96 2.83 2.83 ...
$ Infrastructure_Index: num [1:17865] 0.381 0.726 0.727 0.765 0.765 0.773 0.785 0.809 0.82 0.823 ...
$ Region : num [1:17865] 1 1 1 1 1 1 1 1 1 1 ...
$ Sex : num [1:17865] 0 0 0 1 1 0 0 0 0 0 ...
$ Caste : num [1:17865] 0 0 0 1 1 1 1 1 1 1 ...
$ Livelihood : num [1:17865] 1 1 1 0 0 1 1 1 1 1 ...
$ School_Class : num [1:17865] 0 0 0 0 0 0 0 0 0 0 ...
$ Household_Class : num [1:17865] 0 0 0 0 0 0 0 0 0 0 ...
$ Remitance_Class : num [1:17865] 0 0 0 0 0 0 0 0 0 0 ...
$ Foodstatus_ID : chr [1:17865] "Food inadquate" "Food inadquate" "Food inadquate" "Food adquate" ...
$ Sex_ID : chr [1:17865] "Male" "Male" "Male" "Female" ...
$ Region_ID : chr [1:17865] "central" "central" "central" "central" ...
$ Livelihood_ID : chr [1:17865] "Agriculture Household" "Agriculture Household" "Agriculture Household" "Non Agriculture Household" ...
Apache Arrow is a cross-language development platform for processing data, both in-memory and larger-than-memory. It provides a standardized, language-independent columnar memory format for flat and hierarchical data, organized to support fast analytic operations on modern hardware. Additionally, it offers computational libraries and zero-copy streaming, messaging, and interprocess communication.
The arrow R package exposes an interface to the Arrow C++ library
, allowing access to many of its features in R. It provides not only low-level access to the Arrow C++ library API
but also higher-level access through a dplyr
backend and familiar R functions.
The arrow package boasts several key features, including interoperability, columnar data representation, and high performance. Arrow offers seamless communication between different systems and languages, making it easy to exchange data between R and other programming languages such as Python
, Julia
, and C++
. Arrow uses a columnar memory layout, which can be more efficient for many analytical tasks than traditional row-based formats. Arrow is designed for high-performance data processing, making it suitable for big data and parallel computing environments.
The arrow package also provides several functionalities. It allows importing data from various sources into R and exporting R data to Arrow files. Arrow data can be manipulated in R for various tasks such as filtering, sorting, and aggregating. Arrow can be integrated with other R packages for advanced data analysis and visualization tasks.
Apache Arrow relies on its in-memory columnar format, a standardized, programming language-independent definition for representing structured, table-like datasets in memory. The arrow R package employs the Table class to store these objects, which behave like data frames. You can use the arrow_table()
function to create new Arrow Tables, much like how data. frame()
is utilized to produce new data frames.
Table
4 rows x 2 columns
$x <int32>
$y <string>
We can also convert exiting data.frame to arrow.table:
You can use [
to specify subsets of Arrow Table in the same way you would for a data frame:
Along the same lines, the $
operator can be used to extract named columns:
We can also convert exiting data.frame to arrow.table:
One of the critical features of Arrow is its ability to handle data in different formats, including CSV,
Parquet,
and Arrow
(also called Feather). While many packages support CSV,
Arrow’s high-speed CSV reading and writing capabilities make it stand out. Additionally, Arrow supports data formats like Parquet and Arrow, which are not widely supported in other packages, making it an excellent choice for handling complex data structures.
Another unique feature of Arrow is its support for multi-file datasets. It can store a single rectangular dataset across multiple files, thus making it possible to work with large datasets that cannot fit into memory. This feature is handy for data scientists and analysts who work with big data and must process large datasets efficiently.
When the goal is to read a single data file into memory, there are several functions you can use:
read_parquet()
: read a file in Parquet format
read_feather()
: read a file in Arrow/Feather format
read_delim_arrow()
: read a delimited text file
read_csv_arrow()
: read a comma-separated values (CSV) file
read_tsv_arrow()
: read a tab-separated values (TSV) file
read_json_arrow()
: read a JSON data file
For writing data to single files, the arrow package provides the following functions, which can be used with both R data frames and Arrow Tables:
write_parquet()
: write a file in Parquet format
write_feather()
: write a file in Arrow IPC format
write_csv_arrow()
: write a file in CSV format
We will write it to a Parquet file using write_parquet()
function:
We can then use read_parquet()
to load the data from this file. As shown below, the default behavior is to return a data frame but when we set as_data_frame = FALSE the data are read as an Arrow Table:
Comparing file sizes among different file formats (data frame, Parquet, Feather, and data table) can be insightful in understanding their efficiency in storage. However, please note that the actual file size depends on various factors such as the data type, compression settings, and the nature of the data itself.
Now, check disk space of these three format:
user system elapsed 0.165 0.004 0.170
Dealing with big data in R requires efficient import and export methods to ensure performance and scalability. Utilizing columnar storage formats like Parquet and Feather, along with database connections and distributed computing frameworks, can help you effectively handle and analyze large datasets in R. Additionally, compression can further optimize storage and transfer of big data files.
This tutorial covers efficient data export-import processes using the R packages data.table, Arrow and Feather, which handle large datasets with speed and ease. We explore data.table’s syntax for importing and exporting data and feather’s binary columnar data format for seamless data exchange between R and other programming languages. Using these packages, data scientists can handle large datasets efficiently, ensuring, storage, speed and readability in data operations.
To optimize data manipulation workflows, consider exploring advanced features of data.table, and experimenting with feather’s compatibility with various data science ecosystems. On the other hand, the arrow package provides a powerful platform for efficient analytic operations on data, with its standardized columnar memory format, computational libraries, and zero-copy streaming capabilities. Its interoperability, columnar data representation, and high performance make it a valuable tool for big data and parallel computing environments.
Compared to other formats like CSV and Feather, Parquet files have a significantly smaller size on disk, making them an excellent option for handling big data. Although Feather files have a faster read and write speed than Parquet, they take up more space on disk. However, the Parquet format supports compression, which helps to reduce the file sizes even further significantly. The actual size of files can depend on various factors, such as the compression codec used (e.g., Snappy, Gzip) and the nature of the data itself. With all these advantages, the Parquet format is an excellent choice for storing large datasets efficiently while keeping their storage costs low.
1 Feather