Apache Arrow: Enabling Data Engineering Tasks in R

class: center, middle

# Apache Arrow: Enabling Data Engineering Tasks in R

### Ian Cook &nbsp;[@ianmcook](http://twitter.com/ianmcook)
### Ursa Computing

#### Video of this talk: [youtu.be/SXbq4OYtsFA](https://youtu.be/SXbq4OYtsFA)

---

# Apache Arrow

- Arrow is a cross-language toolkit for in-memory analytics
 - Defines a language-independent columnar memory format for tabular data
 - Provides libraries for working with tabular data in many languages
 - C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust
 - Emphasizes performance, efficiency, standardization, and interoperability
 
- The Arrow project started in 2016 under the Apache Software Foundation
 - A collaboration of developers of the Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, pandas, Parquet, Phoenix, Spark, and Storm projects
 - Version 1.0 was released July 2020

---

# Ursa Computing

- Wes McKinney founded **Ursa Labs** in 2018
 - An independent development lab
 - In partnership with RStudio and Two Sigma
 - Sponsored by NVIDIA, Intel, Bloomberg, G-Research, OneSixtyTwo Tech, et al.
 - Primary goal: Advance Apache Arrow
 
- Wes founded **Ursa Computing** in 2020
 - Goals: Sustain Arrow; build enterprise products and services for data teams
 - Raised $4.9 million in seed funding led by GV
 - Continues to maintain a Labs team and accept Labs sponsorships
 - We're hiring: [jobs.lever.co/Ursa](https://jobs.lever.co/Ursa)

---

# The arrow R Package

- The **arrow** R package exposes an interface to the Arrow C++ library
  - Provides low-level access to the C++ library API
  - Provides higher-level access through a **dplyr** backend and familiar functions
  - Facilitates many common data engineering and ETL tasks in R
  - More details at [arrow.apache.org/docs/r](https://arrow.apache.org/docs/r/)

Install the latest CRAN release:

```r
install.packages("arrow")
```
  
  
---

# Data Engineering

- Data engineering has emerged as a discipline distinct from data science
 - Data engineers typically **build, manage, and optimize systems for transforming data into forms that facilitate analysis**
 - What’s important in data engineering is very different from what’s important in statistics and data science
 
- For example, a data engineer might need to:
 - Choose **file formats** and **compression algorithms** based on user requirements, data longevity, performance needs, storage costs, and more
 - Carefully control the **data types** of columns to avoid truncation, loss of precision, floating point errors, and inefficiencies in storage and computation
 - Ensure **interoperability** of data files with multiple languages and big data tools such as Spark, Impala, Hive, Presto, Athena, and BigQuery
 - Move data files between **local filesystems** and **cloud object storage**

---

# Data Engineering in R

- To date, R has taken a back seat to other languages for data engineering
 - R has lacked the tooling to perform some essential tasks
 
- Apache Arrow and the R package **arrow** is helping to change this
 - **arrow** brings a powerful suite of data engineering capabilities to R

---
class: center, middle

# Examples

### [github.com/ianmcook/indy-user-may-2021](https://github.com/ianmcook/indy-user-may-2021)

---
class: center, middle

# Q&A

### Ian Cook &nbsp;[@ianmcook](http://twitter.com/ianmcook)