class: center, middle # Apache Arrow: Enabling<br/>Data Engineering Tasks in R ### Ian Cook <small> [@ianmcook](http://twitter.com/ianmcook)</small> ### <small>Ursa Computing</small> #### Video of this talk: [youtu.be/SXbq4OYtsFA](https://youtu.be/SXbq4OYtsFA) --- # Apache Arrow - Arrow is a cross-language toolkit for in-memory analytics - Defines a language-independent columnar memory format for tabular data - Provides libraries for working with tabular data in many languages - C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust - Emphasizes performance, efficiency, standardization, and interoperability <br /><br /> - The Arrow project started in 2016 under the Apache Software Foundation - A collaboration of developers of the Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, pandas, Parquet, Phoenix, Spark, and Storm projects - Version 1.0 was released July 2020 --- # Ursa Computing - Wes McKinney founded **Ursa Labs** in 2018 - An independent development lab - In partnership with RStudio and Two Sigma - Sponsored by NVIDIA, Intel, Bloomberg, G-Research, OneSixtyTwo Tech, et al. - Primary goal: Advance Apache Arrow <br /><br /> - Wes founded **Ursa Computing** in 2020 - Goals: Sustain Arrow; build enterprise products and services for data teams - Raised $4.9 million in seed funding led by GV - Continues to maintain a Labs team and accept Labs sponsorships - We're hiring: [jobs.lever.co/Ursa](https://jobs.lever.co/Ursa) --- # The arrow R Package - The **arrow** R package exposes an interface to the Arrow C++ library - Provides low-level access to the C++ library API - Provides higher-level access through a **dplyr** backend and familiar functions - Facilitates many common data engineering and ETL tasks in R - More details at [arrow.apache.org/docs/r](https://arrow.apache.org/docs/r/) <br /> Install the latest CRAN release: ```r install.packages("arrow") ``` --- # Data Engineering - Data engineering has emerged as a discipline distinct from data science - Data engineers typically **build, manage, and optimize systems for transforming data into forms that facilitate analysis** - What’s important in data engineering is very different from what’s important in statistics and data science <br /><br /> - For example, a data engineer might need to: - Choose **file formats** and **compression algorithms** based on user requirements, data longevity, performance needs, storage costs, and more - Carefully control the **data types** of columns to avoid truncation, loss of precision, floating point errors, and inefficiencies in storage and computation - Ensure **interoperability** of data files with multiple languages and big data tools such as Spark, Impala, Hive, Presto, Athena, and BigQuery - Move data files between **local filesystems** and **cloud object storage** --- # Data Engineering in R - To date, R has taken a back seat to other languages for data engineering - R has lacked the tooling to perform some essential tasks <br /><br /> - Apache Arrow and the R package **arrow** is helping to change this - **arrow** brings a powerful suite of data engineering capabilities to R --- class: center, middle # Examples ### [github.com/ianmcook/indy-user-may-2021](https://github.com/ianmcook/indy-user-may-2021) --- class: center, middle # Q&A ### Ian Cook <small> [@ianmcook](http://twitter.com/ianmcook)</small>