Document Type

Article

Publication Date

4-3-2019

Publication Title

Journal of Computational and Graphical Statistics

Publication Title

Journal of Computational and Graphical Statistics

Volume

28

Issue

2

Abstract

Many interesting datasets available on the Internet are of a medium size—too big to fit into a personal computer’s memory, but not so large that they would not fit comfortably on its hard disk. In the coming years, datasets of this magnitude will inform vital research in a wide array of application domains. However, due to a variety of constraints they are cumbersome to ingest, wrangle, analyze, and share in a reproducible fashion. These obstructions hamper thorough peer-review and thus disrupt the forward progress of science. We propose a predictable and pipeable framework for R (the state-of-the-art statistical computing environment) that leverages SQL (the venerable database architecture and query language) to make reproducible research on medium data a painless reality. Supplementary material for this article is available online.

Comments

Peer reviewed accepted manuscript.

First Page

256

Last Page

264

Digital Object Identifier (DOI)

10.1080/10618600.2018.1512867

Rights

“Licensed to Smith College and distributed CC-BY under the Smith College Faculty Open Access Policy.”

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.