Document Type
Article
Publication Date
4-3-2019
Publication Title
Journal of Computational and Graphical Statistics
Publication Title
Journal of Computational and Graphical Statistics
Volume
28
Issue
2
Abstract
Many interesting datasets available on the Internet are of a medium size—too big to fit into a personal computer’s memory, but not so large that they would not fit comfortably on its hard disk. In the coming years, datasets of this magnitude will inform vital research in a wide array of application domains. However, due to a variety of constraints they are cumbersome to ingest, wrangle, analyze, and share in a reproducible fashion. These obstructions hamper thorough peer-review and thus disrupt the forward progress of science. We propose a predictable and pipeable framework for R (the state-of-the-art statistical computing environment) that leverages SQL (the venerable database architecture and query language) to make reproducible research on medium data a painless reality. Supplementary material for this article is available online.
First Page
256
Last Page
264
Recommended Citation
Baumer, Benjamin S., "A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data" (2019). Statistical and Data Sciences: Faculty Publications, Smith College, Northampton, MA.
https://scholarworks.smith.edu/sds_facpubs/34
Digital Object Identifier (DOI)
10.1080/10618600.2018.1512867
Rights
“Licensed to Smith College and distributed CC-BY under the Smith College Faculty Open Access Policy.”
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Included in
Data Science Commons, Other Computer Sciences Commons, Statistics and Probability Commons
Comments
Peer reviewed accepted manuscript.