From 933ddb130b4c1cfdb5ca7b5fd369f0053750b4bf Mon Sep 17 00:00:00 2001 From: jd-otero Date: Wed, 5 Feb 2025 23:08:25 +0100 Subject: [PATCH 1/2] Include design principles vignette --- _pkgdown.yml | 4 ++ vignettes/design_principles.Rmd | 73 +++++++++++++++++++++++++++++++++ 2 files changed, 77 insertions(+) create mode 100644 vignettes/design_principles.Rmd diff --git a/_pkgdown.yml b/_pkgdown.yml index e4b9513..58422d3 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -18,6 +18,10 @@ articles: navbar: Merge data contents: - merge_geo_demographic +- title: Design Principles + navbar: Design Principles + contents: + - design_principles reference: - title: Demographic Data diff --git a/vignettes/design_principles.Rmd b/vignettes/design_principles.Rmd new file mode 100644 index 0000000..b3add65 --- /dev/null +++ b/vignettes/design_principles.Rmd @@ -0,0 +1,73 @@ +--- +title: "Design Principles for {ColOpenData}" +output: + rmarkdown::html_vignette: + df_print: tibble +vignette: > + %\VignetteIndexEntry{Population Projections with ColOpenData} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + eval = identical(tolower(Sys.getenv("NOT_CRAN")), "true") +) +``` + +::: {style="text-align: justify;"} +This vignette outlines the design decisions that have been taken during the development of the ColOpenData R package, and provides some of the reasoning, and possible pros and cons of each decision. + +This document is primarily intended to be read by those interested in understanding the code within the package and for potential package contributors. +::: + +## Scope +::: {style="text-align: justify;"} +ColOpenData provides an easy access to Colombian Open Government Data from two main data sources: Departamento Administrativo Nacional de Estadística (DANE), and Instituto de Hidrología, Meteorología y Estudios Ambientales (IDEAM). These include Demographic, Geospatial, Climate, and Population Projections' datasets. The package centralizes information an provides harmonized and cleaned datasets for easier data analysis. + +In addition, the package includes a series of documentation and dictionaries to help the user navigate the open datasets available. Further information regarding included documentation, naming conventions, and uses can be found here: [Documentation and Datasets](https://epiverse-trace.github.io/ColOpenData/articles/documentation_and_dictionaries.html) +::: + +## Output + +::: {style="text-align: justify;"} +The output for each function is a `data.frame` with the requested data. When downloading geospatial data, which includes maps, an `sf` object is created. +::: + +## Design Decisions + +::: {style="text-align: justify;"} +A manual cleaning process was necessary to address issues like inconsistent headers in the original datasets, while harmonization ensured alignment with tidy data standards and national naming conventions. To avoid repeating these steps at each download, and ensure stable and independent access to open data, the already processed datasets were stored on a private server for later redistribution. The datasets list and dictionaries are stored inside the package data, to avoid unnecessary downloads of frequently consulted datasets. + +Additional particularities include: + +- The datasets are only available in the original language (Spanish). However, dictionaries and dataset lists are available in English and Spanish. +- Geospatial datasets include very detailed and segregated structures (down to block level). To avoid extremely long download times, a simplified version was included for all the maps, allowing the users to select either of them according to their needs. +- Climate datasets are stored as they were acquired from the source, so filtering and aggregation occurs during each user's request. +- Dictionaries are only provided for geospatial data, since they were the only ones originally provided from the source. +::: + +## Dependencies + +::: {style="text-align: justify;"} +The package dependencies are: + +- `{checkmate}` +- `{config}` +- `{dplyr}` +- `{magrittr}` +- `{rlang}` +- `{sf}` +- `{stringdist}` +- `{tidyr}` +- `{utils}` + +::: + +## Additional Considerations + +::: {style="text-align: justify;"} +The datasets included in the package contain changes which may alter the structure, format, or content, meaning the data does not reflect the official dataset. The package is developed independently, with no endorsement or involvement from these institutions or any Colombian government body. The authors of ColOpenData are not liable for how users utilize the data, and users are responsible for any outcomes from their use or analysis of the data. +::: From e60ac17147ea34d354673247c40560c1b71b143c Mon Sep 17 00:00:00 2001 From: jd-otero Date: Wed, 5 Feb 2025 23:09:51 +0100 Subject: [PATCH 2/2] Include warning for heavy dataset (#100) Only needed for geospatial datasets --- R/download_geospatial.R | 3 +++ tests/testthat/_snaps/download_geospatial.md | 2 ++ 2 files changed, 5 insertions(+) diff --git a/R/download_geospatial.R b/R/download_geospatial.R index 71d3509..8a9f36e 100644 --- a/R/download_geospatial.R +++ b/R/download_geospatial.R @@ -54,6 +54,9 @@ download_geospatial <- function(spatial_level, simplified = TRUE, if (simplified) { dataset_path <- sub("\\.gpkg$", "_SIM.gpkg", dataset_path) } + + message("This might take a while") + geospatial_data <- sf::st_read(dataset_path, quiet = TRUE) geospatial_vars <- c("area", "latitud", "longitud") shape_vars <- c("shape_length", "shape_area") diff --git a/tests/testthat/_snaps/download_geospatial.md b/tests/testthat/_snaps/download_geospatial.md index 5dc7cf6..15f3ae0 100644 --- a/tests/testthat/_snaps/download_geospatial.md +++ b/tests/testthat/_snaps/download_geospatial.md @@ -4,6 +4,7 @@ download_geospatial(spatial_level = "department", simplified = TRUE, include_geom = TRUE, include_cnpv = FALSE) Message + This might take a while ColOpenData provides open data derived from Departamento Administrativo Nacional de Estadística (DANE), and Instituto de Hidrología, Meteorología y Estudios Ambientales (IDEAM) but with modifications for @@ -51,6 +52,7 @@ download_geospatial(spatial_level = "dpto", simplified = TRUE, include_geom = FALSE, include_cnpv = TRUE) Message + This might take a while ColOpenData provides open data derived from Departamento Administrativo Nacional de Estadística (DANE), and Instituto de Hidrología, Meteorología y Estudios Ambientales (IDEAM) but with modifications for