diff --git a/DESCRIPTION b/DESCRIPTION index bb6a373..ed43a27 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -64,7 +64,7 @@ Imports: lifecycle, httr2, desc -RoxygenNote: 7.3.2 +RoxygenNote: 7.3.3 Suggests: knitr, rmarkdown, diff --git a/NEWS.md b/NEWS.md index 9b5efb0..5882a1f 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,4 +1,9 @@ # QCkit v1.2.1 +## 2026-02-26 + * update `get_elevation` to handle unpredictable API responses for invalid lat/long values. + * update unit tests to pass using DataStore v8 API + * update unit tests for `summarize_qc_flags` to reflect changes made in the function + ## 2025-01-09 * `create_datastore_script` now has a new parameter, lib_type. If "R" is specified it will pull as much info as it can from the DESCRIPTION file of an R package and use that info to fill out the Core Bibliography tab on the reference landing page. diff --git a/R/datastore_interactions.R b/R/datastore_interactions.R index c7b3954..449d9a9 100644 --- a/R/datastore_interactions.R +++ b/R/datastore_interactions.R @@ -289,7 +289,11 @@ create_datastore_script <- function(owner, if (force == FALSE) { cat("Your file, ", crayon::blue$bold(file_name), ", has been uploaded to:\n", sep = "") - cat(ds_resource_url, "\n\n", sep = "") + + + #### ds_resource_url is not a good url; needs fixing + + cat(ds_resource_url, "\n\n", sep = "") } #add a web link: #release url: diff --git a/R/elevation.R b/R/elevation.R index 741e0fd..0063e7b 100644 --- a/R/elevation.R +++ b/R/elevation.R @@ -78,14 +78,15 @@ get_elevation <- function(df, if (req$status_code == 200) { gh_req_json <- suppressMessages(httr::content(req, "text")) # if something else went wrong - likely coordinates outside USA. - if (gh_req_json == "Invalid or missing input parameters."){ + if (gh_req_json == "Invalid or missing input parameters." | + grepl("Call failed", gh_req_json)){ if (force == FALSE){ cat("Invalid input. NAs generated. Are your coordinates inside the US?") } elev <- append(elev, NA) } else { # everything checks out, add elevation to df - elevation <- httr::content(req)$value + elevation <- as.numeric(httr::content(req)$value) elev <- append(elev, elevation) } } else { diff --git a/docs/404.html b/docs/404.html deleted file mode 100644 index 1c7926f..0000000 --- a/docs/404.html +++ /dev/null @@ -1,83 +0,0 @@ - - -
- - - - -Site built with pkgdown 2.1.1.
+Site built with pkgdown 2.2.0.
diff --git a/docs/LICENSE-text.html b/docs/LICENSE-text.html index 2482b59..8e98b91 100644 --- a/docs/LICENSE-text.html +++ b/docs/LICENSE-text.html @@ -50,7 +50,7 @@vignettes/articles/DRR_Purpose_and_Scope.Rmd
- DRR_Purpose_and_Scope.RmdThe Data Release Report (DRR) is aimed at fulfilling requirements and -expectations of Open Science at the National Park Service. This -includes:
-Broad adoption of open-data and open-by-default -practices.
A move in the scientific disciplines toward considering and -publishing data sets as independently-citable scientific works.
Routine assignment of digital object identifiers (DOIs) to -datasets to facilitate location, reuse, and citation of specific -data
Increased transparency and reproducibility in the processing and -analysis of data.
Establishment of peer reviewed “data journals” dedicated to -publishing data sets and associated documentation designed to facilitate -their reuse.
Expectation that science-based decisions are based on -peer-reviewed, reproducible, and open science by default.
Data Release Reports are designed to parallel external peer-reviewed -scientific journals dedicated to facilitate reuse of reproducible -scientific data, in recognition that the primary reason IMD data are -collected is to support science-based decisions.
-Note that publication in a Data Release Report Series (not mandated) -is distinct from requirements to document data collection, processing, -and quality evaluation (mandated; see below). The establishment of a -Data Release Report Series is intended to facilitate and encourage this -type of reporting in a standard format, and in a manner commensurate -with current scientific norms.
-Reproducibility. The degree to which scientific -information, modeling, and methods of analysis could be evaluated by an -independent third party to arrive at the same, or substantially similar, -conclusion as the original study or information, and that the scientific -assessment can be repeated to obtain similar results (Plesser 2017). A -study is reproducible if you can take the original data and the computer -code used to analyze the data and reproduce all of the numerical -findings from the study. This may initially sound like a trivial task -but experience has shown that it’s not always easy to achieve this -seemingly minimal standard (ASA 2017, Plesser 2017).
-Transparency. Full disclosure of the methods used to -obtain, process, analyze, and review scientific data and other -information products, the availability of the data that went into and -came out of the analysis, and the computer code used to conduct the -analysis. Documenting this information is crucial to ensure -reproducibility and requires, at minimum, the sharing of analytical data -sets, relevant metadata, analytical code, and related software.
-Fitness for Use. The utility of scientific -information (in this case a dataset) for its intended users and its -intended purposes. Agencies must review and communicate the fitness of a -dataset for its intended purpose, and should provide the public -sufficient documentation about each dataset to allow data users to -determine the fitness of the data for the purpose for which third -parties may consider using it.
-Decisions. The type of decisions that must be based -on publicly-available, reproducible, and peer-reviewed science has not -been defined. At a minimum it includes any influential decisions, but it -may also include any decisions subject to public review and comment.
-Descriptive Reporting. The policies listed above are -consistent in the requirement to provide documentation that describes -the methods used to collect, process, and evaluate science products, -including data. Note that this is distinct from (and in practice may -significantly differ from) prescriptive documents such as protocols, -procedures, and study plans. Descriptive reporting should cite or -describe relevant science planning documents, methods used, deviations, -and mitigations. In total, descriptive reporting provides a clear “line -of sight” on precisely how data were collected, processed, and -evaluated. Although deviations may warrant revisions to prescriptive -documents, changes in prescriptive documents after the fact do not meet -reproducibility and transparency requirements.
-DO11B-a, -DO -11B-b, OMB -M-05-03 (Peer review and information quality):
-Scientific information must be appropriately reviewed prior to -use in decision-making, regulatory processes, or dissemination to the -public, regardless of media.
As per OMB M-05-03 “scientific information” includes factual -inputs, data, models, analyses, technical information, or scientific -assessments related to such disciplines as the behavioral and social -sciences, public health and medical sciences, life and earth sciences, -engineering, or physical sciences.
Methods for producing information will be made transparent, to -the maximum extent practicable, through accurate documentation, use of -appropriate review, and verification of information quality.
OMB -M-19-15 (Updates to Implementing the Information -Quality Act):
-Federal agencies must collect, use, and disseminate information -that is fit for its intended purpose.
Agencies must conduct pre-dissemination review of quality [of -scientific information] based on the likely use of that information. -Quality encompasses utility, integrity, and objectivity, defined as -follows: a) Utility – utility for its intended users and its intended -purposes, b) Integrity – refers to security, and c) Objectivity – -accurate, reliable, and unbiased as a matter of presentation and -substance.
Agencies should provide the public with sufficient documentation -about each dataset released to allow data users to determine the fitness -of the data for the purpose for which third parties may consider using -it. Potential users must be provided with sufficient -information to understand… the data’s strengths, weaknesses, analytical -limitations, security requirements, and processing options.
Reproducibility requirements for Influential Information. Note -that because this may not be determined at the time of collection, -processing, or dissemination this should be the default for NPS -scientific activities:
-Analyses must be disseminated with sufficient descriptions of -data and methods to allow them to be reproduced by qualified third -parties who may want to test the sensitivity of agency analyses. This is -a higher standard than simply documenting the characteristics of the -underlying data, which is required for all information.
Computer code used to process data should be made available to -the public for further analysis. In the context of results generated, -for example, a statistical model or machine augmented learning and -decision support, reproducibility requires, at a minimum transparency -about the specific methods, design parameters, equations or algorithms, -parameters, and assumptions used.
Reports, data, and computer code used, developed, or cited in the -analysis and reporting of findings must be made publicly available -except where prohibited by law.
Multiple policy and guidance documents require the use of best -available science in decision-making at the Natonal Park Service (NPS). -Additional requirements include:
-SO -3369 (Promoting Open Science):
-DO -11B (Ensuring Objectivity, Utility, and Integrity -of Information Used and Disseminated by the National Park -Service):
-NPS-75 -(Inventory and Monitoring Guidelines):
-An annual summary report documenting the condition of park -resources should be developed as part of the annual revision of the -parks Resource Management Plan.
An annual report provides a mechanism for reviewing and making -recommendations for revisions in the [Protocol/SOPs].
[Inventory] data obtained should be archived in park records and, -when appropriate, a report should be written summarizing -findings.
Reporting requirements as per IMD directive
IMD -Reporting and Analysis Guidance
-Because all of the data the NPS IMD collects is intended for use in -supporting science-based decisions as per our program’s five goals, and -is intended for use in planning (the decisions of which are subject to -public comment as per NEPA requirements), this means that by -default:
-All analytical work we do should be reproducible to the extent -possible. Analytical work includes both statistical analysis and -reporting of data as well as quality control procedures where data are -tested against quality standards and qualified or corrected as -appropriate.
Full reproducibility may not be possible in all cases, -particularly where analytical methods involve subject matter expertise -to make informed judgments on how to proceed with analyses. In such -cases, decisions should be documented to ensure transparency.
All IMD data should be published with supporting documentation to -allow for reproduction of results.
All IMD data should be evaluated to determine whether they are -suitable for their intended use.
All IMD data should be published with information fully -describing how data were collected, processed, and evaluated.
All data should be published in open formats that support the FAIR principles -(Findable, Accessible, Interoperable, and Resuable).
(for the NPS Inventory & Monitoring Program)
-Any project that involves the collection of scientific data for use -in supporting decisions to be made by NPS personnel. General study data -may or may not be collected based on documented or peer-reviewed study -plans or defined quality standards, but are in most cases purpose-driven -and resultant information should be evaluated for the suitability -for—and prior to—their use in decision support. These data may be reused -for secondary purposes including similar decisions at other locations or -times and/or portions of general study data may be reused or contribute -to other scientific work (observations from a deer browsing study may be -contribute to an inventory or may be used as ancillary data to explain -monitoring observations).
-
-Workflow for data collection, processing, dissemination, and use for -general studies. Teal-colored boxes are subject to reproducibility -requirements. -
-Vital signs monitoring data are collected by IMD and park staff to -address specific monitoring objectives following methods designed to -ensure long-term comparability of data. Procedures are established to -ensure that data quality standards are maintained in perpetuity. -However, because monitoring data are collected over long periods of time -in dynamic systems, the methods employed may differ from those -prescribed in monitoring protocols, procedures, or sampling plans, and -any deviations (and resultant mitigations to the data) must be -documented. Data should be evaluated to ensure that they meet prescribed -standards and are suitable for analyses designed to test whether -monitoring objectives have been met. Monitoring data may be reused for -secondary purposes including synthesis reports and condition -assessments, and portions of monitoring data may contribute to -inventories.
-
-Workflow for data collection, processing, dissemination, and use for -vital sign monitoring efforts. Teal-colored boxes are subject to -reproducibility requirements. -
-Inventory study data are similar to general study data in that they -are time- and area-specific efforts designed to answer specific -management needs as well as broader inventory objectives outlined in -project-specific study plans and inventory science plans. Inventory -studies typically follow well-documented data collection methods or -procedures, and resultant data should be evaluated for whether they are -suitable for use in supporting study-specific and broader -inventory-level objectives. Inventory study data are expected to be -reused to meet broader inventory level goals, but may also support other -scientific work and decision support.
-
-Workflow for data collection, processing, dissemination, and use for -inventory studies. Teal-colored boxes are subject to reproducibility -requirements. -
-American Statistical Association (ASA). 2017. Recommendations to -funding agencies for supporting reproducible research. https://www.amstat.org/asa/files/pdfs/POL-ReproducibleResearchRecommendations.pdf.
-Plesser, H. E. 2017. Reproducibility vs. Replicability: A brief -history of a confused terminology. Front. Neuroinform. 11:76. https://doi.org/10.3389/fninf.2017.00076.
-Purpose and Scope of Data -Release Reports
-DRRs are created by the National Park Service and provide detailed -descriptions of valuable research datasets, including the methods used -to collect the data and technical analyses supporting the quality of the -measurements. Data Release Reports focus on helping others reuse data -rather than presenting results, testing hypotheses, or presenting new -interpretations and in-depth analyses.
-Opening a new NPS DRR Template will write a folder to the current -working directory that contains the an rmarkdown (.rmd) file that is the -DRR Tempate, a references.bib file for bibtex references, a -national-park-service-DRR,csl file for formatting references, and a -sub-folder, BICY_Example with an example data package that can be used -to knit an example DRR to .docx.
-Upon submission for publication, the .docx file will be ingested by -EXstyles, converted to an .xml file and fully formatted according to NPS -branding and in compliance with 508 accessibility requirements upon -final publication. The goal of this process is to relieve data -producers, managers, and scientists from the burden of formatting and -allow them to focus primarily on content. Consequently, the .docx -generated for the publication process may not be visually appealing. The -content, however, should focus on the production, quality, and utility -of NPS data packages.
-To start your DRR you will need all of your data in flat -.csv files. All quality assurance, quality control, and quality -flagging should be completed. Ideally you have already created or are in -the process of creating a data package (see the documentation -associated with the R package EMLeditor -for data package creation). All of the .csv files you want to describe -in the DRR should be in a single folder with no additional .csv -files (other files such as .txt and .xml will be ignored). This -folder can be the same folder you used/are using to create a data -package.
Using Rstudio, open an R project (Select: File -> New Project…) in the same folder as your .csv files. If you already -have an R project (.Rproj) initiated from creating a data package, you -can use that same R project.
Install, update (if necessary), and load the QCkit R -package. QCkit can be installed either as a component of the NPSdataverse -or on its own. The benefits of installing the entire NPSdataverse is -that upon loading the NPSdataverse, you will automatically be informed -if there are any updates to QCkit (or any of the constituent packages). -The downside to installing and loading the NPSdataverse is that the -first time you install it the process can be lengthy (there are many -dependencies) and you may hit the GitHub.com API rate limit. Either -installation is from GitHub.com and requires the devtools package to -install.
-# Install the devtools package, if you don't already have it:
-install.packages("devtools")
-# Install and load QCkit via NPSdataverse:
-devtools::install_github("nationalparkservice/NPSdataverse")
-library(NPSdataverse)
-# Alternatively, install and load just QCkit:
-devtools::install_github("nationalparkservice/QCkit")
-library(QCkit)

After selecting “OK” two things will happen: First, you the DRR -Template file will open up. It is called “Untitled.Rmd” by default. -Second, a new folder will be created called “Untitled” (Unless you opted -to change the default “Name:” in the “New R Markdown” pop up, then these -will have whatever name you gave them).
Edit the DRR Template to reflect the data you -would like to descibe and according to the instructions in the “Using the DRR Template” -guide.
“knit” the .rmd file to Word when you are done -editing it. Submit the resulting .docx file for publication (via a -yet-to-be-determined process).
Knit your own example DRR: Assuming you left the -“Name:” as the default “Untitled”, you should be able to knit the DRR -template in to an example .docx that could be submitted for publication. -If you opted to change the Name, you will need to update the the file -paths before knitting.
-vignettes/articles/Using-the-DRR-Template.Rmd
- Using-the-DRR-Template.RmdData Release Reports (DRRs) are created by the National Park Service -and provide detailed descriptions of valuable research datasets, -including the methods used to collect the data and technical analyses -supporting the quality of the measurements. Data Release Reports focus -on helping others reuse data, rather than presenting results, testing -hypotheses, or presenting new interpretations, methods or in-depth -analyses.
-DRRs are intended to document the processing of fully-Quality-Assured -data to their final (Quality Controlled) form in a reproducible and -transparent manner. DRRs document the data collection methods and -quality standards used to prepare and review data prior to release. DRRs -present the quality of resultant data in the context of fitness for -their intended use.
-Each DRR cites source and resultant data packages that are published -concurrently and cross-referenced. Associated data packages are made -publicly available with the exception of data that must be protected -from release as per NPS and park-specific policies.
-Data packages that are published concurrently with DRRs are intended -to be independently citable scientific works that can serve as the basis -for subsequent analysis and reporting by NPS or third parties.
-To set up a project, follow the instructions in the Article, “Starting a DRR”.
-The DRR Template takes advantage of rmarkdown code chunks to help -generate a reproducible report. The template includes all of the -required code chunks. Some of these code chunks need to be edited to -generate the report, others should not be edited. Below is description -of each code chunk in the DRR Template and instructions on how to (and -when not to) edit them.
-In addition to the report outline and a description of content for -each section, the template includes four standard code chunks.
-YAML Header:
-The YAML header helps format the DRR. You should not need to edit any -of the YAML header.
-R code chunks:
-user_edited_parameters. A series of parameters that
-are used in the creation of the DRR and may be re-used in metadata and
-associated data package construction. You will need to edit these
-parameters for each DRR.
title. The title of the Data Release Report.report_number. This is optional, and should
-only be included if publishing in the semi-official DRR series.
-Set to NULL if there is no reportNumber.drr_ds_ref_id. This is the DataStore reference ID for
-the report. It should be 7 digits long.authorNames. A list of the author’s names. If an author
-has multiple institutional affiliations, the author should be listed
-once for each affiliation.author_affiliations. A list of the author’s
-affiliations. The order of author affiliations must match the order of
-the authors in the authorNames list. Note that the entirety
-of each affiliation is enclosed in a single set of quotations. Do not
-worry about indentation or word wrapping. If two authors have the same
-affiliation, list the affiliation twice.author_orcid. A list of ORCID iDs for each author in
-the format “(xxxx-xxxx-xxxx-xxxx)”. If an author does not have an ORCID
-iD, specify NA (no quotes). The order of ORCID iDs (and NAs) must
-correspond to the order of authors in the authorNames list.
-If an author was listed more than once in the authorNames
-list, the corresponding ORCID (or NA) should also be listed more than
-once. Future iterations of the DRR Template will pull ORCID iDs from
-metadata and eventually from Active Directory. See ORCID for more information about ORCID
-iDs or to register an ORCID iD.drr_abstract. The abstract for the DRR (which may be
-distinct from the data package abstract). Pay careful attention to
-non-standard characters, line breaks, carriage returns, and
-curly-quotes. You may find it useful to write the abstract in NotePad or
-some other text editor and NOT a word processor (such as Microsoft
-Word). Indicate line breaks with and a space between paragraphs - should
-you want them - using . The Abstract should succinctly describe the
-study, the assay(s) performed, the resulting data, and their reuse
-potential, but should not make any claims regarding new scientific
-findings. No references are allowed in this section. A good suggested
-length for abstracts is less than 250 words.data_package_ref_id. DataStore reference ID for the
-data package associated with this report. You must have at least one
-data package. Eventually, we will automate importing much of this
-information from metadata and include the ability to describe multiple
-data packages in a single DRR.data_P_package_T_title. The title of the data package.
-Must match the title on DataStore (and metadata).data_package_description. A short title/subtitle or
-short description for the data package. Must match the data package
-metadata.data_package_DOI_doi. Auto-generated, no need to edit
-or update. This is the data package DOI. It is based on the DataStore
-reference number.data_package_file_names. List the file names in your
-data package. Do NOT include metadata files. For example, include
-“my_data.csv” but do NOT include “my_metadata.xml”. Note: Because data
-packages contain only .csv and .xml files, all data files should be
-.csv.data_package_file_sizes. List the approximate size of
-each data file. Make sure the order of the file sizes corresponds to the
-order of file names in dataPackage_fileNames.data_package_file_descript. A short description of the
-corresponding data file that helps distinguish it from other data files.
-A good guideline is 10 words or less. This will be used in a table
-summary table so brevity is a priority. If you have already created
-metadata for your data package in EML format, this should be the same
-text as found in the “entityDescription” element for each data
-file.setup_do_not_edit. Most users will not need to edit
-this code chunk. There is one code snippet for loading packages; the
-r_packages section is a suite of packages that are used to
-assist with reproducible reporting. You may not need these for your
-report, but we have included them as part of the base recommended
-packages. If you plan to perform you QC as part of the DRR construction
-process, you can add a second code snipped to import necessary packages
-for your QC process here.
title_do_not_edit. These parameters are
-auto-generated based on either the EML you supplied (when that becomes
-an option) or the information you’ve already supplied under
-“user-edited-parameters”. You really should not need to edit these
-parameters.
authors_do_not_edit. There is no need to edit this
-chunk. This writes the author names, ORCID iDs, and affiliations to the
-.docx document based on information supplied in
-user-edited-parameters.
file_table. Do not edit. Generates a table of file
-names, sizes, and descriptions in the data package being described by
-the DRR.
data_acceptance_criteria. If you did not use the
-standard data quality assurance flags (A = accepted, AE = Accepted
-(estimated), R = Rejected, P = Provisional), set this code chunk to
-eval = FALSE and generate your own custom code chunk to
-summarize your custom data flagging procedures. If you did use the
-standard QA flags, indicate which fields in your data files contain
-flagged data. Assuming your column names are unique, you do not need to
-specify which file the columns are in. If your column names are not
-unique, you will need to design your own summary table. Briefly describe
-the acceptance criteria for each data quality flagged column in the same
-order as the you specified the columns.
data_column_flagging. Uses the input from
-data_acceptance_criteria to generate a table summarizing
-the data quality flagging in the data package. If you set
-data_acceptance_criteria parameter
-eval = FALSE, also set data_column_flagging
-parameter to eval = FALSE. Update the first line (which in
-the example points to BICY_Example) to point to the directory where your
-data are.
data_package_flagging. If you used standard QA flags
-in your data package, leave the parameter eval = TRUE. If
-you did not use standard QA flags, set eval = FALSE and
-design your own custom summary table to handle your custom flagging
-protocols. If you set eval = TRUE, update the file path to
-pointing to your data files (in the example, the path points to the
-directory “BICY_Example”).
figure1. This is an example code chunk for inserting
-figures. Edit and re-deploy as necessary to include as many or as few
-figures as your require.
listing. Appendix A, by default is the code listing.
-This will generate all code used in generating the report and data
-packages. In most cases, code listing is not required. If all QA/QC
-processes and data manipulations were performed elsewhere, you should
-cite that code (in the methods and references) and leave the “listing
-code chunk with the default settings of eval=FALSE and echo=FALSE. If
-you have developed custom scripts, you can add those to DataStore with
-the reference”Script” and cite them in the DRR.
session_info is the information about the versions
-of R and packages used in generating the report. In most cases, you do
-not need to report session info (leave the session-info code chunk
-parameters in their default state: eval=FALSE). Session and version
-information is only necessary if you have set the “Listing” code chunk
-in appendix A to eval=TRUE. In that case, change the “session info” code
-chunk parameters to eval=TRUE.
The following text in the body of the DRR template will need to be -edited to customize it to each data package.
-This is a required section and consists of two subheadings:
-Data inputs - an optional subsection used to -describe datasets that the data package is based on if it is a -re-analysis, reorganization, or re-integration of prevously existing -data sets.
Summary of datasts created - this is a required -section used to explain each data record associated with the work (for -instance, a data package), including the DOI indicating where this -information is stored. It shoudl also provide an overview of the data -files and their formats. Each external data record should be -cited.
Sample text is included that uses r code to incorporate previously -specified parameters such as the data package title, file names, and -DOI.
-A code for a sample table summarizing the contents of the data -package (except the metadata) is provided.
-This is a required section. and the text includes multiple suggested -text elements and code for an example table defining data flagging -codes. Near future development here will incorporate additional optional -tables to summarize the data quality based on the flags in the data -sets.
-This is a required section that should contain brief instructions to -assist other researchers with reuse of the data. This may include -discussion of software packages (with appropriate citations) that are -suitable for analysing the assay data files, suggested downstream -processing steps (e.g. normalization, etc.), or tips for integrating or -comparing the data records with other datasets. Authors are encouraged -to provide code, programs or data-processing workflows if they may help -others understand or use the data.
-This is a required section that cites previous methods used but -should also be detailed enough in describing data production including -the experimental design, data acquisition assays, and any computational -processing (e.g. normalization, QA, QC) such that others can understand -the methods without referring to associated publications.
-Optional sub-sections within the methods include:
-This required section includes full bibliographic references for each -paper, chapter, book, data package, dataset, protocol, etc cited within -the DRR. Each item in the Reference section should be specifically cited -in-text as well.
-To automate citations, add the citation in bibtex format to the file -“references.bib”. You can manually copy and paste the bibtex for each -reference in, or you can search for it from within Rstudio. From within -Rstudio, make sure you are editing the DRR rmarkdown template using the -“Visual” view (as opposed to “Source”). From the “Insert” drop-down -menu, select “@ Citation…” (shortcut: Cntrl-Shift-F8). This will open a -Graphical User Interface (GUI) tool where you can view all the citations -in your reference.bib file as well as search multiple databases for -references, automatically insert the bibtex for the reference into your -references.bib file (and customize the unique identifier if you’d like) -and insert the in-text citation into the DRR template.
-
-Adding Citations - Source vs. Visual editing of the Template and how to -access the citation manager. -
-
-Adding Citations - Using the citation manager. -
-Once a reference is in your references.bib file, using the Visual -view of the template you can simply type the ‘@’ symbol and select which -reference to insert in the text.
-If you need to edit how the citation is displayed after inserting it -into the text, switch back to the “Source” view. Each bibtex citation -should start with a unique identifier; the example reference in the -supplied references.bib file has the unique identifier “@article{Scott1994,”. Using the “Source” view in -Rstudio, insert the reference in your text, by combining the “at” symbol -with the portion of the unique identifier after the curly bracket: @Scott1994 .
-| Syntax | -Result | -
|---|---|
-@Scott1994 concludes that … |
-Scott et al., 1994 concludes that … | -
-@Scott1994[p.33] concludes that … |
-Scott (1994, p.33) concludes that … | -
… end of sentence [@Scott1994]. |
-… end of sentence (Scott et al., 1994). | -
… end of sentence [see @Scott1994,p.33]. |
-… end of sentence (see Scott et al. 1994,p.33). | -
delineate multiple authors with colon:
-[@Scott1994; @aberdeen1958]
- |
-delineate multiple authors with colon: (Scott et al., 1994; -Aberdeen, 1958) | -
| Scott et al. conclude that …. [-@Scott1994] - | -Scott et al. conclude that . . . (1994) | -
The full citation, properly formatted, will be included in a -“References” section at the end of the rendered MS Word document. . . -though it is also worth visually inspecting the .docx for citation -completeness and formatting.
-If you would like to format your citations manually, please feel free -to do so. Make sure to look at the References section of the DRR -Template for how to properly format each citation type.
-There are numerous examples of proper formatting for each of these. -Future versions of the DRR will enable automatic reference formatting -given a correctly formatted bibtex file with the references (.bib).
-Figures should be inserted using code chunks in all cases so that -figure settings can be set in the chunk header. The chunk header should -at a minimum set the fig.align parameter to “center” and the specify the -figure caption (fig.cap parameter). Inserting figures this way will -ensure that the caption is properly formatted and it will apply copy the -caption to the figure’s “alt text” tag, making it 508-compliant.
-For example:
-```{r fig2, echo=FALSE, out.width="70%", fig.align="center", fig.cap="Example general workflow to incude in the methods section."}
-knitr::include_graphics("ProcessingWorkflow.png")
-```Results in:
-Tables should be created using the kable function. Specifying the -caption in the kable function call (as opposed to inline markdown text) -will ensure that the caption is appropriately formatted.
-For example:
-```{r Table2, echo=FALSE}
-c1<-c("Protocol1","Protocol2","Protocol3")
-c2<-c("Park Unit 1","Park Unit 2","Park Unit 3")
-c3<-c("Site 1","Site 2","Site 3")
-c4<-c("Date 1","Date 2","Date 3")
-c5<-c("GEOXXXXX","GEOXXXXX","GEOXXXXX")
-Table2<-data.frame(c1,c2,c3,c4,c5)
-
-kable(Table2,
- col.names=c("Subjects","Park Units","Locations","Sampling Dates","Data"),
- caption="**Table 1.** Monitoring study example Data Records table.") %>%
- kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width=F)
-```Results in:
-| -Subjects - | --Park Units - | --Locations - | --Sampling Dates - | --Data - | -
|---|---|---|---|---|
| -Protocol1 - | --Park Unit 1 - | --Site 1 - | --Date 1 - | --GEOXXXXX - | -
| -Protocol2 - | --Park Unit 2 - | --Site 2 - | --Date 2 - | --GEOXXXXX - | -
| -Protocol3 - | --Park Unit 3 - | --Site 3 - | --Date 3 - | --GEOXXXXX - | -
You can generate a .docx document manually without ussing the DRR -Template. The .docx, if properly formatted, can be ingested by the -publication software. Assuming the manually created .docx also has all -the required components and information, it can pass the review process -and be published. The final product should be indistinguishable from one -generated using the DRR Template.
-Manually generating a .docx for DRR publication is not suggested and -not supported.
-Because data release reports and associated data packages are -cross-referential, report numbers are typically assigned early in data -processing and quality evaluation.
-DataStore Reference Numbers. When developing a -report and data packages, DataStore references should be created as -early in the process as practicable. While the report and data packages -are in development, these should not be activated.
Report Numbers. If you are planning to publish a -Data Release Report with an official DRR number, please contact the IMD -Deputy Chief with the DataStore reference number associated with the -DRR.
Persistent Identifiers. Digital object -identifiers (DOIs) will be assigned to all DRRs and concurrently -published data packages. DOIs will resolve to a DataStore Reference; -DOIs are reserved when a draft reference is initiated in DataStore. They -are not activated until the publication process, including relevant -review, is complete.
DRR DOIs have the format: https://doi.org/10.36967/xxxxxxx
-Data package DOIs have the format: https://doi.org/10.57830/xxxxxxx
-Where the “xxxxxx” is the 7-digit DataStore reference number.
-Under no circumstances should reports and associated data packages or -metadata published in the DRR series contain disclaimers or text that -suggests that the work does not meet scientific integrity or information -quality standards of the National Park Service. The following -disclaimers are suitable for use, depending on whether the data are -provisional or final (or approved or certified).
---For approved & published data sets: “Unless -otherwise stated, all data, metadata and related materials are -considered to satisfy the quality standards relative to the purpose for -which the data were collected. Although these data and associated -metadata have been reviewed for accuracy and completeness and approved -for release by the National Park Service Inventory and Monitoring -Division, no warranty expressed or implied is made regarding the display -or utility of the data for other purposes, nor on all computer systems, -nor shall the act of distribution constitute any such warranty.”
-
--For provisional data: “The data you have secured -from the National Park Service (NPS) database identified as [database -name] have not received approval for release by the NPS Inventory and -Monitoring Division, and as such are provisional and subject to -revision. The data are released on the condition that neither the NPS -nor the U.S. Government shall be held liable for any damages resulting -from its authorized or unauthorized use.”
-
NEWS.md
- create_datastore_script now has a new parameter, lib_type. If “R” is specified it will pull as much info as it can from the DESCRIPTION file of an R package and use that info to fill out the Core Bibliography tab on the reference landing page.create_datastore_script function. Defaults to TRUE.2025-03-12 * Updated license to MIT which is JOSS, NPS, and R compatible!
-2025-03-10 * Updated license to OSI-approved “Zero-Clause BSD” in support of JOSS submission.
-2025-03-06 * Begin development of unit_codes_to_names function to translate NPS unit codes into full unit (park) names * Add basic unit test for unit_codes_to_names
2025-02-25 * Updated CONTRIBUTING.md.
2025-02-21 * Add CONTRIBUTING.md file
2024-07-16 * Added experimental function document_missing_values(), which searches a file for multiple missing value codes, replaces them all with NA, and generates a new column with the missing value codes so that they can be properly documented in EML. This is a work-around for the fact that there is currently not a good way to get multiple missing value codes in a single column via EMLassemblyline. This function is still under development; expect substantial changes an improvements up to and including removing the function entirely.
2024-07-09 * Added function get_user_email(), which accesses NPS active directory via a powershell function to return the user’s email address. Probably won’t work for non-NPS users and probably won’t work for non-windows users. * Updated rest API from legacy v6 to current v7.
2024-06-28 * Updated get_park_polygon() to use the new API (had been using a legacy API). Added documentation to specify that the function is getting the convexhull for the park, which may not work particularly well for some parks. 2024-06-27 * bug fixes for generate_ll_from_utm() * add function remove_empty_tables() (and associated unit tests) * update documentation for replace blanks() to indicate it can replace blanks with more than just NA
2024-05-08 * Updated the replace_blanks() function to accept any missing value code a user inputs (but it still defaults to NA). 2024-04-18 * Added the function generate_ll_from_utm() which supersedes convert_utm_to_ll() and improves upon it in several ways, included accepting a column of UTMs and also returns a column of CRS along with the decimal degrees latitude and longitude. 2024-04-17 * Major updates to the DRR template including: using snake case instead of camel case for variables; updating Table 3 to only display filenames only when there are multiple files, fixed multiple issues with footnotes, added citations to NPSdataverse packages, added a section that prints the R code needed to download the data package and load it in to R. * Updated the DRR documentation to account for new variable names.
2024-03-07 * Update error warning in check_te() to not reference VPN since NPS no longer uses VPN. * add private function .get_unit_bondaries(): hits ArcGIS API to pull more precise park unit boundaries than get_park_polgyon() * add validate_coord_list() function that takes advantage of improved precision of .get_unit_boundaries() and is vectorized, enabling users to input multiple coordinate combinations and park units directly from a data frame.
2024-02-09 * This version adds the DRR template, example files, and associated documentation to the QCkit package. * Bugfix in get_custom_flag(): it was counting both A (accepted) and AE (Accepted, estimated) as Accepted. Fixed the regex such that it Accepted will include all cells that start with A followed by nothing or by any character except AE such that flags can have explanation codes added to them (e.g. A_jenkins if “Jenkins” flagged the data as accepted)
2024-01-23 * Maintenance on get_custom_flag() to align with updated DRR requirements * Added function replace_blanks() to ingest a directory of .csvs and write them back out to .csv (overwriting the original files) with blanks converted to NA (except if a file has NO data - then it remains blank and needs to be dealt with manually)
2023-11-20 * Added the function create_datastore_scipt(), which given a username and repo for GitHub will generate a draft Script Reference on DataStore based on the information found in the latest Release on GitHub.
24 April 2023 * fixed bug in get_custom_flags().
17 April 2023
-get_elevation() new function for getting elevation from GPS coordinates via USGS API.21 March 2023
-order_cols new function for ordering columns added16 March 2023
-get_taxon_rank() which takes a column of scientific names and generates a new column with the most specific scientific name rank listed. It does this purely based on recognizing patterns in the scientific naming scheme and not by matching a list of known genera, families, etc.get_taxon_rank() and te_check() into a single file, taxonomy.R.te_check() in favor of check_te()
-DC_col_check() in favor of check_dc_cols()
-utm_to_ll() in favor of convert_utm_to_ll()
-long2UTM() in favor of convert_long_to_utm()
-28 February 2023
-te_check() bug fix - exact column name filtering allows for multiple columns with similar names in the input data column. Improved documentation for transparency.23 February 2023
-te_check(). It now supports searching multiple park units.22 February 2023
-te_check(). Now prints the source of the federal match list data and the date it was accessed to the console. Made the output format prettier. Added an “expansion” option to the function. Defaults to expansion = FALSE, which checks for exact matches between the scientific binomial supplied by the user and the full scientific binomial in the matchlist. When expansion = TRUE, the genera in the data supplied will be checked against the matchlist and all species from a given genera will be returned, regardless of whether a given species is actually in the supplied data set. A new column “InData” will tell the user whether a given species is actually in their data or has been expanded to.02 February 2023
-te_check() that was causing the function return species that were not threatened or endangered. The function now returns a tibble containing all species that are threatened, endangered, or considered for listing, specifies the status code of each species, and then give a brief explanation of the federal endangered species act status code returned.get_dp_flags(), get_df_flags(), and get_dc_flags in favor of get_custom_flags(). The new get_custom_flags() function returns 1-3 data frames, depending on user input that contain the output of the 3 previous functions. It also allows the user to specify additional non-flagged columns to be included in the QC summary.
-get_custom_flags() as experimental.get_dp_flags() returns counts of each data flag (A, AE, R, P) across the whole data package (as well as all cells in the data package).get_df_flags() returns counts of data flags within each data file of the data package (as well as counts for all cells within the data package).get_dc_flags() returns the name of each flagging column within each data package and the count of each flag within each column as well as the total number of cells across all the data flagging columns.force option that defaults to force = FALSE and prints the results to the screen. setting force = TRUE will suppress the on-screen output.lifecycle::badge("deprecated")
DC_col_check() was deprecated in favor of check_dc_cols() to enforce consistency in function naming throughout the package and to be consistent with tidyverse style guides.
DC_col_check checks to see if the column names in your dataframe match the standardized simple Darwin Core names established by the Taxonomic Databases Working Group
-The function returns a list of the column names you should fix (not fitting with simple Darwin Core terms, custom name formatting, data quality flagging formatting). Additionally, a small summary table is printed with the counts of the columns falling under each category (DarwinCore, Custom, DQ, Fix Me).
A dataframe is created with all the simple DarwinCore terms, drawn from Darwin Core reference guide: https://dwc.tdwg.org/terms/ last updated 07-15-2021. We have chosen to align ourselves mostly with the simple Darwin Core rules: https://dwc.tdwg.org/simple/. The function runs through each of the column names in your working dataframe to see if they match 1. A standard simple DarwinCore name 2. A name with a pattern of strings matching "custom_", indicating a custom made column or 3. A name with a pattern of strings matching "_DQ", indicating a data quality flag. If the column name does not fit within any of the three categories, a "Fix me" statement is printed alongside the column name. The function then counts all of the names fitting within each category and prints a summary table.
-R/QCkit-package.R
- QCkit-package.RdThis package contains a set of useful functions for data munging, quality control, and data flagging. Functions are contributed by multiple U.S. National Park Service staff, contractors, partners and others. These functions will likely be most useful for quality control of NPS data but may have utility beyond their intended functions.
-Maintainer: Robert Baker robert_baker@nps.gov (ORCID)
-Authors:
Judd Patterson judd_patterson@nps.gov (ORCID)
Joe DeVivo
Issac Quevedo (ORCID)
Sarah Wright (ORCID)
Other contributors:
check_dc_cols() checks to see if the column names in your dataframe match the standardized simple Darwin Core names established by the Taxonomic Databases Working Group
The function returns a list of the column names you should fix (not fitting with simple Darwin Core terms, custom name formatting, data quality flagging formatting). Additionally, a small summary table is printed with the counts of the columns falling under each category (DarwinCore, Custom, DQ, Fix Me).
A dataframe is created with all the simple DarwinCore terms, drawn from Darwin Core reference guide: https://dwc.tdwg.org/terms/ last updated 07-15-2021. We have chosen to align ourselves mostly with the simple Darwin Core rules: https://dwc.tdwg.org/simple/. The function runs through each of the column names in your working dataframe to see if they match 1. A standard simple DarwinCore name 2. A name with a pattern of strings matching "custom_", indicating a custom made column or 3. A name with a pattern of strings matching "_DQ", indicating a data quality flag. If the column name does not fit within any of the three categories, a "Fix me" statement is printed alongside the column name. The function then counts all of the names fitting within each category and prints a summary table.
-check_te() generates a list of species you should consider removing from your dataset before making it public by matching the scientific names within your data set to the Federal Conservation List. check_te() should be considered a helpful tool for identifying federally listed endangered and threatened species in your data. Each National Park has a park-specific Protected Data Memo that outlines which data should be restricted. Threatened and endangered species are often - although not always - listed on these Memos. Additional species (from state conservation lists) or non-threatened and non-endangered species of concern or other biological or non-biological resources may be listed on Memos. Consult the relevant park-specific Protected Data Memo prior to making decisions on restricting or releasing data.
The name of your data frame containing species observations
The name of the column within your data frame containing the scientific names of the species (genus and specific epithet).
A four letter park code. Or a list of park codes.
Logical. Defaults to FALSE. The default setting will return only exact matches between your the scientific binomial (genera and specific epithet) in your data set and the federal match list. Setting expansion = TRUE will expand the list of matches to return all species (and subspecies) that from the match list that match any genera listed in your data set, regardless of whether a given species is actually in your data set. An additional column indicating whether the species returned is in your data set ("In your Data") or has been expanded to ("Expansion") is generated.
The function returns a (modified) data frame with the names of all the species that fall under the federal conservation list. The resulting data frame may have multiple instances of a given species if it is listed in multiple parks (park codes for each listing are supplied). Technically it is a huxtable, but it should function identically to a data frame for downstream purposes.
-Define your species data set name, column name with the scientific names of your species, and your four letter park code.
-The check_te() function downloads the Federal Conservation list using the IRMA odata API service and matches this species list to the list of scientific names in your data frame. Keep in mind that this is a Federal list, not a state list. Changes in taxa names may also cause some species to be missed. Because the odata API service is not publicly available, you must be logged in to the NPS VPN or in the office to use this function.
For the default, expansion = FALSE, the function will perform an exact match between the taxa in your scientificName column and the federal Conservation List and then filter the results to keep only species that are listed as endangered, threatened, or considered for listing. If your scientificName column contains information other than the binomial (genus and species), no matches will be returned. For instance, if you have an Order or just a genus listed, these will not be matched to the Federal Conservation List.
-If you set expansion = TRUE, the function will truncate each item in your scientificName column to the first word in an attempt to extract a genus name. If you only have genera listed, these will be retained. If you have have higher-order taxa listed such as Family, Order, or Phyla again the first word will be retained. This first word (typically a genus) will be matched to just the generic name of species from the Federal Conservation List. All matches, regardless of listing status, are retained. The result is that for a given species in your scientificName column, all species within that genus that are on the Federal Conservation List will be returned (along with their federal conservation listing codes and a column indicating whether the species is actually in your data or is part of the expanded search).
-if (FALSE) { # \dontrun{
-#for individual parks:
-check_te(x = my_species_dataframe, species_col = "scientificName", park_code = "BICY")
-list<-check_te(data, "scientificName", "ROMO", expansion=TRUE)
-# for a list of parks:
-park_code<-c("ROMO", "YELL", "SAGU")
-list<-check_te(data, "scientificName", park_code, expansion=TRUE)
-} # }
-
-R/dates_and_times.R
- convert_datetime_format.RdConvert EML date/time format string to one that R can parse
-A character vector of EML date/time format strings. This function understands the following codes: YYYY = four digit year, YY = two digit year, MMM = three letter month abbrev., MM = two digit month, DD = two digit day, hh or HH = 24 hour time, mm = minutes, ss or SS = seconds, +/-hhmm, +/-hh:mm, or +/-hh = UTC offset.
Should a "Z" at the end of the format string (indicating UTC) be replaced by a "%z"? Only set to TRUE if you plan to use fix_utc_offset to change "Z" in datetime strings to "+0000".
convert_datetime_format() is not a sophisticated function. If the EML format string is not valid, it will happily and without complaint return an R format string that will break your code. You have been warned. Note that UTC offset formats using a colon or only two digits will be parsed by this function, but if parsing datetime values from strings, you will also need to use fix_utc_offset to change the UTC offsets to the +/-hhmm format that R can read.
convert_datetime_format("MM/DD/YYYY")
-#> [1] "%m/%d/%Y"
-convert_datetime_format(c("MM/DD/YYYY", "YY-MM-DD"))
-#> [1] "%m/%d/%Y" "%y-%m-%d"
-
-
-
convert_long_2_utm() was deprecated in favor of get_utm_zone() as the
-new funciton name more accurately reflects what the function does.
-convert_long_to_utm() take a longitude coordinate and returns the
-corresponding UTM zone.
R/geography.R
- convert_utm_to_ll.Rd
-
convert_utm_to_ll() was superseded in favor of generate_ll_from_utm() to
-support and encourage including zone and datum columns in datasets. generate_ll_from_utm()
-also adds the ability to specify the coordinate reference system for lat/long coordinates,
-and accepts column names either quoted or unquoted for better compatibility with
-tidyverse piping.
-convert_utm_to_ll() takes your dataframe with UTM coordinates
-in separate Easting and Northing columns, and adds on an additional two
-columns with the converted decimalLatitude and decimalLongitude coordinates
-using the reference coordinate system WGS84. You may need to turn the VPN OFF
-for this function to work properly.
The dataframe with UTM coordinates you would like to convert. -Input the name of your dataframe.
The name of your Easting UTM column. Input the name in -quotations, ie. "EastingCol".
The name of your Northing UTM column. Input the name in -quotations, ie. "NorthingCol".
The UTM Zone. Input the zone number in quotations, ie. "17".
The datum used in the coordinate reference system of your -coordinates. Input in quotations, ie. "WGS84"
The function returns your dataframe, mutated with an additional two -columns of decimal Longitude and decimal Latitude.
-Define the name of your dataframe, the easting and northing columns -within it, the UTM zone within which those coordinates are located, and the -reference coordinate system (datum). UTM Northing and Easting columns must be -in separate columns prior to running the function. If a datum is not defined, -the function will default to "WGS84". If there are missing coordinates in -your dataframe they will be preserved, however they will be moved to the end -of your dataframe. Note that some parameter names are not in snake_case but -instead reflect DarwinCore naming conventions.
-R/datastore_interactions.R
- create_datastore_script.RdGiven a GitHub owner ("nationalparkservice") and public repo ("EMLeditor"), the function uses the GitHub API to access the latest release version on GitHub and generate a corresponding draft public Script reference on DataStore.
-WARNING: if you are not an author of the repo on GitHub, you should probably NOT be the one adding it to DataStore unless you have very good reason. If you want to cite a GitHub release/repo and need a DOI, contact the repo maintainer and suggest they use this function to put it on DataStore for you.
-The function searches DataStore for references with a similar title (where the title is repo + release tag). If force = FALSE and there are similarly titled references, the function will return a list of them and ask if the user really wants a new DataStore reference generated. Assuming yes (or if there are no existing DataStore references with a similar title or if force = TRUE), the function will:
download the .zip of the latest GitHub release for the repo,
initiate a draft reference on DataStore,
give the draft reference a title (repo + release tag),
upload the .zip from GitHub
add a web link to the release on GitHub.
add the items listed under GitHub repo "Topics" as keywords to the DataStore Script reference
Set for or by NPS flag
Set the issued date
If you indicate it is an R package, the authors, steward, description, and other fields will be filled out on the Core tab -#' -The user will still need to go access the draft Script reference on DataStore to fill in the remaining fields (which are not accessible via API and so cannot be automated through this function) and activate the reference (thereby generating and registering a citeable DOI).
If the Reference is a version of an older reference, the user will have to access the older version and indicate that it is an older version of the current Reference. The user will also have to manually add the new Reference to a Project for the repo, if desired.
-String. The owner of the account where the GitHub repo resides. For example, "nationalparkservice"
String. The repo with a release that should be turned into a DataStore Script reference. For example, "EMLeditor"
String. Can be one of three values: generic_script, R, or python. Defaults to "generic_script".
String. The location where the release .zip from GitHub should be downloaded to (and uploaded from). Defaults to the working directory of the R Project (i.e. here::here()).
Logical. Defaults to FALSE. In the default status the function has a number of interactive components, such as searching DataStore for similarly titled References and asking if a new Reference is really what the user wants. When set to TRUE, all interactive components are turned off and the function will proceed unless it hits an error. Setting force = TRUE may be useful for scripting purposes.
Logical. Defaults to FALSE. In the default status, the function generates and populates a new draft Script reference on the DataStore production server. If set to TRUE, the draft Script reference will be generated and populated on the DataStore development server. Setting dev = TRUE may be useful for testing the function without generating excessive references on the DataStore production server.
Logical. Was the code, script, or software created either for or by NPS? Defaults to TRUE.
The "chunk" size to break the file into for upload. If your network is slow and your uploads are failing, try decreasing this number (e.g. 0.5 or 0.25).
How many times to retry uploading a file chunk if it fails on the first try.
-
-Given a file name (.csv only) and path, the function will search the
-columns for any that contain multiple user-specified missing value codes.
-For any column with multiple missing value codes, all the missing values
-(including blanks) will be replaced with NA. A new column will be generated
-and, populated with the given missing value code from the origin column.
-Values that were not missing will be populated with "not_missing". The
-newly generate column of categorical variables can be used do describe
-the various/multiple reasons for why data is absent in the original column.
The function will then write the new dataframe to a file, overwriting the -original file. If it is important to keep a copy of the original file, make -a copy prior to running the function.
-WARNING: this function will replace any blank cells in your data with NA!
-document_missing_values(
- file_name,
- directory = here::here(),
- colname = NA,
- missing_val_codes = NA,
- replace_value = NA
-)String. The name of the file to inspect
String. Location of file to read/write. Defaults to the current working directory.
String. The columns to inspect. CURRENTLY ONLY WORKS AS SET TO DEFAULT "NA".
List. A list of strings containing the missing value code or codes to search for.
String. The value (singular) to replace multiple missing values with. Defaults to NA.
UTC offsets can be formatted in multiple ways (e.g. -07, -07:00, -0700) and R often struggles to parse these offsets. This function takes date/time strings with valid UTC offsets, and formats them so that they are consistent and readable by R. Here, you can supply a vector of dates in ISO 8601 format and they will be returned in a consistent format compatible with R. Date strings with missing or invalid UTC offsets will result in a warning.
-datetime_strings with UTC offsets consistently formatted to four digits (e.g. "2023-11-16T03:32:49-0700").
-datetimes <- c("2023-11-16T03:32:49+07:00", "2023-11-16T03:32:49-07",
-"2023-11-16T03:32:49","2023-11-16T03:32:49Z")
-fix_utc_offset(datetimes)
-#> Warning: Date strings contain missing or invalid UTC offsets
-#> [1] "2023-11-16T03:32:49+0700" "2023-11-16T03:32:49-0700"
-#> [3] "2023-11-16T03:32:49" "2023-11-16T03:32:49+0000"
-
-R/geography.R
- fuzz_location.Rdfuzz_location() "fuzzes" a specific location to something less
-precise prior to public release of information about sensitive resources for
-which data are not to be released to the public. This function takes
-coordinates in either UTM or decimal degrees, converts to UTM (if in decimal
-degrees), creates a bounding box based on rounding of UTM coordinates, and
-then creates a polygon from the resultant points. The function returns a
-string in Well-Known-Text format.
The latitude in either UTMs or decimal degrees.
The longitude in either UTMs or decimal degrees
The EPSG coordinate system of the latitude and -longitude coordinates. Either 4326 for decimal degrees/WGS84 datum, 4269 for -decimal degrees/NAD83, or 326xx for UTM/WGS84 datum (where the xx is the -northern UTM zone). For example 32616 is for UTM zone 16N.
Use "Fuzzed - 10km", "Fuzzed - 1km", or "Fuzzed - 100m"
R/geography.R
- generate_ll_from_utm.Rdgenerate_ll_from_utm() takes your dataframe with UTM coordinates
-in separate Easting and Northing columns, and adds on an additional two
-columns with the converted decimalLatitude and decimalLongitude coordinates
-using the reference coordinate system NAD83. Your data must also contain columns
-specifying the zone and datum of your UTM coordinates.
-In contrast to convert_utm_to_ll() (superseded), generate_ll_from_utm() requires
-zone and datum columns. It supports quoted or unquoted column names and a user-specified datum for lat/long
-coordinates. It also adds an extra column to the output data table that documents the
-lat/long coordinate reference system.
generate_ll_from_utm(
- df,
- EastingCol,
- NorthingCol,
- ZoneCol,
- DatumCol,
- latlong_datum = "NAD83"
-)The dataframe with UTM coordinates you would like to convert. -Input the name of your dataframe.
The name of your Easting UTM column. You may input the name -with or without quotations, ie. EastingCol and "EastingCol" are both valid.
The name of your Northing UTM column. You may input the name -with or without quotations, ie. NorthingCol and "NorthingCol" are both valid.
The column containing the UTM zone, with or without quotations.
The column containing the datum for your UTM coordinates, -with or without quotations.
The datum to use for lat/long coordinates. Defaults to NAD83.
The function returns your dataframe, mutated with an additional two -columns of decimalLongitude and decimalLatitude plus a column LatLong_CRS containing -a PROJ string that specifies the coordinate reference system for these data.
-Define the name of your dataframe, the easting and northing columns -within it, the UTM zone within which those coordinates are located, and the -reference coordinate system (datum). UTM Northing and Easting columns must be -in separate columns prior to running the function. If a datum for the lat/long output -is not defined, the function will default to "NAD83". If there are missing coordinates in -your dataframe they will be preserved, however they will be moved to the end -of your dataframe. Note that some parameter names are not in snake_case but -instead reflect DarwinCore naming conventions.
-This function uses tidy evaluation (i.e. you can provide column name arguments -as strings or you can leave them unquoted). If you wish to store column names -as strings in variables, you must enclose the variables in double curly braces -when you pass them into the function. See code examples below.
-if (FALSE) { # \dontrun{
-
-# Using magrittr pipe (%>%) and unquoted column names
-my_dataframe %>%
-generate_ll_from_utm(
- EastingCol = UTM_X,
- NorthingCol = UTM_Y,
- ZoneCol = Zone,
- DatumCol = Datum
-)
-
-# Providing column names as strings (in quotes)
-generate_ll_from_utm(
- df = mydataframe,
- EastingCol = "EastingCoords",
- NorthingCol = "NorthingCoords",
- ZoneCol = "zone",
- DatumCol = "datum",
- latlong_datum = "WGS84"
-)
-
-# Column names stored as strings in separate variables
-easting <- "EastingCoords"
-northing <- "NorthingCoords"
-zonecol <- "zone"
-datumcol <- "datum"
-latlong_dat <- "WGS84"
-
-generate_ll_from_utm(
- df = mydataframe,
- EastingCol = {{easting}}, # enclose variables that store column names in {{}}
- NorthingCol = {{northing}},
- ZoneCol = {{zonecol}},
- DatumCol = {{datumcol}},
- latlong_datum = latlong_dat # this isn't a column name so it doesn't need {{}}
-)
-
-} # }
-R/summarize_qc_flags.R
- get_custom_flags.Rd
-get_custom_flags returns data frames that that summarize data
-quality control flags (one that summarizes at the data file level and one for each column). The summaries include all data
-with quality control flagging (a column name that ends in "_flag") and
-optionally any additional custom columns the user specifies, either by column
-name or number.
The use can specify which of the 2 data frames (or all as a list of -dataframes) should be returned.
-The number of each flag type for each column (A, AE, R, P) is reported. -Unflagged columns are assumed to have only accepted (or missing) data. The -total number of data points in the specified columns (and data flagging -columns for) each .csv are also reported. NAs considered missing data. An -Unweighted Relative Response (RRU) is calculated as the total number of -accepted data points (A, AE, and data that are not flagged) divided by the -total number of data points (excluding missing values) in all specified -columns (and the flagged columns).
-is the path to the data package .csv files (defaults to the -current working directory).
A comma delimited list of column names. If left unspecified, -defaults to just flagged columns.
A string indicating what output should be provided. "columns" -returns a summary table of QC flags and RRU values in each specified column -for every data file. "files" returns a summary table of total QC flags and -mean across each data file. "all" will return all three -data frames in a single list.
a dataframe with quality control summary information summarized at -the specified level(s).
-Flagged columns must have names ending in "_flag". Missing values -must be specified as NA. The function counts cells within "*_flag" columns -that start with one of the flagging characters (A, AE, R, P) and ignores -trailing characters and white spaces. For custom columns that do not include -a specific flagging column, all non-missing (non-NA) values are considered -Accepted (A).
-The intent of get_custom_flags is for integration into reports on data -quality, such as Data Release Reports (DRRs).
-R/summarize_qc_flags.R
- get_dc_flags.Rd
-get_dc_flags (dc=data columns) returns a data frame that, for
-each data file in a data package lists the name of each data flagging column
-and the number of each flag type within that column (A, AE, R, P) as well as
-the total number of data points in the data flagging columns for each .csv,
-excluding NAs. Unweighted Relative Response (RRU) is calculated as the total
-number of accepted data points (A, AE, and data that are not flagged).
get_dc_flags(directory = here::here())a dataframe named dc_flag that contains a row for each .csv file in -the directory with the file name, the count of each flag and total number of -data points in each .csv (including data flagging columns).
-The function can be run from within the working directory where the -data package is, or the directory can be specified. The function only -supports .csv files and assumes that all data flagging columns have column -names ending in "_flag". It counts cells within those columns that start with -one of the flagging characters (A, AE, R, P) and ignores trailing characters -and whitespaces.
-if (FALSE) { # \dontrun{
-get_df_flags("~/my_data_package_directory")
-get_df_flags() # if your current working directory IS the data package
-directory.
-# ->
-get_custom_flags(output="columns")
-} # }
-
-R/summarize_qc_flags.R
- get_df_flags.Rd
-get_df_flags (df = data files) returns a data frame that lists
-the number of cells in each data file in the entire data package (excluding
-NAs) with relevant flags (A, AE, R, P) as well as the total number of data
-points in each .csv (including data flagging columns, but excluding NAs).
-Unweighted Relative Response (RRU) is calculated as the total number of
-accepted data points (A, AE, and data that are not flagged).
get_df_flags(directory = here::here())a dataframe named df_flag that contains a row for each .csv file in -the directory with the file name, the count of each flag and total number of -data points in each .csv (including data flagging columns).
-The function can be run from within the working directory where the -data package is, or the directory can be specified. The function only -supports .csv files and assumes that all .csv files in the folder are part -of the data package. It also assumes that the values A, AE, R, and P have -only been used for flagging. It assumes that there are no additional -characters in the flagging cells (such as leading or trailing white spaces).
-if (FALSE) { # \dontrun{
-get_df_flags("~/my_data_package_directory")
-get_df_flags() # if your current working directory IS the data package
-directory.
-# ->
-get_custom_flags(output="files")
-} # }
-
-R/summarize_qc_flags.R
- get_dp_flags.Rd
-get_dp_flags (dp=data package) returns a data frame that list
-the number of cells in the entire data package with relevant flags (A, AE,
-R, P) as well as the total number of non-NA cells in the data package
-(including data flagging columns). Unweighted Relative Response (RRU) is
-calculated as the total number of accepted data points (A, AE, and data that
-are not flagged).
get_dp_flags(directory = here::here())a dataframe named dp_flag that contains the four flags, the count of -each flag and total number of data points in the entire data package.
-The function can be run from within the working directory where the -data package is, or the directory can be specified. The function only -supports .csv files and assumes that all .csv files in the folder are part -of the data package. The function counts cells within "*_flag" -columns that start with one of the flagging characters (A, AE, R, P) and -ignores trailing characters and whitespaces. NAs are assumed to be empty -cells or missing data.
-if (FALSE) { # \dontrun{
-get_dp_flags("~/my_data_package_directory")
-get_dp_flags() # if your current working directory IS the data package
-directory.
-# ->
-get_custom_flags(output="package")
-} # }
-
-get_elevation() takes a dataframe that includes GPS coordinates (in decimal degrees) and returns a dataframe with two new columns added to it, minimumElevationInMeters and maximumElevationInMeters. The function requires that the data supplied are numeric and that missing values are specified with NA.
get_elevation(
- df,
- decimal_lat,
- decimal_long,
- spatial_ref = c(4326, 102100),
- force = FALSE
-)a data frame containing GPS decimal coordinates for individual points with latitude and longitude in separate columns.
String. The name of the column containing longitudes
String. The name of the column containing latitudes
Categorical. Defaults to 4326. Can also be set to 102100.
Logical. Defaults to FALSE. Returns verbose comments, interactions, and information. Set to TRUE to remove all interactive components and reduce/remove all comments and informative print statements.
a data frame with two new columns, minimumElevationInMeters and maximumElevationInMeters
-get_elevation() uses the USGS API for The National Map to identify the elevevation for a given set of GPS coordinates. To reduce API queries (and time to completion), the function will only search for unique GPS coordinates in your dataframe. This could take some time. If you have lots of GPS coordinates, you can also perform a manual bulk upload (maximum = 500 points).
Note that both new columns (minimumElevationInMeters and maximumElevationInMeters) contain the same elevation; this is expected behavior as a single GPS coordinate should have the same maximum and minimum elevations. The column names are generated in accordance with the simple Darwin Core Standards.
-Points outside of the US may return NA values as they are not part of The National Map.
-R/geography.R
- get_park_polygon.Rdget_park_polygon() retrieves a geoJSON string for a polygon of
-a park unit. This is not the official boundary. Note that the REST API call returns the default "convexHull". This is will work better or worse for some parks, depending on the park shape/geography/number of disjunct areas.
-#'
get_taxon_rank() generates a new column in your selected data set called taxonRank that will show the taxonomic rank of the most specific name in the given scientific name column. This is a required column in the Simple Darwin Core rule set and guidelines. This function will be useful in creating and auto populating a required Simple Darwin Core field.
The function returns a new column in the given data frame named taxonRank with the taxonomic rank of the corresponding scientific name in each column. If there is no name in a row, then it returns as NA for that row.
-Define your species data set name and the column name with the scientific names of your species (if you are following a Simple Darwin Core naming format, this column should be scientificName, but any column name is fine).
-The function will read the various strings in your species name column and identify them as either a family, genus, species, or subspecies. This function only works with cleaned and parsed scientific names. If the scientific name is higher than family, the function will not work correctly. Subfamily and Tribe names (which, similar to family names end in "ae*") will be designated Family.
-get_utm_zone() replaces convert_long_2_utm() as this
-function name is more descriptive. get_utm_zone() takes a longitude
-coordinate and returns the corresponding UTM zone.
check_dc_cols()
-
- check_te()
-
- convert_datetime_format()
-
- convert_long_to_utm()
- deprecated
- convert_utm_to_ll()
- superseded
- create_datastore_script()
-
- document_missing_values()
- experimental
- fix_utc_offset()
-
- fuzz_location()
-
- generate_ll_from_utm()
-
- get_custom_flags()
- experimental
- get_elevation()
-
- get_park_polygon()
-
- get_taxon_rank()
-
- get_user_orcid()
-
- get_utm_zone()
-
- order_cols()
-
- remove_empty_tables()
-
- replace_blanks()
-
- unit_codes_to_names()
-
- validate_coord()
-
- validate_coord_list()
-
-
-
long2UTM was deprecated in favor of convert_long_to_utm() to enforce a
-consistent function naming pattern across the package and to conform to the
-tidyverse style guide.
long2UTM() take a longitude coordinate and returns the corresponding UTM
-zone.
order_cols() Checks and orders columns with TDWG Darwin Core naming standards and custom names in a dataset
The function returns a list of required and suggested columns to include in your dataset. When assigning to an object, the object contains your new dataset with all columns ordered properly.
Check to see if you have three (highly) recommended columns (locality, type, basisOfRecord) and various suggested columns present in your dataset. Print a list of which columns are present and which are not. Then, order all the columns in your dataset in the following order: (highly) recommended columns, suggested columns, the rest of the Darwin Core columns, "custom_" (non-Darwin Core) columns, and finally sensitive species data columns.
-Any columns that are not darwinCore term names, do not start with "custom_" or are not "scientificName_flag" will be placed after the darwinCore columns and before the "custom_" columns.
-One exception is if your dataset includes the column custom_TaxonomicNotes, it will be placed directly after namePublishedIn, if that column exists.
-Suggested darwinCore column names (plus scientificName_flag) include (in the order they will be placed): eventDate, eventDate_flag, scientificName, scientificName_flag, taxonRank, verbatimIdentification, vernacularName, namePublishedIn, recordedBy, individualCount, decimalLongitude, decimalLatitude, coordinate_flag, geodeticDatum", verbatimCoordinates, verbatimCoordinateSystem, verbatimSRS,coordinateUncertaintyInMeters. Note that suggested names include some custom, non-Darwin Core names such as "scientificName_flag".
-sensitive species data columns are defined as: informationWithheld, dataGeneralizations, and footprintWKT.
-Remove empty tables from a list
-R/replace_blanks.R
- replace_blanks.Rdreplace_blanks() is particularly useful for exporting data
-from a database (such as access) and converting it to a data package with
-metadata.
replace_blanks() will import all .csv files in the specified working
-directory. The files are then written back out to the same directory,
-overwriting the old .csv files. Any blank cells (or cells with "NA" in the
-original .csv files) will be replaced with the specified string or integer.
-If no missing value is specified, the function defaults to replacing all
-blanks with "NA".
Please keep in mind the "missing" is a general term for all data -not present in the data file or data package. Although you may have a very -good reason for not providing data and that data may not, from the data -package creator's perspective, be "missing" (maybe you never intended to -collect it) from a data package user's perspective any data that is not in -the data package is effectively "missing" from the data package. Therefore, -it is critical to document in metadata any data that are absent with an -appropriate "missingValueCode" and "missingValueDefinition". These terms are -defined by the metadata schema and are broadly used to apply to any data not -present.
-This function will replace all empty cells and all cells with NA with a -"missingValueCode" of your choice (although it defaults to NA).
-replace_blanks(directory = here::here(), missing_val_code = NA)One exception is if a .csv contains NO data (i.e. just column names -and no data in any of the cells). In this case, the blanks will not be -replaced with NA (as the function cannot determine how many NAs to include).
- if (FALSE) { # \dontrun{
-#replaces all blank cells in .csvs in the current directory with NA:
- replace_blanks()
-
-#replace all blank cells in .csvs in the directory ./test_data with "NODATA"
- dir <- here::here("test_data")
- replace_blanks(directory = dir, missing_val_code = "NODATA")
-
-#replace all blank cells in .csvs in the current directory with -99999
-replace_blanks(missing_val_code = -99999)
-} # }
-This function has been deprecated in favor of check_te(). The function name was changed to promote constancy in function naming across the package and to conform with tidyverse style guides. te_check() is no longer updated and may not reference the latest version of the federal endangered and threatened species listings.
te_check() generates a list of species you should consider removing from your dataset before making it public by matching the scientific names within your data set to the Federal Conservation List. te_check() should be considered a helpful tool for identifying federally listed endangered and threatened species in your data. Each National Park has a park-specific Protected Data Memo that outlines which data should be restricted. Threatened and endangered species are often - although not always - listed on these Memos. Additional species (from state conservation lists) or non-threatened and non-endangered species of concern or other biological or non-biological resources may be listed on Memos. Consult the relevant park-specific Protected Data Memo prior to making decisions on restricting or releasing data.
The name of your data frame containing species observations
The name of the column within your data frame containing the scientific names of the species (genus and specific epithet).
A four letter park code. Or a list of park codes.
Logical. Defaults to FALSE. The default setting will return only exact matches between your the scientific binomial (genera and specific epithet) in your data set and the federal match list. Setting expansion = TRUE will expand the list of matches to return all species (and subspecies) that from the match list that match any genera listed in your data set, regardless of whether a given species is actually in your data set. An additional column indicating whether the species returned is in your data set ("In your Data") or has been expanded to ("Expansion") is generated.
The function returns a (modified) data frame with the names of all the species that fall under the federal conservation list. The resulting data frame may have multiple instances of a given species if it is listed in multiple parks (park codes for each listing are supplied). Technically it is a huxtable, but it should function identically to a data frame for downstream purposes.
-Define your species data set name, column name with the scientific names of your species, and your four letter park code.
-The te_check() function downloads the Federal Conservation list using the IRMA odata API service and matches this species list to the list of scientific names in your data frame. Keep in mind that this is a Federal list, not a state list. Changes in taxa names may also cause some species to be missed. Because the odata API service is not publicly available, you must be logged in to the NPS VPN or in the office to use this function.
For the default, expansion = FALSE, the function will perform an exact match between the taxa in your scientificName column and the federal Conservation List and then filter the results to keep only species that are listed as endangered, threatened, or considered for listing. If your scientificName column contains information other than the binomial (genus and species), no matches will be returned. For instance, if you have an Order or just a genus listed, these will not be matched to the Federal Conservation List.
-If you set expansion = TRUE, the function will truncate each item in your scientificName column to the first word in an attempt to extract a genus name. If you only have genera listed, these will be retained. If you have have higher-order taxa listed such as Family, Order, or Phyla again the first word will be retained. This first word (typically a genus) will be matched to just the generic name of species from the Federal Conservation List. All matches, regardless of listing status, are retained. The result is that for a given species in your scientificName column, all species within that genus that are on the Federal Conservation List will be returned (along with their federal conservation listing codes and a column indicating whether the species is actually in your data or is part of the expanded search).
-if (FALSE) { # \dontrun{
-#for individual parks:
-te_check(x = my_species_dataframe, species_col = "scientificName", park_code = "BICY")
-list<-te_check(data, "scientificName", "ROMO", expansion=TRUE)
-# for a list of parks:
-park_code<-c("ROMO", "YELL", "SAGU")
-list<-te_check(data, "scientificName", park_code, expansion=TRUE)
-} # }
-
-R/unit_code_names.R
- unit_codes_to_names.Rdunit_code_to_names takes a single unit code or vector of unit codes and returns a data frame of full unit names using a public IRMA API. For example if given the code "ROMO" the function will return "Rocky Mountain National Park".
if (FALSE) { # \dontrun{
- unit_codes_to_names("ROMO")
- unit_codes_to_names(c("ROMO", "GRYN"))
- } # }
-R/geography.R
- utm_to_ll.Rd
-
utm_to_ll() was deprecated in favor of convert_utm_to_ll() to enforce a consistent naming scheme for functions across the package and to conform with the tidyverse style guide.
utm_to_ll takes your dataframe with UTM coordinates in separate Easting and Northing columns, and adds on an additional two columns with the converted decimalLatitude and decimalLongitude coordinates using the reference coordinate system WGS84.
-The dataframe with UTM coordinates you would like to convert. Input the name of your dataframe.
The name of your Easting UTM column. Input the name in quotations, ie. "EastingCol".
The name of your Northing UTM column. Input the name in quotations, ie. "NorthingCol".
The UTM Zone. Input the zone number in quotations, ie. "17".
The datum used in the coordinate reference system of your coordinates. Input in quotations, ie. "WGS84"
The function returns your dataframe, mutated with an additional two columns of decimal Longitude and decimal Latitude.
-Define the name of your dataframe, the easting and northing columns within it, the UTM zone within which those coordinates are located, and the reference coordinate system (datum). UTM Northing and Easting columns must be in separate columns prior to running the function. If a datum is not defined, the function will default to "WGS84". If there are missing coordinates in your dataframe they will be preserved, however they will be moved to the end of your dataframe. Note that some parameter names are not in snake_case but instead reflect DarwinCore naming conventions.
-R/geography.R
- validate_coord.Rdvalidate_coord() compares a coordinate pair (in decimal
-degrees) to the polygon for a park unit as provided through the NPS
-Units rest services. The function returns a value of TRUE or FALSE.
R/geography.R
- validate_coord_list.RdThis function can take a list of coordinates and park units as input. In
-addition to being vectorized, depending on the park borders, it can be a
-major improvement on validate_coord().
numeric. An individual or vector of numeric values representing -the decimal degree latitude of a coordinate
numeric. An individual or vector of numeric values representing -the decimal degree longitude of a coordinate
String. Or list of strings each containing the four letter -park unit designation
if (FALSE) { # \dontrun{
-x <- validate_coord_list(lat = 105.555, long = -47.4332, park_units = "DRTO")
-
-# or a dataframe with many coordinates and potentially many park units:
-x <- validate_coord_list(lat = df$decimalLatitutde,
- lon = df$decimalLongitude,
- park_units = df$park_units)
-# you can then merge it back in to the original dataframe:
-df$test_GPS_coord <- x
-} # }
-